A/B testing tools have created a golden age of shitty statistical practices in business

doctorow · July 25, 2018, 2:45pm

Originally published at: https://boingboing.net/2018/07/25/behavioral-statistical-economi.html

…

anon61833566 · July 25, 2018, 3:30pm

we do A/B testing constantly with web content. And I kept trying to explain that the results are garbage because the process is totally invalid.

I gave up 2 years ago from futility.

LurkingGrue · July 25, 2018, 4:52pm

I always wondered why Microsoft claimed their data suggested nobody was using the start button and so they could remove it from windows 8. I suspected they were just looking at their telemetry data wrong to support their ideas.

Nice to know this is probably the case.

fredtal · July 25, 2018, 5:00pm

Somewhat related is the book I’m currently trying to read “Fooled by Randomness” It’s suppose to be talking about how we find order when there isn’t any.

Dioptase1 · July 25, 2018, 5:00pm

A better reason for ending tests early is they cost money. Both in running the test and, in theory, implementing the results.

On the surface, it seems perfectly valid to look at early results, run statistics, and if the results are clear, why bother with more results? Move to the next stage! And sometimes this is true (Hey, every patient we try this drug on dies! Stop the testing!) and sometimes not (Both late stage pancreatic cancer patients we tried this on died … because that’s what normally happens. Maybe we shouldn’t give up hope yet.).

The rigor of the statistics should match the importance of the results. Doing A/B testing of a new font for a “Which superhero are you?” quiz? Saving a few bucks to cut the testing short and getting it wrong is ok. Measuring rainfall for only a few weeks to set water conservation policy because it rained … maybe not quite so smart.

stanestane · July 25, 2018, 7:36pm

At my work we do a lot of A/B testing. In my experience the main use of it is to post hoc justify the business decision that were already made. Because of, hey,… science…

fuzzyfungus · July 25, 2018, 8:46pm

Is it an overwhelming triumph or overwhelming failure, for the UI/UX guys that, while ‘garbage in, garbage out’ remains an iron law we are increasingly able to format the output garbage into something that 68% of users find more trustworthy than truth?

anon30760835 · July 25, 2018, 11:03pm

That’s a nice euphemism.

anon30760835 · July 25, 2018, 11:09pm

That’s a better reason for not-assing them at all, instead of half-assing them.

In a word, “Science”. Clarity comes only -after- all the analysis. Not before. That’s bias.

As long as it’s made clear up front that only half the paid for work is being done…

This.

Justification:Explanation::Might:Right

anon30760835 · July 25, 2018, 11:11pm

reusing this:

FGD135 · July 26, 2018, 4:53am

Or this:

bolamig · July 26, 2018, 9:50am

I got when the psychologists had their reproducibility crisis. But computer engineers getting this wrong is a sign of the apocalypse. I have to suspect that the AIs that will replace us aren’t so lazy about doing statistics.

Dioptase1 · July 26, 2018, 4:43pm

You have a messed up view of testing. I end tests early all the time. Why finish a test of 298 devices when the first two for the first 50 broke in the middle of the test to demonstrate 90 at 95? Science? Wrong wrong wrong.

We also run what are called screening tests. Basically test the test. You run the test to gather some preliminary data and check to see what needs to be changed in the procedure. Sometimes things run so smooth, you quit early. This feeds into your sampling size and plan.

But these are all decided based on the risk. Continuous analysis based on results so far vs risk/benefit. What you seem to miss is me saying that can be misleading. Even complete testing can be misleading. A P-value of 0.05 means that you are accepting a 5% chance that your tests are BS. That’s not clarity after analysis. That’s statistics. No such thing as clarity. Just an assessment of risk.

jimmoffet · July 27, 2018, 12:20am

“a tendency to terminate sooner when observing effects small rather than large in magnitude”

This is a sound practice.

If you have a given budget, reasonable sample sizes, behavior that is not wildly volatile and can observe that early performance is correlated positively with performance at completion, then you should absolutely be stopping tests with small effects early. The goal is to discover new, large effect sizes as quickly as possible, not to build a catalog of true effect sizes regardless of magnitude. The risk/cost of stopping early is the strength of the correlation between early and late results (just don’t skimp on that one).

doctorow · July 30, 2018, 2:45pm

This topic was automatically closed after 5 days. New replies are no longer allowed.

Topic		Replies	Views
Site is calling BS on crappy data visualization and other annoyances boing	50	3950	March 15, 2017
500 phrases from scientific publications that are correlated with bullshit boing	25	3786	November 21, 2015
Eminent psychologists condemn "emotion detection" systems as being grounded in junk science boing	23	1456	July 25, 2019
Exit Reviews is a website about how long products last, how they broke, and how to fix them boing	13	998	May 13, 2022
Theranos but for poop boing	37	2319	May 26, 2019

A/B testing tools have created a golden age of shitty statistical practices in business

Related topics