A/B testing tools have created a golden age of shitty statistical practices in business

Originally published at: https://boingboing.net/2018/07/25/behavioral-statistical-economi.html

4 Likes

we do A/B testing constantly with web content. And I kept trying to explain that the results are garbage because the process is totally invalid.

I gave up 2 years ago from futility.

6 Likes

I always wondered why Microsoft claimed their data suggested nobody was using the start button and so they could remove it from windows 8. I suspected they were just looking at their telemetry data wrong to support their ideas.

Nice to know this is probably the case.

9 Likes

Somewhat related is the book I’m currently trying to read “Fooled by Randomness” It’s suppose to be talking about how we find order when there isn’t any.

1 Like

A better reason for ending tests early is they cost money. Both in running the test and, in theory, implementing the results.

On the surface, it seems perfectly valid to look at early results, run statistics, and if the results are clear, why bother with more results? Move to the next stage! And sometimes this is true (Hey, every patient we try this drug on dies! Stop the testing!) and sometimes not (Both late stage pancreatic cancer patients we tried this on died … because that’s what normally happens. Maybe we shouldn’t give up hope yet.).

The rigor of the statistics should match the importance of the results. Doing A/B testing of a new font for a “Which superhero are you?” quiz? Saving a few bucks to cut the testing short and getting it wrong is ok. Measuring rainfall for only a few weeks to set water conservation policy because it rained … maybe not quite so smart.

2 Likes

At my work we do a lot of A/B testing. In my experience the main use of it is to post hoc justify the business decision that were already made. Because of, hey,… science…

6 Likes

Is it an overwhelming triumph or overwhelming failure, for the UI/UX guys that, while ‘garbage in, garbage out’ remains an iron law we are increasingly able to format the output garbage into something that 68% of users find more trustworthy than truth?

5 Likes

That’s a nice euphemism.

image

That’s a better reason for not-assing them at all, instead of half-assing them.

In a word, “Science”. Clarity comes only -after- all the analysis. Not before. That’s bias.

As long as it’s made clear up front that only half the paid for work is being done…

This.

Justification:Explanation::Might:Right

1 Like

reusing this:

Or this:

2 Likes

I got when the psychologists had their reproducibility crisis. But computer engineers getting this wrong is a sign of the apocalypse. I have to suspect that the AIs that will replace us aren’t so lazy about doing statistics.

You have a messed up view of testing. I end tests early all the time. Why finish a test of 298 devices when the first two for the first 50 broke in the middle of the test to demonstrate 90 at 95? Science? Wrong wrong wrong.

We also run what are called screening tests. Basically test the test. You run the test to gather some preliminary data and check to see what needs to be changed in the procedure. Sometimes things run so smooth, you quit early. This feeds into your sampling size and plan.

But these are all decided based on the risk. Continuous analysis based on results so far vs risk/benefit. What you seem to miss is me saying that can be misleading. Even complete testing can be misleading. A P-value of 0.05 means that you are accepting a 5% chance that your tests are BS. That’s not clarity after analysis. That’s statistics. No such thing as clarity. Just an assessment of risk.

1 Like

“a tendency to terminate sooner when observing effects small rather than large in magnitude”

This is a sound practice.

If you have a given budget, reasonable sample sizes, behavior that is not wildly volatile and can observe that early performance is correlated positively with performance at completion, then you should absolutely be stopping tests with small effects early. The goal is to discover new, large effect sizes as quickly as possible, not to build a catalog of true effect sizes regardless of magnitude. The risk/cost of stopping early is the strength of the correlation between early and late results (just don’t skimp on that one).

This topic was automatically closed after 5 days. New replies are no longer allowed.