# Understanding spurious correlation in data-mining

**doctorow**#1

**Nylund**#3

“Anyone who thinks it’s possible to draw truthful conclusions from data analysis without really learning statistics needs to read this.”

That’s all good and true. You probably can’t trust the statistical results presented by someone who never really learned statistics. Similarly though, you can’t ever put too much weight on critiques of statistical techniques by people who never really learned statistics either. I’m not saying the latter applies here.

I think a clearer message would be, “Dear non-statisticians, please realize that other non-statisticians do bad statistical analysis.”

There really isn’t much here for statisticians to learn from. They already know this stuff. It’s more about educating laypeople and/or creating a sense of superiority in one group of non-statisticians over another group of non-statisticians.

The bad thing about “geek chic” is that it’s vastly increased the number of dilettantes who like to lecture people about math and science when their credentials don’t extend much past having watched Battlestar Galactica or whatever it is they think gives them nerd-cred.

**SamSam**#4

I don’t see the cause of confusion.

It’s one thing to say “If you take a humongous pile of Big Data and just randomly run regressions, you are going to find correlations that don’t exist.”

It’s another thing to say “if you have a hypothesis about two things, and gather the data and see that there is a correlation between the two things, then that adds some evidence that there is a connection between the two things.”

In the case of the guns study, there are numerous plausible hypotheses (off the top of my head: there are more people with guns in certain states, and there are more people who got a higher score on that “symbolic racism” test in those states (maybe because of the wordings in the test), so the two are therefore correlated). Just because this one emotionally got your goat (because it involves guns) doesn’t mean you have to say that all statistics are bullshit.

**Nylund**#5

As the post states:

Given two measurements xi in X and yi in Y on a set of points p1…n in P, if the value of xi+yi increases the chance that pi will be sampled, it will introduce a phantom correlation between X and -Y

For the gun/racism thing that would translate to a “phantom correlation” if your gun-ownership status plus your racism score made it more likely that you would show up in the American National Election Study. If that is the case, then this particular issue is one to be aware of.

Out of all the potential statistical problems with the gun/racism study, that’s a pretty minor one to worry about though.

**doctorow**closed #6

This topic was automatically closed after 5 days. New replies are no longer allowed.