Big Data's "theory-free" analysis is a statistical malpractice


Originally published at:


Big data seems to be throwing up repeated examples of analysis without insight, coming up against the same problems that have been known about (and previously solved for) for years. This is why Statistics is an entire field of study, people!


This takes on an extra wrinkle when you consider all of the smart city proposals. A lot of these systems are premised on the idea that enough data collection will be enough to reshape cities in a positive manner.


The hard part in research is always the testing. It requires questioning assumptions and acceptance that the world is always right, whether it agrees with your idea or not. Data mining for ideas works if afterwords those ideas are rigorously tested. Most consumers of statistics (including many professionals) just see a “significant” finding and assume causality without question.


I’d suggest a friendly reminder for any city considering becoming “smarter” that if the people who design a system don’t understand it then more data won’t help.


Does “endemic” mean “epidemic, but cooler?”


Data analaysis like for the “Google Flu” example is perfectly valid for generating hypotheses.

You can’t be “theory free”, but if the top 50 search terms hypothesis from one year predicts the flu the next year then you’ve found something that might be useful.


You can’t have an empirical or material conclusion without the rigorous application of theory. Otherwise all you have is observations, and then you’re right back to thinking that lightning is caused by angry buffet gods, and thunder is the clanking of their forks.


Looking for patterns is a valid first step. Then you have to check that the pattern keeps being valid, and make hypotesis about the reasons behind it.


Obligatory XKCD:


This doesn’t follow, but i do think it’s the case that the statistical mechanisms need to be understood (which they most often aren’t).


As a biologist, I chuckled.

As someone who knows medical usage of the term, I shrugged.


The old idea of p>.01 being the gold standard falls down when you are checking thousands of parameters. Perhaps p>.01/Nparameters needs to be the standard. But you would need a lot of data to get that, and the more data the more parameters you probably have to check.


Eh, just dump all the data into an AI black box and it’ll tell us whether it means anything or not.


Everyone who has paid attention to more than one lecture in inferential statistics knows that you have to correct for the number of tests that you’re doing (and .01/N of parameters is the simplest one, but there are many more sophisticated methods). But that doesn’t solve the problem: even if you do that, there’s still going to be significant outcomes/interesting-looking correlations, if you just look at enough parameters, just by chance! But p values only work if you had a hypothesis beforehand, it’s in their very nature, but this doesn’t get taught to eagerly in courses nowadays, although that’s changing in the wake of the replication crisiscredibility revolution in psychology and other fields.

Philosophers of science and even some psychologists have talked about this problem for decades, but nothing much has changed, because how do you publish your scientific articles if you just don’t find anything interesting and finding no significant p values doesn’t count as interesting (although it might well be). This is how you get publication bias.

I guess the same thing applies to big data: If you’re a marketing consultancy and your clients pay you to find patterns in some user data and all you find is a big chunk of uninteresting noise, well, you’re sure as hell going to squeeze blood from a stone to present something “groundbreaking” (and get paid).

(Also, pedantry: it’s p < .01 not >)


If by statistical mechanisms you mean which correlations are causal/predictive, then we are saying the same thing. No human population data is ever from a controled experiment, which means that it is impossible to tell from the data alone what relationships are causal and what the result of future manipulations will be. This depends on adding the data to your model/concuptual framework of the system. Our ability to apply the data is based on our understanding of the system. If you’re conceptual framework of the system is off, simply adding more data doesn’t help anything.


Unfortunately no one understands the systems of cities. We have a lot of fancy terms for it in the academic literature, but the biggest lesson to come out of any study of urban studies/planning literature is humility, that and the massive effects of racism.


No, I mean the properties of the statistical tools need to be understood. e.g. what happens to our inference if we perturb the inputs?

In case you haven’t come across it, I have it on good authority that the book of why is an important contribution to the science of cause and effect. I confess, I still need to read it, but here’s a quick overview from someone competent.


Indeed. You are describing Bonferroni correction, but multiple hypothesis correction is an ongoing topic of investigation. Honestly, all this “people don’t know the pitfalls of statistical inference” stuff in the media is getting pretty tedious. Yes, “people” as in the general public, perhaps. But people who actually do statistics as part of their job know the issues and deal with them.


The author’s book titles sound interesting, but after reading the article I don’t think they’ll go on my reading list. The examples he gives seem weak, or obvious, or just plain uninteresting, and he doesn’t explain them well.

For example he cites a stock picker’s methodology. There are literally thousands of these books, so it’s not hard to find one that did poorly going forward. About half will probably do worse than average, nothing amazing here.

Google Flu was meant to be an experiment. They made a thing, and said “let’s see if it can predict the flu”. It didn’t. Experiment wasn’t a success, so what?

Even the Feynmann example I don’t get. It seems Feynmann was really telling the students, if you want to solve a problem do it the obvious way: to get the odds of a certain car plate existing in a parking lot go out and LOOK in the parking lot for that car plate. If it’s not there, the odds are zero. If it is there, the odds are 1. Instead the silly students started figuring out probabilities. What does computing a probability have to do with data mining?

Also a student put a fish in an MRI, disproving MRI machines I guess.

Maybe it’s his writing style but I left the article a cheerleader for big data mining. At least it comes up with interesting results, true or not!