Big Data's "theory-free" analysis is a statistical malpractice

doctorow · January 11, 2019, 6:17pm

Originally published at: https://boingboing.net/2019/01/11/start-with-a-hypothesis.html

…

Purplecat · January 11, 2019, 6:29pm

Big data seems to be throwing up repeated examples of analysis without insight, coming up against the same problems that have been known about (and previously solved for) for years. This is why Statistics is an entire field of study, people!

moortaktheundea · January 11, 2019, 7:21pm

This takes on an extra wrinkle when you consider all of the smart city proposals. A lot of these systems are premised on the idea that enough data collection will be enough to reshape cities in a positive manner.

Scientist · January 11, 2019, 7:23pm

The hard part in research is always the testing. It requires questioning assumptions and acceptance that the world is always right, whether it agrees with your idea or not. Data mining for ideas works if afterwords those ideas are rigorously tested. Most consumers of statistics (including many professionals) just see a “significant” finding and assume causality without question.

Scientist · January 11, 2019, 7:27pm

I’d suggest a friendly reminder for any city considering becoming “smarter” that if the people who design a system don’t understand it then more data won’t help.

Boundegar · January 11, 2019, 9:20pm

Does “endemic” mean “epidemic, but cooler?”

themule · January 11, 2019, 11:15pm

Data analaysis like for the “Google Flu” example is perfectly valid for generating hypotheses.

You can’t be “theory free”, but if the top 50 search terms hypothesis from one year predicts the flu the next year then you’ve found something that might be useful.

reactionabe · January 11, 2019, 11:32pm

You can’t have an empirical or material conclusion without the rigorous application of theory. Otherwise all you have is observations, and then you’re right back to thinking that lightning is caused by angry buffet gods, and thunder is the clanking of their forks.

nemonowan · January 12, 2019, 12:33am

Looking for patterns is a valid first step. Then you have to check that the pattern keeps being valid, and make hypotesis about the reasons behind it.

kennykb · January 12, 2019, 2:29am

Obligatory XKCD:

heng · January 12, 2019, 8:28am

This doesn’t follow, but i do think it’s the case that the statistical mechanisms need to be understood (which they most often aren’t).

LutherBlisset · January 12, 2019, 1:08pm

As a biologist, I chuckled.

As someone who knows medical usage of the term, I shrugged.

bolamig · January 12, 2019, 4:53pm

The old idea of p>.01 being the gold standard falls down when you are checking thousands of parameters. Perhaps p>.01/Nparameters needs to be the standard. But you would need a lot of data to get that, and the more data the more parameters you probably have to check.

anon78706664 · January 12, 2019, 4:55pm

Eh, just dump all the data into an AI black box and it’ll tell us whether it means anything or not.

SpookyFM · January 12, 2019, 7:25pm

Everyone who has paid attention to more than one lecture in inferential statistics knows that you have to correct for the number of tests that you’re doing (and .01/N of parameters is the simplest one, but there are many more sophisticated methods). But that doesn’t solve the problem: even if you do that, there’s still going to be significant outcomes/interesting-looking correlations, if you just look at enough parameters, just by chance! But p values only work if you had a hypothesis beforehand, it’s in their very nature, but this doesn’t get taught to eagerly in courses nowadays, although that’s changing in the wake of the ~~replication crisis~~credibility revolution in psychology and other fields.

Philosophers of science and even some psychologists have talked about this problem for decades, but nothing much has changed, because how do you publish your scientific articles if you just don’t find anything interesting and finding no significant p values doesn’t count as interesting (although it might well be). This is how you get publication bias.

I guess the same thing applies to big data: If you’re a marketing consultancy and your clients pay you to find patterns in some user data and all you find is a big chunk of uninteresting noise, well, you’re sure as hell going to squeeze blood from a stone to present something “groundbreaking” (and get paid).

(Also, pedantry: it’s p < .01 not >)

Scientist · January 12, 2019, 8:49pm

If by statistical mechanisms you mean which correlations are causal/predictive, then we are saying the same thing. No human population data is ever from a controled experiment, which means that it is impossible to tell from the data alone what relationships are causal and what the result of future manipulations will be. This depends on adding the data to your model/concuptual framework of the system. Our ability to apply the data is based on our understanding of the system. If you’re conceptual framework of the system is off, simply adding more data doesn’t help anything.

moortaktheundea · January 13, 2019, 1:12am

Unfortunately no one understands the systems of cities. We have a lot of fancy terms for it in the academic literature, but the biggest lesson to come out of any study of urban studies/planning literature is humility, that and the massive effects of racism.

heng · January 13, 2019, 10:22am

No, I mean the properties of the statistical tools need to be understood. e.g. what happens to our inference if we perturb the inputs?

In case you haven’t come across it, I have it on good authority that the book of why is an important contribution to the science of cause and effect. I confess, I still need to read it, but here’s a quick overview from someone competent.

jhbadger · January 13, 2019, 1:48pm

Indeed. You are describing Bonferroni correction, but multiple hypothesis correction is an ongoing topic of investigation. Honestly, all this “people don’t know the pitfalls of statistical inference” stuff in the media is getting pretty tedious. Yes, “people” as in the general public, perhaps. But people who actually do statistics as part of their job know the issues and deal with them.

KingGhidorah · January 13, 2019, 3:02pm

The author’s book titles sound interesting, but after reading the article I don’t think they’ll go on my reading list. The examples he gives seem weak, or obvious, or just plain uninteresting, and he doesn’t explain them well.

For example he cites a stock picker’s methodology. There are literally thousands of these books, so it’s not hard to find one that did poorly going forward. About half will probably do worse than average, nothing amazing here.

Google Flu was meant to be an experiment. They made a thing, and said “let’s see if it can predict the flu”. It didn’t. Experiment wasn’t a success, so what?

Even the Feynmann example I don’t get. It seems Feynmann was really telling the students, if you want to solve a problem do it the obvious way: to get the odds of a certain car plate existing in a parking lot go out and LOOK in the parking lot for that car plate. If it’s not there, the odds are zero. If it is there, the odds are 1. Instead the silly students started figuring out probabilities. What does computing a probability have to do with data mining?

Also a student put a fish in an MRI, disproving MRI machines I guess.

Maybe it’s his writing style but I left the article a cheerleader for big data mining. At least it comes up with interesting results, true or not!

Topic		Replies	Views
Big Data's religious faith denies the reality of failed promises, privacy Chernobyls boing	39	3414	October 12, 2015
Chicago PD's Big Data: using pseudoscience to justify racial profiling boing	103	6231	March 3, 2014
MI5 warning: we're gathering more than we can analyse, and will miss terrorist attacks boing	18	2745	June 12, 2016
Weapons of Math Destruction: invisible, ubiquitous algorithms are ruining millions of lives general topics	20	3312	February 18, 2017
The Undercover Economist Strikes Back: How to Run or Ruin an Economy boing	17	2574	September 14, 2013

Big Data's "theory-free" analysis is a statistical malpractice

Related topics