Big Data's "theory-free" analysis is a statistical malpractice


I think it is hard to say that people who use statistics as part of their job understand this stuff with all replication issues we’ve seen in a lot of fields and the very existence of a lot of big data outfits. All of those people are using statistics, poorly, in a way that has very real effects.


Well, there are a lot that don’t.


Is this related to law enforcement’s parallel construction ?


It’s called a ‘familywise error rate’ and, yeah, correcting for a number of tests is an ancient statistics trick. I’ve used, ooh, at least half-a-dozen various corrective measures to deal with it when doing post hoc tests. It’s basic statistics.

However, the issue isn’t just the number of tests, it’s the sample size. A huge sample size covereth a multitude of sins, and if you get it stupidly huge enough you can produce results with almost arbitrarily small p-values even after the most stringent corrections.

And these results won’t be the result of chance necessarily, but they still might be spurious. A statistical method is, in many ways, like picky filter amplifying one particular signal. However, like with any amplifier, if you crank it up too high the noise will get amplified alongside the signal. Thus, a lot of results are absolutely true and correct and perfect in every way except they don’t show real-world effects but, instead, detect with magnificent precision defects in the experimental protocol or statistical method used.

There are ways to handle this, incidentally and they aren’t even that complicated. You use multiple approaches at the same time that share as little of the underlying structure as possible to make sure your statistical methods aren’t screwing with you. You always always cross-validate to correct for overfitting on the available data-set[1], you run all your experiments triple-blind to avoid any sign of bias, and once you’ve crossed all the t’s and dotted all the i’s you then hand the whole mess to a completely different research group to replicate, ideally using a completely different data set.

It’s a lot more work, however, and costs a lot more and nobody is going to do it much because the present system encourages opposite behavior. For the industry either the company’s worth is predicated on producing seemingly impressive results as quickly and cheaply as possible[2] then that’s what you have to do or be outcompeted by someone who will or you’ve commissioned the research and the company doing the actual work is, again, incentivized to get results as cheaply as possible.

In academia the problem is the publishing rat-race. You need that publication, badly, and no journal is going to publish your impeccably researched null result. But they might just publish your very-marginal-if-you-don’t-look-at-this-one-test-we’ll-not-do research since it produces a result.

I do a hell of a lot of peer review and if I had a dollar for every bit of mathematical legerdemain I’ve had to call authors out on I could… well okay, a dollar isn’t that much, but I could afford a damned fine bottle of whisky.

The industry stuff you can’t fix: that’s your basic flaws-of-capitalism stuff. The academic stuff you could fix by requiring that all studies be pre-registered and for according full credit for replications, but what sort of institution could make such a sweeping change stick I can’t imagine.

[1] In fact, you run multiple-ply crossvalidations for added security. Mind, that gets expensive when you have to run your analyses on supercomputers but that’s just how the cookie crumbles.
[2] And as an occasional statistician, I assure you, the only thing stopping me from pulling rabbits out of my hat on command is professional integrity and honesty.


Very well put, @LapsedPacifist.

On top of all your points, the very idea of executing a null hypothesis significance test (NHST) and calculating p-values is becoming ever more questionable in my view as a statistician. In my view, seeing hypothesis tests and p-values as the essence of scientific reasoning, as it has been taught for decades, is dangerous and flawed.


Thank you kindly.

And yes, I agree with your assessment. There’s still situations where NHST works remarkably well, certainly, but it can’t be the only tool in the toolbox, especially not applied as badly as this.


That video is interesting, but is really about an entirely different topic. Ali Rahimi is making the point that machine learning/data mining is currently “alchemy” because it is a collection of ideas that often work in practice even if we don’t always understand why and that’s a problem because we have no recourse when they fail. That’s different than arguing that people are using statistics incorrectly out of incompetence or dishonesty.


Statisticians, yes.

Research scientists who use inferential statistics…not so much. The drive to publish often overcomes statistical rigour, as does the peer reviewers’ frequent insistence upon traditional but outdated statistical methods.


I think most people using “big data” are doing exactly what Rahimi is talking about.

edit: My point really is that Rahimi’s video highlights one way in which big data is problematic through ignorance.


Are there any smart city proposals in the western hemisphere being pushed by someone other than data mining companies and the politicians they’ve “lobbied”?

Hypothesis: reshaping cities to be attractive to pedestrians will be the “insight” of this data mining, and the “ideal” city will look suspiciously like what city planners suggest today, plaigerized, not replicated.

Meh…I actually wonder if getting enough people on board to reshape cities for people, instead of cars, would be worth the loss of privacy? I mean … Even if big data were valuable to the people being mined, city liveability is something that is studied and ignored already.

“Give me all your data and I will consider following 30 year old suggestions” – politician who is obviously not collecting money from anybody


Isn’t the enlightenment something like 300 years old by now?


From my perspective as a grad student, I see a lot of this getting propagated in academia through the direction and example of advisors/PIs/mentors. The case at Cornell described in the article is definitely one of the more egregious, but by no means an isolated incident. A lot of grad students know when they’ve been sent on a fishing expedition (particularly since gradual improvements in statistical training mean that in some branches of social science, your average research assistant will sometimes have a better handle on the analysis than their PI) but there is little that they can do about it without jeopardizing their career: There can be tremendous pressure not only to build out their CV by attaining publishable results, but to not disappoint or undermine senior faculty by questioning their approach. By the same token, those same senior researchers have the opportunity to encourage better practices.

This is illustrated for me in the contrasting experiences of two colleagues (let’s call them Student A and Student B), both of whom were working on their PhDs and conducting survey research as part of groups lead by senior faculty in their respective institutions: Around the same time, Advisor A and Advisor B got access to relatively large and high-quality sets of survey responses that were relevant to their groups’ interests. The problem (or opportunity, depending on your ethics, I suppose) was that the questionnaires were very extensive, yielding ~1,000 data points per respondent – mining would almost certainly turn up something and thanks to the size and quality of the dataset, any relationships they found would almost certainly make for a published journal article.

Needless to say, Advisor A succumbed to temptation, turned over the entire dataset to Student A and his peers and essentially instructed them each to “find a significant relationship and find me a theory that accounts for it and we’ll write it up,” which is exactly what they did – not exactly a “theory-free” approach, but it may as well have been. Student A was fully aware of why this was bad statistical practice but didn’t want to give up the opportunity to work with the data or the line on his CV, or to make an enemy of his advisor.

Meanwhile, Advisor B took a different approach. Instead of giving their group access to the data right off the bat, he instead gave them access to a list of variables and told them each to come back with a single, specific hypothesis, a brief written review of the theoretical basis for it, and a list of the data points they would need in order to test it. He then gave each student access to only that data, forcing them to think carefully beforehand about what exactly they were looking for and why. After analysis, the students who found preliminary support for their hypothesis using the restricted data wrote up their results and students who didn’t were re-tasked with cross-checking and trying to pick holes in their work. In some cases, this resulted in the initial conclusion falling apart and the whole group being re-tasked again, but in the end, the manuscripts that emerged were robust and well-polished – one was even published without revision at a well-regarded journal in their field, which is basically unheard of.

In the end, Advisor B’s group published fewer articles than Advisor A’s, but I would hazard that B’s publications were of greater merit. Meanwhile, his students actually gained something from the experience beyond a CV line, and I know from conversations with both that Student B was much happier with her output than Student A, since she felt she had actually contributed something of value to her field rather than just “playing the game.” Meanwhile, Student A has generally become more cynical regarding his work, and justifiably so. The distinction had really nothing to do with the students’ respective knowledge of statistics but with the leadership and support of their mentors (or lack thereof) in taking a more considered approach to their research.


This topic was automatically closed after 5 days. New replies are no longer allowed.