Big Data's "theory-free" analysis is a statistical malpractice

From my perspective as a grad student, I see a lot of this getting propagated in academia through the direction and example of advisors/PIs/mentors. The case at Cornell described in the article is definitely one of the more egregious, but by no means an isolated incident. A lot of grad students know when they’ve been sent on a fishing expedition (particularly since gradual improvements in statistical training mean that in some branches of social science, your average research assistant will sometimes have a better handle on the analysis than their PI) but there is little that they can do about it without jeopardizing their career: There can be tremendous pressure not only to build out their CV by attaining publishable results, but to not disappoint or undermine senior faculty by questioning their approach. By the same token, those same senior researchers have the opportunity to encourage better practices.

This is illustrated for me in the contrasting experiences of two colleagues (let’s call them Student A and Student B), both of whom were working on their PhDs and conducting survey research as part of groups lead by senior faculty in their respective institutions: Around the same time, Advisor A and Advisor B got access to relatively large and high-quality sets of survey responses that were relevant to their groups’ interests. The problem (or opportunity, depending on your ethics, I suppose) was that the questionnaires were very extensive, yielding ~1,000 data points per respondent – mining would almost certainly turn up something and thanks to the size and quality of the dataset, any relationships they found would almost certainly make for a published journal article.

Needless to say, Advisor A succumbed to temptation, turned over the entire dataset to Student A and his peers and essentially instructed them each to “find a significant relationship and find me a theory that accounts for it and we’ll write it up,” which is exactly what they did – not exactly a “theory-free” approach, but it may as well have been. Student A was fully aware of why this was bad statistical practice but didn’t want to give up the opportunity to work with the data or the line on his CV, or to make an enemy of his advisor.

Meanwhile, Advisor B took a different approach. Instead of giving their group access to the data right off the bat, he instead gave them access to a list of variables and told them each to come back with a single, specific hypothesis, a brief written review of the theoretical basis for it, and a list of the data points they would need in order to test it. He then gave each student access to only that data, forcing them to think carefully beforehand about what exactly they were looking for and why. After analysis, the students who found preliminary support for their hypothesis using the restricted data wrote up their results and students who didn’t were re-tasked with cross-checking and trying to pick holes in their work. In some cases, this resulted in the initial conclusion falling apart and the whole group being re-tasked again, but in the end, the manuscripts that emerged were robust and well-polished – one was even published without revision at a well-regarded journal in their field, which is basically unheard of.

In the end, Advisor B’s group published fewer articles than Advisor A’s, but I would hazard that B’s publications were of greater merit. Meanwhile, his students actually gained something from the experience beyond a CV line, and I know from conversations with both that Student B was much happier with her output than Student A, since she felt she had actually contributed something of value to her field rather than just “playing the game.” Meanwhile, Student A has generally become more cynical regarding his work, and justifiably so. The distinction had really nothing to do with the students’ respective knowledge of statistics but with the leadership and support of their mentors (or lack thereof) in taking a more considered approach to their research.

2 Likes