Statcheck: a data-fakery algorithm that flagged 50,000 articles


Originally published at:


triggered 50,000 retractions

Wait, what? That’s a huge claim, but it doesn’t seem to be based on any of the articles, unless I’m missing something.

What does it mean, exactly? 50,000 papers were retracted?

The article just seems to say that they’re going to look at 50,000 papers, and many of them had no flags.


Yeah, no. No 50,000 retractions.

I love the idea behind statcheck and I will be using it on my own papers. But there are two problems, one with the reporting and one with the way the software is used.

The reporting: What statcheck does is extract all p-values and the corresponding statistical values (F, Z, chi-squared, what have you) from a text. It then tests whether they fit (an F-value of 2.5 with numerator degrees of freedom of 1 and denominator df of 31 will have a p of 0.07 which you can just look up in a table). This is a good thing; it finds typos or errors made when copying values from the output of your statistical package. These mistakes of course do get made and researchers as well as journal reviewers and editors should be on the lookout for them. Statcheck makes this a lot easier.

However, I am not convinced that the tool is really an effective measure against p-hacking. “Real” p-hacking is much more subtle. If my p-value is 0.07, thus not reaching “statistical significance” (current publishing practices make it a lot easier to publish my paper when it’s at least below .05) and I am unscrupulous, I will not just change the value in the paper. I will actually change my analysis to make it look plausible that it’s below .05. I can add covariates, only report the outcome measures that do reach significance and leave out the rest, I can remove those study participants from the data set that don’t fit the picture, etc. And statcheck will never know.

On the other hand, there are also false positives: Statcheck also misinterprets values (for example, it can think a test is two-sided, when it was really one-sided and thus the p-value can be halved) and even reports errors when they don’t even change the significance level of the results.

And therein lies the problem: Statcheck doesn’t really find p-hacking, only (mostly) typos. But it is being reported as finding p-hacking. This leads us straight into problem no. 2: The initiative reported in the article is automatically checking tens of thousands of papers and publishing its results on PubPeer (a scientific, open post-publishing peer review platform) without any human intervention. Now, this wouldn’t be that much of a problem if statcheck results weren’t associated with questionable research practices by reports like Cory’s (and many others!).

Of course, diligence should be expected from us researchers and statistical errors are unfortunate, however, they are seldom deliberate and usually have nothing to do with fraud or bad science. But now thousands of researchers come into undeserved disrepute just because there’s a typo in their paper or (just as likely) statcheck’s algorithm read the results section wrong. Suddenly, one of my papers is one of “50,000 retractions” although it’s really only one that has been checked by statcheck.

Sorry, if this was too technical, but it is a complicated topic and it’s just a little frustrating. Yes, much of science seems to be in a sorry state (psychology just got called out first) and we’re trying hard to drag ourselves out of the mud. When stories like these come along and make even the honest, hard-trying researchers look bad just because of a knee jerk initiative, I’m not sure I agree with that. And I am someone you would probably call an open science, good stats advocate…


I wonder if this technique could be applied to discovering data truthfullness?


I think you’re right. We should have tools available that flag suspicious analysis - those tools used to be called “conscientious and statistically-trained referees” - but they should be wielded by humans during the peer-review process.

Unfortunately, the volume arising from the competition to publish has made it much harder for journals to ensure quality refereeing than it was decades ago.


Not really, because it doesn’t look at the data, only at statistical values in the published manuscripts. And it makes only quite simple calculations, albeit automatically. For results from a line of work, there’s p-curve analysis which can give insight into whether the results of a number of studies might have been manipulated.


Thanks Spook


This topic was automatically closed after 5 days. New replies are no longer allowed.