Psychology's reproducibility crisis: why statisticians are publicly calling out social scientists


Originally published at:


I agree that his rebuttal was an interesting and important read, and if Fiske’s article was the firestorm that he needed to attach to to get his point across, i suppose the ends may justify the means. However, the original speech is really about nastiness in discourse among a loud, rude minority of commenters. She talks specifically about critics who attack researchers and their motives, rather than that quality of the work itself, and turning off young academics from even entering the field. Perhaps she is hiding an underlying desire to be protected from any criticism in a veneer of complaints about abuse, but that’s suspiciously close to the argument made by GGers.

Sure, the Sciences need a good, deep, ongoing critiquing. I was a Soc major, and a lot of bullshit is throw around, though I think it’s pretty unclear that this is limited to social sciences. To me, the addition of “Science” to Social Studies represents, not so much an achievement, but rather an aspiration to add rigor to a field that is constantly battling against the widespread, casual arm-chair social science of “everybody knows people X act like Y” that goes on on the street every day.


The problem is not specific to the social sciences - it applies to any empirical work that uses controlled experiments with statistical analysis of a null hypothesis. Many scientists in these fields have succumbed to p-hacking (e.g. see, and this needs to change. There’s also a plethora of low quality journals out there that will publish almost anything. But that doesn’t mean science is broken, it just means that you have to be careful to check the credibility of published work. It was always this way - communities of scholars spend many years being trained how to test the credibility of work in their field, to pick out the solid results from the crap. Its just that now non-experts have instant access to the crap that was previously locked away in university libraries. So we need that domain expertise even more. Unfortunately, many people seem to believe the era of expertise is over, when, unfortunately, we need it more than ever.

Calling it a “reproducibility crisis” doesn’t help either. Most people banging on about reproducibility advocate that more scientists should spend time re-running each other’s experiments, and scientists should make it easier for others to do so, e.g. by releasing all their experimental materials, software, etc. There’s no harm in releasing it all, but it’s a complete waste of time for scientists to replicate crappy experiments, especially as repeating it won’t tell you it’s crappy. What we really need is for scientists in experimental fields to be more explicit about the theoretical frameworks for their experiments, and we need other scientists critiquing those theoretical frameworks by running different experiments, attacking the theory from a different angle. That’s always preferable to re-running a flawed experiment.


There is an argument that the “hard” sciences (Physics, chemistry) are actually the soft sciences because a great deal has been found out with (relatively) little effort. The really hard sciences are psychology and sociology, where it sometimes seems that there has been almost no progress in the last fifty years.
I think there are two reasons for this. First, the process of decoupling religious thinking is much harder for the social sciences than for physics and chemistry. Once Aristotle and his Christian misinterpreters had been debunked, physics and chemistry were relatively free. I would suggest that the pivotal moment was when the professor at Padua refused to look into Galileo’s telescope and it became obvious to every inquiring mind in Europe that something had changed. But social sciences are still subject to and informed by religious thinking*. It has been very hard for psychologists in some countries to accept that human minds are part of a spectrum, not a discontinuous step from the other primates and mammals; and of course the results of social science investigations into things like sexual behaviour are relentlessly attacked by people with religious axes to grind.
Second, to a large extent we are in the position of being part of what we are trying to understand. The experimenter is always part of the experiment in a way that is not true of, say, analysing a chemical reaction and its products. Interpretation is always there.

But the reproducibility crisis has been simmering for a very long time. It was an active subject when I was briefly involved in experimental psychology in the 1970s, and my then supervisor was very unpopular with a number of his colleagues because of his statistical work in debunking flawed research - especially with the drug companies, as he was being paid to review the test results for certain psychoactive drugs and was finding experimental flaws that made the results invalid. The desire not to admire that the emperor is unclothed springs only partly from academic protectionism. There are large vested interests in drug companies, prisons, drug policy and education that really don’t want accepted dogma challenged - and I don’t think it is tinfoil hattery to say so.

*edit - I include non-theistic religions like Marxism-Leninism in this. Marxism (not the economics) is after all based partly in messianic eschatological Judaism and can be considered the most recent Abrahamic religion to gain traction.
**edit edit - I don’t mean the artificial “Messianic Judaism” construct, which AFAIK wasn’t around in Marx’s day, but the actual eschatological wing of Jewish thought that awaits the coming of the Messiah and, for instance, doesn’t recognise the State of Israel because it was founded by people and not by the Messiah. @bibliophile20 has drawn my attention to what seems to be a weird US sect of that name which I’d never heard of before.

☭ Sup Marxists? ☭

The best part about fights like this is the make up sex afterwards. Later, spooning.



I’m a research psychologist and very much in favour of the open science initiatives and I think it’s important to finally fix our field. (It is also very true that this is by far not psychology’s problem alone but our field was the one that got called out first. Medical research, ostensibly a much “harder” science, also has serious replication problems. I guess you can say, the harder publication pressure, the more serious the problem, as this really encourages unethical research practices and “flexibility” in data analysis.)

Anyway: Unfortunately, the tone among parts of the open science/better statistics crowd is sometimes less than civil which regrettably undermines their (important!) point. The biyearly congress of the German Psychology Assocation just ended yesterday. Susan Fiske gave a, in my opinion, very good keynote about her current research. However, one of the first comments after the speech went straight to the letter mentioned in the article. She said that she would only respond briefly because it wasn’t the subject of her keynote and now she’s getting a whole bunch of Twitter flak for that, noticeably all of it talking about her but not even to her. I wonder whether one of those on the open science front would have reacted differently, if they had just given a talk on their research (and not on open science) and then gotten a question like that… These people are mad about her tone but they fail at setting a good example themselves.

I think this shows that the current discussion about research practices indeed has a civility problem which doesn’t help our science (or any other) at all which was more or less her (poorly made) point.


The problem is, the experiment doesn’t need to be crappy for it not to replicate. In fact, it can be planned and performed flawlessly but you can just be wrong about the hypothesis you’re testing. The problem is measurement error (and the hard sciences have this problem just as much as the “softer” ones). You can do the experiment once and find an effect and then you’ll run it again (the very same thing) and not find one. This has nothing to do with the quality of the experiment at all.

Of course, running a slightly different experiment, what you’d call a conceptual replication, is of course important as well as it’ll test under which conditions an effect still holds and under which it doesn’t.


The problem here being crappy statistical tools. I thoroughly recommend the late, great David Mackay as a good reference on this topic (the example on page 458 is very informative).


Isn’t that exactly the crux of the reproducibility problem? Many of these experiments are literally textbook examples of “good” experiments. The researchers were doing all the correct technical things, but yet the experiments failed to reproduce.

Does that make a bad experiment? No. Does that make bad science? Perhaps. At the very least, it implies that the questions asked and experiments performed were not important or interesting ones since they are not generally applicable.

And that’s what I think the problem is. It’s that what was once considered a general and universal finding turns out to be specific to the experiment being run. To me, it seems much easier to see this in retrospect once reproducibility is tried.


You had me up until religion and “non-theistic religions” i.e. Religious thinking, being a cause for lack of progress in psychology and sociology.

Just thought you should know.


A couple of things.

  1. This is not “calling out”. As the article says, this is an academic argument that utilises evidence, to push forward its conclusions, taking place in public. Not a tiresome social media bun fight.

  2. Reproducibility is an important problem, and it’s good that people are working on it.It’s actually a sign of a healthy and vital scientific community- people are dissatisfied with the status quo and are looking to improve the quality of the science that gets done. The correct conclusion is “we need more tests and better rigour” rather than “Oh look, some conclusions are wrong, so let’s throw out everything”

  3. It probably seems really unfair to single out Psychology for this, because it affects all fields. Well, apart from the fields that don’t hold with silly ideas like evidence, or falsifiability. Their obscurantist praxeology of despair is not affected by this.

  4. We really need a “Journal of negative results” to start to deal with the issue of publication bias.


That’s what I meant by measurement error. The researchers didn’t really have to do anything wrong (although in some cases, they might have, see the p-hacking/questionable practices discussion). But whether or not a hypothesis is really true or false, any single experiment might come out either way. So you have to replicate it, not once or twice, but many times, to see how the results pan out. That’s what statistical inference is for: Not for giving you a “true” result every time, but by giving you an estimate how right your hypothesis is likely to be and how sure you can be about it (and Bayesian statistics might be better at this kind of thing than the classical, p-value, frequentist sort).


OK. I see your point now and I mostly agree.

Part of the problem, I think is that researchers were (and are) asking the wrong kinds of questions. These questions and experiments may have statistical validity, but lack a connection to how people actually behave in the wild.

I can only really give an example from Software Engineering (the area that I did my PhD in). Researchers would perform some very well-designed experiments on how college students program, other researchers would perform similar experiments on other students that would largely reproduce the original experiment. And a new general rule is discovered.

The problem is that giving 20 college freshmen a toy problem to work on bears almost no relation to how things get done outside of the university.

It’s not exactly the same problem that is being talked about here, but strongly related. The problem is not with the experiment, or the reproducibility, but that the sample population is so restricted that the results don’t generally apply.


If it was only the tools, we’d be in a much easier position. The Bayesian statistics that you linked to have already gained a lot of traction in the community (I’m actually finally going to a workshop in two weeks). But improving our statistical methods is only one necessary step. Better statistics will not keep people from making mistakes or coming to wrong conclusions (willingly or inadvertently). More transparent science will make mistakes more easily identifiable by others, that’s why open science (releasing all material, data, and analysis protocols) is a good idea.

But what I see as the much bigger problem is the incentive system of research: The only thing that really, really matters for getting jobs, reputation and grant money, is how much you publish. That puts a lot of pressure, especially on junior researchers. And since positive (i.e. statistically significant, “We found an effect!” ones) and catchy results are, at the moment, better publishable than boring ones (“We did a similar experiment than those other guys but we didn’t find the effect/found the same thing, well duh.”) this incentivises us to KEEP RE-DOING ANALYSES UNTIL P < .05!! Just so the journals will take it. And that is really independent of the statistical method used. If you don’t have to disclose what exactly you did, you can find something in your data with any method you use and get away with it, even if you didn’t consciously intend to.

Obligatory xkcd.

Well, ideally, non-significant results would be just as publishable as significant ones. But there are, luckily, already journals like that, this one for example. I’m not aware of any psychological journals of that type that are indexed in any of the big databases (EBSCO etc.).


I agree but I think it’s a whole different problem. It’s about the dichotomy between basic and applied research as well as lab and field studies. But, you know, getting more diverse samples is really hard. But yeah, I don’t think I’m exaggerating when I say that about 70% of psychology studies use psychology students as their sample. And my research is only a small exception because I have easy “access” to pre-service teachers which are at least a little more diverse (and I do educational psychology research so maybe at least some ecological validity there).


Good for you for not giving up on research (like I did). I was just too disillusioned by the banality of the questions being asked and the inability to gain even basic answers. Of course, I was in a very different field from you.


Yeah, the subjects of psychological studies tend to be WEIRD.


Oh, I just ignore that :wink: .
I think, at the end of the day, whether I will give up or not will not be my decision. I will be either be the one out of hundreds of aspirants who gets the position with tenure or get kicked out of the system like most of the rest.


Princeton University psych prof Susan Fiske published an open letter denouncing the practice of using social media to call out statistical errors in psychology research, describing the people who do this as “terrorists” and arguing that this was toxic because of the structure of social science scholarship, having an outsized effect on careers.

And Princeton used to be such a good school…


Christ, what an asshole.

It’s not just social sciences, either. The medical sciences are almost as bad. As someone who is not a social scientist or a medical researcher, but does understand statistics and design of experiments, I can’t tell you how many times I’ve followed some claim to the source, looked at their experimental design, and been appalled that something like that got by peer review in that field. It’s not just things like fishing for p-values (looking at all possible correlations, and revising your hypothesis after the fact to favor the one that had the best p-value, without also revising that p-value to reflect the number of correlations you fish through) that are hard to detect just by reading the article, but the huge number of times, in both the social sciences and the medical sciences, when they fail to control for things that, even at first glance, seem much more likely to be the cause of the effect that they’re describing than what they’re proposing.