The real problem is that p-values are the fig leaf. People across many disciplines use statistics improperly. Even highly educated people lazily bandy things like p-values about and the speed-reading masses skip to the part where a magical number means “SCIENCE HAS PROCLAIMED IT TO BE TRUE!” All a p-value of 0.05 really means is that the results of whatever test that was employed has a 5% probability of being due to chance and chance alone. That’s still a pretty high probability if you think of it. Anyhow, the point I want to make is that the choice of the statistical test, and of course the raw data itself is far more important.
Can you blame them? There’s no money to be made proving the null hypothesis.
I use 'hitch in my/his/her giddy up" often, scientifically speaking that is.
Didn’t phdcomics do something like this? All I could find was
And of course there’s https://xkcd.com/882/ .
P values are not magic.
Let’s suppose that someone does a study showing that a harmless drug extends life tenfold, whitens your teeth permanently, and tans you seasonally and appropriately. And that this study has a p-value of 0.2.
Significant? Not by our conventional significance of p<0.05. Which is just that, a convention. But it still means that there’s an 80% chance this magic sauce is legit.
But – you interject – that doesn’t pass the eye test. It’s too good to be true. There must be methodological flaws.
Probably. A p-value is just a posterior probability. And if your pretest probability of this thing being true is very, very low you may suspect that the study includes assumptions or errors that lead to pretty good certainty about something wrong. We encounter this all the time when we argue with people who have excellent formal logic with weird initial assumptions – That combination leads to crazy talk. Viz. Ayn Rand. So you go back and do an independent study, and another, and a few more, and if five in a row support it the odds are now exponentially less that all the studies with different methodologies leading to the same results are due to chance.
P<0.05 is not magic. As in all things, you use it as a way to calibrate your uncertainty, and to hedge your bets.
There is nothing wrong with publishing negative results, and, in fact, it is actually a good idea. When researchers engage in “file drawer bias” by not submitting negative results (or when journals engage in “publication bias” by declining to accept negative studies) the published record becomes skewed in favor of positive results, whether they are representative or not, as XKCD illustrated so well:
The issue is not that of publishing negative results, the issue is the use of misleading claims that results short of the threshold chosen for statistical significance are statistically significant. This relates to your other post about scientific publishing:
Replication, or reproducibility, is a key to the methodology of science. Not only are many initial studies not replicable, journals often refuse to publish negative results of attempts to replicate a study.
Evidence for precognition is big news, but leading journals won’t touch repeat studies that fail to replicate the results.
The bias against negative studies is, I think, one of the reasons that people try to claim that negative results are positive. We should encourage the publication of high quality negative studies. Doing so may be a way to reduce the perceived incentives to make misleading claims about the statistical significance of study results.
This is why registered studies and registered clinical trials are so important, and need to be further enforced.
The FDA ought to require registration of all trials for submitted drugs and devices. Otherwise the file drawer effect runs rampant.
Pre-registered studies in regular journals are also a boon to the researchers themselves. The current wide-spread model is to “generate results”. Ones that are interesting, so that they get published. But in the few places that have started pre-registering the publishing of the study is guaranteed, so that researchers don’t feel forced to get the outcomes they want, and journals themselves are less prone to ignoring all the important replication.
Of course this only works if the journals are both willing to do pre-registration, and if the journals are willing to do pre-registration of replication studies as well.
On the other hand this piece of magnificent gobbledygook was actually accepted by a real math journal. The references at the end are what should have given it away, but…
See also: blog post about the thing.
Life gets interesting when you edit translations of abstracts for articles on the beneficial effects of Buddhism in sociology, poli sci, etc., by Buddhist monks at a large Southeast Asian seminary university that shall remain nameless.
"Weasel words for p > 0.05? We don’t need no stinkin’ weasel words for p > 0.05. We don’t get p > 0.05.
“Repeat after us: qualitative methods…”
Phra Marasatsana has this science thing figured out.
There are, of course, several different standards for statistical significance. The p < .05 is typically used in the social sciences. P <.001 is the standard in other sciences. You can set the boundary anywhere you want, except that it has to be accepted by the rest of the scientists in your discipline.
What did we know and when did we know it?
Knowledge is a tricky thing. Each individual has a different standard for deciding to accept something as knowledge or not. Some people accept hearsay. Some people accept witnessing. Some people accept “science” as it is reported in the news. Some people accept knowledge they can verify in their own labs. My point is that the social and individual factors play a huge role in any person’s decision to decide that something is known or not.
Is there an objectively correct standard for knowledge? I doubt it, but I’m willing to entertain arguments (or better yet, evidence) to the contrary.
For all that it is good to be skeptical, too much skepticism paralyzes people. So at some point, a different point for every individual, we must take a leap of faith in deciding how much evidence of what quality is necessary to decide whether we know enough to take action. Uncertainty is rampant. And still, we survive.
I would also like to add that “insignificant” is not an accurate way to describe a non significant result, and “very significant” or “of strong significance” make equally no sense for when a small p value is detected. I was taught that there is only one way to report a result greater than the p value you chose for the experiment: non significant. A significant p value only signals the beginning of a discussion, it is not the end of the discussion.
Calling that a real journal is a stretch. It’s a scam, not a journal.
p < .05 is standard for medical research as well.
Also wanted to support the publishing of negatives. “Trend towards significant” etc are valid responses as long as they are clearly stated and the suggested action based on them is further research. A study that returns a p of .055 may have had an anomaly or a poorly controlled confounding variable. One that returns a p of .89 is almost certainly negative if the methodology is sound.
I’m surprised that the publishing of negative results hasn’t become more popular in the age of the internet. Shooting down other people’s arguments and evidence is the universal pasttime of internet chatter. Sure, compared to buzz news, scientific forums would move at a pitch-drop pace, but if they learned to incorporate memes when letting each other know that their result were not reproducable, it might garner more public attention.
“ermagerd intenervning veriabers!”
*awkward penguin image* “results couldn’t be reproduced independently by 3 other labs in 6 additional trials”
*willy wonka patronizing face* “Oh really, tell me more about your p < 0.05 significance…”
“I’m in yur methodologies section, pointing out flaws”
That doesn’t make sense. p-values are a continuum, describing our confidence that a correlation is real, and that confidence approaches certainty as p approaches 0.
A p-value of 0.000001 is “very significant.” It means that you are almost absolutely certain that the correlation you found is real and not due to chance (regardless of methodological flaws in your study, which is an orthogonal question).
I’m going to start The Journal of Negative Results. I’m not entirely unserious about this.
We’ll have to disagree there. I was always taught, and continue to believe that the a p value is a binary type value in the way we use it. You either reach significance or you don’t. Getting a smaller p value than you were hoping to get doesn’t mean your results are any truer or better or “significant” in a real world context. That is the discussion preserved for effect size or another practical measure such as how many people you have to treat in order to get a positive result. Perhaps it’s like being “very” or “more” pregnant. It just doesn’t make sense (BTW I hate that pregnant or not pregnant binary I just used to illustrate the point).
BTW, I hate p values and the frequentist statistical approach. I would much prefer social scientists made more use of regression analysis where variance can be measured across multiple variables. I think this is much more useful than rejection of the null hypothesis of various single hypotheses.
Is there any reason why significance should be a strictly binary proposition? I’d think that saying so is a somewhat arbitrary proposition, grounded in convention rather than on any inherently binary nature to the idea of significance. However, agreeing on a threshold for significance beforehand, then trying to change it post hoc is goal post moving and post hoc rationalization, that is, bad science.
I think that there’s a confusion over the binary fact of whether there IS or ISN’T a correlation with our certainty of whether we know there is a correlation.
Whether or not there exists a correlation is binary, as you say. It may be a strong correlation or a weak correlation, but there either is or isn’t a correlation between them. And this is what we’re trying to determine with p-values, that’s true.
However, we cannot be certain of whether there is a correlation, or whether all the results we’ve seen are the result of being luck. The p-value represents our confidence that there is indeed a real correlation. This confidence can never be 100% — we can never be absolutely certain that there is a correlation — but we can measure this confidence.
This is what we mean by the results being “strongly” significant — that our confidence of whether it really is significant is higher. But you’re right that it’s a poor phrase: the significance itself is still binary, but our confidence about it has increased.
That said “you either reach significance or you don’t,” while true in practice (articles tend to just say they’re “significant” when they hit that 0.05 level), isn’t really true. The 0.05 boundary is completely arbitrary. It really is meaningful if your p-value is 0.00001 vs 0.05. 1 our of 20 studies with a p-value of 0.05 are likely to get overturned when someone repeats the study, even though they claimed to be “significant,” while those with a p-value of 0.00001 are extremely unlikely to be overturned (using the same methodology, that is). That’s useful information to know.
Right, the effect size is completely separate from the p-value (except that to even say there’s any effect size at all means you’re assuming any correlation you see is real), and too often it’s ignored. (Witness the kerfuffle over eating bacon having a “significant” affect on cancer rates, when the actual effect size was tiny.)