In ScienceNews this month, there’s controversial article exposing the fact that results claimed to be “statistically significant” in scientific articles aren’t always what they’re cracked up to be. The article — titled “Odds Are, It’s Wrong” is interesting, but I take a bit of an issue with the sub-headline, “Science fails to face the shortcomings of Statistics”. As it happens, the examples in the article are mostly cases of scientists behaving badly and abusing statistical techniques and results:
- Authors abusing P-vales to conflate statistical significance with practical significance. A for example, a drug may uncritically be described as “significantly” reducing the risk of some outcome, but the the actual scale of the statistically significant difference is so small that is has no real clinical implication.
- Not accounting for multiple comparisons biases. By definition, a test “significant at the 95% level” has 5% chance of having occurred by random chance alone. Do enough tests, and you’ll find some indeed indicate significant differences — but there will be some fluke events in that batch. There are so many studies, experiments and tests being done…
In ScienceNews this month, there’s controversial article exposing the fact that results claimed to be “statistically significant” in scientific articles aren’t always what they’re cracked up to be. The article — titled “Odds Are, It’s Wrong” is interesting, but I take a bit of an issue with the sub-headline, “Science fails to face the shortcomings of Statistics”. As it happens, the examples in the article are mostly cases of scientists behaving badly and abusing statistical techniques and results:
- Authors abusing P-vales to conflate statistical significance with practical significance. A for example, a drug may uncritically be described as “significantly” reducing the risk of some outcome, but the the actual scale of the statistically significant difference is so small that is has no real clinical implication.
- Not accounting for multiple comparisons biases. By definition, a test “significant at the 95% level” has 5% chance of having occurred by random chance alone. Do enough tests, and you’ll find some indeed indicate significant differences — but there will be some fluke events in that batch. There are so many studies, experiments and tests being done today (oftentimes, all in the same paper)that the “false discovery rate” maybe higher than we think — especially given that most nonsignificant results go unreported.
Statisticians, in general, are aware of these problems and have offered solutions: there’s a vast field of literature on multiple comparisons tests, reporting bias, and alternatives (such as Bayesian methods) to P-value tests. But more often than not, these “arcane” issues (which are actually part of any statistical training) go ignored in scientific journals. You don’t need to be a cynic to understand the motives of the authors for doing so — hey, a publication is a publication, right? — but the cooperation of the peer reviewers and editorial boards is disturbing.
ScienceNews: Odds Are, It’s Wrong