How did journals like Science and Nature become so strongly associated with weak, overly-hyped research? Andrew Gelman speculates on the replicability crisis in the social sciences where conclusions derived from definitive studies based on small sample sizes not only lead to tabloid fodder, but also have a high probability of being wrong. There is nothing wrong with writing a paper with an inconclusive observational study coupled with speculative conclusions but perhaps there are more responsible outlets than high-profile “quickie” journals to encourage criticism, debate and tested replicability.
I posted on the Monkey Cage blog some further thoughts on those “Psychological Science” papers on menstrual cycles, biceps size, and political attitudes, tied to a horrible press release from the journal Psychological Science hyping the biceps and politics study. Then I was pointed to these suggestions that Richard Lucas and M. Brent Donnellan have on improving the replicability and reproducibility of research published in the Journal of Research in Personality:
It goes without saying that editors of scientific journals strive to publish research that is not only theoretically interesting but also methodologically rigorous. The goal is to select papers that advance the field. Accordingly, editors want to publish findings that can be reproduced and replicated by other scientists. Unfortunately, there has been a recent “crisis in confidence” among psychologists about the quality of psychological research (Pashler & Wagenmakers, 2012). High-profile cases of repeated failures to replicate widely accepted findings, documented examples of questionable research practices, and a few cases of outright fraud have led some to question whether there are systemic problems in the way that research is conducted and evaluated by the scholarly community. . . .
In an ideal world—one with limitless resources—the path forward would be clear. . . . In reality, time and money are limited . . . Once the reality of limited resources is acknowledged, then agreement about the precise steps that should be taken is harder to attain. Our view is that at least in the initial stages of methodological reform, we should target those changes that bring the most bang for the buck. . . .
These are good points with which I largely agree but maybe it’s not so simple. Even in a world of unlimited resources, I don’t think there’d be complete agreement on what to do about the replicability crisis. Consider all the cases where journals have flat-out refused to run correction letters, non-replications, and the like. A commenter recently pointed to an example from Richard Sproat. Stan Liebowitz has a story. And there are many others, along with various bystanders who reflexively defend fraudulent research and analogize retractions of flawed papers to “medieval torture.” Between defensiveness, publicity seeking, and happy talk, there’s a lot of individual and institutional opposition to reform.
Lucas and Donnellan continue:
First, a major problem in the field has been small sample sizes and a general lack of power and precision (Cohen, 1962). This not only leads to problems detecting effects that actually exist, it also results in lower precision in parameter estimates and systematically inflated effect size estimates. . . . Furthermore, running large numbers of weakly powered studies increases the chance of obtaining artifactual results
This is all fine, but in addition, low-powered studies have high Type S errors, that is, any statistically significant claims have a high probability of being in the wrong direction. Thus, the problem of low-powered studies is not just they have problems detecting effects that actually exist, but also that they apparently “detect” results in the wrong direction. And, contrary to what might be implied by the last sentence above, it is not necessary to run large numbers of weakly powered studies to get artifactual results (i.e., Type S errors). Running just one study is enough, because with enough effort you can get statistical significance out of just about any dataset!
I’m not making these comments out of a desire to be picky, just trying to clarify a couple of issues that have arisen lately, as psychometrics as a field have moved beyond a narrow view of file-drawer effects into an awareness of the larger problems of p-hacking. I think it’s important to realize that the problem isn’t just that researchers are “cheating” with their p-values (which might imply that all could be solved via an appropriate multiple comparisons correction) but rather that the old paradigm of a single definitive study (the paradigm which, I think, remains dominant in psychology and in statistics, even in my own articles and books!) should be abandoned.
A theory of the tabloids
By the way, how did Science and Nature become so strongly associated with weak, overly-hyped social science research? Has this always been the case? I don’t know, but here’s a (completely speculative) theory about how this could have happened.
The story goes like this. Papers in Science and Nature are short. The paradigmatic paper might be: We constructed a compound that cures cancer in mice. The underlying experiment is a randomized controlled study of a bunch of mice, there’s also a picture of slides showing the live and dead cancer cells, and the entire experiment was replicated in another lab (hence the 50 coauthors on the paper). It’s a short crisp paper, but underlying it are three years of research and a definitive experiment. Or, if it’s a physics paper, there might be a log-log plot of some sort. More recently we’ve been seeing papers on imaging. These are often on shakier ground (Vul and all that), but if done carefully they can result in valid population inference given the people in the study.
In social science, though, we usually can’t do definitive experiments. The relevant data are typically observational, and it’s difficult to design an experiment that plausibly generalizes to the real world. Effects typically vary a lot across people, which means that you can’t necessarily trust inferences from a convenience sample, and you also have to worry about generalizing from results obtained under particular conditions on a particular date.
But . . . people can still write short crisp papers that look like Science and Nature papers. And I think this might be the problem with Science, Nature, Psychological Science, and any other “tabloid” journals that might be out there. People submit social science papers that have the look of legitimate scientific papers. But, instead of the crisp tabloid paper being a concise summary of years of careful research, it’s a quickie job, a false front.
A place for little studies
I’d also like to repeat the point that there’s nothing wrong with writing a paper with an inconclusive observational study coupled with speculative conclusions. This sort of thing can go on to Arxiv or Plos-One or any number of specialized journals. Researchers A and B publish a speculative paper based on data from a convenience sample, researchers C and D publish their attempted replications, E and F publish criticisms, and so forth. The problem is that Science, Nature, Psychological Science, etc. publish quickie papers, and so there’s a motivation to send stuff there, and this in turn devalues the papers that don’t make it into the top journals.
Currently, journals hold criticisms and replications to such a high standard of publication that papers with errors often just stand by themselves in the scientific literature. Publishing a criticism can require a ridiculous amount of effort. Perhaps blogs are a problem here in that they provide an alternative outlet for the pressure of criticism. If I were not able to reach many thousands of people each day with my blog, I’d probably be putting more effort into getting correction notices published in scientific papers, maybe my colleagues and I would already have created a Journal of Scientific Criticism, and so forth.
I hope that the occasional high-profile criticisms of flawed papers (for example, here) will serve as some incentive for researchers to get things right the first time, and to avoid labeling speculation as certainty.
This was originally published on Andrew Gelman’s personal blog and is reposted with permission.
Note: This article gives the views of the author, and not the position of the Impact of Social Science blog, nor of the London School of Economics.
Andrew Gelman is a professor of statistics and political science and director of the Applied Statistics Center at Columbia University. He blogs here and his institutional website can be found here.