Failure Is Moving Science Forward | FiveThirtyEight

IBEX Bionomics , April 17, 2016 / 2030 0

While I paced around the green room at a recent TEDx event in Colorado, one of the other speakers offered the rest of us some advice on how to ease our nerves. “Raise your arms up in the air and make yourself big — it will help you feel powerful!” It was scientifically proven, she told us (she’d seen it in a TED talk), that adopting a so-called power pose — shoulders wide, arms strong — could raise your testosterone levels, lower your stress hormones, and make you feel more confident and commanding.
Like everyone else, I was nervous. This wasn’t my usual kind of speech; it was a performance — a scripted story that wasn’t supposed to soundscripted, told with no notes and no cues. I knew my lines by heart, but I also knew that one moment of doubt was all it would take for me to draw a blank up on stage. So just before I walked through the curtains, I took a deep breath and raised my arms overhead as if signaling victory. I don’t know if the power pose helped me, but it didn’t seem to hurt.
What I didn’t say back in the green room was that although one highly touted study had shown how adopting a power pose could alter your hormone levels and make you more bold, another group of researchers had tried to repeat the study and found no such effect. It’s possible that the power pose phenomenon was nothing more than a spurious result.
Power poses aren’t the only well-publicized finding called into question by further research. Psychology, biomedicine and numerous other fields of science have fallen into a crisis of confidence recently, after seminal findings could not be replicated in subsequent studies. These widespread problems with reproducibility underscore a problem that I discussed here last year — namely, that science is really, really hard. Even relatively straightforward questions cannot be definitively answered in a single study, and the scientific literature is riddled with results that won’t stand up. This is the way science works — it’s a process of becoming less wrong over time.

The roots of the reproducibility crisis

As science grapples with what some have called a reproducibility crisis, replication studies, which aim to reproduce the results of previous studies, have been held up as a way to make science more reliable. It seems like common sense: Take a study and do it again — if you get the same result, that’s evidence that the findings are true, and if the result doesn’t turn up again, they’re false. Yet in practice, it’s nowhere near this simple.
“Scientific claims don’t gain credibility by someone saying, ‘I found it.’ They gain credibility by others being able to reproduce it,” said Brian Nosek, a psychologist at the University of Virginia, co-founder of the Center for Open Science and leader of the Reproducibility Project: Psychology. RP:P, initiated in 2011, attempted to replicate 100 studies published in three high-profile psychology journals in 2008. By this logic, a replication study’s purpose is to confirm a previously reported finding.
Yet there are good reasons why real effects may fail to reproduce, and in many cases, we should expect replications to fail, even if the original finding is real. It may seem counterintuitive, but initial studies have a known bias toward overestimating the magnitude of an effect for a simple reason: They were selected for publication because of their unusually small p-values, saidVeronica Vieland, a biostatistician at the Battelle Center for Mathematical Medicine in Columbus, Ohio.
Imagine that you were looking at the relationship between height and college majors. You collected data from math majors in a small class that had a couple of unusually tall students and compared it with a similar sized philosophy class that happened to have one unusually short person in it. Comparing the two averages, the differences seem large — math majors are taller than philosophy majors (and perhaps the unusual difference between these two particular classes is what caught your attention in the first place). But most of those differences were flukes, and when you repeat the study you’re unlikely to see such an extreme difference between the two majors, especially if the second study has a larger sample. If you’re trying to figure out the true height differences, this “regression to the mean” is a good thing, because it gets you closer to the true averages.
But the regression to the mean issue also means that even if the initial results are correct, they may not be replicated in subsequent studies. The RP:P project attempted to replicate 100 studies, 97 of which had produced results with a “significant” p-value of 0.05 or less. By selecting so many positive studies, the group set itself up for a regression to the mean phenomenon, and that’s what it found, said Steven Goodman, co-director of the Meta-Research Innovation Center at Stanford (he was not involved in the RP:P project).
Indeed, less than half of the replication studies in RP:P reproduced the original results. That reduction in positive findings could mean that the original studies were wrong, or it could represent a simple regression to the mean. It’s also possible that some of the replication studies produced false negatives, failing to find effects that were real. The paper in the journal Science that described the RP:P results concluded, “how many of the effects have we established are true? Zero. And how many of the effects have we established are false? Zero.” Still, the message that made media headlines was that all these studies were disproven, and that simply wasn’t true, Goodman said.

How should we think of replications?

Researchers in RP:P did everything possible to duplicate the original studies’ methods and materials — they even contacted original authors and asked for advice and feedback on their replication plans. Even so, there could have been differences between the studies that explained why their results weren’t similar.
For instance, Elizabeth Gilbert, a graduate student at the University of Virginia, attempted to replicate a study originally done in Israel looking at reconciliation between people who feel like they’ve been wronged. The study presented participants with vignettes, and she had to translate these and also make a few alterations. One scenario involved someone doing mandatory military service, and that story didn’t work in the U.S., she said. Is this why Gilbert’s study failed to reproduce the original?
For some researchers, the answer is yes — even seemingly small differences in methods can cause a replication study to fail. In a commentary published March 4 in Science, Daniel Gilbert (no relation to Elizabeth), Gary King, Stephen Pettigrew and Timothy Wilson argue that methodological differences between the original studies and RP:P’s replications led the RP:P authors to underestimate how many replication studies would fail by chance. They also took issue with some of the sampling and statistical methods used in the RP:P analysis and conclude that “the reproducibility of psychological science is quite high.”
“Individually, each of these problems would be enough to cast doubt on the conclusion that most people have drawn from this study, but taken together, they completely repudiate it,” according to a statement attributed to Gilbert and King in a Harvard University news release. “Psychology might have a replication problem, but as far as I can see, nothing in [the RP:P] articleprovides evidence for this conclusion,” Gilbert told me. “I learned nothing.”
This is more than just a dispute about these particular research projects; it’s a fundamental argument about how scientific studies should be conducted and assessed.
The RP:P team responded to Gilbert and his team in the same issue of Science, writing that their “very optimistic assessment is limited by statistical misconceptions” and by the selective interpretation of correlational data to infer causation. The RP:P data can be used to arrive at both “optimistic and pessimistic conclusions about reproducibility” but “neither are yet warranted,” they said. Other researchers have also disputed the statistical argument made by Gilbert’s group, and psychologists continue to debate and discuss the two commentaries.1 In a post at Retraction Watch, Nosek and Elizabeth Gilbert take issue with Gilbert et al.’s characterization of the differences between the original and replication studies, pointing out that some of the studies they called flawed were endorsed by the original authors and one of them successfully replicated.
The debate between these two groups is highly technical and difficult to parse without a deep grasp of statistics and research methodology. It’s also critically important. This is more than just a dispute about these particular research projects; it’s a fundamental argument about how scientific studies should be conducted and assessed.
If the push for replication has taught us anything, it’s that seemingly esoteric decisions about how to conduct a study can lead to different results, as demonstrated in this chart from a previous article of mine, which shows what happened when dozens of researchers used the same data to explore a single research question.

When 29 research teams working with the best intentions (and fully aware that their work will be scrutinized and compared with that of the other teams) can come up with such a wide range of answers, it’s easy to imagine that similarly earnest efforts to replicate existing studies might also produce different results, whether or not the original finding is correct. The takeaway is clear — methods matter.
Years ago, someone asked John Maddox how much of what his prestigious science journal Nature printed was wrong. “All of it,” the renowned editor quickly replied. “That’s what science is about — new knowledge constantly arriving to correct the old.” Maddox wasn’t implying that science was bunk; he was saying that it’s only as good as the current available evidence, and as more data pours in, it’s inevitable that our answers change.
When studies conflict, which is right?

When considering the results of replication studies, what we really want to know is whether the evidence for a hypothesis has grown weaker or stronger, and we don’t currently have an accurate metric for measuring that, Vieland said. P-values, which are commonly (and, statisticians say, erroneously) used to assess how likely it is that a finding happened by chance, don’t measure the strength of the evidence, even if they’re often treated as if they do. You also can’t take a study showing that a drug reduced blood pressure by 30 percent, add it to a study that suggested that the treatment increased blood pressure by 10 percent, and then conclude that the actual effect is a 20 percent reduction in blood pressure. Instead, you have to look at the evidence in total and carefully consider the methods used to produce it.
Consider the power pose concept, which began with a 2010 study that became a sensation via a TED talk and subsequent media storm. The results were exciting. The study suggested that briefly standing in a power pose, such as the “Wonder Woman stance,” could “configure your brain” to boost your testosterone levels, reduce the output of the stress hormone cortisol, and make you act less cautious and more confident, as Harvard psychologist Amy Cuddy explained in a speech that’s become the second-most-viewed TED talk of all time.
But in a study published last spring, Eva Ranehill at the University of Zurich and her colleagues set out to confirm and extend the effects of power poses found in Cuddy’s 2010 experiment. Ranehill told me that her team was so sure that the original finding would replicate that they’d created a whole research plan around it, hoping to see if power posing could help close gender gaps. Using a design similar to Cuddy’s, the researchers found that power posing had no effect on testosterone, cortisol or financial risk-taking. People still felt good, though — the study reproduced the self-reported feelings of power among participants who did the poses.
It’s hard to know for sure why the results of the second study didn’t mirror the first one. In a response to the Ranehill paper, Cuddy’s team spelled out 12 differences between the two studies, including their gender ratios (the original study had a higher proportion of women), the length of time participants spent in the power pose (it was three times longer in the replication), and what participants were told before the experiment began (people in the original study were given a cover story to obscure the study’s purpose, and those in the replication weren’t deceived about the study’s focus).

ILLUSTRATION BY SHOUT
But the most striking difference between the two studies was the disparity in their sample sizes. The original study involved just 42 people, and the replication had 200 participants.3 Studies with small sample sizes generally have low statistical power, meaning that they’re unlikely to distinguish an effect, even if it’s present, said Michèle Nuijten, who studies statistics and data manipulation at Tilburg University in the Netherlands. Small studies also tend to contain a lot of noise, and as a result, she said, the effect size estimates they produce can vary wildly — ranging from severe underestimations to severe overestimations. Whether they’re first findings or replications, large studies are generally more trustworthy than small ones, Nuijten said.4
So we have one small study suggesting that power poses can alter your hormones and also your behavior and another, much larger one suggesting that they don’t. What now?
Who wants to walk back the speech that made you famous by saying, I still believe in the result, but the science is less settled than I originally thought?
Cuddy declined to be interviewed for this article, but she did email a statement. A replication, she wrote, “should be treated as another single study that adds to a body of evidence.” She likened a replication to a “sibling to the original study” — “nothing more, nothing less.” Her group’s published response to the Ranehill study also noted that 33 studies have shown some effect from “expansive posture,” but Ranehill notes that “no published study of which we are aware has replicated their results on hormones.”5Psychologists Joe Simmons and Uri Simonsohn analyzed Cuddy’s original study and Ranehill’s replication and calculated that “even if the effect existed, the replication suggests the original experiment could not have meaningfully studied it.”
In her statement to me, Cuddy suggested that her work had been targeted because other researchers disagreed with her results. “It’s important for replicators to not ‘target’ findings because they don’t pass the replicator’s own personal gut check,” she wrote, adding that she would like to see replication efforts begin with a “collaborative conversation” with “no resentment or envy or nastiness.” Ranehill told me her group hadn’t contacted Cuddy’s team but said they had no ill will toward the researchers, and she seemed puzzled that their well-intended effort to reproduce the original study would be greeted as a threat.
Cuddy is hardly the only scientist who’s reacted defensively toward someone else’s failure to replicate. “Findings become like possessions,” said Nosek, who told me that he was “taken aback” by the tone of the Harvard news release about Gilbert’s commentary on his work. “I want to adopt a stance of humility and assume that there are errors and that’s why I need to be cautious in my conclusions,” he said. The ultimate goal should be reducing uncertainty, not being able to say nah, nah, nah, I’m right.
Goodman argues that the replication framework is the wrong criteria by which to judge studies, because it implies that the first study is privileged. Focusing specifically on replication implies that the first experiment has a special claim on truth, Goodman said. Instead, “We should just be looking at an accumulating evidence paradigm, where we’re getting closer and closer to truth.”
Take, for instance, that study asking whether soccer referees were more likely to give red cards to dark-skinned players. The 29 teams who investigated the question got 29 different results, but taken together, they pointed to a similar answer (yes).
It’s easy to imagine that some of the pushback Cuddy has experienced stems from envy at the way her power pose study exploded into the spotlight and brought her fame and a nice book deal. But at the moment, the evidence supporting her contention that power poses can provoke hormonal changes seems shaky at best, and her public messaging about her results glosses over the uncertainty that remains. She’s in a tight spot, of course — who wants to walk back the speech that made you famous by saying, I still believe in the result, but the science is less settled than I originally thought?
The thing to keep in mind is that no single study provides definitive evidence. The more that science can bake this idea into the way that findings are presented and discussed, the better. Indeed, problems replicating studies have led some researchers to look for intentionally nonconfrontational approaches to weeding out false results.
One such method is something called “pre-publication independent replication” (PPIR) in which results are replicated before first publication. A team led by psychologist Eric Luis Uhlmann at the Insead business school in Singapore recently published one such project. A total of 25 research groups conducted replications of 10 studies that Uhlmann and his collaborators had “in the pipeline” as of August 2014. As they explain in their paper, six of the findings replicated according to all predetermined criteria, and the others failed at least one major replication criterion. “One strength of PPIR is that it makes sure that what comes out in the journal is quite robust,” Uhlmann said. “If something fails the PPIR, then you could avoid the media fanfare.”
Columbia University statistician Andrew Gelman, meanwhile, has proposed what he calls the “time-reversal heuristic”: “Pretend that the unsuccessful replication came first, then ask what you would think if a new study happened to find [that] a statistically significant interaction happened somewhere.”
Elsewhere, Nosek and his colleagues at the Center for Open Science are pushing for a whole new paradigm — transparency, data sharing and a move toward “registered reports,” in which researchers commit to a design and analysis plan in advance of the study. This strategy is meant to prevent researchers from exploiting “researcher degrees of freedom” — decisions about how to collect and analyze data — in a way that leads to p-hackingand other questionable research practices.
In a 2011 paper, psychologists Simonsohn, Simmons and Leif Nelson demonstrated how easy it is to fiddle with researcher degrees of freedom to produce almost any result, and a new paper by Richard Kunert at the Max Planck Institute for Psycholinguistics concludes that questionable research practices (like p-hacking, sampling until a “significant” result is obtained and hypothesizing after a result is known) are at the heart of many replication failures. The implication is clear: If scientists want to make their results more reproducible, they need to make their methods more rigorous.
In her statement to me, Cuddy wrote, “we are trying to do something quite difficult here — predict human behavior and understand subjective experiences. Psychology may not be a hard science,” she writes, but it is certainly a difficult one. It shouldn’t be so surprising that psychology studies don’t always replicate — the field faces an inherent challenge. Rather than measuring molecules or mass, it examines human motives and behavior, which are frustratingly hard to isolate.
What some have interpreted as a “terrifying” unraveling of psychology, others see as a sign of gathering strength. “I trust the public to recognize that this conversation about our methods is healthy and normal for science,” Simine Vazire wrote on her blog Sometimes I’m Wrong. “Science progresses slowly,” wrote the University of California, Davis, psychologist. “I don’t think we’re going to look that bad if psychologists, as a field, ask the public for some patience while we improve our methods.”
I’m skeptical that the power pose I did before my TEDx talk gave me a hormone boost and not just affirmation, but I’m also open to new data.6We’ll learn more about power poses this fall when the journalComprehensive Results in Social Psychology publishes its special issue on the topic. The journal is among a growing number that use a registered reports format in which researchers submit their hypothesis, methods and intended analyses in advance of the study and then the journal sends it out for peer review and accepts articles based on the experiment’s methods and rigor, rather than the results. None of the individual studies will provide the final word, but taken together, they should get us closer to the truth