Reproducing results: how big is the problem?

Paul Jump examines the many reasons for irreproducibility in science and efforts to tackle it

September 3, 2015

“Modern scientists are doing too much trusting and not enough verifying – to the detriment of the whole of science, and of humanity. Too many of the findings that fill the academic ether are the result of shoddy experiments or poor analysis.” This was the conclusion of The Economist’s leader writers in 2013, after the magazine published a story on what is often referred to as science’s “reproducibility crisis”.

Worries about irreproducibility – when researchers find it impossible to reproduce the results of an experiment when it is rerun under the same conditions – came to the fore again last week when a landmark effort to reproduce the findings of 100 recent papers in psychology failed in more than half the cases (“More than half of psychology papers are not reproducible”, 27 August). But the concerns are not new. Dorothy Bishop, professor of developmental neuropsychology at the University of Oxford, who chaired an Academy of Medical Sciences conference on the issue in April, recently pointed out on her blog that reproducibility was a significant worry for the 17th-century scientist Robert Boyle. He lamented that “you will find…many of the Experiments publish’d by Authors, or related to you by the persons you converse with, false or unsuccessful”.

According to Brian Nosek, a professor of psychology at the University of Virginia and co-founder and executive director of the Center for Open Science, which ran the psychology reproducibility project, methodology texts in the 1960s mention many of the same problems and discuss some of the same solutions that have been highlighted recently. Two decades ago, an editorial in the British Medical Journal decried “the scandal of poor medical research”, carried out by “researchers who use the wrong techniques (either wilfully or in ignorance), use the right techniques wrongly, misinterpret their results, report their results selectively, cite the literature selectively, and draw unjustified conclusions”. And John Ioannidis’ landmark 2005 paper on “Why most published research findings are false” has been viewed nearly 1.4 million times.

But the issue of reproducibility really began to reach mainstream scientific and public consciousness after the 2011 publication of a paper in Nature by researchers from Bayer HealthCare, a German pharmaceutical company. The paper, “Believe it or not: how much can we rely on published data on potential drug targets?”, reported that the company had been able to replicate only between 20 and 25 per cent of 67 published preclinical studies, mostly in cancer.

The alarm was reinforced in 2012 by another Nature paper, “Drug development: raise standards for preclinical cancer research”, which reported that the Californian pharmaceutical company Amgen had been able to reproduce just six of 53 “landmark” cancer studies it tested. It described that 11 per cent success rate as “shocking”: “Clearly there are fundamental problems in both academia and industry in the way such research is conducted and reported,” the paper concluded.

According to Nosek, the lack of detail in the Bayer and Amgen papers about what they actually did prompted some academics to dismiss them entirely, on the grounds that “we have no idea if they did anything competently”. And he concedes that although there is much circumstantial and theoretical evidence of problems, such as that published by Ioannidis, “direct evidence” is still lacking.

But, for Mark Winey, a professor of molecular, cellular and developmental biology at the University of Colorado Boulder, who recently chaired a “task force” on irreproducibility for the American Society for Cell Biology, the Bayer and Amgen papers were “a real wake-up call”. “There were concerns about cell line contamination going back to the 1960s…but those papers raised broader issues about other types of reagents and the lack of detail in published protocols,” he says.

Chris Chambers, head of brain stimulation at Cardiff University, says that another part of the reason for irreproducibility’s rise to prominence is the attention generated in recent years by a string of major research fraud cases, perhaps most famously that of Diederik Stapel, the eminent Dutch social psychologist who turned out to be a serial fabricator of data. Chambers shares the common view that even if fraud is more common than is typically acknowledged, it is unlikely to be the major reason for such high levels of irreproducibility. However, “in the process of trying to understand how fraud cases could have happened, you identify all these other problems that aren’t fraud but are on the spectrum”, he explains.

But why do people find themselves adopting practices that are on the fraud spectrum in the first place? One reason frequently cited is the overvaluation by funders and institutions of publications in high-impact journals. The claim is that while researchers are busy cutting corners and torturing data in order to secure that career-defining publication, top journals’ concern with maximising their impact factors and their prominence in the mainstream press leads them to, in Winey’s words, “push papers through with insufficient review or addressing of concerns”.

“The incentives that motivate individual scientists are completely out of step with what is best for science as a whole,” Chambers says. “If we built aircraft the way we do basic biomedical research, nobody would fly because it wouldn’t be safe. But in biomedicine risk-taking is rewarded.”

Nosek agrees: “It is not necessarily in my interest to learn a new statistics technique or show you all the false starts we had. You would probably get different answers from scientists to the questions: ‘Do you want your paper to be reproducible?’; ‘Do you hope that it is?’; and ‘Do you think that it actually is?’”

He says a colleague once warned junior colleagues never to try to carry out direct replication of their own work lest they be “confronted with the effect going away. That is crazy in terms of how science is supposed to operate.”

Journals’ supposed reluctance to publish negative findings is also blamed for the fact that any number of labs may waste time attempting to pursue research avenues or build on results that others have already found to be flawed.

Furthermore, journals’ desire for neat stories is also part of the reason for the widespread perception that the methods sections of papers do not supply enough information about what was done to permit replication.

According to Elizabeth Iorns, founder and chief executive of contract research company Science Exchange, the reality of science is “messy”, so “people exclude things that don’t fit perfectly with the story, which means you aren’t seeing the whole picture”.

Another bugbear is the length restrictions print journals typically impose on methods sections. However, even in online journals with unlimited space, methodological detail is often lacking. As Nosek says: “I don’t want to have to show all these things as an author, and I don’t care to ask for them as a reviewer. We are our own worst enemies.”

In a 2014 Nature article setting out the concerns of the US National Institutes of Health about irreproducibility, Francis Collins, the institute’s director, and Lawrence Tabak, its principal deputy director, added that “some scientists reputedly use a ‘secret sauce’ to make their experiments work – and withhold details from publication or describe them only vaguely to retain a competitive edge”.

According to Iorns, other bars to the reproduction of published findings include the difficulty of contacting the original experimenters, who have often moved on and left their lab books behind, and the difficulty of obtaining the materials that they used, such as genetically modified animals.

Concerns also abound about the purity of commercially produced reagents and cell lines. “You have to test that what you have got is what you think it is,” according to Chambers.

The use of statistics is another major worry. According to Chambers, the pressure on researchers to “crank out” papers means that they are more likely to carry out a succession of small studies rather than one larger one. But this runs the risk that the studies are “statistically underpowered”, lacking enough data points to draw reliable conclusions. This means that the experimenter is “more likely to miss a true discovery, but also more likely to find something that isn’t real”. A particular concern that animals are essentially wasted in statistically underpowered experiments led the UK research councils earlier this year to begin requiring grant applicants to demonstrate that their experiments will give “robust results”.

Statistics are often crucial to the claim that there is a causal link between two observed phenomena. Typically, the hypothesised link is deemed to be effectively proved when the likelihood of the same observation occurring by chance is less than 5 per cent – or, in technical terms, where the “p-value” is less than 0.05. Critics assert that the concept of proof in this probabilistic context is misguided and, worse, that many unscrupulous or statistically illiterate scientists routinely engage in “p-hacking”. This involves measuring multiple variables and trawling through the results until a relationship with a p-value of less than 0.05 is uncovered. The culprits then write their paper as if that were the result they had hypothesised all along.

“It is a bit like the Texas sharpshooter fallacy, where you spray the wall with a machine gun and then draw the target around where you happened to hit,” as Chambers puts it. “In psychology,” he adds, “a very high proportion of people admit to having done this.”

The problem with p-hacking is that, according to Bishop, the statistics have a “different meaning” depending on whether the observation was genuinely hypothesised or not, because, when multiple relationships are examined, the odds of finding one that is statistically significant are relatively high.

“We have enormous statistics [programs] that do very complex things at the touch of a button, and a lot of people don’t understand quite what they are doing,” she says.

A 2012 paper in the journal Psychological Science, “Measuring the prevalence of questionable research practices with incentives for truth telling”, based on a survey of 2,000 psychologists, found that various “questionable practices may constitute the prevailing research norm”, and the journal Basic and Applied Social Psychology recently banned all mention of p-values.

Feature illustration (3 September 2015)

Chambers says that the problem looks bigger in psychology than in other disciplines only because psychologists are paying more attention to it. But, according to Ottoline Leyser, professor of plant development at the University of Cambridge, the abuse of p-values is much more problematic in the relatively small number of disciplines that rely on “one line of evidence” to prove causation.

“Huge swathes of biology are answering questions not about whether x causes y, but about how x causes y,” she says. “Those kinds of studies require multiple different sorts of evidence brought together”, making the impact of statistical skulduggery less of an issue.

Moreover, she questions the extent to which irreproducibility should be seen as a problem, since, in many instances, it is a “normal part of biology”. One reason is that a lot of experiments are technically demanding; another is that biological systems are affected by many variables that are unknown and therefore cannot be controlled for.

Leyser is concerned that “perfectly reasonable and important” concerns about irreproducibility will lead to a “compliance-based response that is one-size-fits-all but makes no sense across a vast swathe of biology. That sort of approach never works in research because, by definition, people are always doing things you’d never have thought of.”

She is particularly concerned about what she calls the “AllTrials notion”. This is a reference to the movement in medical research for all clinical trials to be registered in order to counter “publication bias”: the fact that only trials with positive results tend to be published. In basic science, a similar thought process has led to the development of “registered reports”. Pioneered by a Center for Open Science committee chaired by Chambers, the hope (see 'Registered reports: preventing p-hacking and publication bias' box, below) is that pre-registration of research questions and proposed methodologies will also prevent p-hacking by compelling researchers to do what they say they will – or to be upfront when they deviate. Since, on this model, journals accept papers on the basis of the research proposal, before the results are even known, the temptation to p-hack is even lower.

But the idea is highly controversial. A few years ago, Sophie Scott, a Wellcome Trust senior fellow at University College London, wrote in Times Higher Education that, in cognitive neuroscience and psychology, “a significant proportion of studies would simply be impossible to run on a pre-registration model because many are not designed simply to test hypotheses” (“Pre-registration would put science in chains”, Opinion, 25 July 2013). Leyser agrees that pre-registration would be “ridiculous” in exploratory investigations “where you change the next experiment based on the results of previous ones. If you had to list them all at the start and weren’t allowed to deviate, it wouldn’t be science.” Chambers counters that not all studies would be expected to be registered reports, but the fear remains that non-registered reports might come to be seen erroneously as second-class science.

Another controversial suggestion is that there should be greater efforts to directly replicate previously published or about-to-be-published findings, such as the psychology project. But while that strove to be as robust as possible, concerns about such efforts more generally include that statistically underpowered or badly conducted replication studies could merely muddy the picture further, and, most crucially, that it simply is not feasible, either financially or in terms of lab capacity, to carry out replication on a large scale.

The financial issues are borne out by the hugely underwhelming response from researchers to Iorns’ 2012 launch of the Reproducibility Initiative, which offers to replicate a lab’s results before they are submitted for publication (see 'The reproducibility initiative: carrying out replication studies' box, below). Nosek, who is on the initiative’s advisory board, agrees that “it would not be reasonable” to expect every study to be replicated. But he believes that testing landmark papers – as the initiative is currently doing in cancer biology – can cast more light on the causes of irreproducibility and guide further efforts to address it.

Chambers advocates building replication into research proposals themselves by encouraging scientists to say that “in order to make a novel discovery I am going to need to replicate a study from a previous paper”. Meanwhile, Bishop suggests requiring every graduate student to try to reproduce a published finding as their first project. This, she believes, would be “both useful training and valuable in itself”, although she adds that such a measure would not be needed if science were “done properly” in the first place.

Bishop’s committee will publish a series of recommendations later this year based on the Academy of Medical Sciences meeting. These include a suggestion that studies be carried out by larger consortia of research groups than is the current norm since this will increase the size and statistical power of studies and encourage pooling of expertise and mutual vigilance. But, for her, the chief solution to irreproducibility is better training in methods and statistics – for senior as well as junior researchers.

For their part, Collins and Tabak chiefly favour “the expanded development and adoption of standards and best practices”, and the American Society for Cell Biology report – “How can scientists enhance rigor in conducting basic research and reporting research results” – singles out the success of efforts by Daniel Klionsky, Alexander G. Ruthven professor of life sciences at the University of Michigan and editor of the journal Autophagy, to establish common standards of proof in his field.

Meanwhile, the Center for Open Science recently set out its Transparency and Openness Promotion (TOP) guidelines, which suggest eight standards of transparency and reproducibility that journals can adopt. The more than 500 that have already signed up commit themselves to reviewing which standard they want to adopt, and the centre hopes that this will establish “community standards” more widely. The NIH has also launched Principles and Guidelines for Reporting Preclinical Research, which nearly 100 journals have endorsed.

Many journals have begun unilaterally to address irreproducibility. Last year, the Plos journals began stipulating that authors must make their raw data available except in exceptional circumstances. And the pioneering acceptance criteria of the now widely imitated megajournal Plos One, which stress the scientific rigour of papers rather than their novelty, were adopted in part to overcome publication bias – although Damian Pattinson, the journal’s editorial director, admits that it has been an uphill struggle to get authors and reviewers to really take them to heart.

“Cover letters for Plos One read on occasion like cover letters for Nature. People really try to convince you [that] they have a huge breakthrough discovery when we just want it to be right,” he says.

He also admits that despite being open to submission of negative results, the journal has not received many. He suspects that this is partly because it is harder to prove a negative, but also because scientists are not motivated to write up such results.

It is a similar story at Scientific Reports, Nature Publishing Group’s version of Plos One. Alison Mitchell, NPG’s editorial director, adds that Nature itself “is now ensuring that authors report whether they have followed certain standards in designing their experiments, such as blinding and randomisation when possible”. The journal has also adopted a checklist “intended to prompt authors to disclose technical and statistical information in their submissions and to encourage referees to consider aspects important for research reproducibility”. Other initiatives include “facilitating” access to raw data, removing word limits on the online methods section and, in some cases, consulting statisticians when reviewing papers.

Many such measures have previously been adopted by medical journals in light of the concerns expressed in the BMJ editorial – for which reason Collins and Tabak say the problem of irreproducibility is less serious in the clinical arena.

Bishop admits that when she first encountered this “great checklist of stuff you have to comply with”, she found it “awful”. She now accepts that “it is quite useful to ensure there is standardised information”. But she is aware of the concerns that such checklists could slow the pace of discovery, stifle creativity and even drive people out of science entirely.

For Leyser, while checklists are “useful for flagging issues”, they are too narrowly conceived for the whole scope of research.

“In the end, it is about the kind of ethos with which scientific research is conducted: people taking responsibility for their data, understanding it fully and presenting it in a completely open way. That needs to be embedded in the way science is done and the way people are trained and rewarded,” she says. The key, in her view, is to develop ways to assess the research process itself, and not just its inputs and outputs.

People’s attitudes towards solutions to irreproducibility evidently bear a close relationship to their perception of the scope and seriousness of the problem. There is, for instance, much less concern about it in the physical sciences. Chambers attributes this to a greater culture of reproducing significant results, while Philip Moriarty, professor of physics at the University of Nottingham, notes that his subject is “very reductive, with fewer variables to control (or, at least, fewer uncontrolled variables)”.

For Collins and Tabak, while the idea that basic biomedical research is self-correcting ultimately remains true, “in the shorter term…the checks and balances that once ensured scientific fidelity have been hobbled”. And Bishop thinks p-hacking is such a big problem that it could undermine science and the public’s trust in it; she notes that climate change deniers have already seized on her blog about Boyle as evidence that “you can’t trust science”. For her, this implies that scientists have to adopt criteria “by which you know if somebody has done x, y and z, their results should be trustworthy”.

The American Society for Cell Biology report also highlights The Economist article and the attention that irreproducibility has recently received from US politicians (such as in the most recent bill to fund the National Science Foundation) as reasons why the issue must be addressed.

Feature illustration (3 September 2015)

Leyser accepts that certain practices could be improved, but she remains doubtful that irreproducibility is a bigger problem now than it was in the past, and fears that the anxiety it is generating is “overblown”. In particular, she laments the implication that authoring a study that cannot be reproduced implies incompetence “or, worse, that you are unethical” when there are “many more straightforward reasons”.

She is wary of attempts to head off science’s critics by “stressing the robustness and certainty achieved through the scientific method because I think that’s misleading. Your interpretation of data is always going to change over time as you accumulate more data and build a more holistic understanding of systems.”

But most observers agree that the fact that funders, in particular, have become very concerned about irreproducibility means that the issue is not going to go away any time soon. Collins and Tabak accept that improving training and implementing “quality management systems” could increase the cost of research by 25 per cent. “However, the societal benefits garnered from an increase in reproducible life science research far outweigh the cost,” they write. The recent estimate in a paper in Plos Biology that the US alone spends about $28 billion (£18 billion) a year on research that can’t be reproduced – 50 per cent of its total spend in the life sciences – only bolsters their argument.

“The tipping point has already passed,” Nosek agrees. “There have been crises in many subfields in the past and people then just went back to business as usual, but I don’t think that is possible now.”

He is also optimistic that the initiatives that he has helped to launch will bear fruit because scientists already believe in the importance of transparency: “We just need to figure out how to shift incentives so scientists can live closer to the values they already have. It is hard but once it gets rolling – and it is already starting – those changes will accelerate and we are going to be in a better place.”

Chambers is similarly heartened by the success of the recent grant application he made to the European Research Council, Europe’s premier funder of “frontier” research, which stressed that every result he obtained would be reproduced in a different sample before it was deemed to be reliable. More than half of his “eight or nine” reviewers praised this aspect of his proposal.

“Despite all the cultural pressures, there is still a core desire to see reproducibility pushed to the fore,” he says. “We are still scientists and still want to know what the truth is. The flame is still burning.”

Lab labours: survey reveals struggle to reproduce rivals’ results

The American Society for Cell Biology recently surveyed its members about reproducibility. Nearly 72 per cent of the 869 respondents have had trouble replicating another lab’s published results, and 23 per cent say that other labs have reported problems replicating theirs.

Sixteen per cent resolved problems through “amicable” communication with the original lab, and another 16 per cent resolved them unilaterally with additional experiments.

Nearly 40 per cent of issues were not resolved, although more than half of those cases were because “the issue was deemed not important enough to pursue”.

Seventeen per cent were unresolved amid “contentious consultation with the other lab”.

For those issues that were resolved, the main reason for the original problems was incomplete specification of the protocol followed. More than half of respondents say that resolving the issue took a “huge” (12 per cent) or “significant” amount of time.

Asked what factors they believe contribute to poor reproducibility, the most popular response is “pressure to publish in a high-profile journal”.

The society’s related policy paper on reproducibility says that this is “leading to a culture of poor standards and ‘cherry-picking’ results to make a great story”. The next most popular responses are “poor methodological training” and “poor lab record keeping”.

Paul Jump

Registered reports: preventing p-hacking and publication bias

Registered reports are papers that are accepted on the basis that their research question and proposed methodology are deemed to be sufficiently interesting and rigorous, respectively.

The fact that editors and reviewers decide on this before any results are known is intended to avoid “publication bias” – the preference of journals for positive results – and “p-hacking” – the recasting of the aims of studies on the basis of which of its results turn out to be statistically significant.

The idea has been pioneered at the cognition journal Cortex. Another 15 journals from a range of disciplines have declared their willingness to publish registered reports, with “many more in the pipeline”, according to Chris Chambers, registered reports editor at Cortex and chair of the registered reports committee at the Center for Open Science.

Cortex has 20 reports going through the review process and has published two completed reports. The journal Social Psychology has published a special issue of 15 registered reports. Brian Nosek, co-founder and executive director of the Center for Open Science, was guest editor.

“It was a fantastic experience of how science could be different by getting people on different sides of a particular debate to wrestle together with the best methodology to test it,” he says.

Finalised registered reports look much like standard papers, but Chambers admits that such an unfamiliar format will still take some time to catch on: “We hope when people see the published reports they will be impressed and say: ‘This is research I can really believe and there are benefits for me to be publishing in this format.’”

He thinks that registered reports should be an option at “every journal that publishes hypothesis-driven research and that uses statistics”. But he concedes that it is not a suitable format for all kinds of science. “When you are looking for dinosaur skeletons, you either find them or you don’t. You don’t need to do a statistical test,” he says.

Paul Jump

The reproducibility initiative: carrying out replication studies

The Reproducibility Initiative was launched in 2012 by a number of groups coordinated by Science Exchange, a network of about 1,000 US laboratories carrying out contract research.

According to Elizabeth Iorns, founder and chief executive of Science Exchange, the idea was sparked by the failure of Amgen and Bayer HealthCare to replicate the majority of published findings they tested, as well as the failure of other replication attempts that her network had carried out for pharmaceutical companies.

The pity was, she said, that those results were never made public, so “no one else was benefiting”. Hence, she offered to carry out replication studies for university labs (for a fee), and the journal Plos One agreed to publish them.

Of 20,000 scientists surveyed before the launch, 2,000 expressed interest, and Iorns was initially looking for 40 to 50 studies to carry out. She estimates that the cost of reproducing a study is less than 10 per cent of the original outlay, but, in the event, so few labs had available funding that just one signed up (replication was successful in that case).

However, in 2013, the initiative was awarded $1.3 million (£829,000) by the Laura and John Arnold Foundation to replicate the 50 cancer studies with the highest citation impact between 2010 and 2012: studies that, in Iorns’ estimation, originally cost in excess of $100 million to carry out. The protocols have been published as registered reports in the journal eLife and the replication work is ongoing.

Iorns is also coordinating an effort by the Prostate Cancer Foundation to replicate three studies in areas that it is interested in funding. And the Reproducibility Initiative has aligned itself with the Center for Open Science’s Reproducibility Project in psychology, which reported its results last week.

Iorns rejects concerns that contract research labs lack the expertise to carry out cutting-edge experiments, pointing out that of the 50 cancer studies, she rejected replication of only one for that reason.

“The vast majority of results in top journals use standard techniques,” she says. “And invented methods are soon validated because others want to use them in their own labs.”

But Science Exchange has ceased applying for grants to carry out any more replication studies. “We feel like we have pushed it as far as we can and now it is up to the community to decide if it is something they want to allocate funding towards,” says Iorns.

Paul Jump