A Manifesto for Reproducible Science Notes
The Problem of False Research Findings
- Several factors contribute to the high probability of incorrect published findings:
- Low sample size
- Small effect sizes
- Data dredging (P-hacking)
- Conflicts of interest
- Competition among scientists without collaboration
- Metascience, the scientific study of science, has revealed threats to efficient knowledge accumulation.
- Reproducibility is lower than desired across many fields.
- One analysis estimates that 85% of biomedical research efforts are wasted.
- 90% of respondents in a Nature survey agreed there is a 'reproducibility crisis'.
- There is substantial room for improvement in research practices to maximize the efficiency of public investment in research.
- The authors propose measures to improve research efficiency and robustness by targeting specific threats to reproducible science.
- These measures fall into categories of methods, reporting and dissemination, reproducibility, evaluation, and incentives.
- The measures are intended to be a practical, evidence-based set of actions for researchers, institutions, journals, and funders.
- Improving reliability and efficiency of research will increase the credibility of scientific literature and accelerate discovery.
Apophenia, Confirmation Bias, and Hindsight Bias
- Scientific creativity involves identifying novel patterns in data, but scientists must avoid being misled by randomness.
- Apophenia: The tendency to see patterns in random data.
- Confirmation bias: The tendency to focus on evidence that aligns with expectations.
- Hindsight bias: The tendency to see events as predictable after they have occurred.
- These biases can lead to false conclusions.
- Example: Astronomers who believed they saw the fictitious planet Vulcan because their theories predicted it.
- Experimenter effects are another example of bias.
- Rapid, flexible, and automated data analysis can facilitate over-interpretation of noise.
- High-dimensional datasets may have numerous reasonable approaches to analysis.
- A systematic review of fMRI studies showed almost as many unique analytical pipelines as studies.
- Applying thousands of analytical pipelines to an fMRI dataset can result in a high likelihood of false-positive findings.
P-Hacking and Data Dredging
- During data analysis, confirmation and hindsight biases can lead researchers to accept outcomes fitting expectations and reject others.
- Hypotheses may be reported without acknowledging their post hoc origin.
- This is self-deception, not scientific discovery, and increases the false discovery rate.
- Measures are needed to counter the tendency to see patterns in noise.
Methods to Improve Research
- This section describes measures for research design, methods, statistics, and collaboration.
Protecting Against Cognitive Biases
- Blinding is an effective solution to mitigate self-deception and unwanted biases.
- Participants and data collectors can be blinded to experimental conditions and research hypotheses.
- Data analysts can be blinded to key parts of the data by masking experimental conditions or variable labels during data preparation.
- Deliberate perturbations or masking of data can allow data preparation without the analyst seeing corresponding results.
- Pre-registration of the study design, primary outcome(s), and analysis plan is a highly effective form of blinding.
- Data do not yet exist, and outcomes are unknown.
Improving Methodological Training
- Research design and statistical analysis are mutually dependent.
- Common misperceptions:
- Interpretation of P values
- Limitations of null-hypothesis significance testing
- Meaning and importance of statistical power
- Accuracy of reported effect sizes
- Likelihood of replication with a statistically significant finding
- Basic design principles are crucial, such as blinding, randomization, counterbalancing, and within-subjects designs.
- Integrative training in research practices is important to protect against cognitive biases and distorted incentives.
- Statistical and methodological best practices are constantly evolving, requiring continuing education for both senior and junior researchers.
- Failure to adopt advances may be due to a lack of formal continuing education requirements.
Accessible Educational Resources
- Effective solutions include developing accessible, easy-to-digest, and immediately applicable educational resources.
- Examples include brief, web-based modules customized for specific research applications.
- A modular approach simplifies iterative updating.
- Demonstration software and hands-on examples make lessons tangible.
- The Experimental Design Assistant (https://eda.nc3rs.org.uk) supports research design for animal experiments.
- P-hacker (http://shinyapps.org/apps/p-hacker/) demonstrates how easy it is to generate statistically significant findings through analytic flexibility.
Implementing Independent Methodological Support
- Independent methodological support is well-established in clinical trials, through multidisciplinary trial steering committees.
- These committees address financial conflicts of interest in clinical trials, where the sponsor may be the manufacturing company.
- Non-financial conflicts of interest also exist, such as scientists' beliefs and the need to publish for career progression.
- Independent researchers can mitigate these influences by participating in design, monitoring, analysis, or interpretation.
- This can occur at the project level or through funding agency facilitation.
Encouraging Collaboration and Team Science
- Statistical power is persistently below 50% across disciplines.
- Low statistical power increases the likelihood of false-positive and false-negative results.
- Team science can address this by distributing collaboration across multiple study sites.
- This facilitates high-powered designs and tests generalizability across settings and populations.
- Team science also brings diverse theoretical perspectives, disciplinary approaches, research cultures, and experiences.
Multi-Center and Collaborative Efforts
- Multi-center and collaborative efforts have a long tradition in clinical medicine (randomized controlled trials) and genetic association analyses.
- These efforts improve the robustness of research.
- Multi-site collaborative projects have been advocated for animal studies to maximize power, enhance standardization, and optimize transparency.
- The Many Labs projects illustrate this potential in the social and behavioral sciences.
Reporting and Dissemination
- This section describes measures for communicating research, including reporting standards, study pre-registration, and disclosing conflicts of interest.
- Pre-registration of study protocols for randomized controlled trials in clinical medicine has become standard practice.
- It involves registering the basic study design or pre-specifying study procedures, outcomes, and statistical analysis plan.
- Pre-registration addresses publication bias and analytical flexibility (outcome switching).
Publication Bias
- Also known as the file drawer problem, it refers to the fact that more studies are conducted than published.
- Studies with positive and novel results are more likely to be published than those with negative results or replications.
- The published literature indicates stronger evidence than exists in reality.
Outcome Switching
- Refers to changing the outcomes of interest based on observed results.
- Researchers may select statistically significant outcomes as the outcomes of interest after the results are known.
- This increases the likelihood of spurious results while ignoring negative evidence.
- These data-contingent analysis decisions constitute P-hacking.
- Pre-registration can protect against all of these.
Benefits of Pre-Registration
- The strongest form involves registering the study and pre-specifying the design, primary outcome, and analysis plan before data collection.
- This addresses publication bias by making all research discoverable.
- It also addresses outcome switching and P-hacking by requiring analytical decisions before observing the data.
- It distinguishes data-independent confirmatory research from data-contingent exploratory research.
Adoption of Pre-Registration
- Pre-registration is common in clinical medicine but rare in the social and behavioral sciences.
- Support is increasing, with websites like the Open Science Framework (http://osf.io/) and AsPredicted (http://AsPredicted.org/).
- The Preregistration Challenge offers education and incentives, and journals are adopting the Registered Reports publishing format.
Improving the Quality of Reporting
- Pre-registration improves discoverability, but not necessarily usability.
- Poor usability reflects difficulty in evaluating what was done and reusing the methodology.
- Improving quality and transparency in reporting is necessary.
- The Transparency and Openness Promotion (TOP) guidelines offer standards for journals and funders.
- The Consolidated Standards of Reporting Trials (CONSORT) statement provides guidance for randomized controlled trials.
- The Equator Network (http://www.equator-network.org/) aggregates reporting guidelines.
- The Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) statement exists for reporting systematic reviews and meta-analyses.
Challenges in Adoption and Use
- The success of reporting guidelines depends on adoption and effective use.
- The social and behavioral sciences lag behind the biomedical sciences in adoption.
- Improved reporting alone may be insufficient, as guidelines can be seen as bureaucratic exercises.
- Even with pre-registration, trials may not report pre-registered outcomes, adding new outcomes instead.
- Franco and colleagues observed that 40% of published reports failed to mention experimental conditions, and approximately 70% failed to mention outcome measures.
- Non-included outcome measures were more likely to be negative results with smaller effect sizes.
Independent Oversight: The Case of CHDI Foundation
- CHDI Foundation established the Independent Statistical Standing Committee (ISSC) to provide unbiased evaluation and expert advice.
- The ISSC comprises experts in research design and statistics who are not engaged in Huntington’s disease research.
- The committee offers assistance in developing protocols and statistical analysis plans.
- Their oversight mitigates low statistical power, inadequate study design, and flexibility in data analysis.
Distributed Student Projects
- Student assessment requirements and limited access to populations may hinder collaboration within institutions.
- A distributed student project can be achieved across multiple institutions.
- Academics and students form a consortium to develop a research question, protocol, and analysis plan.
- The protocol is implemented by each student, and data is pooled for analysis.
- Consortium meetings integrate training in research design and offer creative input.
- The Collaborative Replications and Education Project (CREP; https://osf.io/wfc6u/) is an example, replicating published research in undergraduate courses.
- The Pipeline and Many Labs projects also offer opportunities for large-scale replication efforts.
Data Sharing
- Sharing data in public repositories offers field-wide advantages in accountability, data longevity, efficiency, and quality.
- Many scientific disciplines, including those studying human behavior, lack a culture that values open data.
- Recent initiatives aim to change the normative culture.
- In 2015, Nosek and colleagues proposed author guidelines to help journals and funders adopt transparency and reproducibility policies.
- As of November 2016, there were 757 journal and 64 organization signatories to the TOP guidelines.
- For example, the journal Science decided to publish papers only if the data used in the analysis are available.
Badges to Acknowledge Open-Science Practices
- The Center for Open Science suggests that journals assign badges to articles with open data, pre-registration, and open materials.
- The journal Psychological Science has adopted these badges, and there is evidence that the open data badge has had a positive effect.
The Peer Reviewers’ Openness Initiative
- Researchers who sign this initiative pledge to not offer comprehensive review for any manuscript that does not make data publicly available without a clear reason.
Requirements from Funding Agencies
- Prominent funding agencies have increased pressure on researchers to share data.
- For instance, NIH intends to make public access to digital scientific data the standard for all NIH-funded research.
- Since 2010, NSF requires submission of a data-management plan.
Full Disclosure
- Full disclosure refers to describing in full the study design and data collected, rather than a curated version.
- If readers know, then they can adjust their interpretation accordingly. From this simple fact it follows that if authors do not tell us whether they collected one or fifteen variables readers cannot evaluate their research.
- The simplest form of disclosure is for authors to assure readers via an explicit statement in their article that they are disclosing the data fully. This can be seen as a simple item of reporting guidance where extra emphasis is placed on some aspects that are considered most essential to disclose. For example, including the following 21-word statement: “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study”86. Alternatively, a more complex, but also more enforceable and accountable process is for jour- nals to require explicit and specific disclosure statements. The journal Psychological Science, for example, now requires authors to “Confirm that (a) the total number of excluded observa- tions and (b) the reasons for making these exclusions have been reported in the Method section(s)”87.
- For example, including the following 21-word statement: “We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study”.
Registered Reports
- The Registered Reports (RR) initiative seeks to eliminate various forms of bias in hypothesis-driven research, and in particular, the evaluation of a study based on the results.
- Unlike conventional journal articles, RRs split the peer review process into two stages, before and after results are known. At the first stage, reviewers and editors assess a detailed protocol that includes the study rationale, procedure and a detailed analysis plan. Following favourable reviews (and probably revision to meet strict methodological standards), the journal offers in-principle acceptance: publication of study outcomes is guaranteed provided the authors adhere to the approved protocol, the study meets pre-specified quality checks, and conclusions are appropriately evidence-bound. Once the study is completed, the authors resubmit a complete manu- script that includes the results and discussion. The article is pub- lished at the end of this two-stage process. By accepting articles before results are known, RRs prevent publication bias. By review- ing the hypotheses and analysis plans in advance, RRs should also help neutralize P-hacking and HARKing (hypothesizing after the results are known) by authors, and CARKing (critiquing after the results are known) by reviewers with their own investments in the research outcomes, although empirical evidence will be required to confirm that this is the case.
- Perhaps the most commonly voiced objection to RRs is that the format somehow limits exploration or creativity by requiring authors to adhere to a pre-specified methodology. However, RRs place no restrictions on creative analysis practices or serendip- ity. Authors are free to report the outcomes of any unregistered exploratory analyses, provided such tests are clearly labelled as post hoc. Thus, the sole requirement is that exploratory outcomes are identified transparently as exploratory (for a list of frequently asked questions see https://cos.io/rr/#faq). Of course, RRs are not intended for research that is solely exploratory.
- As of November 2016, RRs have been adopted by over 40 journals, including Nature Human Behaviour, covering a wide range of life, social and physical sciences (for a curated list see https://cos.io/rr/#journals). The concept also opens the door to alternative forms of research funding that place a premium on transparency and reproducibility. For example, authors could submit a detailed proposal before they have funding for their research. Following simultaneous review by both the funder and the journal, the strongest proposals would be offered financial support by the funder and in-principle acceptance for publica- tion by the journal (https://cos.io/rr/#funders). In medicine there is strong evidence for the effectiveness of CONSORT guidelines — journals that do not endorse the CONSORT statement show poorer reporting quality compared with endorsing journals79. For the ARRIVE (Animal Research: Reporting of In Vivo Experiments) guidelines80, studies compar- ing the reporting of ARRIVE items in specific fields of research before and after the guidelines were published report mixed results81–83. A randomized controlled trial is in progress to assess the impact of mandating a completed ARRIVE checklist with manuscript submissions on the quality of reporting in published articles (https://ecrf1.clinicaltrials.ed.ac.uk/iicarus). The suc- cess of these efforts will require journals and funders to adopt guidelines and support the community’s iterative evaluation and improvement cycle.
Diversifying Peer Review
- For most of the history of scientific publishing, two functions have been confounded — evaluation and dis- semination. Journals have provided dissemination via sorting and delivering content to the research community, and gatekeeping via peer review to determine what is worth disseminating. However, with the advent of the internet, individual researchers are no longer dependent on publishers to bind, print and mail their research to subscribers. Dissemination is now easy and can be controlled by researchers themselves. For example, preprint services (arXiv for some physical sciences, bioRxiv and PeerJ for the life sciences, engrXiv for engineering, PsyArXiv for psychology, and SocArXiv and the Social Science Research Network (SSRN) for the social sci- ences) facilitate easy sharing, sorting and discovery of research prior to publication. This dramatically accelerates the dissemination of information to the research community.
With increasing ease of dissemination, the role of publishers as a gatekeeper is declining. Nevertheless, the other role of publish- ing — evaluation — remains a vital part of the research enterprise. Conventionally, a journal editor will select a limited number of reviewers to assess the suitability of a submission for a particu- lar journal. However, more diverse evaluation processes are now emerging, allowing the collective wisdom of the scientific commu- nity to be harnessed66. For example, some preprint services support public comments on manuscripts, a form of pre-publication review that can be used to improve the manuscript. Other services, such as PubMed Commons and PubPeer, offer public platforms to com- ment on published works facilitating post-publication peer review. At the same time, some journals are trialling ‘results-free’ review, where editorial decisions to accept are based solely on review of the rationale and study methods alone (that is, results-blind)67.
Transparency and Open Science
- Open science refers to the process of making the content and process of producing evidence and claims transparent and acces- sible to others.
- Very little of the research process (for example, study protocols, analysis workflows, peer review) is accessible because, historically, there have been few opportunities to make it accessible even if one wanted to do so. This has motivated calls for open access, open data and open workflows (including analysis pipelines), but there are substantial barriers to meeting these ideals, including vested finan- cial interests (particularly in scholarly publishing) and few incen- tives for researchers to pursue open practices. For example, current incentive structures promote the publication of ‘clean’ narratives, which may require the incomplete reporting of study procedures or results. Nevertheless, change is occurring. The TOP guidelines54,65 promote open practices, while an increasing number of journals and funders require open practices (for example, open data), with some offering their researchers free, immediate open-access publi- cation with transparent post-publication peer review (for example, the Wellcome Trust, with the launch of Wellcome Open Research).
Conclusion
- The challenges to reproducible science are systemic and cultural, but that does not mean they cannot be met. The measures we have described constitute practical and achievable steps toward improving rigor and reproducibility. All of them have shown some effectiveness, and are well suited to wider adoption, evaluation and improvement. Equally, these proposals are not an exhaustive list; there are many other nascent and maturing ideas for making research practices more efficient and reliable73. Offering a solution to a problem does not guarantee its effectiveness, and making changes to cultural norms and incentives can spur additional behavioural changes that are difficult to anticipate. Some solutions may be inef- fective or even harmful to the efficiency and reliability of science, even if conceptually they appear sensible.
- The field of metascience (or metaresearch) is growing rapidly, with over 2,000 relevant publications accruing annually16. Much of that literature constitutes the evaluation of existing practices and the identification of alternative approaches. What was previously taken for granted may be questioned, such as widely used statisti- cal methods; for example, the most popular methods and software for spatial extent analysis in fMRI imaging were recently shown to produce unacceptably high false-positive rates74. Proposed solutions may also give rise to other challenges; for example, while replica- tion is a hallmark for reinforcing trust in scientific results, there is uncertainty about which studies deserve to be replicated and what would be the most efficient replication strategies. Moreover, a recent simulation suggests that replication alone may not suffice to rid us of false results71.
These cautions are not a rationale for inaction. Reproducible research practices are at the heart of sound research and integral to the scientific method. How best to achieve rigorous and efficient knowledge accumulation is a scientific question; the most effective solutions will be identified by a combination of brilliant hypothesiz- ing and blind luck, by iterative examination of the effectiveness of each change, and by a winnowing of many possibilities to the broadly enacted few. True understanding of how best to structure and incen- tivize science will emerge slowly and will never be finished. That is how science works. The key to fostering a robust metascience that evaluates and improves practices is that the stakeholders of science must not embrace the status quo, but instead pursue self-examina- tion continuously for improvement and self-correction of the scien- tific process itself.