A Hidden Universe of Uncertainty: Sharing Analysis Pipelines Reveals Previously Hidden Decision Points

By Dana Lewis, PhD

Published November 4, 2024

Reproducibility is when others can produce the same results when provided with the data, code, and documentation.¹ Computational reproducibility is increasingly viewed as a minimum standard for scientific research to meet.^1-2 Despite this, less than 0.5% of publications share their code.³ Below, we explain why Aligning Science Across Parkinson’s (ASAP) requires code used in a manuscript to be shared in a publicly accessible repository, like GitHub.

Same Data, Different Conclusions

Several studies have shown that independent researchers can yield strikingly different results when analyzing the same data set.^4-7 In one study, 70 researchers evaluated nine hypotheses using the same neuroimaging data.⁴ Each researcher chose a unique analysis pipeline, leading to diverse outcomes across the hypotheses. In another study, 71 researchers were given the same social science dataset, and no two researchers arrived at the same conclusions.⁵

“Researchers can take literally millions of different paths in wrangling, analyzing, presenting, and interpreting their data…The conclusions of this study were derived from myriad seemingly minor analytical decisions.”

Breznau, et al. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. (2022)

The lack of consensus seems to be due to different analytical approaches such as data preprocessing and analysis techniques. In both studies, no two researchers used the same approach during data analysis.^4-5 After coding all identifiable decisions of each researcher’s workflow, 95.2% of total variance remains unexplained, suggesting that, even when asked to submit a detailed report of analysis methods, there is lost knowledge and information about the analysis process.⁵

0% of researchers used the same analytical approach
95.2% of total variance is unexplained by all identifiable decisions in workflows

“Researchers must make analytical decisions so minute that they often do not even register as decisions. Instead, they go unnoticed as non-deliberate actions following ostensibly standard operating procedures. Our study shows that, when taken as a whole, these hundreds of decisions combine to be far from trivial… [Our study] revealed the hidden universe of consequential decisions and contextual factors that vary across researchers.”

Breznau, et al. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. (2022)

Code Should Not Be Withheld From View

Sharing analytical code is also critical for providing an unambiguous record of the methods used. This is not only crucial for reproducibility and replicability but also essential for detecting programming errors. A notable example involves an article, retracted in 2019 after it was discovered that the control and intervention group variables had been switched during analysis, which led to findings being ‘almost completely reversed.”^8-9

“It is baffling that we are expected to rely on brief narrative text descriptions for complex technical data analysis…Research cannot progress at pace with its most foundational text – the code that analyzes the data – withheld from view.”

Goldacre, et al. Why researchers should share their analytic code: Retraction of a trial shows the importance of transparency. (2019)

ASAP’s Approach to the Hidden Universe of Uncertainty

When code for data processing, cleaning, and analysis is not shared publicly with a readme file and relevant details for reuse and evaluation, the analytical method falls into a “hidden universe of uncertainty.”⁵ To reduce this uncertainty, ASAP requires grantees to share code in a publicly accessible repository with a permanent identifier. This includes scripts, software, packages, macros, pipelines, and any other code used for data manipulation. An assessment of code sharing in ASAP-funded publications showed that the policy’s implementation led to an increase in code sharing at publication. In the first year of ASAP’s policy implementation, 13.3% of newly generated code was shared in the final publication. By contrast, so far in 2024, 43.4% of newly generated code in ASAP-funded publications was shared in a publicly accessible repository, a 225.6% increase (13.3% to 43.4%) in code sharing.

Learn More About Code Sharing and the ASAP Open Science Policy:

ASAP Open Science Policy
Assigning a DOI to GitHub code Guide: This guidance document provides step-by-step instructions on how to assign a DOI to GitHub code using Zenodo.
Readme for Code Guide: This guidance document describes how to write a readme file for metadata associated with a dataset to improve the reusability of data.

References

[1] National Academies of Science, Engineering, and Medicine. Reproducibility and replicability in science. The National Academies Press (2019). https://nap.nationalacademies.org/catalog/25303/reproducibility-and-replicability-in-science

[2] Alston & Rick. A beginner’s guide to conducting reproducible research. Ecological Society of America (2021). https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/bes2.1801

[3] Hamilton et al. Prevalence and predictors of data and code sharing in the medical and health sciences: Systematic review with meta-analysis of individual participant data. BMJ 11:382:e075767 (2023). https://pubmed.ncbi.nlm.nih.gov/37433624/

[4] Botvinik-Nezer, et al. Variability in the analysis of single neuroimaging dataset by many teams. Nature 582(7810): 84-88 (2020). https://pubmed.ncbi.nlm.nih.gov/32483374/

[5] Breznau, et al. Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proc Natl Acad Sci USA 119(44): e2203150119 (2022). https://pubmed.ncbi.nlm.nih.gov/36306328/

[6] Schweinsberg, et al. Same data, different conclusions: Radical dispersion in empirical results when independent analysts operationalize and test the same hypothesis. Org Behav & Human Decision Proc 165, 228-249 (2021). https://www.sciencedirect.com/science/article/pii/S0749597821000200

[7] Gould, et al. Same data, different analysts: Variation in effect sizes due to analytical decisions in ecology and evolutionary biology. EcoRxiv (2023). https://ecoevorxiv.org/repository/view/6000/

[8] Aboumatar and Wise. Notice of retraction, Aboumatar et al: Effect of a program combining transitional care and long-term self-management support on the outcomes of hospitalized patients with chronic obstructive pulmonary disease: A randomized clinical trial. JAMA 320(22):2335-2343 (2018). https://jamanetwork.com/journals/jama/fullarticle/2752474

[9] Goldacre, et al. Why researchers should share their analytic code: Retraction of a trial shows the importance of transparency. BMJ 367:I6365 (2019). https://pubmed.ncbi.nlm.nih.gov/31753846/

Topic

Open Science