Data graveyards: A holding place for poorly curated, inaccessible datasets
Back to Newsroom

Data graveyards: A holding place for poorly curated, inaccessible datasets

Author(s)
  • ASAP
    Program Officer

    Dana Lewis, PhD

    Aligning Science Across Parkinson’s (ASAP) | USA

    Dr. Dana Lewis is a Program Officer at the Coalition for Aligning Science (CAS) and Aligning Science Across Parkinson’s (ASAP), a basic science initiative aimed at unraveling the etiology of Parkinson’s disease. Dr. Lewis earned her PhD in Neuroscience from George Washington University in the laboratory of Dr. Zayd Khaliq at the National Institutes of Neurological Disorders and Stroke. She completed her postdoctoral work at the Johns Hopkins University School of Medicine in the laboratory of Dr. Maya Opendak at the Kennedy Krieger Institute. As a graduate student and postdoctoral fellow, Dr. Lewis’ research focused on connecting neurophysiological measurements of mesolimbic circuits with behavior and biomarkers of disease. In addition to her expertise in neurophysiology and systems neuroscience, Dr. Lewis is passionate about facilitating science communication to scientists and nonscientists alike, and has served as editor of an undergraduate research journal and scientific community newsletter, education consultant for a patient-focused nonprofit, lecturer, and has served on numerous committees focused on communication and dissemination of science.

Sharing data in publicly accessible repositories is a key component of the FAIR Guiding Principles for Scientific Data Management and Stewardship. Due to the time consuming nature of ensuring data curation, many researchers will prefer to instead note that data is “available upon request,” or store it in a supplemental file linked to the publication. Below, we make the case for why ASAP requires that all data generated in a publication be deposited in a publicly accessible repository to ensure that the data is not lost to the research community. 

Data not available upon request

Many journals now require authors to include a Data Availability statement as part of their publication. It is common practice for authors to use this space to state that the data is “available upon request;” however according to several papers across scientific disciplines, authors do not or cannot share their data upon request. In one large-scale study examining nearly 900 papers published in Nature and Science between 2000-2019, 60.7% of papers with “data available upon request” statements did not share their data when researchers contacted the authors with this request. The most common reasons for not sharing data were: no response (41.3%), loss of data (27.7%), lack of time to search for data (29.2%), and privacy or legal concerns (23.1%).

  • 54.2% of papers published in Nature and Science between 2000-2019 shared their data
  • 39.4% of authors shared data upon request
  • 41.3% of authors did not respond to requests for data sharing
  • 56.9% of authors reported lost data or not knowing where the data was stored

“As our study exemplifies, the ‘data available upon request’ model is insufficient to ensure access to datasets and other critical materials…While the majority of data are eventually available, it is alarming that less than half of the data stated to be available upon request could be effectively obtained from the authors. These figures are in the top end of studies conducted thus far and indicate the relatively superior overall data availability in Science and Nature compared with other journals.”

Tedersoo, et al. Data sharing practices and data availability upon request differ across scientific disciplines. Sci Data 8, 192 (2021).

Supplemental data is lost through link rot and content drift

Sharing data through supplemental materials is another common practice in the research community. However, supplementary materials are not persistent and discoverable in the same way as data in a publicly accessible repository that is linked in the main manuscript. Supplemental files are not hosted as part of a journal article. As a result, they are not assigned a persistent identifier (DOI or accession number) and the URL associated with the supplemental data is subject to link rot and content drift. One study of 655 supplementary data links from PubMed abstracts, revealed that, on average, only 74% of manuscripts had links that were still accessible.

ASAP’s approach to data graveyards

Data that is not shared in a publicly accessible repository has the potential risk of entering into a “data graveyard” – where they become inaccessible and unutilized. ASAP requires grantees to share datasets and associated metadata in a publicly accessible repository by the time of publication to improve the persistence, discoverability, and reusability of datasets and to avoid these data graveyard scenarios. Assessment of data sharing for ASAP-funded publications revealed that implementation of this policy resulted in an increase in data sharing at the time of publication when compared to the first ASAP-funded publications. In the first year of ASAP policy implementation, 47.37% of datasets were shared in publicly accessible repositories. By contrast, so far in 2024, 73.33% of datasets in ASAP-funded publications were shared in publicly accessible repositories, a 54.8% increase (47.37% to 73.33%) in data sharing.

To learn more about data sharing and the ASAP Open Science Policy, visit: