Data integration enables global biodiversity synthesis

Study of more than 4,000 papers relying on GBIF-mediated data shows that sharing biodiversity data freely and openly enables global research at multiple scales spanning disciplinary boundaries

Claytonia virginica
Virginia spring beauty (Claytonia virginica) observed in Plum, PA, USA by J. Mason Heberling. Photo via iNaturalist (CC BY 4.0)

J. Mason Heberling of the Carnegie Museum of Natural History and Scott B. Weingart of Carnegie Mellon University, in collaboration with the GBIF Secretariat, have today published in the Proceedings of the National Academy of Sciences (PNAS) a comprehensive analysis and review of more than 4,000 peer-reviewed studies published between 2003 and 2019 that make use GBIF-mediated data. This study provides an evidence base for the ongoing development of the next generation of biodiversity-related research.

In addition to drawing on article metadata assembled through the GBIF literature tracking programme, Heberling and colleagues employed quantitative text analysis and bibliographic synthesis of the 4,000+ papers to explore the network of interdisciplinary knowledge facilitated through GBIF-mediated data.

Key research findings

  • Both data availability and data use have increased
  • Data integration facilitates global research and access
  • Uses of GBIF-mediated data span disciplinary boundaries
  • The scientific areas using GBIF-mediated data are conceptually diverse and change in prelevance over time
  • Globally integrated datasets enable researchers to ask both basic and applied questions at taxonomic, temporal and spatial scales that would be otherwise impossible
  • The synergistic roles of observation- and specimen-based biodiversity data highlight the value and need for deeper integration with phylogenetic, environmental, phenotypic ecological and genetic sources of data

"This study will enable the GBIF community not only to understand the patterns and trends of the scientific data use in the recent research literature, but also to identify opportunities for diversifying the data use portfolio of GBIF in research," said Tanya Abrahamse, chair of the GBIF Governing Board. "We're eager to pursue the promising directions highlighted by Dr Heberling and his co-authors to the overall benefit of the GBIF network, its users, and science and society at large."

Structural topic model results from 4,035 studies that used GBIF-mediated data published between 2003 and 2019.

"This exciting new analysis demonstrates that increasing data availability incentivizes novel uses of primary biodiversity data within and beyond the life sciences," said Enrique Martinez Meyer of Universidad Nacional Autónoma de México and 2nd vice chair of the GBIF Science Committee. "Given the critical need of facing the challenges imposed by rapid environmental change and biodiversity loss, these results will help direct the GBIF community's efforts to improve the quality and availability of biodiversity data for different use profiles."

In its independent 20-year review released in 2020, the Committee on Data of the International Science Council (CODATA) concluded that "GBIF is the most comprehensive, openly available, application-agnostic (most unbiased), easiest-to-use, and modern access point to known digital occurrence data" on biodiversity. But despite the technological, analytic and cultural advances made since GBIF's formation in 2001, the scientific impacts and patterns of data use aggregated by the world's largest biodiversity data network have not previously been quantified systematically.

While the study was commissioned and funded by GBIF, the Secretariat had no influence on analysis, results or conclusions.

Heberling JM, Miller JT, Noesgaard D, Weingart SB and Schigel D (2021) Data integration enables global biodiversity synthesis. Proceedings of the National Academy of Sciences. Proceedings of the National Academy of Sciences 118(6): e2018093118. Available at: https://doi.org/10.1073/pnas.2018093118.