Biodiversity infrastructures to crosslink metagenomics and species occurrence data

Deepening collaboration between GBIF and EMBL-EBI to address taxonomic bias and reduce barriers between big, FAIR and open data for biodiversity and metagenomics

diatoms-mcmurdo
Marine diatoms living between crystals of annual sea ice in McMurdo Sound, Antarctica. Photo via Wikimedia Commons.

GBIF and the European Bioinformatics Institute (EMBL-EBI) will extend their collaboration by sharing evidence (species occurrence records) of living creatures and communities known only from their genetic material. This collaboration around metagenomic data adds a significant new data stream to GBIF.org and marks an important step in bridging the gap between biodiversity studies based on molecular data and those that rely on morphological data.

Metagenomics research studies genetic material collected at a defined single time and place and uses DNA fragments as signatures to identify the organisms from molecular traces present in the sample. Since these samples are regularly and routinely collected using consistent methods, EMBL-EBI can share them on GBIF.org as sampling-event datasets.

EMBL-EBI and GBIF hope that integrating metagenomic data into GBIF.org will increase its usefulness to new audiences, introduce new ways of analysing biodiversity evidence from various environments, and offer fresh perspectives on the patterns of biodiversity, additional to those that are already available through the GBIF network.

Publication of sequence-based occurrence records from EMBL-EBI will increase the representation of microscopic diversity data available through GBIF.org, reducing a well-known taxonomic bias against microbial lifeforms (see Troudet et al. 2017, among others). These datasets will benefit from enhancements already planned for the GBIF user interface and data services related to sampling-event data.

By exposing relevant sequences within a broader biodiversity-related context, EMBL-EBI hopes to further extend its user base to the wider community of biodiversity researchers. GBIF’s DOI-based citation tracking system will alert EMBL-EBI of additional research uses, documenting the wider impact of investments by its member nations.

The joint plan of action evolved quickly during a meeting at the Wellcome Genome Campus in Cambridgeshire, UK, from 19 to 21 November 2018. The meeting was hosted by ELIXIR, an intergovernmental organization of 22 members that includes EMBL-EBI as a node and seeks to coordinate European life-science data resources into a single infrastructure.

”ELIXIR is pleased to help establish closer connections between these two leading research infrastructures, as well as the research communities they serve,” says Jerry Lanfear, Chief Technology Officer of ELIXIR (one of the keynote speakers for the GBIC2 event in July 2018). “We’re eager to see how breaking down barriers and encouraging wider use and reuse of FAIR and open data can demonstrate cost savings and efficiencies while advancing science.”

EMBL-EBI first published data through GBIF in 2014, with a species occurrence dataset based on georeferenced nucleotide sequences from the European Nucleotide Archive that now contains 7.7 million records. GBIF will stream data from MGnify as standardized Darwin Core sampling-event datasets. This collaboration will contribute many millions of records through GBIF.org, making eDNA-based data a significant factor for future modelling of biodiversity patterns.

”By connecting these distinct but complementary types of evidence, we can further develop GBIF.org as the evidence base for variation in all aspects of biodiversity,” says Donald Hobern, GBIF Executive Secretary. “We look forward to supporting further advances in our understanding of how microbial diversity shapes and responds to diversity in other taxa.”

The collaboration between MGnify and GBIF builds on the recent addition of non-Linnaean ‘operational taxonomic units’ and eDNA-based occurrences for fungi from the UNITE and BIOWIDE projects. Meeting attendees also included representatives from GLOMICON and SILVA, both of which will contribute molecular reference libraries for organizing data on prokaryotes, protists, and other microbial and cryptic biodiversity.

The informatics teams at GBIF and MGnify are already developing an initial prototype to be further refined during a hands-on workshop in spring. The teams hope to present outcomes at biodiversity_next in October 2019. As the work progresses, GBIF, ELIXIR and EMBL-EBI will invite participation for additional holders and aggregators of molecular data, further widening the use and the uptake of molecular-based occurrences by a range of biodiversity data indexes across wilderness biomes.

Meeting attendees, Wellcome Genome Campus, Cambridgeshire, United Kingdom, 19-21 Nov 2018. Left to right: Pier Luigi Buttigieg (AWI/GLOMICON/SILVA), Rob Finn (EBI MGnify), Guy Cochrane (EBI ENA), Donald Hobern (GBIF), Rachel Drysdale, Dmitry Schigel (GBIF), Jerry Lanfear (ELIXIR Hub), Christian Quast (SILVA), Corinne Martin (ELIXIR Hub), Thomas Jeppesen (GBIF). Not pictured: Alex Mitchell, Ola Tarkowska (EBI MGnify).