New guide published on sharing DNA-derived occurrence data

Digital documentation offers practical how-to aimed at extending the utility of genomic and metagenomic data

Flammulina-velutipes-iNat-rovzap2-hero
Velvet shank (Flammulina velutipes), Kursk, Russian Federation. Photo 2020 Oleg Ryzhkov via iNaturalist research-grade observations, licensed under CC BY-NC 4.0.

The GBIF Secretariat has released a new guide, Publishing DNA-derived data through biodiversity data platforms, aimed at providing holders of genomic and metagenomic information with practical considerations for resurfacing DNA-derived occurrences in biodiversity data platforms like GBIF.org.

An expert team of co-authors from Australia, Estonia, Norway, Sweden and Denmark has described principles and practices for holders of DNA-based data interested in increasing its usability beyond its initial "omics" contexts.

The use of genetic data to detect, describe, classify and quantify taxa has become widespread in molecular ecology, phylogenetics and other areas of biodiversity research. The new guide helps holders of such data realize the potential of making it accessible to a wider range of biodiversity research and policy.

By refraining from describing platform-specific details documented elsewhere in favour of common terms and schemas, typical pitfalls, and best practices, the general approach the authors have taken applies to sharing of DNA-derived data through any biodiversity data platform, including the many national systems used around the world.

"There is only one biodiversity, no matter how we document it. So resurfacing the DNA-derived data across relevant infrastructures is timely and important," said Andrew Young, Director of National Research Collections Australia and a member of the GBIF executive committee, who first suggested the idea for a guide two years ago at biodiversity_next. "Connecting the evidence in use across the genomic- and biodiversity-related fields provides a holistic digital picture of nature, and having clear guidelines on how to do it will enable the GBIF community to take action on the world's rapidly accumulating store of DNA-derived biodiversity data."

The guide's publication will also allow GBIF network members including the Ocean Biodiversity Information System (OBIS) and the Atlas of Living Australia to align the efforts of their own networks to combine DNA-detected records with existing data from museum specimens, field surveys and monitoring, citizen-science projects and other sources.

"Because of the lack of clear guidelines, we were missing out on newly created evidence of biodiversity based on molecular methods including the increasing use of environmental DNA for species detection," said Ward Appeltans, coordinator of the OBIS at UNESCO. "While that domain is growing rapidly, we feel this guide should now encourage the many molecular biodiversity observing programmes to also publish their georeferenced species occurrence data via our systems."

The guide benefitted from discussions with members of several different DNA-focused communities, including the Biodiversity Information Standards (TDWG) Genomic Biodiversity Working Group and the TDWG task group on sustainable Darwin Core-MIxS interoperability. These communities' collective expertise helped clarify how best to apply terms from genomic data standards and guidelines including MIxS, GGBN) and MIQE. These additions are supported through a new Darwin Core extension for DNA-derived data now in production in both the GBIF Integrated Publishing Toolkit (IPT) and GBIF.org.

While the practices detail how to share DNA-derived data in biodiversity platforms, the guide reinforces the community expectation that primary genomic data is first shared through the International Nucleotide Sequence Database Collaboration (INSDC). By resurfacing DNA-derived data through other biodiversity platforms, data publishers contribute to a more complete view of world's biodiversity, with specimens, observations and sequences contributing to a single searchable resource. Cross-platform indexing of data helps produce richer datasets with better metadata, while GBIF's data-clustering algorithm identifies potential links between the different sources of evidence used in natural history and molecular biology.

The guide represents the next stage in GBIF's efforts to connect its data infrastructure and tools with relevant sources of genomic and metagenomic information, building on collaborations with the UNITE Community, EMBL's European Bioinformatics Institute (through both its European Nucleotide Archive and MGnify platform), and the International Barcode of Life Consortium (IBOL).

Authors