Natural history collections, environmental monitoring programmes, recording societies, citizen scientist projects and others all hold valuable data on the world’s biodiversity. They collect and manage their information in many different systems and environments, and vary widely, depending on what kind of details are captured and stored for any individual record.
So how can we integrate these diverse datasets most simply and efficiently so scientists, analysts and policymakers can use them in research and policy?
The Darwin Core Standard (DwC) offers a stable, straightforward and flexible framework for compiling biodiversity data from varied and variable sources. Originally developed by the Biodiversity Information Standards (TDWG) community, Darwin Core is 'an evolving community-developed biodiversity data standard. It plays fundamental role in the sharing, use and reuse of open-access biodiversity data and today accounts for vast majority of the hundreds of millions of species occurrence records available through GBIF.org.
In practice, using Darwin Core revolves around a standard file format, the Darwin Core Archive (DwC-A). This compact package (a ZIP file) contains interconnected text files and enables data publishers to share their data using a common terminology. This standardization not only simplifies the process of publishing biodiversity datasets, it also makes it easy for users to discover, search, evaluate and compare datasets as they seek answers to today’s data-intensive research and policy questions.
What’s in an archive?
When preparing a Darwin Core Archive version from their source data, publishers restructure and streamline information into a small but structured collection of text files. One of these files is the ‘core’ file and holds a separate record for each of the items included in the archive. Other ‘extension’ files may also be included. These contain additional information linked to the records in the core file. Extension files allow the archive to model many-to-one relationships.
Depending on how much information the source data contains—and how much they wish to share—publishers can create a Darwin Core Archive with one of three cores:
- a Taxon core, which lists a set of species, typically coming from the same region or sharing common characteristics
- an Occurrence core, which lists a set of times and locations at which particular species have been recorded
- an Event core, which lists field studies (including the protocols used, the sample size, and the location for each).
In the case of an Event core, one extension file usually contains the elements displayed in an Occurrence core, which enables the inclusion of many observation records as part of a single planned field study.
Finally, each archive contains two more pieces that help both machines and humans interpreting the data. The first, a descriptor file (meta.xml), defines the precise structure and relationships between the core and any extensions. The second, a complementary metadata file, describes the datasets contained in the archive, typically in Ecological Metadata Language (EML.xml)—though the GBIF’s Integrated Publishing Toolkit produces these files automatically for its users.
Sharing species monitoring and sampling data with the Event Core
Efforts to track shifts in biodiversity patterns over space and time have increased the amount of species information available through sampling and monitoring programmes. In addition to having more precisely described methods than ‘presence-only’ data, these sample-based datasets capture richer, more complex details about species quantities and frequency.
With their frequent inclusion of repeated measurements from the same places, sampling-event data from ecological and environmental investigations are better at detecting shifts and trends in species populations—and critical to understanding the scope and speed of global change.
But to help make the most of these diverse data and ensure their efficient contribution to more precise scientific analyses and policy outcomes, researchers need easy access to them in a consistent, compatible format.
The Darwin Core Standard has become the most widely used open-access standard for biodiversity data. Developed to provide a simple way to document and share information about species occurrences, whether in the field or in a museum collection, the standard has made it possible to integrate hundreds of millions of records through GBIF.org.
Recent additions to Darwin Core detailed below support the aggregation of sampling-event datasets. The newly introduced ‘Event core’ places the sampling event at the center of the simplified dataset and links its protocol, effort and measurements to the species occurrences derived from the samping events, which are appended as a separate extension in the standard’s one-to-many star schema.
As a result, researchers can now tap into more complex, quantitatively richer records for analyses and combine them alongside others focused on single organisms or individual taxa. These changes could even lead to improvements in the quality and usefulness of datasets already published on GBIF.org that derive from more complex surveys and censuses.
The hope is that mingling these varied sources of data will, rather than limiting or prescribing their uses, encourage their discovery and reuse—and perhaps even reveal higher-level relationships and insights that would not be apparent from examining individual records.
How to get started
The most efficient way to prepare and publish Darwin Core-based datasets is through GBIF’s Integrated Publishing Toolkit. EU BON and other partners provided vital contributions toward the changes needed to support this new class of datasets. Data holders with ongoing monitoring programmes and sampling projects can also configure automatically scheduled publishing cycles on the multilingual-friendly IPT.
What’s new in the DwC-A event core
The addition of the ‘event core’ to the Darwin Core Standard includes several new terms highly applicable to sample-based and monitoring data.
- eventID: an identifier specific for the event in a dataset
- parentEventID: an identifier that groups events
- samplingProtocol: name, reference, description of method or protocol used during sampling event
- sampleSizeValue: numeric value for the size (duration, length, area or volume) of a sample in a sampling event. Must have a corresponding sampleSizeUnit
- sampleSizeUnit: the unit of measure of the size (sampleSizeValue)
- organismQuantity: a number for the quantity of organisms. Must have a corresponding organismQuantityType
- organismQuantityType: the type of quantification system used for the quantity of organisms