gbif.org
Informatics
Participation
Governance
Communications
gbif.orggbif.org

The Darwin Core standard has been used to mobilise the vast majority of specimen occurrence and observational records within the GBIF network. The Darwin Core standard was originally conceived to facilitate the discovery, retrieval, and integration of information about modern biological specimens, their spatio-temporal occurrence, and their supporting evidence housed in collections (physical or digital). The Darwin Core achieved this by defining a set of items in an ordered list, published in an XML document.

The Darwin Core today is broader in scope. It aims to provide a stable, standard reference for sharing information on biological diversity. As a glossary of terms, the Darwin Core provides stable semantic definitions with the goal of being maximally reusable in a variety of contexts. This means that Darwin Core may still be used in the same way it has historically been used, but may also serve as the basis for building more complex exchange formats, while still ensuring interoperability through a common set of terms. One such exchange format is defined in the GBIF Integrated Publishing Toolkit, which allows for the definition of multiple extensions to a core ’taxon occurrence’ or species entity. These extensions provide a means of serving multiple identifications to a specimen, multiple images of a specimen or multiple common names to a taxon concept. This is now possible due to the broadening of scope of the Darwin Core and a redefinition of its structure into a reusable glossary of terms.

Darwin Core Archives

The preferred format for publishing data to the GBIF network is the Darwin Core Archive (DwC-A), which is essentially a set of text (e.g., TAB or CSV) files with a simple descriptor to inform others how your files are organized. The format is defined in the Darwin Core text guidelines.

The updated Darwin Core is no longer strictly bound to occurrence data, and together with Dublin Core (on which its ideas are based), it is used by GBIF to encode data about organism names, taxonomies and species information and distributions; GBIF also uses it to list publications.

The central idea of this archive is that its data files are logically arranged in a star-like manner, with one core data file surrounded by any number of ’extensions’. Each extension record (or ‘extension file row’) points to a record in the core file; in this way, many extension records can exist for each single core record.

For example, in the whales archive below, there are two extensions, one listing the geographic distribution and the other the type of specimens for whale species; the species themselves are listed in the core file ’whales.txt’. But even a single text file that simply lists classic Darwin Core occurrence records is of great value to the GBIF network:  

Records in the core data file supports one of two types (or classes) of biodiversity data.

Details about recommended extensions can be found in their respective subsections and are extensively documented in the GBIF registry, which catalogues all available extensions.  

Sharing entire datasets instead of using pageable web services like DiGIR and TAPIR allows much simpler and more efficient data transfer. For example, retrieving 260,000 records via TAPIR takes about nine hours, issuing 1,300 http requests to transfer 500 MB of XML-formatted data. The exact same dataset, encoded as DwC-A and zipped, becomes a 3 MB file.  An archive comprised of more than one file requires using ZIP or GZIP to generate a DwC-A from a folder or set of files so that the archive can be identified using a single URL. When using GBIF's Integrated Publishing Toolkit, this is done automatically.

An archive requires stable identifiers for core records, but not for extensions. For any kind of shared data it is therefore necessary to have some sort of local record identifiers. It’s good practice to maintain – with the original data – identifiers that are stable over time and are not being reused after the record is deleted. If you can, please provide globally unique identifiers instead of local ones.