GBIF data processing

Getting biodiversity data from publisher to user in near-real time

Crabtree Nature Preserve
Crabtree Nature Preserve by Justin Kern licensed under CC BY-NC-ND 2.0.

GBIF data processing – from “Register” to downloading

In this description, imagine a researcher, Professor Smith, who has an occurrence dataset of insects with about 50,000 records. He has compiled the data into an Excel spreadsheet and uploaded it to his institution’s instance of the Integrated Publishing Toolkit (IPT). The IPT combines the occurrence data from the spreadsheet with metadata entered by the professor, and packages everything in a Darwin Core Archive. Once the dataset is ready to be included in GBIF, he clicks “Register”.

The GBIF Registry

The IPT immediately starts talking to the GBIF registry via the Registry API. The registry responds by creating a new dataset based on a minimum set of metadata provided by the IPT.

DOIs

A message then goes out from the registry saying that there’s a new dataset. This message is picked up by the DOI updater, which in turn talks to Datacite to create a new DOI for the dataset. Once created, the DOI updater updates the dataset to include the DOI. If the publishing organization has their own agreement with Datacite, the IPT is able to handle DOIs, in which case it would already have been assigned. Generally speaking, DOIs assigned by GBIF (via the DOI updater) will resolve to the dataset page on gbif.org, whereas IPT-assigned DOIs will resolve to the dataset page on the institution’s instance of the IPT.

Crawling

The news of a new dataset will also be picked up by the GBIF crawling infrastructure. Crawling is the process by which the content of the dataset makes its way to GBIF. The crawling infrastructure is a distributed system that handle many datasets in simultaneously. The crawler will contact the IPT and transfer the Darwin Core Archive to GBIF servers. The crawler is also able to retrieve data from other sources using different protocols (e.g. BioCASe).

Fragmenting, persisting, normalizing and interpreting

At this stage, the dataset is split into individual records, a process called fragmenting. The fragmented records referred to as “raw” are then individually identified to determine whether to create a new record and update an existing one. The content of each fragment is now normalized to Darwin Core terms, at which time the records are referred to as “verbatim”. Finally, the record goes through interpretation where quality control is also applied. For instance, this is where the taxonomic names are checked against the GBIF backbone. If there are gaps, say a record only has a genus and species name, the higher taxonomic levels are added. If during interpretation, a mistake is noticed or we make an assumption, a flag is raised. On gbif.org you will be able see the interpreted version of the record with issues (flags raised during interpretation), if any, but you can also view and compare with the verbatim version. The record is finally stored in a massive database.

Searching and downloading

Once the record has been stored, maps, counters and search indexes are updated. At this time the record will be visible on GBIF.org and available for download. The entire process from when the professor clicked “Register” until his data is available for the world to see, usually doesn’t take more than 5 minutes. This of course depends on number of records in the dataset.

Data is available for download in two formats:

  • Tab-delimited CSV: This simple format provides a tabular view of the data with the most commonly used columns. The table includes only the data after it has gone through interpretation and quality control. Tools such as Microsoft Excel can be used to read this format.
  • Darwin Core Archive: This format is a TDWG Standard and contains rich information. It is a zip file containing the original data as shared by the publisher, and the interpreted view after data has gone through quality control procedures. Additional files provide supplementary information such as images. This is a richer format than simple CSV but provides the most complete view of data.

Subject