New processing routine improves stability of GBIF occurrence IDs

Analysis of records in updated datasets detects potential errors, enabling data managers to make timely decisions about preserving existing identifiers

Eurema blanda arsakia-twmoths-hero
Pupae of a three-spot grass yellow butterfly(Eurema blanda subsp. arsakia), observed on the island of Taiwan. Photo 2022 羅美玲 via Taiwan Moth Occurrence Data Collected From Social Network, licensed under CC BY 4.0.

Refinements to GBIF's data ingestion processes will enable data managers and publishers to make better, more informed choices when updates to their data alter the unique identifiers assigned to each occurrence record in

GBIF occurrence IDs (gbifIDs) are not designed as persistent identifiers—in fact, they are simply the numeric string used to form the URL of any individual occurrence, for example 2284341217). However, these identifers often provide a convenient, best-available means of creating consistent, unambiguous references to these individual records.

As a result, improving their stability can help the data holders and researchers who cite and link to specimen or observation records, even as they await the results of initiatives across the GBIF community aimed at developing a robust global system of unique persistent identifiers. (This issue can provide an introduction into the technical details for those interested learning more about the wider topic.)

With the introduction of these back-end improvements, when publishers update datasets, the data-processing pipeline analyses each update and alerts the GBIF help desk if it detects an unusually high number of altered identifiers. Staff can then review the data and work the data publishers to confirm whether the changes are intentional or not. Secretariat staff will monitor and refine the threshold for triggering this process over time.

Data publishers typically change GBIF occurrence IDs for one of three reasons:

  1. Administrative updates reflecting policy decisions, such as the adoption of new identifiers (e.g. CETAF Stable Identifiers) or patterns (e.g. institutionCode:collectionCode:catalogNumber), the introduction of website encryption (e.g. http > https), or changes to the data-holding institution's name or packaging of datasets
  2. Unintentional mistakes inadvertently caused behind the scenes by software updates, faulty scripts or other technical glitches
  3. Lack of awareness that introducing widespread or frequent changes to occurrence IDs may have downstream consequences for other users

In case of the first example, the GBIF help desk now has a workflow for engaging data publishers to coordinate administrative changes, detect accidental mistakes and preserve the provenance and consistency of individual records without creating new occurrence IDs.

"Maintaining the stability of gbifIDs is critical for building trust in both research and the systems on which it depends," said David Shorthouse, developer of Bionomia, an open curatorial environment for linking and crediting natural history specimen records to the experts who collected and identified them. "Any improvements in the durability and persistence of gbifIDs brings more stability to Bionomia, more trust in links our volunteers establish, and greater potential for completing round-trip workflows to reincorporate digital enhancements back into collections management systems and other local data stores."

"Stable occurrence IDs are a precondition for matching material citations in publications to occurrences and thus to extend access to and knowledge about an occurrence," said Donat Agosti, president of GBIF participant and data publisher Plazi. "These matches provide the first step toward revealing occurrences hidden in publications and linking the specimens in natural history collections to taxonomic treatments in their libraries—a hot topic currently supported by the EU-funded BiCIKL project, Swiss universities and the Arcadia Fund."

With these changes in place, GBIF will now start work on developing tools that monitor the stability of occurrence IDs by institution and country and help data users assessing whether the IDs fulfil their needs.