Infrastructure

How GBIF works as a global informatics infrastructure

About

Quick links

The occurrence publishing network spans across 500+ publishing institutions globally. Data holders manage content in either spreadsheets or databases and then use specific publishing tools to expose those data for querying and access over the internet. The existence of the dataset and the technical protocols required to access the data are entered into the GBIF registry.

Aggregators such as the GBIF global portal and national GBIF data portals, crawl datasets and build sophisticated indexes to allow users to efficiently search and access content across datasets.

This page briefly describes the architecture and operations performed in the global GBIF portal when crawling and indexing occurrence data for user search and download.

Component architecture

See also

The architecture is designed to be an efficient distributed crawling and processing system, with decoupling of components. The primary goals of the architecture design were to ensure rapid processing of data to reduce latencies between publisher changes and global discovery, and to be flexible enough that components can be swapped out or updated as needed. The architecture is best explained with 2 example use cases.

    1. by describing the sequential set of steps that occur when a new dataset is registered into the registry. This use case was chosen since a newly registered dataset triggers action on all components of the architecture.
    2. by describing the process that occurs when a user searches and chooses to download content.

The following illustrates the core components of the occurrence architecture which will be explained as we progress through the examples. Core to all of this is the distributed Apache Hadoop cluster, which provides distributed redundant copies of content and parallelized processing capabilities.

Use case: A newly registered dataset

  • The dataset is registered in the registry. The technical access point information is associated with the dataset, including the URL and protocol required to access it.
  • The registry emits a message onto the shared messaging bus (RabbitMQ) that a dataset is ready to be crawled. Note that the crawl scheduling component might also emit this message when a dataset is ready to be re-crawled which happens periodically
  • There are many crawlers that can run concurrently, and are controlled by the crawler coordinator. The coordinator receives the message from the registry and will manage crawl resources to initiate the crawl. At this point, locks (Apache ZooKeeper) are taken out to ensure that concurrent crawlers do not over eagerly crawl an endpoint. Additionally, shared counters will be configured, which will be updated at every subsequent stage of processing to allow for overall monitoring, and the ability to determine when a crawl is complete across all components.
  • Some of the data protocols GBIF support require paging (e.g. TAPIR, BioCASe), while others support a single HTTP request (e.g. Darwin Core Archives). In both cases, for each artifact retrieved (a page or the full dataset) the crawler harvests the content, persists it to disk and emits a message declaring it is ready to process. Shared counters are maintained (Apache ZooKeeper) about the progress. For Darwin Core Archives, validation will occur to ensure the archive is suitable for further processing
  • For each artifact, a processor will extract single occurrence records, emitting a message for each record harvested. This is received and a processor inspects the record for its identity (e.g. dwc:occurrenceID or using the dwc:institutionCode, dwc:collectionCode and dwc:catalogNumber). In this case the records are all new and the fragment for the record will be inserted into Apache HBase. This is the raw content as harvested, so might be a fragment of XML for some protocols. A message is emitted and counters updated
  • A processor will receive the message and the raw view of the record is normalized into separate fields of the DarwinCore with as little interpretation as possible. A message is emitted declaring the record is normalized and a processor will start interpreting content and applying quality control. This includes:
    1. Ensuring all basis of record values align to a vocabulary
    2. Ensuring all coordinates look sensible compared to the other fields present
    3. Making inference of (e.g.) country, where coordinates exist but no country is stated
    4. Aligning the record to the GBIF Backbone taxonomy using the name lookup web service
    Once processing is complete, a message is emitted to declare the record is finished, and counters are updated. The crawler coordinator monitoring the counters will observe this, and will write the result of the crawl to the registry for future auditing, and reporting on crawling history when it determines from the counters that all messages are processed.
  • In order to support real time search, and various views on the portal such as maps and statistics specific data structures are maintained. The newly created record messages will be observed by
    1. The Apache SOLR index updaters will update the index
    2. The maps will be computed for all supported views (e.g. dataset, publisher, taxa etc). See map web services
    3. The data cube supporting the various metrics will be updated with new counts. See the metrics web services

 

Use case: User search and download

  • The user performs a search in the portal and chooses the download button
  • A download workflow is initiated, which is orchestrated by the Hadoop workflow manager provided by Apache Oozie
  • The workflow will follow the following process:
    1. A query against Apache SOLR is used to get the count of records to download
    2. If the result size is below a threshold, a paging process iterates over results from SOLR, retrieving the occurrence ids. For each ID, a "get by key" operation against HBase provides the record details, which are written into the output file.
    3. If the result size is above the threshold, a Hadoop MapReduce based process is initiated. Apache Hive is used to provide a "SQL like" query language, which iterates over the HBase content to produce the output file
    4. Once complete, subsequent workflow stages run which produce the necessary citation and metadata files by using the Registry API
    5. Finally all pieces are converted into a Darwin Core Archive by zipping together. For large datasets this is the most time consuming stage, as the Zip format is very slow to produce as it cannot be parallelized, but is the most compatible.