Occurrence processing

Every single occurrence record in GBIF goes through a series of processing steps until it becomes available in the GBIF portal. Internally the processing is glued together by a messaging system that keeps our processing code independent of each other. The process can be devided up into 3 main parts: crawling datasets into fragments, parsing fragments into verbatim occurrences and interpreting verbatim values.

The outcome of each of these steps is available through our API. Every single occurrence record therefore has a raw fragment, verbatim and interpreted view. The corresponding timestamps lastCrawled, lastParsed and lastInterpreted indicate the exact last time each step has run.

Raw fragments

The very first step is to harvest data from the registered service endpoint in the GBIF registry. If multiple services are registered we prefer Darwin Core Archives (Dwc-A). On every dataset details page you can see all registered services in the external data section of the summary block. Similarily they are also included in a dataset detail from our JSON API.

In addition to Darwin Core Archives GBIF also supports crawling of the XML based BioCASe, TAPIR and DiGIR protocols. The outcome of any crawling regardless of its protocol is a set of fragments each representing a single occurrence record in it's raw form. In the case of Dwc-A this is a JSON representation of an entire star record, i.e. a single core record with all the related extension records attached. In the case of the XML protocols, a fragment is the exact piece of XML that we've extracted. Each protocol and content schema (ABCD1.2, ABCD2.06, DwC1.0, DwC1.4, ...) therefore still expose their entire content and nature. For example here are fragments of ABCD2.06 andDarwinCore.

An important part of fragmenting is to assign a stable GBIF identifier to each fragment. This is a delicate process that uses the occurrenceID, catalogNumber, collectionCode and institutionCode in combination with the dataset registry key to either mint a new identifier or reuse an existing one if the dataset has already been processed before.

See also

Verbatim records

Each fragment subsequently is then processed into a standard, Darwin Core based form which we call the verbatim representation of an occurrence. This form is very similar to a Darwin Core Archive star record, but it is a bit more structured and we limit the stored extensions to just 12 that we actually understand. At this stage the value of any individual term of a record is still untyped and has the exact verbatim value as found during crawling.

Parsing has the biggest impact on ABCD fragments as these need to be translated to Darwin Core terms. We are still in the middle of improving the ABCD transformation, that's why you currently will not find all ABCD content in the verbatim version of a record.

    Interpreted record

    Once all records are available in the standard verbatim form they go through a set of interpretations. These do basic string cleanups but for many important properties we also use strong data typing. For example latitude and longitude values are represented by java doubles and the country, basis of record and many other terms which are based on a controlled vocabulary, are represented by fixed enumerations in our java API.

    ISSUES

    There are many things that can go wrong and we continously encounter unexpected data. In order to help us and publishers to improve the data, we flag records with variousissues that we have encountered. This is also very useful for data consumers as you can include these issues as filters in occurrence searches. Not all issues indicate bad data. Some are merley flagging the fact that GBIF has altered values during processing. On the details page of any occurrence record you will see the list of issues in the notice at the very bottom.

    DARWIN CORE VS GBIF TERMS

    For the interpreted records we use Darwin Core terms as much as possible, but there are some cases when we needed to mint new terms in the GBIF namespace. Often these are very GBIF specific things, but in some cases we opted against existing terms in favor of consistency in our API. This is primarily the case for anything related to accuracy. Darwin Core sometimes represents accuracy by providing a minimum and a maximum term, sometimes there is an explicit precision or accuracy term. We decided to be consistent and always use a single term, e.g. depth, accompanied by a matching accuracy term, in this case depthAccuracy.

    Location interpretation

    COORDINATE

    If geolocated, the interpreted occurrence contains latitude and longitude as decimals for the WGS84 geodetic datum. A coordinateAccuracy in decimal degrees is optionally given if known. We decided not to use dwc:coordinatePrecision as we mean accuracy, not precision. We try to parse and verify the following verbatim terms in the given order to derive a valid WGS84 coordinate:

    1. dwc:decimalLatitude & dwc:decimalLongitude
    2. dwc:verbatimLatitude & dwc:verbatimLongitude
    3. dwc:verbatimCoordinates

    If a geodetic datum is given we then try to interpret the datum and, if different from WGS84, do a reprojection into WGS84. In addition if a literal country was indicated we verify that the coordinate falls within the given country. Frequently lat/lon values are swapped or have negated values which we can also often detect by looking at the expected country.

     

    VERTICAL POSITION

    For the vertical position of the occurrence Darwin Core provides a wealth of terms. Sadly it is often not clear how to use (min/max)elevationInMeters, (min/max)depthInMeters and (min/max)distanceAboveSurfaceInMeters in more complex cases. We decided to keep it simple and only use elevation and depth together with their accuracy terms to represent the vertical position. The absolute elevation is given as a decimal in meters and should point at the exact location of the occurrence. It is the coordinates vertical position in a 3-dimensional coordiante system. Depth is a relative value indicating the distance to the surface of the earth, whether that's terrestrial or water. We preferred the term depth over distanceAboveSurface as it is very common for sea observations and rarely used for above ground distances.

     

    GEOGRAPHY

    All geographical area terms in Darwin Core are processed, but only country is interpreted as a fixed enumeration matching the current ISO countries. When no country but a coordinate was published, we derive a country from the coordinate using our reverse geocoding API.

    Taxonomy interpretation

    For a hierarchical, taxonomic search and consistent metrics to work all records need to be tied to a single taxonomy. As there is still no single taxonomy existing that covers all known names GBIF builds it's own GBIF backbone on top of the Catalog of Life. The higher classification above family level exclusively comes from the Catalogue of Life, while lower taxa can be added in an automated way from other taxonomic datasets available through the GBIF Checklist Bank.

    BACKBONE MATCHING

    Every occurrence is assigned a taxonKey which points to the matching taxon in the GBIF backbone. This key is retrieved by querying our taxon match service, submitting the scientificName, taxonRank, genus, family and all other higher verbatim classification. If the scientificName is not present it will be assembled from the individual name parts if present: genus, specificEpithet and infraspecificEpithet. Having a higher classification qualifying the scientificName helps improving the accuracy of the taxonomic match in two ways, even if it is just the family or even kingdom:

    1. In case of homonyms or similar spelled names the service has a way to verify the potential matches.
    2. In case the given scientific name is not (yet) part of the GBIF backbone we can at least match the record to some higher taxon, e.g. the genus.

    Fuzzy name matching, matching to higher or no taxa are issue flags we assign to records.

    TYPIFICATION

    The type status of a specimen is interpreted from dwc:typeStatus using theTypeStatusParser according to our type status vocabulary.

    Temporal interpretation

    Dates and time can come in various formats, locales and terms in Darwin Core. The majority of dates comes as simple strings, but the recording date might be a complex one defined by multiple terms. In general we use our date parser to process verbatim values which prefers the ISO 8601 date format.

    SIMPLE DATE PARSING

    GBIF processes the following date terms as simple dates:

    • dc:modified: the date the record has last changed in the source
    • dateIdentified: the date when the taxonomic identification happened
    RECORDING DATE

    Far more important and complex is the task of interpreting the recording date. It can come in either as:

    • year, month, day
    • eventDate
    • verbatimEventDate

    We try to parse the first 2 in any case and compare results if they both exist, flagging mismatches.

    See also

    Other interpretation

    To provide a consistent search experience GBIF interprets a few terms by mapping values to a controlled enumeration:

    This is done by case insensitive parsers based on a manually maintained dictionary that maps verbatim values we spot to their respective enumeration value. Basic string cleaning and whitespace normalisation is done in any case. 

    MULTIMEDIA

    Please see http://gbif.blogspot.com/2014/05/multimedia-in-gbif.html