Every single occurrence record in GBIF goes through a series of processing steps until it becomes available in the GBIF portal. Internally the processing is glued together by a messaging system that keeps our processing code independent of each other. The process can be devided up into 3 main parts: crawling datasets into fragments, parsing fragments into verbatim occurrences and interpreting verbatim values.
The outcome of each of these steps is available through our API. Every single occurrence record therefore has a raw fragment, verbatim and interpreted view. The corresponding timestamps lastCrawled, lastParsed and lastInterpreted indicate the exact last time each step has run.
The very first step is to harvest data from the registered service endpoint in the GBIF registry. If multiple services are registered we prefer Darwin Core Archives (Dwc-A). On every dataset details page you can see all registered services in the external data section of the summary block. Similarily they are also included in a dataset detail from our JSON API.
In addition to Darwin Core Archives GBIF also supports crawling of the XML based BioCASe, TAPIR and DiGIR protocols. The outcome of any crawling regardless of its protocol is a set of fragments each representing a single occurrence record in it's raw form. In the case of Dwc-A this is a JSON representation of an entire star record, i.e. a single core record with all the related extension records attached. In the case of the XML protocols, a fragment is the exact piece of XML that we've extracted. Each protocol and content schema (ABCD1.2, ABCD2.06, DwC1.0, DwC1.4, ...) therefore still expose their entire content and nature. For example here are fragments of ABCD2.06 andDarwinCore.
An important part of fragmenting is to assign a stable GBIF identifier to each fragment. This is a delicate process that uses the occurrenceID, catalogNumber, collectionCode and institutionCode in combination with the dataset registry key to either mint a new identifier or reuse an existing one if the dataset has already been processed before.
Each fragment subsequently is then processed into a standard, Darwin Core based form which we call the verbatim representation of an occurrence. This form is very similar to a Darwin Core Archive star record, but it is a bit more structured and we limit the stored extensions to just 12 that we actually understand. At this stage the value of any individual term of a record is still untyped and has the exact verbatim value as found during crawling.
Parsing has the biggest impact on ABCD fragments as these need to be translated to Darwin Core terms. We are still in the middle of improving the ABCD transformation, that's why you currently will not find all ABCD content in the verbatim version of a record.
Once all records are available in the standard verbatim form they go through a set of interpretations. These do basic string cleanups but for many important properties we also use strong data typing. For example latitude and longitude values are represented by java doubles and the country, basis of record and many other terms which are based on a controlled vocabulary, are represented by fixed enumerations in our java API.
There are many things that can go wrong and we continously encounter unexpected data. In order to help us and publishers to improve the data, we flag records with variousissues that we have encountered. This is also very useful for data consumers as you can include these issues as filters in occurrence searches. Not all issues indicate bad data. Some are merley flagging the fact that GBIF has altered values during processing. On the details page of any occurrence record you will see the list of issues in the notice at the very bottom.
DARWIN CORE VS GBIF TERMS
For the interpreted records we use Darwin Core terms as much as possible, but there are some cases when we needed to mint new terms in the GBIF namespace. Often these are very GBIF specific things, but in some cases we opted against existing terms in favor of consistency in our API. This is primarily the case for anything related to accuracy. Darwin Core sometimes represents accuracy by providing a minimum and a maximum term, sometimes there is an explicit precision or accuracy term. We decided to be consistent and always use a single term, e.g. depth, accompanied by a matching accuracy term, in this case depthAccuracy.
If geolocated, the interpreted occurrence contains latitude and longitude as decimals for the WGS84 geodetic datum. A coordinateAccuracy in decimal degrees is optionally given if known. We decided not to use dwc:coordinatePrecision as we mean accuracy, not precision. We try to parse and verify the following verbatim terms in the given order to derive a valid WGS84 coordinate:
- dwc:decimalLatitude & dwc:decimalLongitude
- dwc:verbatimLatitude & dwc:verbatimLongitude
If a geodetic datum is given we then try to interpret the datum and, if different from WGS84, do a reprojection into WGS84. In addition if a literal country was indicated we verify that the coordinate falls within the given country. Frequently lat/lon values are swapped or have negated values which we can also often detect by looking at the expected country.
For the vertical position of the occurrence Darwin Core provides a wealth of terms. Sadly it is often not clear how to use (min/max)elevationInMeters, (min/max)depthInMeters and (min/max)distanceAboveSurfaceInMeters in more complex cases. We decided to keep it simple and only use elevation and depth together with their accuracy terms to represent the vertical position. The absolute elevation is given as a decimal in meters and should point at the exact location of the occurrence. It is the coordinates vertical position in a 3-dimensional coordiante system. Depth is a relative value indicating the distance to the surface of the earth, whether that's terrestrial or water. We preferred the term depth over distanceAboveSurface as it is very common for sea observations and rarely used for above ground distances.
All geographical area terms in Darwin Core are processed, but only country is interpreted as a fixed enumeration matching the current ISO countries. When no country but a coordinate was published, we derive a country from the coordinate using our reverse geocoding API.
For a hierarchical, taxonomic search and consistent metrics to work all records need to be tied to a single taxonomy. As there is still no single taxonomy existing that covers all known names GBIF builds it's own GBIF backbone on top of the Catalog of Life. The higher classification above family level exclusively comes from the Catalogue of Life, while lower taxa can be added in an automated way from other taxonomic datasets available through the GBIF Checklist Bank.
Every occurrence is assigned a taxonKey which points to the matching taxon in the GBIF backbone. This key is retrieved by querying our taxon match service, submitting the scientificName, taxonRank, genus, family and all other higher verbatim classification. If the scientificName is not present it will be assembled from the individual name parts if present: genus, specificEpithet and infraspecificEpithet. Having a higher classification qualifying the scientificName helps improving the accuracy of the taxonomic match in two ways, even if it is just the family or even kingdom:
- In case of homonyms or similar spelled names the service has a way to verify the potential matches.
- In case the given scientific name is not (yet) part of the GBIF backbone we can at least match the record to some higher taxon, e.g. the genus.
Fuzzy name matching, matching to higher or no taxa are issue flags we assign to records.
Dates and time can come in various formats, locales and terms in Darwin Core. The majority of dates comes as simple strings, but the recording date might be a complex one defined by multiple terms. In general we use our date parser to process verbatim values which prefers the ISO 8601 date format.
SIMPLE DATE PARSING
GBIF processes the following date terms as simple dates:
- dc:modified: the date the record has last changed in the source
- dateIdentified: the date when the taxonomic identification happened
Far more important and complex is the task of interpreting the recording date. It can come in either as:
- year, month, day
We try to parse the first 2 in any case and compare results if they both exist, flagging mismatches.