New GBIF data validator provides pre-publication review of datasets

‘Early release’ service is first to assess dataset content

iNat-1331376-blingbeek
Death's head hawkmoth (Acherontia atropos) with honey bees (Apis melifera). Photo by blingbeek via iNaturalist Research-grade observations licensed under CC BY-NC 4.0.

Data publishers can improve the quality of their datasets by identifying and addressing potential issues prior to publication with the help of the new GBIF data validator.

The GBIF Secretariat’s informatics team developed this ‘early release’ version of the service and expect it will improve, thanks in no small part to user feedback that is already flowing in. The service runs the same checks as those carried out after datasets are published on GBIF.org, rather than simply spotting and flagging errors once they are public. It is also the first tool that interprets and validates a dataset’s content as well as its structure.

Users who upload or drag-and-drop a dataset using accepted formats to the validator quickly receive a report that interprets the data and highlights potential issues with its content, syntax and structure. Supported file types include Darwin Core Archives (DwC-A) and standard GBIF dataset templates as well as simple CSV files that contain Darwin Core terms in the first row. Those wishing to validate large datasets can also submit dataset URLs.

Processing time varies, depending on the size of any given dataset. However, since each new validation process generates a unique job ID, users with large datasets or strapped for time can bookmark their report URLs and return to them later.

Each validation report contains:

  • A quick summary of the dataset that indicates whether GBIF.org can successfully index the file or not
  • An overview of any GBIF interpretation issues with the dataset
  • A detailed rundown of any issues with the metadata, dataset core and extensions
  • The number of records successfully interpreted
  • The frequency of terms used in dataset

Users of the data validator can also preview how their metadata will appear once it’s published on GBIF.org.

Users whose validation reports identify blocking issues with indexing their datasets can turn their attention to addressing them prior to publication. At the same time, users whose datasets get the green light can carefully review other less severe problems or conversion errors and further improve the quality of their data. All users are encouraged to resubmit datasets regardless of whether the errors they correct are large and systemic or single typos.

Like all GBIF tools, the data validator is open-source software, with its source code and documentation available in the project’s GitHub repository.

Learn more about the data validator or, better yet, put the tool to use. User feedback will be both welcome and vital to refining this service and helping data publishers resolve potential issues with their datasets quickly and effectively.

Subject