With nearly two scientific articles relying on species occurrence data from GBIF.org published every day, ensuring the data quality of this indispensable resource is critical. While small datasets may be cleaned manually by visual inspection, this is impractical–if not impossible–for studies using thousands or millions of records.
Faced with this challenge, a team of researchers including the 2nd prize winner of the 2016 Ebbe Nielsen Challenge developed CoordinateCleaner–a new open source R library for standardized cleaning of species occurrence records. Based on a reference database of coordinates indicating typical geolocation problems, e.g. country centroids and capitals, as well as a coordinates of biodiversity institutions (e.g. museums), CoordinateCleaner quickly flags records with potential issues. The software also employs novel algorithms to spot datasets of records with systemized coordinate conversion and rasterized sampling bias.
In an example using all GBIF-mediated records of flowering plants (~91 million records) and the Paleobiology Database (19,000 records), the authors demonstrate how CoordinateCleaner flags 3.6 per cent of GBIF records and 6.3 per cent of PBDB records as having geolocation issues. At the contributing dataset level, the example flagged four per cent datasets for potential coordinate conversion bias and 18.5 per cent of datasets for potential rounding or rasterization. While the output shouldn’t be taken as immediate grounds for exclusion, it grants researchers a more objective means of flagging datasets for manual case-by-case validation.
CoordinateCleaner provides a comprehensive set of cleaning routines that can help researchers perform fast, systematic and reproducible cleaning of occurrence data for use in ecological, biogeographic and paleontological studies.