With more than two scientific articles relying on species occurrence data from GBIF.org published every day, ensuring the data quality of this indispensable resource is critical. While small datasets may be inspected visually, this is impractical–if not impossible–for studies using thousands or millions of records.
Faced with this challenge, a team of researchers including the 2nd prize winner of the 2016 Ebbe Nielsen Challenge developed CoordinateCleaner–a new open source R library for standardized cleaning of species occurrence records. Based on a reference database of coordinates indicating typical geolocation problems, e.g. country centroids and capitals, as well as coordinates of biodiversity institutions (e.g. museums), CoordinateCleaner quickly flags records with potential issues. The software also employs novel algorithms to spot datasets including records with systemized coordinate conversion and rasterized sampling bias.
In an example using all GBIF-mediated records of flowering plants (~91 million records) and the Paleobiology Database (19,000 records), the authors demonstrate how CoordinateCleaner flags 3.6 per cent of GBIF-mediated records and 6.3 per cent of PBDB records as having potential geolocation issues. At the contributing dataset level, the example flagged four per cent of datasets for potential coordinate conversion bias and 18.5 per cent of datasets for potential rounding or rasterization. While the output should not be taken as immediate grounds for exclusion, it provides researchers with a more objective means of flagging datasets for manual case-by-case validation.
CoordinateCleaner provides a comprehensive set of routines that can help researchers perform fast, systematic and reproducible cleaning of occurrence data for use in ecological, biogeographic and paleontological studies.