Cleaning up big data

This study discusses methods of data cleaning and assesses the impact of strict cleaning schemes on performance of species distribution models.

Data resources used via GBIF : 1,041,941 occurrence records
Numbat (Myrmecobius fasciatus)

Numbat (Myrmecobius fasciatus) via iNaturalist. Photo by r_o_b27 licensed under CC BY-NC 4.0.

Researchers frequently combine downloads from the more than 700 million occurrence records in the GBIF network with environmental data to model distributions of species or higher level taxa. An initial step in the process often involves data cleaning to ensure the quality of the models, however, what effect do such cleaning procedures have on model performance? In this paper, researchers used one million records of Australian mammals to compare model performance before and after strict data cleaning, during which they removed among others entries with coordinates with less than 3 decimal digits and entries recorded before 1990, effectively reducing the number of records to half. The massive reduction in records significantly improved the performance of the models across all spatial scales and measurements, as both gain measures and AUC increased, most prominently for small mammals. This paper demonstrates an example of quality control and data cleaning in a world of big data.


Gueta T and Carmel Y (2016) Quantifying the value of user-level data cleaning for big data: A case study using mammal distribution models. Ecological Informatics. Elsevier BV 34: 139–145. Available at doi:10.1016/j.ecoinf.2016.06.001.