GridDER and bdc share top honors in 2022 GBIF Ebbe Nielsen Challenge

Two international teams led from Brazil and the United States earn selection for reusable software packages addressing data fitness-for-use; second-prize winner deploys machine learning to extract species interactions data from GBIF-enabled research papers

Bruno Ribeiro (left) and Xiao Feng (right) led the teams sharing top honors for GBIF's annual incentive prize named in honour of Ebbe Schmidt Nielsen (center).

bdc, which integrates available data-cleaning tools into a single toolkit, and GridDER, which identifies potential sources of inaccuracy arising from the use of gridded surveys in occurrences, have earned selection as joint first-prize winners of the 2022 GBIF Ebbe Nielsen Challenge.

Both first-prize winners are products of seven-person international teams that have developed open-source software packages containing instructions and reusable functions for R, the most prevalent software environment used to access and work with GBIF-mediated data.

The team responsible for GridDER: Grid Detection and Evaluation in R is led by Xiao Feng, a quantitative ecologist and biogeographer who heads the Quantitative Biogeography Lab in the Department of Geography at Florida State University (FSU). Colleagues represent the Instituto de Pesquisas Jardim Botânico do Rio de Janeiro (JBRJ) and the Department of Biological Sciences and Center for Plant Biology at Purdue University, along with FSU (complete details). GridDER automates the process of identifying occurrence records geolocated within gridded survey systems. Data curators can apply the package to simulate the original grids where details have gone missing, while users can better account for uncertainties associated with gridded coordinates and adjust their analyses accordingly.

"Gridded coordinates, often from gridded surveys, are broadly used in biological studies, but their usage in aggregated format can be problematic because the associated spatial and environmental uncertainties can get lost," said Feng. "Our team developed the GridDER package to enhance the usage of gridded coordinates by matching coordinates to the original grid definitions, which fills in the missing coordinate uncertainty."

The team that developed bdc: A toolkit for standardizing, integrating and cleaning biodiversity data is directed by Bruno R. Ribeiro, a former PhD candidate in ecology and evolution at the Universidade Federal de Goiás (UFG) who currently works a private environmental consultant. Together the collaborators represent the Instituto de Biología Subtropical at Argentina's Universidad Nacional de Misiones-CONICET, Royal Botanic Gardens, Kew, in the United Kingdom, and several more institutions and departments elsewhere in Brazil and at UFG (complete details). bdc provides a single R package that runs a series of tests on the taxonomic, spatial and temporal dimensions of biodiversity data to address common data-quality issues and improve the overall fitness-for-use of a dataset. Its records, figures and reports allow users to interpret and visualize the results and take steps to compile and harmonize data across taxonomic groups, different sources, or geographic scales.

"Handling biodiversity data from several different sources is not an easy task, and the main novelty of our is that it brings together several aspects of biodiversity data cleaning in one package," said Ribeiro. "We believe that bdc can facilitate and scale the data-cleaning process and catalyze improvements to allow the wise and efficient use of primary biodiversity data."

Ángel Luis Robles Fernández (left) and Nate Upham (right) of Arizona State University are second-prize winners in the 2022 Ebbe Nielsen Challenge.

GBIF LACS: GBIF Literature Abstract Classification System developed by Ángel Luis Robles Fernández and Nate Upham of the School of Life Sciences at Arizona State University stands alone as the second-prize winner of the 2022 Challenge. This entry taps into the corpus of nearly 8,000 peer-reviewed research uses of GBIF-mediated data to establish a high-throughput workflow for identifying and classifying publications that contain hidden sources of ecological data on host-parasite interactions between two or more species. By applying a type of machine learning called "positive-unlabeled” (PU) learning, the GBIF LACS project is able to rapidly score abstracts for their probability of containing host-parasite interactions (complete details). Such interaction-based biodiversity information is essential for modelling and preparing for public-health risks from outbreak diseases like COVID-19, Ebola, Zika and numerous other threats to human communities.

“This project owes a lot to Ángel’s ingenuity with novel approaches to biodiversity data science," said Upham. “He and I actually met after his 2020 GBIF Young Researchers Award when I reached out and invited him to join my lab. Our work together on ecological interactions started and continues thanks to GBIF’s support."

The jury led by Jurate de Prins of the Royal Belgian Institute of Natural Sciences (RBINS) and GBIF science committee reviewed a pool of 11 qualified submissions to select this year's Challenge winners. This annual incentive prize honours the memory of Dr Ebbe Schmidt Nielsen, a Danish-Australian entomologist who was one of the principal founders of GBIF and an inspired leader in the fields of biosystematics and biodiversity informatics.

For their selection as the top-prize winners, the two first-place teams will receive €8,000 from an annual prize pool of €20,000, while the second-place team will receive €4,000.

2022 GBIF Ebbe Nielsen Challenge prize winners

First Prizes

bdc: A toolkit for standardizing, integrating and cleaning biodiversity data
This flexible open-source R package integrates several available tools with a series of new ones, enabling users to harmonize and integrate standardized data from different sources while implementing various tests and operations to flag, validate, document, clean and correct taxonomic, spatial and temporal data on biodiversity. Its thematic structure and accompanying tutorials permit users to address different dimensions of the data, for example, by flagging and prefiltering invalid or non-interpretable records for removal or amendment, harmonizing scientific names against different taxonomic references, and flagging suspicious geographic coordinates and inconsistent collection dates.
video | GitHub | CRAN tutorials | publication

For further reading: Ribeiro BR, Velazco SJ, Guidoni-Martins K, Tessarolo G, Jardim L, Bachman SP & Loyola R (2022) bdc: A toolkit for standardizing, integrating and cleaning biodiversity data. Methods in Ecology and Evolution 13: 1421-1428. doi:10.1111/2041-210X.13868

GridDER: Grid Detection and Evaluation in R
This open-source R package identifies occurrence records whose locations are plotted using gridded coordinates. Variation in the size and orientation of different grids masks the variability and accuracy of the centroid-based coordinates and introduces errors that can limit the efficacy of such records in research. By estimating the degree of environmental heterogeneity associated with the underlying grid system, GridDER enables users to make informed decisions about how to use and account for these records in their analyses. Both data curators and data users can benefit from the tool: the former, by reconstructing the original grid system where its details are no longer known; the latter, by flagging records for further curation or quantifying spatial and environmental uncertainties to better accommodate their application in studies.
video | GitHub

Second Prize

GBIF LACS: GBIF Literature Abstract Classification System
This set of workflows uses a positive-unlabeled (PU) machine-learning model to explore the corpus of research uses of GBIF-mediated data to classify publications that contain ecological data on biological interactions between host and parasite species. The PU model carries out a topic analysis of GBIF abstracts already linked with GBIF datasets through the GBIF Literature API, grouping and filtering them based on the presence of information about host-parasite interactions. These papers receive classification scores and annotations that are then served in a web application enabling search, topic modelling download and reuse of the newly revealed links based on users' topics of interest.
video | workflow diagramme

Jury for 2022 Ebbe Nielsen Challenge