Processes for validating and improving data quality prior to publication often require either separate tools or manual intervention. Besides consuming extra time and resources, these approaches can be difficult if not impossible when working with large datasets—or publishing data in languages other than English.
In this project, SiB Colombia (in Spanish, the Colombian Biodiversity Information System) has worked with the U.S.-based collections collaboration VertNet to translate the interface and documentation for the Darwin Core Data Migrator Toolkit. The collaboration between these two GBIF Participants has resulted in a version of the tool that fills an important technical gap for the numerous Spanish-speaking staff across the GBIF community.
By generating automatic data quality check and improvement reports on datasets, the Darwin Core Data Migrator Toolkit reflects VertNet's long-standing experience in developing and automating routines to monitor and improve data quality. The team also hopes that the project can act as the pilot for future cooperation between stakeholders elsewhere around the world interested in procedures for improving biodiversity data quality early and often.
The project was developed in two phases. During the first phase, a 5-day training workshop was held, providing partners with training in the use of VertNet's Data Migrator Toolkit within the SiB Colombia data sharing workflow. Alongside this, enhancements to the toolkit have been performed and documents have been translated into Spanish for its broader use by the Spanish-speaking community. Currently, the tool has been tested by SiB Colombia staff on a dataset from the fish collection of the Humboldt Institute, and the anticipated feedback will help to develop this in the near future.
From here, the second phase of the project included test-usage of the toolkit and adjustments to the SiB Colombia workflow and implementation of the toolkit on a series of datasets. The project carried out a range of activities to implement the toolkit including selecting and training an intern to assist with the implementation, re-publishing datasets and the creation of vocabularies. As a result of the project, changes to the tool have rendered it more powerful for data cleaning across a broader range of datasets. During the migration process, 7 data sets that had been published but which had not been registered in GBIF were identified and republished following data cleaning and quality checks including data held by the Herbarium from the Universidad Tecnológica del Chocó. Resources and documents which have been produced as part of the project are available below in both Spanish and English versions.
As the project is now complete, the SiB Colombia team will continue the collaboration with VertNet to improve the tool, its implementation and future use. Joint efforts to build Spanish versions of the available controlled vocabularies will continue with other Spanish-speaking countries in the region who have already shown interest in participating in such efforts. The team will also explore the possibility of expanding the use of the Migrator Toolkit, or its components, within the region following discussions and potential future collaborations with several countries.