India harbors incredible biodiversity across its diverse biomes, yet charismatic vertebrate species receive much of the attention, research and funding. By contrast, invertebrates, despite their manifold diversity, remain understudied.
The current state of knowledge on spiders in India exemplifies this issue. More than 1,800 species across 63 families have been recorded in the country so far, even as the true number of species is many multiples higher and many require taxonomic revision. Even for common species, spatial distribution, seasonality and natural history common species are inaccurate and poorly studied—a knowledge gap that has been criticized in several works on the systematics and biogeography of spiders.
The India Biodiversity Portal (IBP) has been documenting the country's species diversity, and its growing data on spiders from citizen science programs is increasing. At the same time, sightings of species reported on popular social media sites outnumber those on citizen science websites. Groups such as SpiderIndia have an established presence on sites such as Facebook, aggregated more than 20,000 observations from about 8,500 spider enthusiasts. However, this data remains fragmented, unstructured and inaccessible to researchers and the general public alike.
To address these issues, this project will update the current status of spider taxonomy on IBP and establish a systematic workflow to enrich its occurrence data on spiders from popular social media networks like SpiderIndia. The semi-automated, replicable workflow will create occurrence records on IBP, enable its validation by taxonomic experts and make validated records available for publication to GBIF. The project team will organize workshops to demonstrate the workflow, recruit other social media communities to adopt it to mobilize and liberate more biodiversity data in order to address spatial, temporal and ecological data gaps on India's arachnids.
The project began with establishing a project team that would collaborate across and within the respective project institutions. Project goals and criteria were then set, and a developer assigned to undertake a background study, test multiple implementations, and choose appropriate algorithms and software. At midterm reporting Facebook APIs and other approaches had been researched to access the required data. Several methods for extracting scientific names, place names, and dates, as well as geocoding software solutions had also been tested.
By final reporting the project was successful in implementing its objectives and achieving the goals of the project. It designed and developed a reusable pipeline for data extraction, curation, validation, and publication as a Darwin Core record and made it available via a user interface on the India Biodiversity Portal. Listings of curated datasets are available via the India Biodiversity Portal, with each dataset clickable to view.
The curation interface allows the uploading of tabular data containing columns with text extracted from social media posts. This text is put through a pipeline that recognizes scientific names of spiders, place names within India, and the date of observation if present. These entities are then presented in a user-friendly interface for a team of curators to examine and curate. The curation effort leads to the creation of primary biodiversity data occurrence records. The interface is replicable and reusable for similar data extraction exercises.
At the end of the project, its curation team had successfully curated over 20,000 such records, which were also examined and validated by a subject expert. Of these, the project generated and published to GBIF 15,055 records (approx. 70% of the curated number) in the dataset “Occurrence records of spiders mobilized through SpiderIndia citizen science initiative on Facebook”.
In addition to the above the project held two workshops; one at the early stages of the project and one at a later stage. The initial workshop held in December 2021, was for members of the SpiderIndia community, where they were sensitized to the objectives of data aggregation and encouraged to contribute. The later workshop “First DiversityIndia Meet (2022)” was held in April 2022, after the workflow was developed and for which a dataset of flora and fauna observed during the workshop was published to GBIF. During both workshops the workflow and pipeline for curation of textual content from Facebook, its practicality, utility, and benefits were explained to stakeholders. The workshops also enhanced the capacity of members to contribute data on Facebook in more standard formats, improved their understanding of informatics, and sensitized them towards open data. Participants were also provided with an overview of GBIF and its role in aggregating and serving open primary biodiversity data.
During implementation the project team has been in discussion with other groups that are interested in the extraction of such data and envisage as part of their post project activities collaborating with these groups, as well as enhancing the curation user interface to make it universal for all data sources.