Learning from—and with—the machines: taxon and trait recognition from herbarium scans

Study employs novel machine-learning methodology for taxon and morphological trait recognition from herbarium specimen images

GBIF-mediated data resources used : 830,408 specimen images
Citrullus lanatus
Citrullus lanatus by A.J.B. Chevalier. Via the Herbarium of the Museum of National History, Paris (CC BY 4.0)

This feature is also published in the GBIF Science Review 2019, which highlights important and noteworthy examples of the use and reuse of GBIF-mediated data in research and policy.

Advances in artificial intelligence (AI) are rapidly enabling new, innovative uses across the biodiversity informatics community. Examples of machine-learning technology used in the biodiversity observation network iNaturalist.org and other tools offer now-familiar examples of how computers can use image recognition to improve real-time species identification across wide-ranging taxonomic groups.

The application of deep learning on natural history collections represents an even more recent development, but a German-Saudi team led by Sohaib Younis of Senckenberg Research Institute has tied together numerous strands of investigation that highlight the potential for machine learning to increase our understanding of life on Earth.

The research taps one of the largest online collections of labelled species images: the GBIF species occurrence index, which contains more than 45 million records associated with at least one image. While an increasing number of photos come from citizen scientists through automated species-recognition suggestions, about three quarters of them—more than 30 million—come from the world’s natural history collections.

Younis and his co-authors focused on these herbarium scans as a first step in crafting this thoughtfully designed study. Recognizing that existing taxon recognition systems currently work best for the taxa of North America and Europe, they chose to concentrate on the plant taxa of Africa, downloading 830,408 images for the 1,000 most-scanned species. This element of the approach could bring the added benefit of improving taxon recognition for a region that needs additional taxonomic resources and expertise.

"To our knowledge, this is the first study to deal with several traits in a large number of taxa, implying more abstraction in the concept of a trait and variability within a trait to be recognized."

By capitalizing on rapid improvements in pattern-recognition algorithms, the authors sought to expand their deep-learning analysis to go beyond taxonomic recognition and explore a deep-learning system’s ability to recognize morphological traits from herbarium scans. Extracting a subset of more than 150,000 images of 170 species for which trait data was available, this portion of the machine-based analysis settled on examining a limited set of 19 leaf traits (related to leaf arrangement, structure, form, margin and venation) believed to be identifiable in herbarium scans.

As in other machine-learning analyses, systematic pre-processing plays a critical role in preparations. Cropping and reducing the downloaded images to a standard size—here, just 292 by 196 pixels—prepares them for image analysis and eliminates elements like colour bars, labels and handwritten annotations that only serve as background noise to the machines.

In results the authors deem ‘promising’, taxon recognition from herbarium specimens proved ‘very efficient’, with 96.3 percent accuracy based on the analysis’s top five predictions. While the approach ‘on average also performed well for traits,’ there’s room for further study. For example, sample size fails to explain why the machines find it more difficult to identify generalized traits than taxon-specific patterns—something that’s directly opposed to humans, who can recognize an individual trait much more easily than they can correctly identify a species.

This last finding highlights the fact that unanticipated gaps remain between human and machine forms of understanding. Automating species and trait recognition from diverse collections offers an auspicious method for supporting and enriching the ongoing work of collections digitization, but cultural norms and practices tend to trail behind the capabilities of the latest technological advances. How we can best integrate them with their respective shortcomings?

The 2018 Montreal Declaration for a Responsible Development of Artificial Intelligence notes that “numbers cannot determine what has moral value, nor what is socially desirable.” Like this study’s research team, the biodiversity informatics community can expect to face choices about how best to design and engage deep-learning tools while striving toward ethically responsible and socially desirable outcomes.

Younis S, Weiland C, Hoehndorf R, Dressler S, Hickler T, Seeger B and Schmidt M (2018) Taxon and trait recognition from digitized herbarium specimens using deep convolutional neural networks. Botany Letters. Informa UK Limited 165(3–4): 377–383. Available at: https://doi.org/10.1080/23818107.2018.1446357
Author country/area: Germany, Saudi Arabia