2002 the First Ebbe Nielsen Prize Acceptance Lecture

Managing Species Names

Nozomi `James' Ytow

University of Tsukuba

Biodiversity, what and why

The goal of biodiversity study is the understanding of the variety of organisms in a region, which can be integrated to give global biodiversity. It encompasses the description of each species, measurement of its abundance and understanding of the mechanisms that made the organisms so diverse.

It originated from curiosity-based, scientific interest and although it is unlikely that biodiversity study will result in direct commercial benefits, it provides an infrastructure for resource monitoring, environmental management and impact studies, pollution control and other applied research.

Sustaining biodiversity is, however, important for the quality of human life. Four billion years have been spent creating organisms on the planet. The extant species on the earth reflects the history of these creations, and hence we cannot recover a species once it is extinct. We need to know about biodiversity to sustain it, of course.

The number of known species (species that have been named) is estimated 1.75 millions which is thought to be about 1% of all species, i.e. there are 100 million or more species without names. We need to start from a list of the named organisms in order to understand biodiversity.

`Catalogue of life'

Life is too diverse to be handled by a single person or single institution.

Knowledge about organisms has been accumulated on paper including books, reports and scientific journals. A modern description of a new species requires say five pages for each species. That will be an estimated 500 million pages for all species. Since these descriptions require good quality paper to reproduce photographs appropriately, every 500 pages will take 5 cm (2 inches) in thickness, resulting in a shelf-width of 50 km for all volumes. Few libraries have sufficient spaces to hold them. Printing restricted number of copies means that each copy will be very expensive. Therefore, a paper-based system is cumbersome and we need an electronic `Catalogue of Life' instead. Using databases to store the information held in the electronic catalogue is sensible because we need a method of finding the organism of interest, which is faster than browsing through 50km of shelving. Because life is too diverse to be handled by a single person or single institution, the database would be better implemented as federation of databases rather than a single, monolithic structure. This is precisely the proposed GBIF architecture.


Names play a key role in the proposed GBIF architecture, allowing access to information on each organism. The number of names to be handled is of the order of the known species number, and it would rise to be of the order of the number of undescribed species. This means that the number of entries is comparable to the number of people in a moderate sized city, or a country respectively. We already have personal registration systems for such cities and countries, and hence one might erroneously assume that name registration of organisms is a trivial task. Personal registration is simpler and more straightforward than species (and higher taxa) name registration because people are individuals and thus countable. When one moves to a new city, or has a new member in the family, one goes to a local office to register. Registration is done by a member of the family (people live in the same place) to which the registrant clearly belongs. Detection of a mistake in a personal registration system can be done by each registered person, by receiving a wrong invoice for example. It is unlikely that it will happen for species names, and even worse, taxonomists may disagree with each other on what constitutes the species. Taxonomists have to start from decision making on the groups of registrant before registration. It doesn't happen in a personal registration system because it registers individuals and its assignment to a family is obvious. Although a name is just a string of characters, the name of species is a more complicated concept than the name of an individual. We need to examine thoroughly the nature of a name in order to design a species name database that is more than a personal registration system. What is a name, then?

What is a name?1

A name can be represented by a string of characters that designates an object, either an individual or a concept composed of objects. A personal name is the former case, while species or higher taxa name is the latter case. Species names represent membership of a group and it is important that individuals belong to one and only one group, so it is important that species names are unique. Names are used to reduce the effort required to designate an object; without a name, we need to describe what we want to designate every time.

Homonyms

A person has her/his own name on a personal registration system, but the name is not necessarily unique. There can be other people with the same name. This situation, multiple objects having the same name, is known as homonymy in taxonomy. Homonyms are not a problem in individual registration systems, but they are in a register of unique names. Homonyms can be detected by comparison of species grouped into higher taxa if such taxa are known.

Synonyms

The inverse situation, a single object with multiple names, is known as synonymy in taxonomy. It is comparable to an alias. A person can choose their name , which might be an alias such as a stage name or a pen name: it is their choice and generally only affects them. Species names, however, must be unique. We classify things into groups in order to reduce the cost of remembering them. The ways of grouping depends on both the purpose of the classification and the understanding of the groups. Hierarchy, composed by recursive grouping, was proposed for living things by Carl Linnaeus, a Swedish curator, and has been used for centuries. The actual hierarchy proposed by Linnaeus has been modified to reflect improvement in understanding of each taxon. Improved understanding of taxa results in different grouping and hence changes of names. Tracking names is a challenging task for databases, but taxonomists have little difficulty in tracing the history of improvement if the relevant publications are accessible. Names and publications are used as communication vehicles when taxonomists have reconstructed the history in their minds. Taxon name databases need to mimic taxonomists' communication. Nomencurator, which will be described in the following sections, is a data model based on taxonomists' communication.

Taxonomists' Communication

When a taxonomist finds a new taxon, she or he publishes a paper describing the taxon with its name. A reader of the publication, who may be reading it long after the death of the author, finds the printed name. The description in the publication is an abstract form of the original taxon concept that the author had in mind when the manuscript was written. The description is not a taxon concept itself. There is no direct link between the taxon concepts of the author and the reader; hence there is no way to verify that these two concepts are identical, even if they are considering the same physical specimens.

Three layer structure in Taxonomy

Taxonomic naming deconvolves into three encapsulating components: the instances (specimens or lower taxa) are encapsulated within a taxon concept (circumscription) which unites them and which is, in turn, encapsulated by the name itself, which provides a name tag. This encapsulation can be nested to construct hierarchies, or taxonomic views. The same taxon name can designate different taxon concepts embedded into different hierarchies, and the same taxon concept can be designated by different names. Teasing out the name and the taxon concept makes the database robust to change of names accompanying improvement in taxonomic understanding with the aid of another structure to record the relationship between different usages of names.

Three dimensional representation of taxonomic view development

The three layers are drawn out into a temporal span to indicate the relationship between the names A, B and C. The information link from a specific publication is shown by a broken line. Existence is shown by solid horizontal lines. Instances are shown by a bundle of horizontal lines. It is important to note that there are no links between the taxon, name and instance except through publications. Taxon A was first described. Taxon B was a later description made in ignorance of taxon A, so is erroneous (a synonym caused by clerical error). Taxon C was described later still with new instances and proposed removing some of the instances from A, so re-defining the taxon concept of A. This represents multiple taxonomic opinion. A later reviser took the view that the division of A and C was unjustified and declared C to be a junior synonym of A, but with an emended definition of the taxon concept A. After this only a single taxon thread (A) continues, whereas in the name layer, the names B and C continue to exist as junior synonyms of A. The view of the later reviser is captured as an annotation statement on previous views. Note that the figure shows only one opinion. Other opinions, with different arrangement of taxon lines, are also possible.

A proposed GBIF implementation

Taxa can be divided into two categories: first, taxa under a stable, well studied hierarchy or second, under hierarchies that remain to be examined. The Species 2000 architecture (Bisby & Smith, 1996) is a feasible way to integrate existing databases covering taxa of the former group. There would be, however, serious difficulties in this approach for the latter group because reassignment of these taxa even at phylum level is not uncommon. This situation arises from a shortage of taxonomists comparing to the amount of biodiversity to be investigated. To cover these `handicapped' taxa, we need to provide a method for compiling the fragmented data available from the research literature. Hierarchies used in these works are not necessarily agreed within the taxonomic community, therefore the compiler must allow multiple taxonomic views. The compiler can be used as a name broker between databases. It also provides a tool for taxonomists that gives a motivation to taxonomists to contribute to biodiversity databases. The roles of Nomencurator in GBIF includes, therefore,

The database sounds useful but...

Since no database works without data, we need to consider who will provide it. It is obvious that taxonomists can provide input for taxonomic databases including catalogues of names, which will be core part of GBIF. Those who have data might say `I know that, so why do I need a database?' or `It will not contribute to my publication output'. Such reactions are commonplace, especially in taxa where there are many undescribed species compared to the number of taxonomists; much work has to be done before database contribution. To overcome this impediment, the database must be a useful tool for experts in this early stage, in order to attract them. It should not require too much effort from contributors, must yield immediate, tangible benefit to the contributor and it must work in the contributors favourite computer environment. We are developing such a user interface that works not only on Windows, but also Mac OS and Unix.

Windows MacOS Unix with X

What is biodiversity informatics?

Biodiversity informatics is generally regarded as a combination of taxonomy and computer science. We need contributions from taxonomists, but they need to benefit from the collaboration. It implies that we also need to consider mechanisms encouraging taxonomists to join, not as unilateral contributions, like a data sink, but in a mutual collaboration. This is an area covered by sociology.

Acknowledgement

Whole work has been done under collaboration with David McL. Roberts at the Natural History Museum UK and David R. Morse at the Open University, UK with following financial supports:

I'd like to express my deepest thanks to my wife Natsuko, who advised me always to keep curiosity in my mind. Indeed, curiosity is the driving force of taxonomy, and hence biodiversity informatics.


[ 1] Refer to philosophers e.g. Wittgenstein, Quine or Kripke for detail of naming