|
|
How to cite GBIF data
White paper - Draft
1. Introduction
GBIF integrates millions of data records from hundreds of rather
heterogeneous different
sources (resources) and providers. Users of that data are
required by GBIF
Data Use Agreement [Annex 1] to recognise
the efforts of those
who make the data available. Making data available involves a
value chain of IPR where each party has contributed something, and
should be acknowledged as appropriate. Those who make data
available should get scientific credit for doing so.
In general, citations are meant to be used in publications as
references to other information resources. They should facilitate
accessing these resources, checking the facts, and reproducing
materials and experiments. Citations of GBIF data are no
different. However, technical challenges arise from the fact that
datasets served on Internet often grow, records can change, and the
data providers can withdraw their data at any time.
For many reasons, including citations, individual data records and
arbitrary datasets should be possible to identify. GBIF is interested
in
globally unique identifiers (GUIDs), comparable to GenBank's accession
numbers, that would be used in biodiversity literature in similar
fashion. However, a
solution for the GUIDs is still being worked on and is not
available at this writing. A mechanism is discussed here that would
give citations a local unique reference within the GBIF data portal.
This document defines the formats and structures for citing GBIF
data. It is supported by two short guidelines for occurrence and names
data, respectively. The paper also discusses some technical issues that
must be considered.
Goals in this design include the following:
- Consistency in the form of citation regardless of the
circumstances
behind the selection of the data.
- Avoidance of reformatting what the providers have made available
(too
much guesswork).
- Independence of what particular unique identifier scheme may be
adopted later by GBIF.
- Compatibility with machine interfaces of Open Archives Initiative
Protocol for Metadata Harvesting.
- Compatibility if the existing data sharing agreements of GBIF.
2. Format of the citations
Scientific citations normally take the form of Author(s), Year, Title,
Reference, Publisher. This might work for data as well, but
reconstructing such a citation of very heterogeneous data sources is
probably going to fail. Therefore a simplified form is sought for below.
Mapping between the above classic form of citations and the components
of the GBIF information model is as follows:
- GBIF Data Portal www.gbif.net
is not semantically an "author", but an "editor" or
"compiler". Such entities can be in first position in traditional
references.
- The sourced Data Provider is clearly the publisher.
- Names of the Resources might match titles, but not quite.
Title is
rather a phrase like "Data records 0000, 0001, ..., from <resource
name>".
2.1. Individual record
For occurrence data the form is like this.
<GBIF
citation>. <Datetime>.
<Provider citation>, <Resource citation>, <Record
citation>.
In cases where several records are concerned individually, the last
element can be repeated.
<GBIF citation>. <Datetime>. <Provider citation>,
<Resource citation>
(Records: <Record citation>, <Record
citation>, …).
For names data, the form is similar but includes the name. Also
the taxonomist who made the revision is recognised when that is known.
<GBIF
citation>. <Datetime>.
<Name>. <Provider citation>, <Resource citation>,
<Taxonomist name>.
2.2. Data from a single resource
All data from a resource is like above, but without specifying any
records.
<GBIF citation>. <Datetime>. <Provider citation>,
<Resource citation>.
2.3. Data from an entire provider
All data from a provider is like above, but without specifying any
resources or records.
<GBIF citation>. <Datetime>. <Provider citation>.
2.4. Set of records from many resources and many providers (dataset)
Unlike the above, which also could be retrieved independently from a
provider,
this is a result of an integrative query to GBIF data portal. The
result of the storage would be like this (in HTML or XML)
<GBIF citation>. <Datetime>.
<Provider citation>,
<Resource citation> (, Records: <Record citation>,
<Record
citation>, …);
<Provider citation>,
<Resource citation> (, Records: <Record citation>,
<Record
citation>, …);
…
Such a citation can get quite long and may not necessarily be
publishable. In those cases, and where the exact dataset must be
available over a longer
period, it would be desirable that the query and the result be stored
and
referenced as one entity. Such a reference would simply be
<GBIF citation>. <Datetime>. Archived dataset
<GBIF identifier>.
3. Individual elements
The question then is what needs to be included in <GBIF
citation>, <Datetime>, <Provider citation>, <Resource
citation>, <Record citation>, and <GBIF
identifier>. The elements <Name> and <Taxonomist
name> are as written and not elaborated further below.
3.1. GBIF citation
This is a simple static description of the fact that these data were
accessed through GBIF, like "GBIF Data Portal, www.gbif.net".
3.2. Datetime
This would be simply a timestamp in ISO 8601 format when the query was
issued, with or without time of day, except for
individual records the current value of the DateLastModified
field. For example:
2005-03-31T21:57:00Z
3.3. Provider citation
This is the name of the provider as retrieved by DiGIR or BioCASe:
3.4. Resource citation
Resource name can in most case be included as the main Title.
This is typically the name of a collection or database.
We should note here that data provider and resource metadata does
contain names
of their custodians that possibly could be used as authors. If an
Author identity was attainable, it would be the resource
“administrative” contacts where these are specified. If there are
no “administrative” contacts, “other” contacts would be used, and
default to “technical” contacts otherwise. The names here can
have either surname first or last, and probably could be formatted
correctly in most cases. However, these contacts
semantically do not correspond to Authors but rather to Editors or
similar compilers. Therefore, these are not included in the
citation. Authors are included only in citations of names data.
3.5. Record citation
There are elements in the data standards which are intended to
guarantee uniqueness. For Darwin Core this will mean that we
construct the citation from the InstitutionCode, CollectionCode and
CatalogNumber elements, like (Records: Institution A, Collection B,
Catalogue numbers ABC, DEF, GHI, JKL; Institution C, Collection D,
Catalogue numbers MNO, PQR)
As the institution and collection are normally identified as the data
provider and resource, respectively, only the catalog number needs to
be given as the record citation. For names data, the name itself
would be given as title.
If the number of records is large for the publication targeted and/or
the individual identification of the records is not necessary, only the
number of records may be mentioned. If the all records from a
provider or resource are included, the record citation can be omitted.
3.6. GBIF identifier
In this section we discuss the issues related to storing and citing
archived datasets. Archiving GBIF data
is a controversial issue as it potentially removes from data providers
their capability to withdraw data. This would be very problematic
in cases where
sensitive data on endangered species was accidentally shared.
Therefore we must note that that no decision on building such an
archiving mechanism has been made.
However, large arbitrarily constructed datasets are being used for
analysis and there is a need to store, document, and cite them. Such
citations can become very large and
unpublishable using the other mechanisms discussed above. Even using
the original query parameters, the resulting dataset is not likely to
be identical to what was obtained at the time when the query
was issued. Archiving the
dataset can be done in many ways, but
storing the incoming and resulting XML stream may be the simplest
solution.
Archived datasets would be referenced using the <GBIF
identifier>. In order to be compatible with the Open
Archives Initiative, the format of the <GBIF identifier> must
correspond
to
that of the URI (Uniform Resource Identifier) syntax. We call
it GBIF_URI. It must be made clear that this
is not a globally/universally unique identifier but a local one.
GBIF_URI must be simple and short, but should be able to produce to
perform the same request again (even if it is a query which may return
different results in the future). It must also be future-proof
because users will be publishing them in printed publications, so it
must always return something sensible when users request it, even if
“sensible” means a clear error message). These URIs cannot
probably be rationally created for each page view of the GBIF data
portal, but should be possible to generate using a specific request,
i.e., a button
that the user can push or a XML request generate. That event
would
then create and store in a database a persistent URI and return it to
the
user or requester.
A persistent GBIF_URI based on that model might
like these examples:
http://www.gbif.net/record/1234567890
http://www.gbif.net/resource/123456
http://www.gbif.net/dataset/12345
4. Full examples
Now we can give some combined examples of static citations:
GBIF
Data Portal, www.gbif.net. 2005-03-31. Museum of
Vertebrate Zoology, Terrestrial
Vertebrate Specimens, Record numbers 20045, 25678, 31098; University of
Washington Burke Museum, 120 records.
GBIF Data Portal, www.gbif.net. 2005-03-31.
Catalogue of Life Partnership, Integrated Taxonomic Information System.
GBIF Data Portal, www.gbif.net. 2005-03-31. Field Museum of Natural
History, 10 records; Museum of Vertebrate Zoology, 204 records;
Royal Ontario Museum, 1 record; University of Washington Burke Museum,
36 records; University of Turku, WWF Peru, 10 records.
The static citations given in examples can be tedious to construct
manually, and therefore could best be generated by some appropriate
tools.
5. Recommendation for DiGIR Citation metadata field
DiGIR metadata includes a Citation-field for the resources. Using
the Citation-field would be good alternative way for handling
citations, but
it will take some standardisation work. At this writing the use of that
field is very inconsistent across the providers and resources. To
alleviate that problem, we discuss here what would be the best use of
that that field.
First, the above formats cannot be applied for the Citation-field as a
user
can access the data provider directly without going via the GBIF Data
Portal. Second, the provider and resource owners know exactly
what are
the roles of various people in producing these data.
Therefore an appropriate form for this citation is much closer to a
traditional citation with authors, year, title and publisher.
E.g.
“Smith, A., Turner, B., 2003-2005. Institution X, Collection Y,
Taxon Y Database”. If this is available, with a flag
denoting
well-formedness, GBIF data portal could forward it and the constructed
citations could be dropped.
GBIF data validation services could
include a review of this text as part of the process each new provider
is helped to connect, and existing providers should be advised on this
as part of the regular process of giving them feedback.
6. Discussion
There are not many are examples on other portals how a citations of the
primary data can be handled. Most of the other portals just cite
to the portal itself. We think such a citation model would not
fulfil the requirements of the GBIF Data Sharing and Data Use
Agreements. The agreements are quite clear on the need to
recognise the efforts of the data providers. Data providers are
often just
technical bodies who publish the data, but do not own the data the same
way that the resource
(=database, collection) custodians may. This point may have to be
revisited in the agreements.
The purpose of citations in general is to enable the reader/consumer to
retrieve the source of information in question. In the situation of
live databases, this poses challenges. It is not expected that
GBIF data providers keep the data available indefinitively. Quite
the contrary, they can withdraw it any time. Static references as
presented above do not therefore always enable retrieval.
Dynamic reference like the <GBIF identifier> can potentially
enable access to original material stored under such reference.
However, guaranteeing the persistence of such references and the
underlying material has to be planned carefully. Issues on
sensitive data and data provider authority have to be clarified and
agreed on. These are clearly conflicting requirements which may
require revising the GBIF Data Sharing Agreement. This paper does
not assume that such a service is yet in place, or will be built by
GBIF. Such service could perhaps be offered by external
archiving services.
In other communities there are examples of direct references to
information sources. In particular electronic publishing of
scientific articles
has touched the issue how to identify electronic content. Open
Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), a
standard for retrieving metadata from digital document repositories
(Lagoze & al. 2004). Adding an XML interface onto GBIF Data
Portal that implements an OAI-PMH repository of citations is attractive
as it could enable handling datasets the same way as publications, and
hence pave the way for getting scientific merit for publishing data.
References
Lagoze, C., Sompel, H. van de, Nelson, M. & Warner, S. 2004.
The Open Archives Initiative Protocol for Metadata Harvesting.
Protocol Version 2.0 of 2002-06-14. Document Version
2004/10/12T15:31:00Z http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
3. In order to make attribution of use for owners of the data
possible, the identifier of ownership of data must be retained with
every data record.
4. Users must publicly acknowledge, in conjunction with the use
of the data, the data providers whose biodiversity data they have
used. Data providers may require additional attribution of
specific collections within their institution.
Version 0.2, Draft, Hannu Saarenmaa 2005-02-10
Version 0.3, Draft, Hannu Saarenmaa 2005-03-08, based on input
by
Donald Hobern
Version 0.4, Draft, Hannu Saarenmaa 2005-03-29, based on Open
Archives
Initiative materials
Version 0.5, Draft, Hannu Saarenmaa 2005-03-31, based on
comments by
Donald Hobern, Jim Edwards, Per de Bjørn, Meredith Lane
Version 0.7, Draft, Hannu Saarenmaa 2005-04-01, based on
comments from
the staff
Version 0.8, Draft, Hannu Saarenmaa 2005-04-08, based on
comments from
the staff
Version 0.9, Draft, Hannu Saarenmaa 2005-04-08, grammatical corrections
by Meredith Lane
Version 0.11, Draft, Hannu Saarenmaa 2005-04-13, comments by Jim Edwards
|