Home | Data | News | Events | Articles | Nodes | Preferences | Help | About | Press | Site map
SITE SEARCH: 
    
GBIF Data
Browse
Search
How to search
Providers
Data policy
About GBIF
Press
GBIF Q&A
GBIF Data Sharing
GBIF Symposia, etc.
Ebbe Nielsen Prize
GBIF Posters
GBIF Publications
GBIF Documents
GBIF Membership
GBIF Nodes
GBIF Directory
Tools and services
Newsletters
Mailing lists
Wiki
UDDI registry
Standards
CIRCA
GBIF tools download
Support
Become a data provider
GB documents [login]
GB15
Helpdesk
Training
Travel guidelines
FAQ
Programmes
DADI
DIGIT
ECAT
OCB
Home GBIF Data Providers Citations

How to cite GBIF data

White paper - Draft

1. Introduction

GBIF integrates millions of data records from hundreds of rather heterogeneous different sources (resources) and providers.  Users of that data are required by GBIF Data Use Agreement [Annex 1] to recognise the efforts of those who make the data available.  Making data available involves a value chain of IPR where each party has contributed something, and should be acknowledged as appropriate.  Those who make data available should get scientific credit for doing so.

In general, citations are meant to be used in publications as references to other information resources.  They should facilitate accessing these resources, checking the facts, and reproducing materials and experiments.  Citations of GBIF data are no different.  However, technical challenges arise from the fact that datasets served on Internet often grow, records can change, and the data providers can withdraw their data at any time.

For many reasons, including citations, individual data records and arbitrary datasets should be possible to identify. GBIF is interested in globally unique identifiers (GUIDs), comparable to GenBank's accession numbers, that would be used in biodiversity literature in similar fashion. However, a solution for the GUIDs is still being worked on and is not available at this writing. A mechanism is discussed here that would give citations a local unique reference within the GBIF data portal.

This document defines the formats and structures for citing GBIF data. It is supported by two short guidelines for occurrence and names data, respectively. The paper also discusses some technical issues that must be considered.

Goals in this design include the following:
  1. Consistency in the form of citation regardless of the circumstances behind the selection of the data.
  2. Avoidance of reformatting what the providers have made available (too much guesswork).
  3. Independence of what particular unique identifier scheme may be adopted later by GBIF.
  4. Compatibility with machine interfaces of Open Archives Initiative Protocol for Metadata Harvesting.
  5. Compatibility if the existing data sharing agreements of GBIF.

2. Format of the citations

Scientific citations normally take the form of Author(s), Year, Title, Reference, Publisher. This might work for data as well, but reconstructing such a citation of very heterogeneous data sources is probably going to fail. Therefore a simplified form is sought for below.

Mapping between the above classic form of citations and the components of the GBIF information model is as follows:
  • GBIF Data Portal www.gbif.net is not semantically an "author", but an "editor" or "compiler".  Such entities can be in first position in traditional references.
  • The sourced Data Provider is clearly the publisher. 
  • Names of the Resources might match titles, but not quite.  Title is rather a phrase like "Data records 0000, 0001, ..., from <resource name>".

2.1. Individual record

For occurrence data the form is like this.

<GBIF citation>. <Datetime>. <Provider citation>, <Resource citation>, <Record citation>.

In cases where several records are concerned individually, the last element can be repeated.

<GBIF citation>. <Datetime>. <Provider citation>, <Resource citation> (Records: <Record citation>, <Record citation>, …).

For names data, the form is similar but includes the name.  Also the taxonomist who made the revision is recognised when that is known.

<GBIF citation>. <Datetime>. <Name>. <Provider citation>, <Resource citation>, <Taxonomist name>.

2.2. Data from a single resource

All data from a resource is like above, but without specifying any records.

<GBIF citation>. <Datetime>. <Provider citation>, <Resource citation>.

2.3. Data from an entire provider

All data from a provider is like above, but without specifying any resources or records.

<GBIF citation>. <Datetime>. <Provider citation>.

2.4. Set of records from many resources and many providers (dataset)

Unlike the above, which also could be retrieved independently from a provider, this is a result of an integrative query to GBIF data portal.  The result of the storage would be like this (in HTML or XML)

<GBIF citation>. <Datetime>.
<Provider citation>, <Resource citation> (, Records: <Record citation>, <Record citation>, …);
<Provider citation>, <Resource citation> (, Records: <Record citation>, <Record citation>, …);


Such a citation can get quite long and may not necessarily be publishable. In those cases, and where the exact dataset must be available over a longer period, it would be desirable that the query and the result be stored and referenced as one entity.  Such a reference would simply be

<GBIF citation>. <Datetime>.  Archived dataset <GBIF identifier>.

3.  Individual elements

The question then is what needs to be included in <GBIF citation>, <Datetime>, <Provider citation>, <Resource citation>, <Record citation>, and <GBIF identifier>.  The elements <Name> and <Taxonomist name> are as written and not elaborated further below.

3.1. GBIF citation

This is a simple static description of the fact that these data were accessed through GBIF,  like "GBIF Data Portal, www.gbif.net".

3.2.  Datetime

This would be simply a timestamp in ISO 8601 format when the query was issued, with or without time of day, except for individual records the current value of the DateLastModified field.  For example:

2005-03-31T21:57:00Z

3.3. Provider citation

This is the name of the provider as retrieved by DiGIR or BioCASe:

3.4. Resource citation

Resource name can in most case be included as the main Title.  This is typically the name of a collection or database. 


We should note here that data provider and resource metadata does contain names of their custodians that possibly could be used as authors.  If an Author identity was attainable, it would be the resource “administrative” contacts where these are specified.  If there are no “administrative” contacts,  “other” contacts would be used, and default to “technical” contacts otherwise.  The names here can have either surname first or last, and probably could be formatted correctly in most cases.  However, these contacts semantically do not correspond to Authors but rather to Editors or similar compilers.  Therefore, these are not included in the citation. Authors are included only in citations of names data.

3.5. Record citation

There are elements in the data standards which are intended to guarantee uniqueness.  For Darwin Core this will mean that we construct the citation from the InstitutionCode, CollectionCode and CatalogNumber elements, like (Records: Institution A, Collection B, Catalogue numbers ABC, DEF, GHI, JKL; Institution C, Collection D, Catalogue numbers MNO, PQR)


As the institution and collection are normally identified as the data provider and resource, respectively, only the catalog number needs to be given as the record citation.  For names data, the name itself would be given as title.

If the number of records is large for the publication targeted and/or the individual identification of the records is not necessary, only the number of records may be mentioned.  If the all records from a provider or resource are included, the record citation can be omitted.

3.6. GBIF identifier

In this section we discuss the issues related to storing and citing archived datasets.  Archiving GBIF data is a controversial issue as it potentially removes from data providers their capability to withdraw data.  This would be very problematic in cases where sensitive data on endangered species was accidentally shared.  Therefore we must note that that no decision on building such an archiving mechanism has been made.

However, large arbitrarily constructed datasets are being used for analysis and there is a need to store, document, and cite them. Such citations can become very large and unpublishable using the other mechanisms discussed above. Even using the original query parameters, the resulting dataset is not likely to be identical to what was obtained at the time when the query was issued.  Archiving the dataset can be done in many ways, but storing the incoming and resulting XML stream may be the simplest solution.

Archived datasets would be referenced using the <GBIF identifier>. In order to be compatible with the Open Archives Initiative, the format of the <GBIF identifier> must correspond to that of the URI (Uniform Resource Identifier) syntax.  We call it  GBIF_URI.   It must be made clear that this is not a globally/universally unique identifier but a local one.

GBIF_URI must be simple and short, but should be able to produce to perform the same request again (even if it is a query which may return different results in the future).  It must also be future-proof because users will be publishing them in printed publications, so it must always return something sensible when users request it, even if “sensible” means a clear error message).   These URIs cannot probably be rationally created for each page view of the GBIF data portal, but should be possible to generate using a specific request, i.e., a button that the user can push or a XML request generate.  That event would then create and store in a database a persistent URI and return it to the user or requester. 

A persistent GBIF_URI based on that model might like these examples:

http://www.gbif.net/record/1234567890
http://www.gbif.net/resource/123456
http://www.gbif.net/dataset/12345

4.  Full examples

Now we can give some combined examples of static citations:

GBIF Data Portal, www.gbif.net. 2005-03-31.  Museum of Vertebrate Zoology, Terrestrial Vertebrate Specimens, Record numbers 20045, 25678, 31098; University of Washington Burke Museum, 120 records.

GBIF Data Portal, www.gbif.net. 2005-03-31. Catalogue of Life Partnership, Integrated Taxonomic Information System.

GBIF Data Portal, www.gbif.net. 2005-03-31. Field Museum of Natural History, 10 records; Museum of Vertebrate Zoology, 204 records;  Royal Ontario Museum, 1 record; University of Washington Burke Museum, 36 records; University of Turku, WWF Peru, 10 records.

The static citations given in examples can be tedious to construct manually, and therefore could best be generated by some appropriate tools.

5.  Recommendation for DiGIR Citation metadata field

DiGIR metadata includes a Citation-field for the resources.  Using the Citation-field would be good alternative way for handling citations, but it will take some standardisation work. At this writing the use of that field is very inconsistent across the providers and resources. To alleviate that problem, we discuss here what would be the best use of that that field.

First, the above formats cannot be applied for the Citation-field as a user can access the data provider directly without going via the GBIF Data Portal.  Second, the provider and resource owners know exactly what are the roles of various people in producing these data.

Therefore an appropriate form for this citation is much closer to a traditional citation with authors, year, title and publisher.  E.g. “Smith, A., Turner, B., 2003-2005.  Institution X, Collection Y, Taxon Y Database”.  If this is available, with a flag denoting well-formedness, GBIF data portal could forward it and the constructed citations could be dropped. 

GBIF data validation services could include a review of this text as part of the process each new provider is helped to connect, and existing providers should be advised on this as part of the regular process of giving them feedback.

6. Discussion

There are not many are examples on other portals how a citations of the primary data can be handled.  Most of the other portals just cite to the portal itself.  We think such a citation model would not fulfil the requirements of the GBIF Data Sharing and Data Use Agreements.  The agreements are quite clear on the need to recognise the efforts of the data providers.  Data providers are often just technical bodies who publish the data, but do not own the data the same way that the resource (=database, collection) custodians may.  This point may have to be revisited in the agreements.

The purpose of citations in general is to enable the reader/consumer to retrieve the source of information in question. In the situation of live databases, this poses challenges.  It is not expected that GBIF data providers keep the data available indefinitively.  Quite the contrary, they can withdraw it any time.  Static references as presented above do not therefore always enable retrieval.

Dynamic reference like the <GBIF identifier> can potentially enable access to original material stored under such reference.  However, guaranteeing the persistence of such references and the underlying material has to be planned carefully.  Issues on sensitive data and data provider authority have to be clarified and agreed on.  These are clearly conflicting requirements which may require revising the GBIF Data Sharing Agreement.  This paper does not assume that such a service is yet in place, or will be built by GBIF.  Such service could perhaps be offered by external archiving services.

In other communities there are examples of direct references to information sources.  In particular electronic publishing of scientific articles has touched the issue how to identify electronic content.  Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), a standard for retrieving metadata from digital document repositories (Lagoze & al. 2004).  Adding an XML interface onto GBIF Data Portal that implements an OAI-PMH repository of citations is attractive as it could enable handling datasets the same way as publications, and hence pave the way for getting scientific merit for publishing data.

References

Lagoze, C., Sompel, H. van de, Nelson, M. & Warner, S. 2004.  The Open Archives Initiative Protocol for Metadata Harvesting.  Protocol Version 2.0 of 2002-06-14.  Document Version 2004/10/12T15:31:00Z  http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm



Annex 1.  Excerpt from GBIF Data Use Agreement.

3.  In order to make attribution of use for owners of the data possible, the identifier of ownership of data must be retained with every data record.

4.  Users must publicly acknowledge, in conjunction with the use of the data, the data providers whose biodiversity data they have used.  Data providers may require additional attribution of specific collections within their institution.



Version 0.2, Draft, Hannu Saarenmaa 2005-02-10
Version 0.3, Draft, Hannu Saarenmaa 2005-03-08, based on input by Donald Hobern
Version 0.4, Draft, Hannu Saarenmaa 2005-03-29, based on Open Archives Initiative materials
Version 0.5, Draft, Hannu Saarenmaa 2005-03-31, based on comments by Donald Hobern, Jim Edwards, Per de Bjørn, Meredith Lane
Version 0.7, Draft, Hannu Saarenmaa 2005-04-01, based on comments from the staff
Version 0.8, Draft, Hannu Saarenmaa 2005-04-08, based on comments from the staff
Version 0.9, Draft, Hannu Saarenmaa 2005-04-08, grammatical corrections by Meredith Lane
Version 0.11, Draft, Hannu Saarenmaa 2005-04-13, comments by Jim Edwards
Contact info | Webmaster | Webmaster login | Printable page