Auto-generated dataset citations
Can you provide an example?
Where does this text come from? I published this dataset, and that is not the citation text that I provided!
The data standards that GBIF supports, and that institutions use to publish their data through GBIF, include a number of so-called metadata elements–descriptive information, that is, about the dataset as a whole. Data publishing institutions can provide unstructured text here recommending how this dataset should be cited, and some do choose to provide citation information in free-text format.
But in practice the use of free-text citations creates many problems and issues that make it difficult to support a consistent user experience and encourage consistent citation of data use—genuine citations in uncommon or nonstandard formats, placeholder or other meaningless text, or absent fields.
For several years, GBIF.org displayed two alternative options: the publisher-provided text, and an auto-generated text that provided a usable alternative in the event that the other was not. While it sought to encourage good data-citation practices, user feedback made it clear that this approach actually caused more confusion than providing guidance.
In 2015, GBIF’s Integrated Publishing Toolkit (IPT) began encouraging publishers to document specific author and contributor roles in the EML metadata that could provide a more consistent format for displaying dataset citations. This more consistent, standardized approach has now enabled us to auto-generate all text citations for datasets.
For publishers who use other tools, protocols and data formats that still capture a free-text string, the auto-generated citations may now override your specifically composed citations. Please contact firstname.lastname@example.org to explore possible solutions if this raises concerns or causes problems.
How is this text auto-generated?
By default, the auto-generated citation contains the following information:
- Name(s) of the dataset’s originating author(s), formatted to show surname and initial(s), e.g. ‘Andersen AA’ for Anders Asger Andersen
- Name(s) of the dataset’s metadata author(s), if one is registered, but only if also an originating author is named
- Publication year of the dataset
- Dataset title, as registered and shown on the dataset page
- Dataset version, if supplied with the metadata
- Name of the publishing organization, as shown on the dataset page
- Type of the dataset, as registered and shown on the dataset page
- Dataset DOI, as registered or assigned during registration with GBIF
- A reference to GBIF.org as the source
- Date of accession
What happens if I don’t name any authors, or list only the metadata author without any originating authors?
In cases where no authors are named, or where only metadata authors are named without any originating authors, the citation text will start with the name of the publishing institution, followed by the publication year and the other elements.
Where can I find additional information about how GBIF automatically generates text citations for datasets?
You’ll find a slightly more formal description of the logic behind automatic citation generation in this GBIF GitHub repository.
How long does it take GBIF to start (re)indexing my dataset?
The answer depends on how long GBIF’s indexing queue is, how big your dataset it and whether GBIF’s indexing service is turned on.
Normally it will take between 5-60 minutes for GBIF to start indexing your dataset. It can take several hours to finish indexing large datasets once started (e.g. with several million records) so please be patient. If you believe GBIF failed to index your dataset successfully, please submit feedback directly via GBIF.org or email the GBIF Helpdesk to investigate what happened. Please see below if you are interested in finding out why GBIF may not have (re)indexed your dataset.
Why hasn’t GBIF (re)indexed my dataset yet?
Occasionally, GBIF turns off its indexing service for maintenance. This is the most common reason why datasets aren’t indexed as quickly as expected.
If your dataset has been successfully reindexed, but the records weren’t actually updated, you may be affected by this bug in the crawling service.
How often does GBIF reindex my dataset?
GBIF automatically attempts to reindex a registered dataset each time its registration is updated. This happens each time the dataset gets republished via the IPT. Note, however, GBIF doesn’t reindex the same dataset more than once every five days.
To cater to datasets not published using the IPT, GBIF automatically attempts to reindex all registered datasets every 7 days. Note, however, GBIF will only reindex the dataset if its last published date has changed since the last time it was indexed.
What type of datasets does GBIF index/support?
GBIF currently supports four classes of datasets. GBIF currently only indexes species occurrence records though, which can be provided as either core records or as extension records. In the case of sampling-event datasets, species occurrences in extension records will be augmented with information coming from its core event record wherever possible.
Who is producing these charts and why?
The GBIF Secretariat is producing information on data mobilization trends observed on the GBIF network. Showing trends on the data mobilized by the GBIF network can help with planning data mobilization efforts, showing the results of previous investments in digitization or data mobilization, or in highlighting issues to be targeted to improve the fitness-for-use of the data.
Are these reports available as a download?
Not currently, but we plan to add this feature in 2015. If you are interested in being able to download these reports, please use the feedback button on the side of this page to explain how you would wish to see this feature implemented.
Can I reproduce these charts in my national reports?
Yes, however we encourage that they be reviewed before doing so.
How often will the charts be updated?
The charts show data trending from the end of 2007 until recent weeks and will be recalculated periodically; approximately quarterly.
How can I contribute to this work?
This project is being developed openly on the GitHub project site. While some data preparation stages require access to the GBIF index and Hadoop infrastructure, other stages run using R and can be developed remotely. Please contact us if you would like to contribute to the work.
Understanding the trends and improving the charts
How have these trends been produced?
The project is documented on the GitHub project site. Approximately 4 historical views per year of the GBIF index were restored (totalling approximately 8 Billion records in May 2014), and the raw data were processed to the latest quality control and taxonomic backbone. Various scripts were then used to digest the records into smaller views which are then processed in R to produce the charts.
How do they take into account changes in the GBIF taxonomic backbone over time?
All data are processed to the latest GBIF backbone taxonomy, to ensure that species counts are comparable over time.
In some charts I can see that the amount of mobilized data sometimes goes down before going up again. Why might that be?
This is due to the removal of data sets from GBIF. This might occur if a publisher wishes to remove their data, but is often due to the removal of datasets that were inadvertently published twice (duplicate datasets).
I can see strange peaks in the charts showing trends in the temporality of the data. What might be the cause of this?
The charts may reveal patterns that represent biases in data collection (seasonality, public holidays) or potential issues in data management (disproportionate numbers of records shown for the first or last days in the year or each month or week). Such issues may arise at various stages in data processing and require further investigation.
I have suggestions to improve the clarity of the charts included here - what should I do?
Please use the feedback button on the side of the page to log any suggestions.
Why are these charts presented as static images and not something more dynamic?
This is a first iteration of work. Future versions could be more interactive, although one has to consider if a PDF view or simple images for (e.g.) annual reports are required. As an open project, anyone with interest in improving the data visualization is welcome to get involved. Please contact us.
How did you select the colours used in these charts and can we improve them?
The colour palettes come from colorbrewer2.org, and an attempt was made to select colours that would be colour-blind safe. It is difficult to find suitable colour palettes that work on all charts (e.g. global and country specific) and input would be greatly appreciated to help improve these.
Which technologies were involved in this work?
The original unprocessed data resides in Hadoop. Hive is used for the SQL processing on the Hadoop data using custom UDFs wrapping the GBIF core processing libraries (Java). Hive is used to digest the data into CSV tables. All other processing is in R.
How can I get involved
What can I do to improve the completeness of records available through GBIF?
A complete record is here defined as having species identification, valid coordinates and the full date of collection or observation. The charts show that some records published to GBIF are incomplete. There can be different reasons for this, which include deliberately excluding coordinates for sensitive data, or the full date of collection not being available for some historic collections. However, for many datasets, the completeness of records could be improved by working with the data publisher concerned. All GBIF Nodes are encouraged to consider how they can work with the data publishers in their networks to improve the completeness of the records, which will contribute to making these data fit for a broader range of uses.
I have suggestions for other interesting charts that I would like to see on GBIF.org. Can I request more charts?
In future GBIF work programmes, it may be possible to extend this work further to include other interesting trends around data mobilization in GBIF. Please use the feedback button to provide any additional ideas or comments on the current charts, or consider contributing to the project.
What would it take for me to produce these charts myself in a different style or language?
The scripts used for this work are maintained in the GitHub project site. GBIF can provide the underlying digested data in the form of a collection of CSV files which can be used in various applications to produce the charts. For those wishing to do far more detailed analysis than GBIF is able to do globally, the processed source records can be provided for subsets of the data (e.g. all records for Spain). Please note that the Secretariat has limited resources but will do all they can to support others wishing to further the analysis. Please also note that the volumes of data can be very large - the data covers approximately 8 Billion records (May 2014)
How do I provide feedback?
Please use the feedback button at the top right of each page or contact us by email.
What is inside a download zip file?
When you request a download in the GBIF data portal, you will receive a Darwin Core Archive file (DwC-A). This is the most widely-used data exchange file format in the GBIF network. To open it, you will need a zip programme installed in your computer (practically all modern operating systems include support for this kind of file). Just double-click on it to see its contents. Inside the zip file, you will find the following components:
- An occurrence data file, ‘occurrence.txt’: A tab-separated data file that contains all the species occurrences included in your download.
- A citations file, ‘citation.txt’: A tab-separated data file that includes all the citation strings for the sources of the data you downloaded.
- A use rights file, ‘rights.txt’: A tab-separated data file that includes any additional use conditions or rights defined by the data publishers responsible for the data you downloaded.
- A metadata file, ‘metadata.xml’: This xml file stores all the information describing the contents of the downloaded dataset.
- A descriptor metadata file, ‘meta.xml’: This xml file describes the structure of the Darwin Core Archive so the whole archive can be processed automatically by software.
To open the different files, please follow these instructions:
- For tab-separated data files ‘&.txt’: These can be opened by any spreadsheet processor (e.g. MS Excel, OpenOffice Calc) or desktop database software (e.g. MS Access). Just open one of the suggested programmes and drag & drop the file into it, or import data by choosing ‘tab delimited’, CSV, ‘text file’ or any similar option. If you are asked to select an ‘encoding standard’ or ‘character set’ manually, please choose ‘Unicode, UTF-8’. NOTE: do not try to double-click on the files, as .txt is a very generic extension and will probably have a generic text viewer associated to it.
- For xml files ‘&.xml’: These files are usually designed to be machine processed. If you are curious about their content, they can normally be interpreted by web browsers: just drag & drop the file into a web browser window. You will require special software if for any reason you want to edit these files manually.
Why can’t I open the zip file I downloaded?
Downloads bigger than four gigabytes (4 GB) need to be compressed using an extension of the original zip format called ZIP64. Not all operating systems support this extension natively. MS Windows XP and Mac OS X systems are among those. Please make sure that the software you are using to decompress the file is compatible with the ZIP64 extension.