Data trends - frequently asked questions

Please contact us for question about the data trends.


Who is producing these charts and why?
The GBIF Secretariat is producing information on data mobilization trends observed on the GBIF network. Showing trends on the data mobilized by the GBIF network can help with planning data mobilization efforts, showing the results of previous investments in digitization or data mobilization, or in highlighting issues to be targeted to improve the fitness-for-use of the data.
Can I reproduce these charts in my national reports?
Yes, however we encourage that they be reviewed before doing so.
How often will the charts be updated?
The charts show data trending from the end of 2007 until recent weeks and will be recalculated periodically; approximately quarterly.
How can I contribute to this work?
This project is being developed openly on the GitHub project site. While some data preparation stages require access to the GBIF index and Hadoop infrastructure, other stages run using R and can be developed remotely. Please contact us if you would like to contribute to the work.

Understanding the trends and improving the charts

How have these trends been produced?
The project is documented on the GitHub project site. Approximately 4 historical views per year of the GBIF index were restored (totalling approximately 8 Billion records in May 2014), and the raw data were processed to the latest quality control and taxonomic backbone. Various scripts were then used to digest the records into smaller views which are then processed in R to produce the charts.
How do they take into account changes in the GBIF taxonomic backbone over time?
All data are processed to the latest GBIF backbone taxonomy, to ensure that species counts are comparable over time.
In some charts I can see that the amount of mobilized data sometimes goes down before going up again. Why might that be?
This is due to the removal of data sets from GBIF. This might occur if a publisher wishes to remove their data, but is often due to the removal of datasets that were inadvertently published twice (duplicate datasets).
I can see strange peaks in the charts showing trends in the temporality of the data. What might be the cause of this?
The charts may reveal patterns that represent biases in data collection (seasonality, public holidays) or potential issues in data management (disproportionate numbers of records shown for the first or last days in the year or each month or week). Such issues may arise at various stages in data processing and require further investigation.
I have suggestions to improve the clarity of the charts included here - what should I do?
Please use the feedback button on the side of the page to log any suggestions.
Why are these charts presented as static images and not something more dynamic?
This is a first iteration of work. Future versions could be more interactive, although one has to consider if a PDF view or simple images for (e.g.) annual reports are required. As an open project, anyone with interest in improving the data visualization is welcome to get involved. Please contact us.
How did you select the colours used in these charts and can we improve them?
The colour palettes come from and an attempt was made to select colours that would be colour-blind safe. It is difficult to find suitable colour palettes that work on all charts (e.g. global and country specific) and input would be greatly appreciated to help improve these.
Which technologies were involved in this work?
The original unprocessed data resides in Hadoop. Hive is used for the SQL processing on the Hadoop data using custom UDFs wrapping the GBIF core processing libraries (Java). Hive is used to digest the data into CSV tables. All other processing is in R.

How to get involved

What can I do to improve the completeness of records available through GBIF?
A complete record is here defined as having species identification, valid coordinates and the full date of collection or observation. The charts show that some records published to GBIF are incomplete. There can be different reasons for this, which include deliberately excluding coordinates for sensitive data, or the full date of collection not being available for some historic collections. However, for many datasets, the completeness of records could be improved by working with the data publisher concerned. All GBIF Nodes are encouraged to consider how they can work with the data publishers in their networks to improve the completeness of the records, which will contribute to making these data fit for a broader range of uses.
I have suggestions for other interesting charts that I would like to see on Can I request more charts?
In future GBIF work programmes, it may be possible to extend this work further to include other interesting trends around data mobilization in GBIF. Please use the feedback button to provide any additional ideas or comments on the current charts, or consider contributing to the project.
What would it take for me to produce these charts myself in a different style or language?
The scripts used for this work are maintained in the GitHub project site. GBIF can provide the underlying digested data in the form of a collection of CSV files which can be used in various applications to produce the charts. For those wishing to do far more detailed analysis than GBIF is able to do globally, the processed source records can be provided for subsets of the data (e.g. all records for Spain). Please note that the Secretariat has limited resources but will do all they can to support others wishing to further the analysis. Please also note that the volumes of data can be very large - the data covers approximately 8 Billion records (May 2014)
How do I provide feedback?
Please use the feedback button on the side of each page or contact us by mail.