The Wayback Machine - https://web.archive.org/web/20230307035218/https://data-blog.gbif.org/

Exploring Related Records in the Flowering Plant Genus Senegalia in Brazil

In 2020 GBIF released a news item “New data-clustering feature aims to improve data quality and reveal cross-dataset connections.” Basically, we run an algorithm across datasets shared with GBIF to search for similarities in occurrences data fields such as location, identifiers and dates. Please read this blog by Marie Grosjean and Tim Robertson for more details on how it works. In general, we can identify linkages between specimens, DNA sequences and literature citations.

Checklist publishing on GBIF - some explanations on taxonID, scientificNameID, taxonConceptID, acceptedNameUsageID, nameAccordingTo

When data publishers publish checklists, they will use a a Darwin Core Archive Taxon Core. And although the taxon core terms are already described here, what exactly to put in which field can sometimes be confusing. And there is a lot to read, like here https://github.com/tdwg/tnc/issues/1 Here I am sharing a summary of an email conversation we had with some data publishers on Helpdesk concerning some of the Taxon Core fields.

Which data can be shared through GBIF and what cannot

Preparing a dataset to be shared on GBIF.org can be a daunting task and many publishers realize that not all their data fits in the Darwin Core standard (DwC) and extensions GBIF uses to structure, standardize and display biodiversity data. This blog post will cover what data fits in GBIF, give examples of data that does not fit in the current format of GBIF, and provide guidance to how you can share relevant data in a metadata-only dataset or through a third-party.

Identifying potentially related records - How does the GBIF data-clustering feature work?

Many data users may suspect they’ve discovered duplicated records in the GBIF index. You download data from GBIF, analyze them and realize that some records have the same date, scientific name, catalogue number and location but come from two different publishers or have slightly different attributes. There are many valid reasons why these duplicates appear on GBIF. Sometimes an observation was recorded in two different systems, sometimes several records correspond to herbaria duplicates (you can check the work of Nicky Nicolson on the topic), sometimes a specimen was digitized twice, sometimes a record has been enriched with genetic information and republished via a different platform…

What are the flags "Collection match fuzzy", "Collection match none", "Institution match fuzzy", "Institution match none" and how to remove them?

You are a data publisher of occurrence data through GBIF.org, care about your data quality, and wonder what to do about the issue flags that show up on your occurrences. You might have noticed a new type flag this year relating to collection and institution codes and identifiers. They are the result of our attempt at linking specimens records to our Registry of Scientific Collections (GRSciColl).

GBIF API beginners guide

This a GBIF API beginners guide.

The GBIF API technical documentation might be a bit confusing if you have never used an API before. The goal of this guide is to introduce the GBIF API to a semi-technical user who may have never used an API before.

The purpose of the GBIF API is to give users access to GBIF databases in a safe way. The GBIF API also allows GBIF.org and rgbif to function.

GBIF and Apache-Spark on AWS tutorial

GBIF now has a snapshot of 1.3 billion occurrence records on Amazon Web Services (AWS). This guide will take you through running Spark notebooks on AWS. The GBIF snapshot is documented : here.

You can read previous discussions about GBIF and cloud computing here. The main reason you would want to use cloud computing is to run big data queries that are slow or impractical on a local machine.

Derived datasets

You’ve finished an analysis using GBIF-mediated data, you’re writing up your manuscript and checking all the references, but you’re unsure of how to cite GBIF. If you Google it, you’ll probably end up reading our citation guideslines and quickly realize that GBIF is all about DOIs. Datasets have their own DOIs and downloads of aggregated data also have their own DOIs.

But maybe you didn’t download data through the GBIF.org portal. Maybe you relied on an R package like rgbif or dismo that retrived data synchronously from the GBIF API? Maybe a grad student downloaded if for you? Maybe you accessed and analyzed the data using a cloud computing service, like Microsoft Azure or Amazon Web Services? In any case, which DOI do you cite if you don’t have one?

GBIF and Apache-Spark on Microsoft Azure tutorial

GBIF now has a snapshot of 1.3 billion occurrences records on Microsoft Azure.

It is hosted by the Microsoft AI for Earth program, which hosts geospatial datasets that are important to environmental sustainability and Earth science. Hosting is convenient because you could now use occurrences in combination with other environmental layers and not need to upload any of it to the Azure. You can read previous discussions about GBIF and cloud computing here. The main reason you would want to use cloud computing is to run big data queries that are slow or impractical on a local machine.

The GBIF Registry of Scientific Collections (GRSciColl) in 2021

The GBIF Registry of Scientific Collections, also known as GRSciColl, has been available on GBIF.org since 2019 but it recently got some more attention when we connected it to GBIF occurrences. Now is the perfect time to share a bit of GRSciColl history and what we plan for its future. A brief history of GRSciColl First of all, here are a few facts about GrSciColl today, at the start of 2021:

Common things to look out for when post-processing GBIF downloads

Post was updated on April 20 2022 to accommodate changes to dwc:establishmentMeans vocabulary handling.

Here I present a checklist for filtering GBIF downloads.

In this guide, I will assume you are familar with R. This guide is also somewhat general, so your solution might differ. This guide is intended to give you a checklist of common things to look out for when post-processing GBIF downloads.

(Almost) everything you want to know about the GBIF Species API

Today, we are talking about the GBIF Species API. Although you might not use it directly, you probably encountered it while using the GBIF web portal:

This API is what allow us to navigate through the names available on GBIF. I will try to avoid repeating what you can already find in its documentation. Instead, I will attempt to give an overview and answer some questions that we received in the past.