Looking Up Data
Last updated on 2023-05-09 | Edit this page
Estimated time 15 minutes
Overview
Questions
- How do I fetch data from an Application Programming Interface (API) to be used in OpenRefine?
- How do I reconcile my data by comparing it to authoritative datasets
Objectives
- Use URLs to fetch data from the web based on columns in an OpenRefine project
- Add columns to parse JSON data returned by web services
Looking up data from a URL
OpenRefine can retrieve data from URLs. This can be used in various ways, including looking up additional information from a remote service, based on information in your OpenRefine data. As an example, you can look up the scientific names in a dataset against the taxonomy of the Global Biodiversity Information Facility (GBIF), and retrieve additional information such as higher taxonomy and identifiers.
Typically this is a two step process, firstly a step to retrieve data from a remote service, and secondly to extract the relevant information from the data you have retrieved.
To retrieve data from an external source, use the drop down menu at
any column heading and select Edit column
>
Add column by fetching URLs...
.
This will prompt you for a GREL expression to create a URL. This URL will use existing values in your data to build a query. When the query runs OpenRefine will request each URL (for each line) and retrieve whatever data is returned (this may often be structured data, but could be HTML). The data retrieved will be stored in a cell in the new column that has been added to the project. You can then use OpenRefine transformations to extract relevant information from the data that has been retrieved. Two specific OpenRefine functions used for this are:
parseHtml()
parseJson()
The parseHtml()
function can also be used to extract
data from XML.
Retrieving higher taxonomy from GBIF
In this case we are going to use the GBIF API. Note that
API providers may impose rate limits or have other requirements for
using their data, so it’s important to check the site’s documentation.
To reduce the impact on the service, use a value of 500
in
the Throttle Delay
setting to specify the number of
milliseconds between requests.
The syntax for requesting species information from GBIF is
http://api.gbif.org/v1/species/match?name={name}
where
{name} is replaced with the scientific name in the dataset.
Using the dropdown menu of
scientificName
, selectEdit column
>Add column by fetching URLs...
In the
New column name
field, enter “gbif_JSON”In the field for
Throttle delay
enter 500-
In the expression box type the GREL
"http://api.gbif.org/v1/species/match?name="+escape(value,'url')
At this point, your screen should be similar to this:
Click ‘OK’
You should see a message at the top on the OpenRefine screen indicating it is fetching some data, with progress showing the percentage of the proportion of rows of data successfully being fetched. Wait for this to complete.
At this point you should have a new column containing a long text string in a format called ‘JSON’ (this stands for JavaScript Object Notation, although very rarely spelt out in full). The results should look like this figure:
OpenRefine has a function for extracting data from JSON (sometimes
referred to as ‘parsing’ the JSON). The parseJson()
function is explained in more detail at the Format-based
functions page.
In the new column you’ve just added use the dropdown menu to access
Edit column
>Add column based on this column...
Add the new column name: “gbif_family”
In the Expression box type the GREL
value.parseJson().get("family")
-
You should see in the Preview the taxonomic family of the scientific names displays, similar to this screen:
The reason for using Add column based on this column
is
that this allows you to retain the full JSON and extract further data
from it if you need to. If you only wanted the taxonomic family and did
not need any other information from the JSON you could use
Edit cells
> Transform...
with the same
GREL expression.