Summary and Schedule
Before you can analyze data you need to clean it. Data cleaning identifies errors and corrects formatting to create consistent data. This step must be taken with extreme care and attention because without clean data the results of analysis may be false and non-reproducible.
OpenRefine is a powerful free and open source tool for working with messy data: cleaning it and transforming it from one format into another.
This lesson will teach you to use OpenRefine to clean and format data effectively and automatically track any changes that you make. Many people comment that this tool saves them literally months of work trying to make these edits by hand.
Getting Started
Data Carpentry’s teaching is hands-on, so participants are encouraged
to use their own computers to ensure the proper setup of tools for an
efficient workflow.
These lessons assume no prior knowledge
of the skills or tools.
To get started, follow the directions in the Setup page to download data to your computer and follow any installation instructions.
To most effectively use these materials, please make sure to install everything before working through this lesson.
For Instructors
If you are teaching this lesson in a workshop, please see the Instructor notes.
Setup Instructions | Download files required for the lesson | |
Duration: 00h 00m | 1. Introduction | How is OpenRefine useful? |
Duration: 00h 10m | 2. Importing Data to OpenRefine | How can we import our data into OpenRefine? |
Duration: 00h 20m | 3. Exploring Data with OpenRefine |
How can we summarise our data? How can we find errors in our data? How can we edit data to fix errors? How can we convert column data from one data type to another? |
Duration: 00h 55m | 4. Transforming Data | How can we transform our data to correct errors? |
Duration: 01h 30m | 5. Filtering and Sorting with OpenRefine |
How can we select only a subset of our data to work with? How can we sort our data? |
Duration: 02h 05m | 6. Exporting Data Cleaning Steps |
How can we document the data-cleaning steps we’ve applied to our
data? How can we apply these steps to additional data sets? |
Duration: 02h 20m | 7. Exporting and Saving Data from OpenRefine | How can we save and export our cleaned data from OpenRefine? |
Duration: 02h 30m | 8. Other Resources in OpenRefine | What other resources are available for working with OpenRefine? |
Duration: 02h 35m | Finish |
The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.
Data
Download the data file Portal_rodents_19772002_simplified.csv,
which is a csv file that will open in a new browser tab. Be sure to
right click or control click in order to save the file (NOTE: In Safari,
right click and select Download linked file
; in Chrome and
Firefox, right-click and select Save link as...
). Make a
note of the location (i.e. the folder, your Desktop) to which you save
the file.
About the data
The data for this lesson is a part of the Data Carpentry Ecology workshop. It is a teaching version of the Portal Database. The data in this lesson is a subset of the teaching version that has been intentionally ‘messed up’ for this lesson.
The data for this lesson and the workshop are in the Portal Project Teaching Database available on FigShare, with a CC-BY license available for reuse.
Software
For this lesson you will need OpenRefine version 3.7.2 and a web browser.
Note: OpenRefine is a Java program that runs on your machine (not in the cloud). It runs inside your browser, but no web connection is needed.
Download OpenRefine version 3.7.2 from https://openrefine.org/download.
- Do not download beta versions or the release candidates. These are only for development and testing of the software.
- If you are on Windows and do not have Java installed, download the
version
Windows (including Java)
. - Unzip the downloaded file into a directory and name that directory something like OpenRefine.
- Check below for further instructions depending on your operating system.
Windows
- Go to your newly created OpenRefine directory.
- Launch OpenRefine by double clicking on
openrefine.exe
(this will launch a black command prompt window first; ignore this window, and wait for OpenRefine to launch in the web browser, which is where you will interact with the program).
- If Windows displays a blue notification titled
Microsoft Defender SmartScreen prevented an unrecognized app from starting
, click onMore info
and then click onRun anyway
.
- If you are using a different browser, or OpenRefine does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
Mac
- Go to your newly created OpenRefine directory.
- Drag the OpenRefine icon into Applications folder, and
Ctrl-click/Open…
it.
- If Mac shows a notification when you try to run the program that it
cannot verify the developer, click
Cancel
. Then,Right-click
orCtrl-click
the icon and selectOpen
. The notification will now have anOpen
button. If it does not allow to open the program, repeat the process and there will be anOpen
button the second time. For additional details, consult the OpenRefine installation guide.
- If you are using a different browser, or OpenRefine does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
Linux
- Navigate to your newly created OpenRefine directory using the command line.
- Type
./refine
into the terminal within the OpenRefine directory - If you are using a different browser, or OpenRefine does not automatically open for you, point your browser at http://127.0.0.1:3333/ or http://localhost:3333 to launch the program.
Web Browser
OpenRefine requires one of these web browsers installed in your computer:
- Google Chrome
- Chromium
- Safari
- Opera
- Microsoft Edge
OpenRefine has some issues with Firefox. Internet Explorer is not supported.
Note: Other versions of OpenRefine should work, but the results might be different due to changes in the software or default settings.