Ever had to manually edit dusty, messy, years-old information from some obsolete software?
I once worked for a company that stored paperwork offsite for 60 years. Materials were indexed in a document table. Most records had a box number, storage date, storage vendor receipt number, and a rough idea of the contents. Most, mind you.
Over 60 years the list got … messy. Storage contracts changed several times—so the box codes and vendor receipts varied over time. Add in the random mistakes that added up over time, and you had quite a mess.
My job was transferring everything to yet another contractor—which meant cleaning up thousands of records to play nice with the new vendor’s fancy online inventory. It was quite the chore—a chore many of us face when trying to organize data.
The good news is, if you can get your messy data into a spreadsheet, you can clean up and reformat it. My favourite tool for this is called OpenRefine, and its specialty is "reconciling" or "normalizing"—making it easy to find typos, variations on phrases, formatting errors, extra spaces, and other things that are hard to spot in rows upon rows of information.
What is OpenRefine?
OpenRefine bills itself, simply, as "a powerful tool for working with messy data." Originally released 2010 as "Freebase Gridworks," it was later called "Google Refine" after being acquired by the search giant. Today it's a community-run, open-source project to, well, refine your data.
To you, this could mean a number of things. You sales team could want to export old store data, reorganize it, and import it into a new eCommerce app. Your accounting staff might have legacy data floating around from years ago. Your PR staff could have multiple email lists from campaigns past you want to merge, modify, or de-duplicate.
Maybe your survey results are messy, your app exports are confusing, or your analytics data needs combined from multiple sources.
OpenRefine was built especially with those types of bulk operations in mind. It may just be what you need to finally finish that data project you’ve been putting off.
Getting Started With OpenRefine
Getting started is easy. Just download OpenRefine—it works on Windows, Mac, and Linux—and start the program. It’ll open up a browser tab that looks much like other Google Apps, and will ask you to create a project, or open a project you’ve already started.
You'll need some data for OpenRefine to work with—and it open any data in a spreadsheet format: CSV, XLS, or even a Google Sheets spreadsheet online. It can also take XML and JSON files, if that’s your jam.
Let's start a new project. This exercise is going to use a set of publicly available data from the Government of Ontario—which, like much public data, is a bit messy. Let’s go with a subject near and dear to my heart: Beer. Copy the link to the
XLSX file, which includes details about Ontario microbrewers and brands. Switch to your OpenRefine tab, start a new project, select the Web Address option, and paste in your spreadsheet link.
As soon as you input a dataset, OpenRefine generates a preview to ensure it’s displayed properly. You can do some preliminary cleanup—remove empty rows, set the first row as a header with column names, or convert columns into specific data types (dates, integers, and so on).
Click "Create Project" when you’ve made sure the data is displaying correctly, and you’ll be brought to the screen where all the magic happens.
The first thing you’ll notice is that OpenRefine doesn’t display your data like a spreadsheet with a long list of rows. Instead, it shows a maximum of 50 rows at a time, essentially just enough of a preview for you to think about what you’re working with. You can page through your data if you need to, but I think you’ll soon get comfortable with being less overwhelmed.
Clean Up Data with OpenRefine Facets
The first step is to learn about facets. These show precisely which values are used in a column, so you can find typos or variations in things that are supposed to be identical. Let’s start with the manufacturer’s name. Click the dropdown button next to the header, select Facet, then Text Facet. You’ll be presented with a column like this, showing a count of the times each item appears in the dataset:
We can see, for example, that Big Rig Brewery has 13 different beers; Big Rock Brewery, 6 different beers. We can already see some messy data here—"Black Swan Brewing Company" and "BLACK SWAN BREWING COMPANY INC." are the same company, but with slightly different names in this spreadsheet.
To fix this, hover your mouse over the name you want to change, click "edit," and type in the new name. Click Apply and it automatically edits all the matching entries in the dataset.
Let’s speed up the process by automatically identifying all of the facets that are similar and merging them—without any typing—by clustering the data. Click the Cluster button at the top of the facet display, and you’ll see all of the similar entries identified by OpenRefine:
For some of these, it’s just an extra space (as at the end of "Square Timber Brewing Company") or an extra comma (as in Blood Brothers Brewing), or liberal use of capslock. As you can see in the "Bevin Palmateer" entry, OpenRefine also identifies words that are out of order.
Check the Merge boxes for anything you want to fix. If you don’t like the suggested new value—for example, the capitalized name suggested for NITA BEER—you can just click the lowercase option and it will change that field. If you don’t like any of the options, just type in your preferred name.
Click Merge Selected & Re-Cluster to do another check. When the check finds no results, try another clustering method to look for more (you should find "Walkervile" and "Walkerville").
It's data-mining, but you don’t have to learn advanced data-mining theory to get results: Just click through all the options. You’ll start to see false positives (for example, "Bell City" isn’t "River City"), which you can just ignore.
There are also some common transform tools you can use to clean stuff up, like eliminating all the spaces before and after text. Let’s also get rid of all the uppercase brewery names by transforming the whole column to Titlecase. Click again on the dropdown menu for the column, go to Edit cells, and read through all the possibilities.
Categorize Data Automatically in OpenRefine
The next step is to do clever things with all this data. Let’s pretend these beers are our product data, and we want to add categories of beer to our catalogue. We don’t want to manually label each entry, so let’s save some time by identifying beer types from the beers’ names.
We can do a quick check for one type of beer using a Custom Text Facet. We’ll look for all cell values that contain "Porter" (this is also case-sensitive, but now that we’ve put everything in titlecase, the capital
P should catch everything). A Custom Text Facet on the Manufacturer’s Brand column brings up this window, into which we enter a filter:
This function returns
true here means 25 beers are porters in the list. (There are also 79 breweries without any actual beers available—the
(blank) category—but let’s ignore that for now.)
These filters are great when you want to manipulate a subset of your spreadsheet without having to delete the rest, or keep your focus rows selected. You can apply a filter, do a bunch of operations, and then remove it later. OpenRefine even includes some common recipes to format data, such as standarizing date formats or transforming "Firstname Lastname" into "Lastname, Firstname."
Let's use that to transform our data into something useful. We'll add a new column based on the "Manufacturer’s Brand" column, using text analysis to guess what type of beer it is. It won’t work on all entries, but for beers that have "IPA", "lager", "stout", "lime", "red", "wheat", and so on right in their name, we’ll have some success.
As with all bulk data work, sometimes mistakes happen. For example, there’s one beer in this list named "More Portly Than Stout Porter." If we search for "Stout," we’ll get a false positive. Keep that in mind, and always set aside time for quality-control!
Start by clicking on "Manufacturer’s Brand." Select Edit Column then choose Create column based on this column. To look for "lager" and replace the entirety of the Beer types value with "lager" where applicable, we use an if statement:
If statements here are straightforward: If the first part is true, transform the whole value to "lager;" otherwise, replace the cell value with itself (or, do nothing).
If we want to categorize a big set of beer types at once, we nest a series of if statements inside each other. It looks a bit silly, but gets the job done:
Essentially, if "Lager" wasn’t found, then try "IPA," then try "Wheat," then try "Pilsner," etc., etc. It's not standard programming syntax, but gets the job done.
Apply that transformation, then check the facets of the column to see our progress.
While we’re at it, let’s clean the results up. Reconcile "I.P.A." and "India Pale Ale" to "IPA" with the steps you learned above. Also keep in mind that the operations work in order: You’ll want to convert "India Pale Ale" before you reformat "Pale Ale." Because these transformations are also case-sensitive, transforming to lowercase "India pale ale" would also protect your work when you search for "Pale Ale" later on.
With a bit of categorizing, we can start to see the spread of beer types in Ontario. (Try them all today!) This is definitely faster than labelling them all by hand, and it should give you an idea how to make OpenRefine filters work for you.
If this was a product list for our online store, we’d want to export our cleaned-up and value-added spreadsheet from OpenRefine and import it into our eCommerce store. The Export button's your friend. You can export your data as a spreadsheet with a range of options and data forms. You can also upload the data directly to a new Google Sheets spreadsheet or Google Fusion table.
Do More with OpenRefine
There are a few other useful OpenRefine tools. The Undo/Redo option gives you detailed information about all your activities instead of just undoing your mistakes—which is super helpful in learning how to get more out of OpenRefine. Also remember: OpenRefine is designed around databases so you can use its records and rows seperately to organize your data.
Now it's your turn to try it out. Have messy data from an app export, or an old spreadsheet full of confusing data?
One great way is to use OpenRefine to organize your contacts: Find typos and formatting errors in email addresses, phone numbers, or company names before importing the data into a new app. I’ve used it to reformat old Mailchimp data when we changed the designs of our signup forms—super handy.
Don't spend hours formatting your data again. OpenRefine can do it for you in minutes.
We'd love to hear how you use OpenRefine to clean up data in the comments below!
Need to import data into a spreadsheet in the first place? Here's how to import any web table into a Google Sheets spreadsheet, where you can use your newfound OpenRefine skills to clean up the data.
New to spreadsheets? Get a crash course in Zapier's free Ultimate Guide to Google Sheets eBook.
Header photo via Pexels.