Cleaning Data with Google Refine

27 Oct 2012

Using Google Refine to clean messy data

Download Installer

Refine is a powerful tool for working with messy data. If you’re not familiar with a scripting language (or even if you are), it’s is great for formatting and cleaning data without suffering through nested LEFT() and RIGHT() functions in Excel.

You can perform basic or sophisticated operations including splitting strings and pattern matching.

Here’s a simple split operation

# split on `(` character and take the first element of the result
"The Shining (film)".split('(')[0] 
# >>> yields `"The Shining"`

…and more complex pattern matching

# get area code from inside parens `()`
"my number is (415) 555-5555. what's yours?".match(/(.+)\((.*)\)(.+)/)[1] 
# >>> yields `415`

…and you can split on a regex pattern

# split the domain from an email address
"foobar@gmail.com".split(/(\w+?)(@)/) 
# >>> yields list with the gmail domain [ "", "gmail.com" ]

Refine supports regex in Google’s regular expression language as well as Clojure and Jython.

Cleaning a messy dataset

This Civil War dataset has some pretty interesting information about battles and casualties, and you could make some interesting plots / visualizations with it. The dataset gives us a date column, but there’s a problem.

Dates
May 29-June 1, 1861
July 30, 1862
January 9-11, 1863

Battles which lasted only one day have a single date while battles that lasted more than a day have a range. This would be really tough to plot, so we’ll have to fix that.

Create a Google Refine Project

Browse to your data and click next.

Google Refine Import Excel Sheet Screenshot

Refine will show you a preview of the data you’re about to import. In this case, we want to import the data in the sheet named “Battle Listing”.

Handling the messy dates column

Let’s separate the single date column into two columns for start and end dates.

Click the column’s dropdown menu and choose

Column >> Edit column >> Add column based on this column

A dialogue box will open which will allow you to operate on the cells in the date column. The result of your operations will be stored in a new start column.

Google Refine Add Column Based on this Column

If you’re not familiar with regular expressions, you can probably do what you need using functions about as straight forward as the ones in Excel.

Get the start date like so

# create column `start` from `dates`
if(value.contains('-')
, value.split('-')[0]
    + ',' 
    + value.split('-')[1].split(',')[1] 
, value)

Gives us

Dates start
May 17, 1864 May 17, 1864
January 9-11, 1863 January 9, 1863

Now we can get the end date using the same strategy. For battles that lasted just one day, just make the end date the same as the start date.

if(value.contains('-'), 
    if(isNumeric(value.split('-')[1].split(',')[0])
        , value.split('-')[0].split(' ')[0]
        + ' ' + value.split('-')[1].split(',')[0]
        + ',' + value.split(',')[1]
    , value.split('-')[1].split(',')[0] 
        + ',' + value.split(',')[1])

, value)

This should take care of cells where the date range is on the cusp of two months (Jan-Feb vs. Jan 10-13).

Dates end
May 29-June 1, 1861 June 1, 1861
January 9-11, 1863 January 11, 1863

Now we have the start and end columns, we don’t need the original dates column. So let’s delete that.

Column >> Edit column >> Remove this column

Now let’s convert our new start and end columns into proper date format.

Column >> Edit cells >> Common transforms >> To date

At this point, I deleted the columns that I wasn’t interested in such as References, Status, Photo Gallery, etc.

Exporting Cleaned Data

Now let’s dump the data to a flat file and load it into R for plotting.

Click the export button in the top right corner.

Export >> Custom Tabular Export

In this dialog box, you can choose which columns you want to export in the Content tab. Click the Download tab to specify download options such as output format.

Google Refine Custom Export

I’m going to output to pipe | delimited file, but you can dump it back to xls, csv, json or whatever you prefer.