Cleaning Data with Google Refine27 Oct 2012
Using Google Refine to clean messy data
Refine is a powerful tool for working with messy data. If you’re not familiar with a scripting language (or even if you are), it’s is great for formatting and cleaning data without suffering through nested
RIGHT() functions in Excel.
You can perform basic or sophisticated operations including splitting strings and pattern matching.
Here’s a simple split operation
# split on `(` character and take the first element of the result "The Shining (film)".split('(') # >>> yields `"The Shining"`
…and more complex pattern matching
# get area code from inside parens `()` "my number is (415) 555-5555. what's yours?".match(/(.+)\((.*)\)(.+)/) # >>> yields `415`
…and you can split on a regex pattern
# split the domain from an email address "email@example.com".split(/(\w+?)(@)/) # >>> yields list with the gmail domain [ "", "gmail.com" ]
Refine supports regex in Google’s regular expression language as well as Clojure and Jython.
Cleaning a messy dataset
This Civil War dataset has some pretty interesting information about battles and casualties, and you could make some interesting plots / visualizations with it. The dataset gives us a date column, but there’s a problem.
|May 29-June 1, 1861|
|July 30, 1862|
|January 9-11, 1863|
Battles which lasted only one day have a single date while battles that lasted more than a day have a range. This would be really tough to plot, so we’ll have to fix that.
Create a Google Refine Project
Browse to your data and click next.
Refine will show you a preview of the data you’re about to import. In this case, we want to import the data in the sheet named “Battle Listing”.
Handling the messy
Let’s separate the single date column into two columns for
Click the column’s dropdown menu and choose
Column >> Edit column >> Add column based on this column
A dialogue box will open which will allow you to operate on the cells in the date column. The result of your operations will be stored in a new
If you’re not familiar with regular expressions, you can probably do what you need using functions about as straight forward as the ones in Excel.
start date like so
# create column `start` from `dates` if(value.contains('-') , value.split('-') + ',' + value.split('-').split(',') , value)
|May 17, 1864||May 17, 1864|
|January 9-11, 1863||January 9, 1863|
Now we can get the
end date using the same strategy. For battles that lasted just one day, just make the
end date the same as the
if(value.contains('-'), if(isNumeric(value.split('-').split(',')) , value.split('-').split(' ') + ' ' + value.split('-').split(',') + ',' + value.split(',') , value.split('-').split(',') + ',' + value.split(',')) , value)
This should take care of cells where the date range is on the cusp of two months (Jan-Feb vs. Jan 10-13).
|May 29-June 1, 1861||June 1, 1861|
|January 9-11, 1863||January 11, 1863|
Now we have the
end columns, we don’t need the original dates column. So let’s delete that.
Column >> Edit column >> Remove this column
Now let’s convert our new
end columns into proper date format.
Column >> Edit cells >> Common transforms >> To date
At this point, I deleted the columns that I wasn’t interested in such as
Photo Gallery, etc.
Exporting Cleaned Data
Now let’s dump the data to a flat file and load it into R for plotting.
export button in the top right corner.
Export >> Custom Tabular Export
In this dialog box, you can choose which columns you want to export in the
Content tab. Click the
Download tab to specify download options such as output format.
I’m going to output to pipe
| delimited file, but you can dump it back to xls, csv, json or whatever you prefer.