Saptarshi Sarkar

Posted on Oct 20 • Originally published at saptarshisarkar.hashnode.dev

Mastering Data Cleanup: Unleash the Power of OpenRefine

#openrefine #datascience #datacleaning #tutorial

Real-world data is messy. It has inconsistent spellings, stray characters, and mixed number formats, which can cause problems even with simple analysis. In this post, I will guide you through a practical, beginner-friendly OpenRefine workflow (based on an example from the Hands-On Data Visualization book) to clean a small but intentionally messy sample dataset. By the end, you'll be able to convert noisy numeric fields into proper numbers, cluster and standardize text values, and export a tidy CSV ready for visualization.

TL;DR

This article provides a beginner-friendly guide to using OpenRefine for data cleanup. It covers importing and parsing data, cleaning numeric and text columns, and exporting a tidy dataset. The process includes using facets, transformations, and clustering to standardize and correct data inconsistencies, making it ready for analysis and visualization. The guide is inspired by the "Hands-On Data Visualization" book and includes practical steps and screenshots for clarity.

Prerequisites

You'll need the following tools and files ready before you start the walkthrough:

Latest version of OpenRefine installed
Downloaded CSV file of the sample messy data provided in the Hands-On Data Visualization book

Dataset preview

Before we start cleaning, let’s look at what the raw data actually looks like. The screenshot below shows the first 10 rows of the sample dataset. Notice the inconsistent spellings in the Country column and the mix of symbols and commas in FundingAmount column — we’ll fix all of these with OpenRefine.

Data Cleaning

Now that we’ve explored the dataset, let’s start cleaning it using OpenRefine. In this section, we’ll walk through a few essential steps to turn messy, inconsistent data into a tidy and analysis-ready table. We’ll begin by importing the CSV into OpenRefine, then clean up numeric fields, standardize text values, and finally export a clean version of the dataset.

OpenRefine lets you make these transformations interactively, with instant previews and the ability to undo or redo any step — perfect for exploring and cleaning data safely before analysis.

Step 1: Importing and parsing (Creating a new project)

Start the OpenRefine application. You should see a webpage open like this:
Since we have already downloaded the CSV file of the dataset, we will upload it from our local computer. Click on This Computer → Choose Files to upload our dataset. You can also drag and drop the file there.
Click on Next. This will show the progress of the data upload.
Finally, our dataset is parsed and previewed. Make sure the data is displayed correctly here. If everything looks good, proceed with creating the project by clicking the Create project button.
The project will open and look like this. Since there are 127 rows and only 10 are shown, we'll switch to display all of them by clicking the "500 rows" button. This gives us a complete view of the data for proper analysis and cleaning.

Step 2: Clean numeric column (`FundingAmount`)

Generally, we can use facets to highlight differences in large datasets because manually analyzing each cell for deviations from the expected result is time-consuming. Since the FundingAmount column should contain positive numbers, we can apply a Custom text facet to identify non-numeric cells.

We will need to reduce the whole data set into a smaller data set excluding those that can be converted to positive numbers (numeric datatype). So, we’ll be using the below GREL expression to classify the dataset into two groups — Valid and Invalid based on whether the values can be converted to a positive number.

if(
    isError(value.toNumber()),
    "Invalid",
    if(
        value.toNumber() < 0,
        "Invalid",
        "Valid"
    )
)

Here is what each part of the expression means:

toNumber() function - converts the value to a number (first, the value is converted to a string if it's not already in string format, and then to a number)
isError() function - Takes a GREL expression as input and returns true if the expression results in an error, and false otherwise.
if() function - Takes three arguments in this order: the expression to evaluate, what to return if true, and what to return if false.

We will categorize the entire dataset into two groups:

Valid: if the value can be successfully converted to a positive number
Invalid: if the value cannot be converted to a number or if the converted number is negative

Click on Ok to apply the facet, and we’ll see two choices: Valid and Invalid. We’ll include only the data records labeled as Invalid. Now, the number of rows decreases from 127 to 121 — not a huge reduction, but it helps in focusing on the problematic entries.

Now, we’ll procced with transforming the invalid values by removing redundant commas and spaces, dollar sign, etc.

Click on Edit cells and then on Transform option in the FundingAmount column.
We will now replace the commas, spaces, dollar signs, and any unnecessary characters in the number. The following GREL expression will be used:
```
value.replace(/[a-z$, ]/, "")
```
The regex expression [a-z$, ] matches any lowercase letters, dollar signs, commas, and spaces. The replace() function will replace each of these matches with an empty string.
On applying the transformation, we’ll see that a large part of our dataset has got cleaned. As we have selected only invalid group, only the records containing negative numbers are shown.
We’ll now finally convert the negative numbers to positive ones using the following GREL expression for the transformation:
```
value.toNumber() * -1
```
As soon as we apply the transformation, all the values in the FundingAmount column have been converted to positive numbers. Now, we’ll remove the facet.

Finally, we’ll convert each of the values to Number format using Edit cells → Common transforms → To number option.

Step 3: Standardizing text column (`Country` & `FundingAgency`)

The Country column contains inconsistent spellings and variations of “North Korea” and “South Korea”. Similar cases are also observed in the FundingAgency column.

One of the most powerful features of OpenRefine is clustering which find groups of cell values containing the alternative representation of the same thing.

Click on the arrow-down button of the Country column → Edit Cells → Cluster and edit option.

We're using the default clustering method. You can learn more about each of the clustering methods in OpenRefine’s documentation.
We’ll set the new cell value to North Korea and South Korea accordingly. Finally, we’ll click on Merge selected & Close option.
Let’s check if any cells have value other than North korea and South Korea. We’ll use the following GREL expression for our custom facet to categorize the cell values into Valid and Invalid groups.
```
if ((value == "North Korea").or(value == "South Korea"),
    "Valid",
    "Invalid"
)
```
We’ll select the Invalid choice in facet. We can see that 3 cell values need cleaning.
We'll first replace Nor and xNorth with North so we can use clustering, which is a better method than editing each cell directly, especially for a large number of rows.

The following GREL expression will be used:
```
value.replace(", Nor", " North")
     .replace("xNorth", "North")
```
We’ll first remove the facet because we need to apply clustering to the entire dataset to find similarities. Then, we’ll select the Cluster and edit option and choose the n-Gram fingerprint method with an n-Gram size of 1. We’re choosing the n-Gram fingerprint method because the only difference between South Korea and SouthKorea is a space, and the n-Gram fingerprint method removes all spaces.

Now, we’ll merge those two clusters.

We have now successfully cleaned the Country column.

Similarly, for the FundingAgency column, we’ll apply n-Gram Fingerprint clustering method with n-Gram size as 1.

This clustering method can sometimes produce false positives, so be sure to check the cluster values before merging. Notice that in the second cluster, Department of Defense is included in the cluster of Department of State, which is incorrect. Therefore, we’ll leave the Department of Defense value unchecked while merging.

Using Text facet, we can view the different groups of values in that column. We notice that some values are still messy, like Department - Agriculture and U.S. Agency: International Development. So, we’ll use clustering again. This time, we’ll use the Nearest neighbor clustering method (with the PPM function) because the cell values share some common words, making it helpful to find the nearest neighbor in this case.

We’ll now merge these three clusters. Next, try increasing the radius value to find more nearest neighbors to catch messy data values with slight similarities (like Department of Argic and Department of Agriculture).

I tried setting the radius value to 3.0 and found some correct clusters that I considered merging.

Step 4: Quick sanity checks & facets

To verify whether the complete dataset has been cleaned, we can use the following facets:

Numeric facet on Funding Amount column
Text facet on Country and FundingAgency columns

Upon examining the facets, we see that they are all valid. Therefore, we have successfully cleaned the dataset.

Congratulations on cleaning your first dataset!

Export & reproducibility

You can now export the cleaned dataset into a file (like CSV) or to a database (such as SQL).

To export, click on the Export button in the top right corner. I am choosing CSV as the output file type.

Conclusion

Data cleaning is often the most time-consuming part of data analysis. With OpenRefine, you can make this step both reproducible and efficient.

Acknowledgments & References

This walkthrough was inspired by the Hands-On Data Visualization book by Jack Dougherty and Ilya Ilyankou, which provides the sample dataset and a concise introduction to cleaning data with OpenRefine.

For detailed explanations of OpenRefine’s features, transformations, and architecture, refer to the official OpenRefine Documentation.

Both the book and the documentation are excellent resources if you’d like to explore more advanced data-cleaning techniques and reproducible workflows.

Resources

Hands-On Data Visualization — https://handsondataviz.org
OpenRefine Official Documentation — https://openrefine.org/docs

I learned a lot by following the Hands-On Data Visualization example and exploring OpenRefine’s documentation. If you’re just starting out with data cleaning, these two resources are the perfect next step.

You can also view my cleaned dataset here: Cleaned dataset on Google Sheets

DEV Community

Mastering Data Cleanup: Unleash the Power of OpenRefine

TL;DR

Prerequisites

Dataset preview

Data Cleaning

Step 1: Importing and parsing (Creating a new project)

Step 2: Clean numeric column (`FundingAmount`)

Step 3: Standardizing text column (`Country` & `FundingAgency`)

Step 4: Quick sanity checks & facets

Export & reproducibility

Conclusion

Acknowledgments & References

Resources

Top comments (0)

TL;DR

Prerequisites

Dataset preview

Data Cleaning

Step 1: Importing and parsing (Creating a new project)

Step 2: Clean numeric column (FundingAmount)

Step 3: Standardizing text column (Country & FundingAgency)

Step 4: Quick sanity checks & facets

Export & reproducibility

Conclusion

Acknowledgments & References

Resources

Step 2: Clean numeric column (`FundingAmount`)

Step 3: Standardizing text column (`Country` & `FundingAgency`)