DEV Community

Cover image for Mastering Data Cleanup: Unleash the Power of OpenRefine
Saptarshi Sarkar
Saptarshi Sarkar

Posted on • Originally published at saptarshisarkar.hashnode.dev

Mastering Data Cleanup: Unleash the Power of OpenRefine

Real-world data is messy. It has inconsistent spellings, stray characters, and mixed number formats, which can cause problems even with simple analysis. In this post, I will guide you through a practical, beginner-friendly OpenRefine workflow (based on an example from the Hands-On Data Visualization book) to clean a small but intentionally messy sample dataset. By the end, you'll be able to convert noisy numeric fields into proper numbers, cluster and standardize text values, and export a tidy CSV ready for visualization.

TL;DR

This article provides a beginner-friendly guide to using OpenRefine for data cleanup. It covers importing and parsing data, cleaning numeric and text columns, and exporting a tidy dataset. The process includes using facets, transformations, and clustering to standardize and correct data inconsistencies, making it ready for analysis and visualization. The guide is inspired by the "Hands-On Data Visualization" book and includes practical steps and screenshots for clarity.

Prerequisites

You'll need the following tools and files ready before you start the walkthrough:

Dataset preview

Before we start cleaning, let’s look at what the raw data actually looks like. The screenshot below shows the first 10 rows of the sample dataset. Notice the inconsistent spellings in the Country column and the mix of symbols and commas in FundingAmount column — we’ll fix all of these with OpenRefine.

A spreadsheet titled

Data Cleaning

Now that we’ve explored the dataset, let’s start cleaning it using OpenRefine. In this section, we’ll walk through a few essential steps to turn messy, inconsistent data into a tidy and analysis-ready table. We’ll begin by importing the CSV into OpenRefine, then clean up numeric fields, standardize text values, and finally export a clean version of the dataset.

OpenRefine lets you make these transformations interactively, with instant previews and the ability to undo or redo any step — perfect for exploring and cleaning data safely before analysis.

Step 1: Importing and parsing (Creating a new project)

  1. Start the OpenRefine application. You should see a webpage open like this:

    Screenshot of the OpenRefine application interface

  2. Since we have already downloaded the CSV file of the dataset, we will upload it from our local computer. Click on This ComputerChoose Files to upload our dataset. You can also drag and drop the file there.

    Screenshot showing that the data file has been chosen

  3. Click on Next. This will show the progress of the data upload.

    uploading data in progress

  4. Finally, our dataset is parsed and previewed. Make sure the data is displayed correctly here. If everything looks good, proceed with creating the project by clicking the Create project button.

    Screenshot of the OpenRefine interface when the dataset is parsed properly

  5. The project will open and look like this. Since there are 127 rows and only 10 are shown, we'll switch to display all of them by clicking the "500 rows" button. This gives us a complete view of the data for proper analysis and cleaning.

    Screenshot of OpenRefine interface showing the complete dataset uploaded

Step 2: Clean numeric column (FundingAmount)

Generally, we can use facets to highlight differences in large datasets because manually analyzing each cell for deviations from the expected result is time-consuming. Since the FundingAmount column should contain positive numbers, we can apply a Custom text facet to identify non-numeric cells.

Screenshot of how to navigate to

We will need to reduce the whole data set into a smaller data set excluding those that can be converted to positive numbers (numeric datatype). So, we’ll be using the below GREL expression to classify the dataset into two groups — Valid and Invalid based on whether the values can be converted to a positive number.

if(
    isError(value.toNumber()),
    "Invalid",
    if(
        value.toNumber() < 0,
        "Invalid",
        "Valid"
    )
)
Enter fullscreen mode Exit fullscreen mode

Screenshot of the custom facet we're setting

Here is what each part of the expression means:

  • toNumber() function - converts the value to a number (first, the value is converted to a string if it's not already in string format, and then to a number)

  • isError() function - Takes a GREL expression as input and returns true if the expression results in an error, and false otherwise.

  • if() function - Takes three arguments in this order: the expression to evaluate, what to return if true, and what to return if false.

We will categorize the entire dataset into two groups:

  • Valid: if the value can be successfully converted to a positive number

  • Invalid: if the value cannot be converted to a number or if the converted number is negative

Click on Ok to apply the facet, and we’ll see two choices: Valid and Invalid. We’ll include only the data records labeled as Invalid. Now, the number of rows decreases from 127 to 121 — not a huge reduction, but it helps in focusing on the problematic entries.

Screenshot showing the applied facet

Screenshot showing some of the matching rows on applying the facet

Now, we’ll procced with transforming the invalid values by removing redundant commas and spaces, dollar sign, etc.

  1. Click on Edit cells and then on Transform option in the FundingAmount column.

    Screenshot of how to navigate to

  2. We will now replace the commas, spaces, dollar signs, and any unnecessary characters in the number. The following GREL expression will be used:

    value.replace(/[a-z$, ]/, "")
    

    The regex expression [a-z$, ] matches any lowercase letters, dollar signs, commas, and spaces. The replace() function will replace each of these matches with an empty string.

    Screenshot of OpenRefine showing the GREL expression applied

  3. On applying the transformation, we’ll see that a large part of our dataset has got cleaned. As we have selected only invalid group, only the records containing negative numbers are shown.

    Screenshot of OpenRefine showing groups of Valid and Invalid labels

  4. We’ll now finally convert the negative numbers to positive ones using the following GREL expression for the transformation:

    value.toNumber() * -1
    

    Screenshot of OpenRefine showing values are converted to positive numbers

    As soon as we apply the transformation, all the values in the FundingAmount column have been converted to positive numbers. Now, we’ll remove the facet.

    Screenshot of OpenRefine showing 0 invalid cells

    Finally, we’ll convert each of the values to Number format using Edit cellsCommon transformsTo number option.

    Screenshot of OpenRefine showing value to number conversion

Step 3: Standardizing text column (Country & FundingAgency)

The Country column contains inconsistent spellings and variations of “North Korea” and “South Korea”. Similar cases are also observed in the FundingAgency column.

One of the most powerful features of OpenRefine is clustering which find groups of cell values containing the alternative representation of the same thing.

  1. Click on the arrow-down button of the Country column → Edit CellsCluster and edit option.

    Screenshot of OpenRefine interface showing how to navigate to

    Screenshot of OpenRefine's Cluster and edit window

    We're using the default clustering method. You can learn more about each of the clustering methods in OpenRefine’s documentation.

  2. We’ll set the new cell value to North Korea and South Korea accordingly. Finally, we’ll click on Merge selected & Close option.

  3. Let’s check if any cells have value other than North korea and South Korea. We’ll use the following GREL expression for our custom facet to categorize the cell values into Valid and Invalid groups.

    if ((value == "North Korea").or(value == "South Korea"),
        "Valid",
        "Invalid"
    )
    

  4. We’ll select the Invalid choice in facet. We can see that 3 cell values need cleaning.

  5. We'll first replace Nor and xNorth with North so we can use clustering, which is a better method than editing each cell directly, especially for a large number of rows.

    The following GREL expression will be used:

    value.replace(", Nor", " North")
         .replace("xNorth", "North")
    

  6. We’ll first remove the facet because we need to apply clustering to the entire dataset to find similarities. Then, we’ll select the Cluster and edit option and choose the n-Gram fingerprint method with an n-Gram size of 1. We’re choosing the n-Gram fingerprint method because the only difference between South Korea and SouthKorea is a space, and the n-Gram fingerprint method removes all spaces.

    Now, we’ll merge those two clusters.

We have now successfully cleaned the Country column.

Similarly, for the FundingAgency column, we’ll apply n-Gram Fingerprint clustering method with n-Gram size as 1.

This clustering method can sometimes produce false positives, so be sure to check the cluster values before merging. Notice that in the second cluster, Department of Defense is included in the cluster of Department of State, which is incorrect. Therefore, we’ll leave the Department of Defense value unchecked while merging.

Using Text facet, we can view the different groups of values in that column. We notice that some values are still messy, like Department - Agriculture and U.S. Agency: International Development. So, we’ll use clustering again. This time, we’ll use the Nearest neighbor clustering method (with the PPM function) because the cell values share some common words, making it helpful to find the nearest neighbor in this case.

We’ll now merge these three clusters. Next, try increasing the radius value to find more nearest neighbors to catch messy data values with slight similarities (like Department of Argic and Department of Agriculture).

I tried setting the radius value to 3.0 and found some correct clusters that I considered merging.

Step 4: Quick sanity checks & facets

To verify whether the complete dataset has been cleaned, we can use the following facets:

  • Numeric facet on Funding Amount column

  • Text facet on Country and FundingAgency columns

Upon examining the facets, we see that they are all valid. Therefore, we have successfully cleaned the dataset.

Congratulations on cleaning your first dataset!

Export & reproducibility

You can now export the cleaned dataset into a file (like CSV) or to a database (such as SQL).

To export, click on the Export button in the top right corner. I am choosing CSV as the output file type.

Conclusion

Data cleaning is often the most time-consuming part of data analysis. With OpenRefine, you can make this step both reproducible and efficient.

Acknowledgments & References

This walkthrough was inspired by the Hands-On Data Visualization book by Jack Dougherty and Ilya Ilyankou, which provides the sample dataset and a concise introduction to cleaning data with OpenRefine.

For detailed explanations of OpenRefine’s features, transformations, and architecture, refer to the official OpenRefine Documentation.

Both the book and the documentation are excellent resources if you’d like to explore more advanced data-cleaning techniques and reproducible workflows.

Resources

I learned a lot by following the Hands-On Data Visualization example and exploring OpenRefine’s documentation. If you’re just starting out with data cleaning, these two resources are the perfect next step.

You can also view my cleaned dataset here: Cleaned dataset on Google Sheets

Top comments (0)