Erin Schaffer for Educative

Posted on Apr 29, 2021 • Originally published at educative.io

Data Science in 5 minutes: What is Data Cleaning?

#datascience #machinelearning #datacleansing

When working with data, your analysis and insights are only as good as the data you use. If you're performing data analysis with dirty data, your organization can't make efficient and effective decisions with that data. Data cleaning is a critical part of data management that allows you to validate that you have a high quality of data.

Data cleaning includes more than just fixing spelling or syntax errors. It's a fundamental aspect of data science analytics and an important machine learning technique. Today, we'll learn more about data cleaning, its benefits, issues that can arise with your data, and next steps for your learning.

We’ll cover:

What is data science cleaning?
Benefits and steps of data cleaning
Next steps for your learning

What is data science cleaning?

Data cleaning, or data cleansing, is the important process of correcting or removing incorrect, incomplete, or duplicate data within a dataset. Data cleaning should be the first step in your workflow. When working with large datasets and combining various data sources, there’s a strong possibility you may duplicate or mislabel data. If you have inaccurate or incorrect data, it will lose its quality, and your algorithms and outcomes become unreliable.

Data cleaning differs from data transformation because you’re actually removing data that doesn’t belong in your dataset. With data transformation, you’re changing your data to a different format or structure. Data transformation processes are sometimes referred to as data wrangling or data munging. The data cleaning process is what we'll focus on today.

So, how do I know if my data is clean?

To determine data quality, you can study its features and weigh them according to what's important to your organization and your project.
There are five main features to look for when evaluating your data:

Consistency: Is your data consistent across your datasets?
Accuracy: Is your data close to the true values?
Completeness: Does your data include all required information?
Validity: Does your data correspond with business rules and/or restrictions?
Uniformity: Is your data specified using consistent units of measurement? Now that we know how to recognize high-quality data, let's dive deeper into the process of data science cleaning, why it’s important, and how to do it effectively.

Benefits and steps of data cleaning

Let's discuss some cleaning steps you can take to ensure you're working with high-quality data. Data scientists spend a lot of time cleaning data because once their data is clean, it's much easier to perform data analysis and build models.

First, we'll discuss some issues you could experience with your data and what to do about them.

Handling missing data

It's common for large datasets to have some missing values. Maybe the person recording the data forgot to input them, or maybe they began collecting those missing data variables late into the data collection process. No matter what, missing data should be managed before working with datasets.

Filtering unwanted outliers

Outliers hold essential information about your data, but at the same time take your focus away from the main group. It's a good idea to examine your data with and without outliers. If you discover you want to use them, be sure to choose a robust method that can handle your outliers. If you decide against using them, you can just drop them.

You can also filter out unwanted outliers by using this method:

# Get the 98th and 2nd percentile as the limits of our outliers

upper_limit = np.percentile(train_df.logerror.values, 98)
lower_limit = np.percentile(train_df.logerror.values, 2)

# Filter the outliers from the dataframe

data[‘target’].loc[train_df[‘target’]>upper_limit] = upper_limit
data[‘target’].loc[train_df[‘target’]<lower_limit] = lower_limit

Standardizing your data

The data in your feature variables should be standardized. It makes examining and modeling your data a lot easier. For example, let's look at two values we'll call "dog" and "cat" that are in the "animal" variable. If you collected data, you may receive different data values that you didn't anticipate, such as:

DOG, CAT (entered in all caps)
Dog, Cat (entered with first letters capitalized)
dof, cart (entered as typos)

If we converted the feature variable into categorical floats, we wouldn't get the 0 and 1 values that we want, we'd get something more like this:

{
   'dog': 0,
   'cat': 1,
   'DOG': 2,
   'CAT': 3,
   'Dog': 4,
   'Cat': 5,
   'dof': 6,
   'cart': 7
}

To effectively deal with the capitalization issues and help standardize your data, you can do something like this:

# Make the string lowercase
s.lower()

# Make the first letter capitalized
s.capitalize()

If there's an issue with typos, you can use a mapping function:

value_map = {'dof': 'dog', 'cart': 'cat'}

pd_dataframe['animals'].map(value_map)

Note: Another way to deal with typos is to run a spelling and grammar check in Microsoft Excel.

Removing unwanted observations

Sometimes you may have some irrelevant data that should be removed. Let's say you want to predict the sales of a magazine. You're examining a dataset of magazines ordered from Amazon over the past year, and you notice a feature variable called "font-type" that notes which font was used in the book.

This is a pretty irrelevant feature, and it probably wouldn't help you predict the sales of a magazine. This is a feature that could be dropped like this:

df.drop('feature_variable_name', axis=1)

Removing those unwanted observations not only makes data exploration easier but also helps train your machine learning model.

Dirty data includes any data points that are wrong or just shouldn't be there. Duplicates occur when data points are repeated in your dataset. If you have a lot of duplicates, it can throw off the training of your machine learning model.

To handle dirty data, you can either drop them or use a replacement (like converting incorrect data points into the correct ones).
To handle duplication issues, you can just drop them from your data.

Removing blank data

You obviously can't use blank data for data analysis. Blank data is a major issue for analysts because it weakens the quality of the data. You should ideally remove blank data in the data collection phase, but you can also write a program to do this for you.

Eliminating white space

White space is a small but common issue within many data structures. A TRIM function will help you eliminate white space.

Note: The TRIM function is categorized under Excel text functions. It helps remove extra spaces in data. You can use the =TRIM(text) formula.

Fixing conversion errors

Sometimes, when exporting data, numeric values get converted into text. The VALUE method is a great way to help with this issue.

The data cleansing process sounds time-consuming, but it makes your data easier to work with and allows you to get the most out of your data. Having clean data increases your efficiency and ensures you're working with high-quality data.

Some benefits of data cleaning include:

There are data cleaning tools, such as DemandTools or Oracle Enterprise Data Quality, that help increase your efficiency and speed up the decision-making process.
You can better monitor your errors to help you eliminate incorrect, corrupt, or inconsistent data.
You will make fewer errors overall.
You can map different functions and what your data should do.
It's easy to remove errors across multiple data sources.
Etc.

Next steps for your learning

Data cleaning is an important part of your organization's data management workflow. Now that you've learned more about this process, you're ready to learn more advanced concepts within machine learning. Here are some recommended things to learn:

Image recognition
Natural language processing
Applied machine learning
Etc.

To get up to speed with the modern techniques in machine learning, check out Educative's Learning Path, Become a Machine Learning Engineer. In this learning path, you'll explore essential machine learning techniques to help you stand out from the competition. By the end, you'll have job-ready skills in data pipeline creation, model deployment, and inference.

Happy learning!

DEV Community