DEV Community

loading...
Cover image for Data Cleaning 101

Data Cleaning 101

sahil_ profile image Sahil Originally published at Medium ・6 min read

What is data cleaning?

Data cleaning is a process to remove, add or modify data for analyzing and other machine learning tasks. If data cleaning is necessary, it is always done before any kind of analysis or machine learning task.

Clive Humby said, “Data is the new oil.” But we know data still needs to be refined.

Why data cleaning is necessary?

Data is considered one of the major assets of a company. Misleading or inaccurate data is risky and can be a reason for the fall of a company.

It is not necessary that data available to us is useful every-time, we must perform many operations to make it useful. So, it is a good idea to remove unnecessary data and, format and modify important data so that we can use it. In some scenarios, it is also required to add information externally by processing the available data. For example, adding a language column based on some data already exist or to generate a column with average value based on some other columns’ data.


Introduction

There are many steps involved in data cleaning process. These all steps are not necessary for everyone to follow or use. To perform the data cleaning, we will use python programming language with pandas library.

I have used python because of its expressiveness and, it is easy to learn and understand. More importantly, python is choice of many experts for machine learning tasks because person without computer science background can easily learn it. Apart from python’s benefits; pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool and it is one of the most popular data analysis and processing tools out there.

To know your data is very important before one start data cleaning process, because what cleaning process to perform, is all depends on what kind of data one has and what is the nature of that data.

Step by step process for cleansing your data

Before cleaning the data, it is important to load data properly. In this tutorial I will show basic methods to load data from a csv file. Find more options to read csv here.

import pandas as pd

"""
Data Loading
"""

# 1. Read data from csv default way
# df = pd.read_csv('my_file.csv')

# 2. Read data from csv using comma as delimiter
# df = pd.read_csv('my_file.csv', delimiter=',')

# 3. Read data from csv using comma as delimiter and no headers
# df = pd.read_csv('my_file.csv', delimiter=',', header=None)

# 4. Read data from csv using comma as delimiter and custom headers
my_headers = ['Id','Name', 'Type', 'Price']
df = pd.read_csv('my_file.csv', delimiter=',', header=0, names=my_headers)
Enter fullscreen mode Exit fullscreen mode

Remove duplicate data

There are certain steps which needs to be followed in every data cleaning process. One of those steps is removal of duplicate data. Regardless of textual or numeric data, removal of duplicate data is very important because if dataset contains too many duplicates then the time to process that data also increases.

"""
Removing Duplicates
"""
# 1. Removes duplicate and returns a copy of dataframe 
df = df.drop_duplicates()

# 2. Removes duplicates in place
df = df.drop_duplicates(inplace=True)

# 3. Drops duplicates and keep first/last occurance
df = df.drop_duplicates(inplace=True, keep='last')

# 4. Consider only certain columns for identigying duplicates
df = df.drop_duplicates(subset=['Id', 'Price'], inplace=True, keep='last')
Enter fullscreen mode Exit fullscreen mode

Remove emojis 😁

There are many cases where we do not want the emojis in our textual dataset. We can remove emojis by using a single line of code. Code snippet shown below will remove emojis from pandas dataframe column by column. Code snippet can be found on Stackoverflow.

"""
Remove emojis
"""
df = df.astype(str).apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
Enter fullscreen mode Exit fullscreen mode

This code snippet encodes all the data into ASCII (American Standard Code for Information Interchange) values and ignore if the data can not be encoded. After encoding it tries to decode them all again because all the emojis were ignored in the encoding process. So now we have all the data without emojis.

Change data into lowercase

Possibilities of changing case of data is very likely. Here, I have attached code snippet to change data to the lowercase. More examples can be found here.

"""
Convert to lowercase
"""
df['Type'] = df['Type'].str.lower()
df['Name'] = df['Name'].str.lower()
Enter fullscreen mode Exit fullscreen mode

Remove multiple white-spaces, tabs and new-lines

Every dataset contains unnecessary whitespaces, tabs and newlines. Problem is that we can see tabs and new lines clearly but not whitespaces which in turn affects when we train our models.

"""
Remove multiple whitespaces, tabs and newlines
"""
df['Type'] = df['Type'].str.replace('\n', '')
df['Type'] = df['Type'].str.replace('\t', ' ')
df['Type'] = df['Type'].str.replace(' {2,}', ' ', regex=True)
df['Type'] = df['Type'].str.strip()
Enter fullscreen mode Exit fullscreen mode

First two lines of code will replace tabs and newlines with empty character, respectively. Third line will find two or more spaces with the help of regular expression(regex), and it will replace it with a single space. Finally, last line will strip (removes whitespaces) data from both the sides.

Remove URLs (Uniform Resource Locators)

Many people use surveys to get data. People tend to fill random details and sometimes this data has URLs in them. I have used regex pattern shown in code snippet to remove URLs, though one can use any regex pattern to match URLs. Here, I have replaced matching URL patterns with empty string character.

"""
Remove URLs
"""
df['Type'] = df['Type'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)
Enter fullscreen mode Exit fullscreen mode

Drop rows with empty data

After all the above cleaning process it left some empty data in columns. We have to get rid of those empty rows otherwise it creates uncertainty in trained model. To make sure we drop all rows with empty data, we use two methods which are as shown below.

"""
Drop Empty Rows
"""
df.dropna()

df['Type'].astype(bool)
df = df[df['Type'].astype(bool)]
Enter fullscreen mode Exit fullscreen mode

First line removes all rows which contain np.nan, pd.NaT and None whereas other lines remove rows which has empty string characters. Second method is fast but if column has even a whitspace it will not work. This is another reason why we strip our data earlier.

More data processing

Sometimes you need to drop some columns, create a new column from existing data or remove rows which does not contain specific data in them.

"""
More Data Processing
"""
import numpy as np

df = df.drop(['Id', 'Name'], axis=1)
df = df[df['Type'].str.contains('frozen') | df['Type'].str.contains('green')]

def detect_price(row):
    if row['Price'] > 15.50:
        return 'High'
    elif row['Price'] > 5.50 and row['Price'] <= 15.50:
        return 'Medium'
    elif row['Price'] > 0.0 and row['Price'] <= 5.50:
        return 'Low'
    else:
        return np.NaN

df['Range'] = df.apply (lambda row: detect_price(row), axis=1)
Enter fullscreen mode Exit fullscreen mode

Here, line number three drops tow columns named Id and Name; and returns a copy of new dataframe. Line number four checks if ‘Type’ column contains string frozen or green then returns true and keep that row. Line number 7 to 17 creates a new column named ‘Range’ based on the value of column ‘Price’. By using lambda function, I passed each row to detect_price function and return value based on the price. This returned value then assigned to new column on row passed to the function. Here, reason to use np.NaN is we can remove those rows afterwards using df.dropna().


Conclusion

Data cleaning process is one of many processes involved in data science. To have a clean and accurate data is blessings in disguise. We need to clean and process data differently in every project. I have mentioned some of the cleaning methods which are used frequently. You can create your own set of methods as well or use any one of the existing methods. I have attached whole code here for reference.


Cover Image Credit: Luke Chesser

Discussion (0)

pic
Editor guide