Data cleaning refers to the process of “cleaning” data, by identifying errors in the data and then rectifying them.
The main aim of Data Cleaning is to identify and remove errors & duplicate data, in order to create a reliable dataset.
We will use the fish dataset as the basis for this tutorial.
Fish Dataset
The “Fish Dataset” is a machine learning dataset.
The task involves predicting the weight of a fish.
You can access the dataset here:
[(https://www.kaggle.com/aungpyaeap/fish-market)]
from pandas import read_csv
from numpy import unique
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
fish = pd.read_csv("Fish.csv")
. How does the data look like?
Fill-Out Missing Values
One of the first steps of fixing errors in your dataset is to find incomplete values and fill them out. Most of the data that you may have can be categorized.
In most cases, it is best to fill out your missing values based on different categories or create entirely new categories to include the missing values.
If your data are numerical, you can use mean and median to rectify the errors.
let's check our dataset:
As you can see, in this case, we do not have missing values.
Removing rows with missing values
One of the simplest things to do in data cleansing is to remove or delete rows with missing values. This may not be the ideal step in case of a huge amount of errors in your training data.
If the missing values are considerably less, then removing or deleting missing values can be the right approach. You will have to be very sure that the data you are deleting does not include information that is present in the other rows of the training data.
Note: As you can see, in this case, we do not have missing values. However, this is not always the case.
Fixing errors in the Dataset
Ensure there are no typographical errors and inconsistencies in the upper or lower case.
Go through your data set, identify such errors, and solve them to make sure that your training set is completely error-free. This will help you to yield better result from your machine learning functions.
Identify Columns That Contain a Single Value
Columns that have a single observation or value are probably useless for modeling.
These columns or predictors are referred to zero-variance predictors as if we measured the variance (average value from the mean), it would be zero.
When a predictor contains a single value, we call this a zero-variance predictor because there truly is no variation displayed by the predictor.
You can detect rows that have this property using the nunique() Pandas function that will report the number of unique values in each column.
Delete Columns That Contain a Single Value
Variables or columns that have a single value should probably be removed from your dataset.
From the above picture we could see that the column Species has a single value.
Columns are relatively easy to remove from a NumPy array or Pandas DataFrame.
One approach is to record all columns that have a single unique value, then delete them from the Pandas DataFrame by calling the drop() function.
Identify Rows That Contain Duplicate Data
Rows that have identical data are probably useless, if not dangerously misleading during model evaluation.
A duplicate row is a row where each value in each column for that row appears in identically the same order (same column values) in another row.
The pandas function duplicated() will report whether a given row is duplicated or not. All rows are marked as either False to indicate that it is not a duplicate or True to indicate that it is a duplicate. If there are duplicates, the first occurrence of the row is marked False (by default), as we might expect.
First, the presence of any duplicate rows is reported, and in this case, we can see that there are no duplicates (False).
But in a case where there are duplicates, we could also use the Pandas function drop_duplicates() to drop the duplicates row.
Conclusion
Data Cleaning is a critical process for the success of any machine learning function. For most machine learning projects, about 80 percent of the effort is spent on data cleaning. We have discussed some of the points.
Top comments (0)