Oluseye Jeremiah

Posted on May 31, 2022 • Originally published at oluseyejeremiah.hashnode.dev

Dealing With Missing Values In A Dataset. A Comprehensive Guide To Handling Missing Values In Machine Learning.

#python #datascience #machinelearning #tutorial

INTRODUCTION

One of the most common issues that Developers in the Data industry have had to deal with over the years is the issue of missing data. Data scientists, analysts, data engineers, and machine learning engineers all face the same issue, and the primary cause of missing values is data collection (i.e. the fact that data is usually collected from many different sources).

There have been a lot of solutions provided to fix the problem of missing data but none of it has been a permanent solution as each solution has its flaws. So each solution depends on the type of dataset and what the data will be used for. For instance, when building a machine model for predicting the prices of houses there are certain statistics that are needed and cannot just be replaced by any value, and the removal of such important value can reduce the accuracy of our model, giving us a biased model.

So it's important to use the method that works best for the dataset. To determine the method you first have to note down the size of the dataset, the percentage of data that is missing, and what the dataset is to be used for, all these statistics help determine which of the methods to use to find the missing value.

In this article, I will briefly explain and list some methods that can be used to deal with missing data with some hands-on examples.

Basic Steps Involved

1)The use of central tendencies for imputing values

Mean
Median
Mode

2) Dropping the column with the missing data.

3) Filling the column with new values.

Example

We will start the hands-on example by first importing all the libraries that are needed in this tutorial.

import pandas as pd
import matplotlib as plt
import seaborn as sns

We then load the required dataset from Kaggle

dataset = pd. read_csv('/Highest Holywood Grossing Movies.csv')

Afterward, we start some exploratory data analysis

dataset.info()

dataset.head()

This is to actually familiarize ourselves with the dataset and also note the location of the missing data.

dataset.tail()

Then we use the dataset. shape function to know the number of rows and columns present in our data.

dataset.shape

The dataset. isnull() function returns false if the values in the dataset are not missing and returns true if it is missing.

dataset.isnull().head()

The dataset. isnull().head() function only gives us the details on the first 5 columns but to know the exact column and how many values are missing we use the dataset. isnull().sum() function that tells us which column has missing values and the number of missing values it has.


dataset.isnull().sum()

The use of central tendencies like the mean, median, and mode

Mean: This is the average of the total of the numbers
Median: This is the middle number when all numbers are arranged in an alphabetical or ascending order.
Mode: This is the number with the most occurring frequencies.

dataset['License'].fillna(dataset['License'].median(),inplace=True)

The above code shows how to fill our empty dataset with the median value. it also applies to the mean and mode just have to make a few changes to the code.


dataset['License'].fillna(dataset['License'].mean(),inplace=True)

How to fill the empty dataset using the mode value.

dataset['License'].fillna(dataset['License'].mode(),inplace=True)

it is important to note though that the use of central tendencies may not always apply to all kinds of datasets, mostly datasets involving numbers.

2) Dropping the column with the missing data.

To drop the column or row with the missing value is quite straightforward as it requires just a line of code but so as not to mess us the entire data there are features you need to include when dropping the column.

dataset.dropna(how='any').shape

The above line of code drops a row if any of its values are missing and as we can see we lost a lot of rows.

dataset.dropna(how='all').shape

The above code is quite similar to the previous one it drops the row if all of its values are missing and we can see that all our rows remain intact because it is not empty

dataset.dropna(subset=['License' , 'Release Date'], how='any').shape

The above method scans through the dataset and checks for the row where values are missing in the License and Release Date column.

Similarly, we can specify which row to remove using the "thresh hold " parameter and what this does is that it keeps any row that has just one missing value. You can always increase the thresh depending on how many rows you want to keep

dataset.dropna(thresh=1).shape

It is important to note that if the data missing in our dataset is above 60% it is advisable to discard such dataset.

3) Fill in the missing values.

The last method involves filling the missing values with either the previous value or any other value of our choice. Also, this is also not the best method as it can cause inaccuracy in our data, and also it is not always realistic.

The first step might involve filling all empty spaces with zero this might work for some datasets but not all. For instance, our dataset has the Release Date column empty filling the date column with zero is not helping our dataset in any way.

dataset.fillna(0)

Another method to fill the missing row is by creating a dictionary and what this does is that it specifies the exact column you want to fill with a new value.

dataset.fillna({
'Release Date':July 6, 2014,
'Liencse':PG-13,
})

The above code fills all the empty spaces in the 'Release Date' column with July 6, 2014 so we have about 118 rows filled with the same value and this can cause some serious alterations in the model. Conversely filling the 'License' column with 'PG-13' has some level of accuracy because every other row is filled with the same thing.

Also, we can use the forward fill('ffill') method that fills the missing value with the previous value and also the backward fill that fills the value with the next value or preceding value.

dataset.fillna(method=''ffill'')

dataset.fillna(method=''bfill'')

So also in a situation where you don't want to fill the entire column with the previous or preceding value you can set a limit.


dataset.fillna(method=''ffill'', limit=2)

The above code only fills the next two empty columns with the previous value.

Conclusion

Thank you for completing this article. I hope you learned something new or perhaps you found it helpful. You can reach out to me on Twitter

DEV Community

Dealing With Missing Values In A Dataset. A Comprehensive Guide To Handling Missing Values In Machine Learning.

INTRODUCTION

Basic Steps Involved

Example

Conclusion

data-science

data-analysis

machine-learning

codenewbies

python

Top comments (0)