DEV Community

Phylis Jepchumba
Phylis Jepchumba

Posted on • Updated on

How to handle missing Values in a dataset using Python.

What is missing data?

Missing data, or missing values, occur when no data value is stored for the variable in an observation.

Effects of having missing values in a dataset
  • Absence of data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false.
  • The lost data can cause bias in the estimation of parameters.
  • Reduces the representativeness of the samples.
  • Absence of data may complicate the analysis of the study.

The above effects may lead to invalid conclusions. To avoid this we will look at different ways and techniques of handling missing values using python programming language.

In this article we will work with Melbourne Housing dataset from kaggle.

Let's now import the dataset as follows:

import pandas as pd
import numpy as np
df=pd.read_csv('melb_data.csv')
df
Enter fullscreen mode Exit fullscreen mode

Screenshot (109)

Find out whether we have missing values in our data.

From pandas official documentation,isnull() function is used to detect missing values for an array-like object.

Now lets find out if there are missing values in our data;

#sum() returns the number of missing values in the data set.
df.isnull().sum()
Enter fullscreen mode Exit fullscreen mode

Screenshot (110)

From the above output, we find out that the following columns have missing values.

BuildingArea    
YearBuilt        
CouncilArea 
Enter fullscreen mode Exit fullscreen mode

We can also use df.info() function to find out the number of non-null values in each column.

Screenshot (111)

From the above output we can see that
BuildingArea, YearBuilt, CouncilArea columns are incomplete.

Different methods that you can use to deal with the missing data.

1.Deleting the columns/rows with missing data

From pandas official documentation,dropna() function is used to remove rows and columns with Null/NaN values.

In this case lets delete the columns with missing values as follows;

Screenshot (112)

Also if there is a certain row with missing data, then you can delete the entire row with all the features in that row.

  • axis=1 is used to drop the column with NaN values.

  • axis=0 is used to drop the row with NaN values.

2.Filling the missing data with a value- Imputation

Removing the rows and columns containing missing data may not be the best approach since they might contain valuable data.

Instead, we can fill the missing data using the following common ways;

  • Filling missing data with mean and median of the data if its a numerical variable.

In this case we will use the fillna() function to fill missing values in 'BuildingArea' and 'YearBuilt' columns using mean since they are numerical values.

df['BuildingArea'].fillna(int(df['BuildingArea'].mean()), inplace=True)
df['YearBuilt'].fillna(int(df['YearBuilt'].mean()), inplace=True)

df.info()
Enter fullscreen mode Exit fullscreen mode
  • Filling the missing data with mode if it’s a categorical value.
df['CouncilArea'].fillna(df['CouncilArea'].mode(), inplace=True)
Enter fullscreen mode Exit fullscreen mode
  • Fill Missing data with a Constant
df['BuildingArea'].fillna(0, inplace=True)
df['YearBuilt  '].fillna(0, inplace=True)
Enter fullscreen mode Exit fullscreen mode

3.Imputation with an additional column
Use the SimpleImputer() function from sklearn module to impute the values.

From scikit learn official documentation,SimpleImputer() is an imputation transformer for completing missing values with mean , median or mode.

To use SimpleImputer() pass the strategy as an argument to function. It can be either mean, mode or median.

The default is mean.

df['CouncilArea_Missing'] = df['CouncilArea'].isnull()
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer(strategy = 'mean')
df['CouncilArea_Missing']
Enter fullscreen mode Exit fullscreen mode

The above code with add an additional column 'CouncilArea_Missing' with True as value, if it is a null value and False if it is not a null value.

Screenshot (113)

The above are just few ways of handling missing values. You can experiment through them and many others to find the best approach that can work for your data.

Thank you for reading the article 🥳🥳🥳.

Top comments (4)

Collapse
 
mattonw profile image
mattonw

Perfect explanation!

Collapse
 
phylis profile image
Phylis Jepchumba

Thank you. Am happy you found it helpful.

Collapse
 
billyndirangu profile image
billy_dev

awesome

Collapse
 
phylis profile image
Phylis Jepchumba

Thank you 🤓