Welcome to the fourth article in the "7 Days of Pandas" series where we cover the pandas
library in Python which is used for data manipulation.
In the first article of the series, we looked at how to read and write CSV files with Pandas.
In the second article, we looked at how to perform basic data manipulation.
In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.
In this tutorial, we will look at how to handle missing values in data.
When working on some data, it's not uncommon to find missing values in the data. Missing values can occur in data for a variety of reasons, for example, error in data capture, encoding issues, etc. It's important to deal with missing values before you proceed on further analyzing data and it's a major step in data preprocessing.
In this tutorial, we will cover the following topics:
- Identifying missing values.
- Handling missing values.
- Filling missing values.
- Removing missing values.
Before we begin, let's first import pandas and create a sample dataframe that we will be using throughout this tutorial.
import pandas as pd
import numpy as np
# employee data
data = {
"Name": ["Tim", "Shaym", "Noor", "Esha", "Sam", "James", "Lily"],
"Age": [26, 28, 27, 32, 24, None, 33],
"Department": ["Marketing", "Product", "Product", "HR", "Product", np.nan, "Marketing"],
"Salary": [60000, np.nan, 82000, np.nan, 58000, 55000, 65000]
}
# create pandas dataframe
df = pd.DataFrame(data)
# display the dataframe
df
Name | Age | Department | Salary | |
---|---|---|---|---|
0 | Tim | 26.0 | Marketing | 60000.0 |
1 | Shaym | 28.0 | Product | NaN |
2 | Noor | 27.0 | Product | 82000.0 |
3 | Esha | 32.0 | HR | NaN |
4 | Sam | 24.0 | Product | 58000.0 |
5 | James | NaN | NaN | 55000.0 |
6 | Lily | 33.0 | Marketing | 65000.0 |
Identifying missing values
To identify missing values in a pandas DataFrame, you can use the pandas isna()
method, which returns a boolean mask indicating the presence of missing values. You can then use this mask to select the rows or columns with missing values.
For example, let's check which values in the above dataframe are missing using the isna()
function.
df.isnull()
Name | Age | Department | Salary | |
---|---|---|---|---|
0 | False | False | False | False |
1 | False | False | False | True |
2 | False | False | False | False |
3 | False | False | False | True |
4 | False | False | False | False |
5 | False | True | True | False |
6 | False | False | False | False |
You can see the resulting boolean mask.
To check which columns in the dataframe have missing value, apply the any()
function on the resulting boolean dataframe with axis=0
.
df.isna().any(axis=0)
Name False
Age True
Department True
Salary True
dtype: bool
We see that only the "Name" column doesn't have any missing values.
Handling missing values
Handling missing values is an important step in the data preparation pipeline. Generally, there are two approaches to handle missing values -
- Fill the missing value with some appropriate value (for example, a constant or mean, median, etc. for continuous variables, and mode for categorical fields).
- Remove the missing values (remove the records with missing values).
Let's now look at how to do both of them in pandas.
Filling missing values
To fill missing values in a pandas DataFrame, you can use the pandas fillna()
method. This method allows you to specify a value to fill the missing values with, or a method for imputing the missing values.
For example, let's see what we get if we fill the missing values with 0.
df.fillna(0)
Name | Age | Department | Salary | |
---|---|---|---|---|
0 | Tim | 26.0 | Marketing | 60000.0 |
1 | Shaym | 28.0 | Product | 0.0 |
2 | Noor | 27.0 | Product | 82000.0 |
3 | Esha | 32.0 | HR | 0.0 |
4 | Sam | 24.0 | Product | 58000.0 |
5 | James | 0.0 | 0 | 55000.0 |
6 | Lily | 33.0 | Marketing | 65000.0 |
It returns a dataframe with the missing values filled with the constant value. Note that the fillna()
function didn't modify the original dataframe in-place. It returned the resulting dataframe after filling the missing values.
You can also specify different values for different columns when filling missing values.
For exmaple, let's fill missing values in "Age" and "Salary" columns with their respective means and the missing value in the "Department" column with its mode (the most frequent value).
df.fillna({'Age': df['Age'].mean(), 'Salary': df['Salary'].mean(), 'Department': df['Department'].mode()[0]})
Name | Age | Department | Salary | |
---|---|---|---|---|
0 | Tim | 26.000000 | Marketing | 60000.0 |
1 | Shaym | 28.000000 | Product | 64000.0 |
2 | Noor | 27.000000 | Product | 82000.0 |
3 | Esha | 32.000000 | HR | 64000.0 |
4 | Sam | 24.000000 | Product | 58000.0 |
5 | James | 28.333333 | Product | 55000.0 |
6 | Lily | 33.000000 | Marketing | 65000.0 |
Dropping rows with missing values
Another strategy of handling missing values is to remove the rows that contain missing values. This is used when the proportion of missing values is comparitively less and we can afford to discard that data.
Use the pandas dropna()
function to remove rows with missing values.
df.dropna()
Name | Age | Department | Salary | |
---|---|---|---|---|
0 | Tim | 26.0 | Marketing | 60000.0 |
2 | Noor | 27.0 | Product | 82000.0 |
4 | Sam | 24.0 | Product | 58000.0 |
6 | Lily | 33.0 | Marketing | 65000.0 |
This is how the dataframe looks after removing rows with any missing values.
Top comments (0)