Visualizing the patterns of missing value occurrence with Python

#python #pandas #seaborn #missingno

(A Japanese translation is available here.)

During data analysis, we need to deal with missing values. Handling missing data is so profound that it will be an entire topic of a book. However, before doing anything to missing values, we need to know the pattern of occurrence of missing values. This article describes easy visualization techniques for missing value occurrence with Python. The techniques are useful in early stages of exploratory data analysis.

I've uploaded a Jupyter notebook in my GitHub repo. You can run it using Binder by clicking the badge below.

Prerequisite

I'm using the Titanic train dataset from Kaggle as an example. To begin with, following code is assumed to be executed.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('train.csv')

# Confirm the number of missing values in each column.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Method 1: seaborn.heatmap

The first method is by seaborn.heatmap. The next single-line code will visualize the location of missing values.

sns.heatmap(df.isnull(), cbar=False)

Against Index, I can see that

Age column has missing values with variation in occurrence,
Cabin column are almost filled with missing values with variation in occurrence, and
Embarked column has few missing values in the beginning part.

This is not the case for this Titanic dataset, but especially in time series data, we need know if the occurrence of missing values are sparsely located or located as a big chunk. This heatmap visualization immediately tells us such tendency. Also, if more than 2 columns have correlation in missing value locations, such correlation will be visualized. (Again, not the case for this dataset, but it is important to know the fact that there is no such correlation in this dataset.)

This single-line code tells us a lot of information of missing value occurrence.

Method 2: missingno module

If you want to proceed further, missingno module will be useful.
To begin with, install and import it.

pip install missingno

import missingno as msno

If you want the similar result to seaborn.heatmap described earlier, use missingno.matrix.

msno.matrix(df)

In addition to the heatmap, there is a bar on the right side of this diagram. This is a line plot for each row's data completeness. In this dataset, all rows have 10 - 12 valid values and hence 0 - 2 missing values.

Also, missingno.heatmap visualizes the correlation matrix about the locations of missing values in columns.

msno.heatmap(df)

missingno module has more features, such as the bar chart of the number of missing values in each column and the dendrogram generated from the correlation of missing value locations. For more information, README is a good primer.

Closing

Two easy visualization methods are described in this article. seaborn.heatmap is the first choice as it requires seaborn only, but it you need more, missingno module will help you.