The first step in data cleaning for me is typically looking for missing data, missing data can have different sources, maybe it isn't available, maybe it gets lost, maybe it gets damaged and normally its not an issue, we can fill it but I think often time missing data is very informative in itself, while we can fill the data with the average or something like that and I will show you how to do that frequently,
For instance, if you have an online clothing store, if a customer never clicked on the baby category, it is likely that they do not have children. You can learn a lot by simply taking the information that is not there.
The missingno Library
Missingno is a great Python module that provides a set of visualisations to help you understand the presence and distribution of missing data within a pandas dataframe. This can take the shape of a dendrogram, heatmap, barplot, or matrix plot.
We can determine where missing values occur, the magnitude of the missingness, and whether any of the missing values are associated with each other using these graphs.
Using the pip command, you may install the missingno library:
pip install missingno
Importing Libraries and Loading the Data
import pandas as pd
import missingno as msno
df = pd.read_csv('housing.csv')
df.head()
Quick Analysis with Pandas
Before we utilise the missingno library, there are a few features in the pandas library that can provide us with an idea of how much missing data there is.
The first method is to use the .describe() method. This function returns a table with summary statistics about the dataframe, such as the mean, maximum, and minimum values.
df.describe()
Using the .info() method, we can go one step farther. This will provide you a count of the non-null values in addition to a summary of the dataframe.
df.info()
Yet another quick technique is
df.isna().sum()
This function produces a summary of the number of missing values in the dataframe. The isna() function finds missing values in the dataframe and returns a Boolean result for each element in the dataframe. The sum() function adds up all of the True values.
Using missingno to Identify Missing Data
There are four types of plots in the missingno library for visualising data completeness: barplots, matrix plots, heatmaps, and dendrogram plots.
msno.matrix(df)
The column total_bedrooms in the resulting graphic displays some amounts of missing data.
msno.bar(df)
The barplot provides a simple plot where each bar represents a column within the dataframe. The height of the bar indicates how complete that column is, i.e, how many non-null values are present.
you can notice the height of total_bedrooms which is less than others
Summary
Identifying missing data before using machine learning is a critical step in the data quality pipeline. This is possible with the missingno library and a sequence of visualisations.
Thank you for your time!
Top comments (0)