Creating an Exploratory Data Analysis Report with Pandas-Profiling

#python #datascience #analytics #pandasprofiling

Data Scientists and Analysts usually spend some time to get to know the data they are going to work on by doing exploratory analysis. It's one of the first steps in their journey before making further analysis and predictions. As Pythonists, while doing exploratory analysis with pandas, it's a must using methods such as head, describe, info, columns, shape, isnull, value_counts, unique, duplicated, corr, and so on. In addition to using some visualization libraries, such as seaborn or matplotlib, which is also primordial.

What if with just a very few lines of code we were able to get insights that would require using all of the methods I mentioned before? What if it's a report with visualization built-in? Wow, that would save us a lot of time! And in fact, we can do that. Hopefully, pandas-profiling can provide us a report with exploratory insights.

pandas-profiling is an open-source Python library that allows us to quickly do exploratory analysis with just a few lines of code. Also, as I mentioned before, it's possible to use this library to generate an interactive report, with variables' distributions besides other insights commonly gotten in dataframes during exploratory analysis. This report can be saved in HTML format and easily shared with anyone. Awesome, right?!

Now, let's see in practice how it works.

Installing pandas-profiling

You can install it from the command line via pip.

pip install pandas-profiling[notebook]

Generating an Exploratory Data Analysis Report

After installing it, go to your Jupyter Notebook and load the data you want to explore as a DataFrame object. As an example, we can use the Titanic dataset, but feel free to use the data you want. See the code below.

import pandas as pd

url = 'https://raw.githubusercontent.com/gabrielatrindade/ml-playground/master/projects/titanic/dataset/train.csv'
titanic = pd.read_csv(url)

Then, let's import the ProfileReport class to create the report for the dataframe.

from pandas_profiling import ProfileReport

Now, we are able to create the report.

profile = ProfileReport(titanic, explorative=True,
                        title='Titanic Exploratory Analysis')

profile

Set the explorative parameter as True for a deeper exploration, and a title.

We can see the report as output in the Jupyter Notebook. However, if you want to generate an HTML file to share the analysis with someone, it's also possible. Check the code below.

profile.to_file('output_titanic_report.html')

The report is composed of a lot of information, below I will list most of them.

Overview: we can see some general statistics of the data, information on the report and warnings, that show insights that can highly impact the analysis, such as a high number of null values in a variable, duplicated rows, and high correlation between variables.
Variables: composed of descriptive and quantile statistics information for each variable. Also, it's possible to see the histogram and the common and extreme values of the variable, in the case of continuous variables, and pie chart and frequency of each value for categorical data.
Interactions: allows us to see the relationship between two variables through the scatter plot visualization.
Correlations: shows the heatmap of Pearson, Spearman, Kendall, and Phik correlation matrix.
Missing values: through a bar chart or matrix visualization it's possible to see the missing values for each variable.
Sample: first 10 rows and last 10 rows are printed.
Duplicate rows: shows the duplicated rows.

In the image below you can see what it looks like.

Pandas-profiling limitation

One limitation I could see of pandas-profiling is when it's applied to large datasets because, as the dataset size increases, the report generation time increases a lot.

One way to solve this problem is to generate the report from a sample of the dataset. In that case, if you select a few rows, it's important to make sure that they are representative of all the data you have or you can also select the variables you want to explore.

Another way to deal with this problem is to use the minimal mode (introduced in the 2.4 pandas profiling version). This will generate a simplified report, taking less time than the full one.

profile = ProfileReport(titanic, minimal=True,
                        title='Titanic Exploratory Analysis')

Conclusion

I have shown you how easily we can get an exploratory data analysis report using the pandas-profiling library. With a few lines of code, we can generate an interactive report and create an HTML file for it. The ProfileReport class can save a lot of work in the phase of knowing the data and getting some insights into it.

I hope it can be useful for you! See you next time!

Deliver your unique apps, your own way.

Heroku tackles the toil — patching and upgrading, 24/7 ops and security, build systems, failovers, and more. Stay focused on building great data-driven applications.

Learn More