Making EDA easy

A package has recently come to my attention that makes performing a blanket EDA of a new dataset easier. This package is pandas_profiling.
To install the package the command is

pip install pandas-profiling[notebook]

So what does it do?
These points were taken from the documentation here

-Type inference: detect the types of columns in a dataframe.
-Essentials: type, unique values, missing values
-Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
-Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
-Most frequent values
-Histograms
-Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
-Missing values matrix, count, heatmap and dendrogram of missing values
-Duplicate rows Lists the most occurring duplicate rows
-Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data

All this is done with one command

from pandas_profiling import ProfileReport
ProfileReport(df, title="Pandas Profiling Report")

df in the above code being a Pandas DataFrame object.

If you have a large dataset (1000000,20) you may be better served by passing the kwarg

ProfileReport(large_dataset, minimal=True)

So while good knowledge of your data is ideal, I think pandas_profile is a good first step to building that knowledge.

DEV Community

Making EDA easy

Top comments (0)

Read next

Is it easy to manage a team of highly qualified engineers?

Introducing GenAI Tweet Creator: Your AI-Powered Tweeting Assistant using Streamlit

Getting Started with AWS: A Practical Guide for Beginner

How To Create and Connect To Your First AWS EC2 Instance: A Beginners Guide