DEV Community

Dylan Lisk
Dylan Lisk

Posted on

Making EDA easy

A package has recently come to my attention that makes performing a blanket EDA of a new dataset easier. This package is pandas_profiling.
To install the package the command is

pip install pandas-profiling[notebook]

So what does it do?
These points were taken from the documentation here

-Type inference: detect the types of columns in a dataframe.
-Essentials: type, unique values, missing values
-Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
-Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
-Most frequent values
-Histograms
-Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
-Missing values matrix, count, heatmap and dendrogram of missing values
-Duplicate rows Lists the most occurring duplicate rows
-Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data

All this is done with one command

from pandas_profiling import ProfileReport
ProfileReport(df, title="Pandas Profiling Report")

df in the above code being a Pandas DataFrame object.

If you have a large dataset (1000000,20) you may be better served by passing the kwarg

ProfileReport(large_dataset, minimal=True)

So while good knowledge of your data is ideal, I think pandas_profile is a good first step to building that knowledge.

Top comments (0)