A package has recently come to my attention that makes performing a blanket EDA of a new dataset easier. This package is pandas_profiling.
To install the package the command is
pip install pandas-profiling[notebook]
So what does it do?
These points were taken from the documentation here
-Type inference: detect the types of columns in a dataframe.
-Essentials: type, unique values, missing values
-Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
-Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
-Most frequent values
-Histograms
-Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
-Missing values matrix, count, heatmap and dendrogram of missing values
-Duplicate rows Lists the most occurring duplicate rows
-Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data
All this is done with one command
from pandas_profiling import ProfileReport
ProfileReport(df, title="Pandas Profiling Report")
df in the above code being a Pandas DataFrame object.
If you have a large dataset (1000000,20) you may be better served by passing the kwarg
ProfileReport(large_dataset, minimal=True)
So while good knowledge of your data is ideal, I think pandas_profile is a good first step to building that knowledge.
Top comments (0)