Raman Butta

Posted on Jul 27 • Edited on Aug 6

Basics of EDA

Exploratory Data Analysis (EDA) is often the first step in grasping a fair idea of a dataset. It not only lays the foundation for subsequent steps like feature engineering and building models to fit the given data, but is invaluable in providing condensed tabular and visual analyses of large datasets.

Whenever you first get a dataset df, you should do light EDA on it. These are the must-dos in quick succession :

I. Textual Analyses

df.shape : gives the dimension of the dataframe
df : shows the first few rows and first few columns. If you want it to show all columns, run: pd.set_option('display.max_columns', None)
df.head() : the first impression
df.describe() : central tendencies of numeric columns (ref: https://youtube.com/watch?v=g2OpfqWi2tM)
df.info() : to visually detect null values in a column, which can be imputed or dropped using .fillna() or .drop() respectively. You can also get the null values from df.isnull().sum()
separate num & cat columns into diff dataframes
generate pivot_table of num col "values" against the target vector "index". This is similar to groupby. For eg.
tips.pivot_table(values='tip', index='sex') and tips.groupby('sex')['tip'].mean() would give exactly the same output, but pivot_table is visually nicer.
see value_counts of cat columns
divide above value_counts between target vector's values by generating pivot_table of cat col "columns" against the target vector "index" with aggfunc='count'

II. Graphical Analyses

Categorical variables

If you want to plot the frequencies (y) of different alleles (x) of a categorical variable, use countplot, eg:

sns.countplot(x="day", data=tips)
plt.show()

As you can see, that you give x, and y is automatically generated. It is the graphical version of value_counts()

or you can make a pieplot:

counts = tips['day'].value_counts()
plt.pie(counts, labels=count.index)
plt.axis('equal')
plt.show()

Numerical variables

0. Histogram

If you want to graph the frequencies (y) of different bins (x) of a numeric variable in a column i, use a histogram, eg:

 plt.hist(df_num[i])
 plt.title(i)
 plt.show()

As you can see, you give i, and x, y are automatically generated.

If you use sns.distplot instead on that column, it creates a histogram with KDE curve overlay

1. General Catplot Syntax for num vars grouped across cat var

If you want to graph the aggregated value of one numerical variable (y) grouped by a categorical variable (x), the general syntax is :

sns.catplot(x='day', y='total_bill', data=tips, kind='...', hue='...')

Within kind you can put the kind of plot. Allowed values for kind are: 'strip', 'swarm', 'box', 'boxen', 'violin', 'bar', 'count', and 'point'.

Within hue you can put another categorical variable for sub-grouping the num var in each curve.

Although you have this general syntax, its helpful to know specific formats as below.

2. Barplot

If you want to graph the aggregated value of one numerical variable (y) grouped by a categorical variable (x), use a barplot. Eg

sns.barplot(x="day", y="total_bill", data=tips, estimator=sum, ci=None)
plt.show()

If you drop the estimator argument, it will default to the "mean" estimator/aggfunc. For central tendencies like mean or median, you may drop the ci argument so that it defaults to True and shows confidence intervals.

3. Boxplot

Note that sns.barplot(x='day', y='total_bill', data=tips) will only give the mean of total_bill in each day (bin). If you instead want to focus on quartiles, spread, and outliers, use a boxplot. Eg:

sns.boxplot(x="day", y="total_bill", data=tips)
plt.show()

This creates a boxplot showing the median, interquartile range (IQR), and potential outliers for total_bill across each day. The box represents the IQR (25th to 75th percentile), the line inside the box is the median, and the "whiskers" extend to the minimum and maximum values within 1.5 * IQR. Points outside this range are plotted as outliers. You can add hue to further split the boxes by another categorical variable, e.g., hue="sex", to compare distributions across subgroups.

4. KDE Plot

If instead of the boxplot, you want a visual KDE curve of total_bill across each day, use

sns.kdeplot(data=tips, x='total_bill', hue='day', multiple='layer')  # or multiple='dodge', 'stack'
plt.show()

5. Scatterplots

Eg:

sns.scatterplot(x="total_bill", y="tip", data=tips, hue="sex")
plt.title("Scatterplot of Total Bill vs Tip")
plt.show()

This creates a scatterplot where each point represents a pair of total_bill (x-axis) and tip (y-axis) values from the tips dataset. The hue="sex" argument colors points by the categorical variable sex.

6. Pairplot

If you find that the KDE curve of total_bill across various days has such overlap that it's not a sharp classifier, you may want to see pairs of numerical variables which may form distinct clusters in the xy plane across different days (hues) and serve as sharp classifiers. Thus you may need a pairplot like this :

sns.pairplot(tips, vars=['total_bill', 'tip', 'size'], hue='day')
plt.show()

You keep numerical variables in vars. If you omit vars, it plots for all numerical variables in the dataframe.

The pairplot grid gives you KDE curves along the diagonal and scatterplots off-diagonally.

7. Violinplot

They are like boxplot inside a KDE curve.

sns.violinplot(x="day", y="total_bill", data=tips)
plt.show()

This creates a violinplot that combines a boxplot with a kernel density estimate (KDE) for total_bill across each day. The width of each "violin" represents the density of the data at different values, providing a clearer view of the distribution's shape compared to a boxplot. The inner boxplot shows the median and IQR, while the KDE extends to show the full distribution. You can add hue to split the violins by another categorical variable, e.g., hue="sex", and use split=True to display the distributions side-by-side within each category for easier comparison.

8. Correlation heatmaps

If you want to visualize the correlation between numerical variables in a dataset, use a heatmap of the correlation matrix. Eg:

sns.heatmap(tips.corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.show()

This creates a heatmap where each cell represents the correlation coefficient between pairs of numerical variables in the tips dataset. The corr() method computes the Pearson correlation by default, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). The annot=True argument displays the correlation values in each cell, and cmap='coolwarm' uses a diverging color scheme to highlight positive (red) and negative (blue) correlations. The vmin and vmax parameters set the range for the color scale.

III. Applying Filters on a categorical column

First find all the alleles of the chosen cat variable. Eg.

print(df1['day'].unique())

Then say Sunday is a unique allele wrt which you want to filter :

days_df = df[df[days] == Sunday]

or if you want to filter out days which have the word "sun" (case-insensitive), run

days_df = df[df[days].str.contains('sun', case=False, na=False)]

And then

days_df

DEV Community