DEV Community

Raman Butta
Raman Butta

Posted on • Edited on

Basics of EDA

Exploratory Data Analysis (EDA) is often the first step in grasping a fair idea of a dataset. It not only lays the foundation for subsequent steps like feature engineering and building models to fit the given data, but is invaluable in providing condensed tabular and visual analyses of large datasets.

Whenever you first get a dataset df, you should do light EDA on it. These are the must-dos in quick succession :

I. Textual Analyses

  1. df.shape : gives the dimension of the dataframe

  2. df : shows the first few rows and first few columns. If you want it to show all columns, run: pd.set_option('display.max_columns', None)

  3. df.head() : the first impression

  4. df.describe() : central tendencies of numeric columns (ref: https://youtube.com/watch?v=g2OpfqWi2tM)

  5. df.info() : to visually detect null values in a column, which can be imputed or dropped using .fillna() or .drop() respectively. You can also get the null values from df.isnull().sum()

  6. separate num & cat columns into diff dataframes

  7. generate pivot_table of num col "values" against the target vector "index". This is similar to groupby. For eg.
    tips.pivot_table(values='tip', index='sex') and tips.groupby('sex')['tip'].mean() would give exactly the same output, but pivot_table is visually nicer.

  8. see value_counts of cat columns

  9. divide above value_counts between target vector's values by generating pivot_table of cat col "columns" against the target vector "index" with aggfunc='count'

II. Graphical Analyses

Categorical variables

  • If you want to plot the frequencies (y) of different alleles (x) of a categorical variable, use countplot, eg:
sns.countplot(x="day", data=tips)
plt.show()
Enter fullscreen mode Exit fullscreen mode

As you can see, that you give x, and y is automatically generated. It is the graphical version of value_counts()

or you can make a pieplot:

counts = tips['day'].value_counts()
plt.pie(counts, labels=count.index)
plt.axis('equal')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Numerical variables

0. Histogram

  • If you want to graph the frequencies (y) of different bins (x) of a numeric variable in a column i, use a histogram, eg:
 plt.hist(df_num[i])
 plt.title(i)
 plt.show()
Enter fullscreen mode Exit fullscreen mode

As you can see, you give i, and x, y are automatically generated.

If you use sns.distplot instead on that column, it creates a histogram with KDE curve overlay

1. General Catplot Syntax for num vars grouped across cat var

If you want to graph the aggregated value of one numerical variable (y) grouped by a categorical variable (x), the general syntax is :

sns.catplot(x='day', y='total_bill', data=tips, kind='...', hue='...')
Enter fullscreen mode Exit fullscreen mode

Within kind you can put the kind of plot. Allowed values for kind are: 'strip', 'swarm', 'box', 'boxen', 'violin', 'bar', 'count', and 'point'.

Within hue you can put another categorical variable for sub-grouping the num var in each curve.

Although you have this general syntax, its helpful to know specific formats as below.

2. Barplot

  • If you want to graph the aggregated value of one numerical variable (y) grouped by a categorical variable (x), use a barplot. Eg
sns.barplot(x="day", y="total_bill", data=tips, estimator=sum, ci=None)
plt.show()
Enter fullscreen mode Exit fullscreen mode

If you drop the estimator argument, it will default to the "mean" estimator/aggfunc. For central tendencies like mean or median, you may drop the ci argument so that it defaults to True and shows confidence intervals.

3. Boxplot

Note that sns.barplot(x='day', y='total_bill', data=tips) will only give the mean of total_bill in each day (bin). If you instead want to focus on quartiles, spread, and outliers, use a boxplot. Eg:

sns.boxplot(x="day", y="total_bill", data=tips)
plt.show()
Enter fullscreen mode Exit fullscreen mode

This creates a boxplot showing the median, interquartile range (IQR), and potential outliers for total_bill across each day. The box represents the IQR (25th to 75th percentile), the line inside the box is the median, and the "whiskers" extend to the minimum and maximum values within 1.5 * IQR. Points outside this range are plotted as outliers. You can add hue to further split the boxes by another categorical variable, e.g., hue="sex", to compare distributions across subgroups.

4. KDE Plot

If instead of the boxplot, you want a visual KDE curve of total_bill across each day, use

sns.kdeplot(data=tips, x='total_bill', hue='day', multiple='layer')  # or multiple='dodge', 'stack'
plt.show()
Enter fullscreen mode Exit fullscreen mode

5. Scatterplots

Eg:

sns.scatterplot(x="total_bill", y="tip", data=tips, hue="sex")
plt.title("Scatterplot of Total Bill vs Tip")
plt.show()
Enter fullscreen mode Exit fullscreen mode

This creates a scatterplot where each point represents a pair of total_bill (x-axis) and tip (y-axis) values from the tips dataset. The hue="sex" argument colors points by the categorical variable sex.

6. Pairplot

If you find that the KDE curve of total_bill across various days has such overlap that it's not a sharp classifier, you may want to see pairs of numerical variables which may form distinct clusters in the xy plane across different days (hues) and serve as sharp classifiers. Thus you may need a pairplot like this :

sns.pairplot(tips, vars=['total_bill', 'tip', 'size'], hue='day')
plt.show()
Enter fullscreen mode Exit fullscreen mode

You keep numerical variables in vars. If you omit vars, it plots for all numerical variables in the dataframe.

The pairplot grid gives you KDE curves along the diagonal and scatterplots off-diagonally.

7. Violinplot

They are like boxplot inside a KDE curve.

sns.violinplot(x="day", y="total_bill", data=tips)
plt.show()
Enter fullscreen mode Exit fullscreen mode

This creates a violinplot that combines a boxplot with a kernel density estimate (KDE) for total_bill across each day. The width of each "violin" represents the density of the data at different values, providing a clearer view of the distribution's shape compared to a boxplot. The inner boxplot shows the median and IQR, while the KDE extends to show the full distribution. You can add hue to split the violins by another categorical variable, e.g., hue="sex", and use split=True to display the distributions side-by-side within each category for easier comparison.

8. Correlation heatmaps

If you want to visualize the correlation between numerical variables in a dataset, use a heatmap of the correlation matrix. Eg:

sns.heatmap(tips.corr(), annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.show()
Enter fullscreen mode Exit fullscreen mode

This creates a heatmap where each cell represents the correlation coefficient between pairs of numerical variables in the tips dataset. The corr() method computes the Pearson correlation by default, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation). The annot=True argument displays the correlation values in each cell, and cmap='coolwarm' uses a diverging color scheme to highlight positive (red) and negative (blue) correlations. The vmin and vmax parameters set the range for the color scale.

III. Applying Filters on a categorical column

First find all the alleles of the chosen cat variable. Eg.

print(df1['day'].unique())
Enter fullscreen mode Exit fullscreen mode

Then say Sunday is a unique allele wrt which you want to filter :

days_df = df[df[days] == Sunday]
Enter fullscreen mode Exit fullscreen mode

or if you want to filter out days which have the word "sun" (case-insensitive), run

days_df = df[df[days].str.contains('sun', case=False, na=False)]
Enter fullscreen mode Exit fullscreen mode

And then

days_df
Enter fullscreen mode Exit fullscreen mode

Top comments (0)