DEV Community

Brenda Mutai
Brenda Mutai

Posted on

The Importance of Skewness and Kurtosis in EDA

Once the data has been collected and carefully cleaned, the next step is to dive into exploring it. This process, called Exploratory Data Analysis (EDA), plays a vital role in any data project. The insights uncovered during EDA guide and influence the decisions made throughout the entire workflow.
A key activity in EDA is examining the distribution shapes of your variables. Understanding these shapes directly impacts later decisions, including:

  • Preprocessing steps
  • Feature selection strategies
  • Algorithm selection
  • Detecting outliers and deciding if to remove them

While visualization is useful, it’s often necessary to have numerical measures for greater reliability. Two important metrics for this are skewness and kurtosis, which help evaluate how closely your data’s distribution aligns with the ideal normal distribution.

SKEWNESS

Skewness is a statistical measure that captures the asymmetry of a distribution around its mean. In a perfectly normal distribution, both tails are balanced, but if one side extends farther than the other, the data becomes skewed. Skewness quantifies the extent of this imbalance.
Accurately identifying and measuring skewness helps reveal how data values are distributed around the mean and guides the selection of appropriate statistical methods or transformations. For example, when a distribution is highly skewed, applying normalization or scaling can make it closer to a normal distribution, which in turn can improve model performance.

Types of Skewness
There are three types of skewness: positive, negative, and zero skewness.
1.Zero skewness
Zero skewness means the distribution is perfectly symmetrical around its mean. The mean, median, and mode are all at the center point.

2.Positive skewness
A positively skewed (right-skewed) distribution has a longer right tail, with the mean greater than the median and the mode being the smallest. Most values cluster on the left, while a few extreme values stretch the distribution to the right.

3. Negative skewness
A negatively skewed (left-skewed) distribution has a longer left tail, with the mean less than the median and the mode being the largest. Most values cluster on the right, while extreme values pull the distribution to the left.

How to calculate skewness
There are many ways to calculate skewness.
_a. Pearson’s second skewness coefficient _

  • this is also known as median skewness.

  • Let’s implement the formula manually in Python:
import numpy as np
import pandas as pd

# health dataset
bmi = pd.Series([22, 24, 27, 30, 35, 40, 18, 25, 29, 32])

mean_bmi = bmi.mean()
median_bmi = bmi.median()
std_bmi = bmi.std()

skewness_bmi = (3 * (mean_bmi - median_bmi)) / std_bmi

print(
    f"The Pearson's second skewness score of BMI distribution is {skewness_bmi:.5f}"
)

Enter fullscreen mode Exit fullscreen mode

b. Moment-Based Formula (used in statistics libraries)
The more general definition of skewness uses the third standardized moment:

Where:

  • n represents the number of values in a distribution

  • x_i denotes each data point

import numpy as np
import pandas as pd

def moment_based_skew(distribution):
    n = len(distribution)
    mean = np.mean(distribution)
    std = np.std(distribution)

    # Formula broken into two parts
    first_part = n / ((n - 1) * (n - 2))
    second_part = np.sum(((distribution - mean) / std) ** 3)

    skewness = first_part * second_part
    return skewness

# Example health dataset: BMI values
bmi = pd.Series([18, 21, 23, 25, 27, 30, 34, 38, 42, 45])

print("Moment-based skewness of BMI distribution:", moment_based_skew(bmi))

Enter fullscreen mode Exit fullscreen mode

Built-in methods from pandas or scipy:

import pandas as pd
from scipy.stats import skew

# BMI values
bmi = pd.Series([18, 21, 23, 25, 27, 30, 34, 38, 42, 45])

# Pandas version
print("Pandas skewness:", bmi.skew())

# SciPy version
print("SciPy skewness:", skew(bmi))

Enter fullscreen mode Exit fullscreen mode

KURTOSIS

While skewness describes the asymmetry of a distribution, kurtosis measures its peakedness or flatness. A high kurtosis means a sharp peak, heavy tails, and a greater chance of extreme values.
Low kurtosis, on the other hand, indicates a flatter peak, lighter tails, and fewer extreme events. For reference, a normal distribution has a kurtosis of about 3.
Types of Kurtosis
Based on kurtosis values, distributions are classified into three types:

  • Mesokurtic (kurtosis = 3, excess = 0): resembles a normal distribution.
  • Leptokurtic (kurtosis > 3, excess > 0): tall peak with heavy tails.
  • Platykurtic (kurtosis < 3, excess < 0): flatter peak with lighter tails.

How to calculate kurtosis
If you want a manual calculation of kurtosis, you can use the following formula:


Where:

  • n = number of observations
  • ˉx= sample mean
  • s = sample standard deviation
  • x_i= each data point

In Python, you can calculate kurtosis the same way as skewness, by using Pandas or SciPy.

import pandas as pd
from scipy.stats import kurtosis

# BMI values
bmi = pd.Series([18, 21, 23, 25, 27, 30, 34, 38, 42, 45])

print("Kurtosis of BMI distribution:", kurtosis(bmi))

Enter fullscreen mode Exit fullscreen mode

In Pandas, kurtosis can be calculated using either kurtor kurtosis. The kurt method works only with Series objects, while kurtosis can be applied to entire DataFrames.

import pandas as pd

#health dataset
health = pd.DataFrame({
    "BMI": [18, 21, 23, 25, 27, 30, 34, 38, 42, 45],
    "BloodPressure": [110, 115, 120, 118, 125, 130, 135, 140, 145, 150],
    "Cholesterol": [160, 170, 175, 180, 185, 190, 200, 210, 220, 230]
})

# Kurtosis of a single column (Series)
print("BMI kurtosis:", health["BMI"].kurt())

# Kurtosis of all numeric columns (DataFrame)
print("\nKurtosis of all health metrics:\n", health.kurtosis())

Enter fullscreen mode Exit fullscreen mode

Skewness and kurtosis are powerful metrics in exploratory data analysis. Skewness helps us understand the asymmetry of a distribution, while kurtosis highlights its peakedness and tail behavior. Together, they provide deeper insights beyond simple measures like mean and variance, guiding decisions on preprocessing, transformations, and model selection. By combining visual inspection with these statistical measures, analysts can better assess data quality and prepare it for reliable modeling.

Top comments (0)