Once the data has been collected and carefully cleaned, the next step is to dive into exploring it. This process, called Exploratory Data Analysis (EDA), plays a vital role in any data project. The insights uncovered during EDA guide and influence the decisions made throughout the entire workflow.
A key activity in EDA is examining the distribution shapes of your variables. Understanding these shapes directly impacts later decisions, including:
- Preprocessing steps
- Feature selection strategies
- Algorithm selection
- Detecting outliers and deciding if to remove them
While visualization is useful, it’s often necessary to have numerical measures for greater reliability. Two important metrics for this are skewness and kurtosis, which help evaluate how closely your data’s distribution aligns with the ideal normal distribution.
SKEWNESS
Skewness is a statistical measure that captures the asymmetry of a distribution around its mean. In a perfectly normal distribution, both tails are balanced, but if one side extends farther than the other, the data becomes skewed. Skewness quantifies the extent of this imbalance.
Accurately identifying and measuring skewness helps reveal how data values are distributed around the mean and guides the selection of appropriate statistical methods or transformations. For example, when a distribution is highly skewed, applying normalization or scaling can make it closer to a normal distribution, which in turn can improve model performance.
Types of Skewness
There are three types of skewness: positive, negative, and zero skewness.
1.Zero skewness
Zero skewness means the distribution is perfectly symmetrical around its mean. The mean, median, and mode are all at the center point.
2.Positive skewness
A positively skewed (right-skewed) distribution has a longer right tail, with the mean greater than the median and the mode being the smallest. Most values cluster on the left, while a few extreme values stretch the distribution to the right.
3. Negative skewness
A negatively skewed (left-skewed) distribution has a longer left tail, with the mean less than the median and the mode being the largest. Most values cluster on the right, while extreme values pull the distribution to the left.
How to calculate skewness
There are many ways to calculate skewness.
_a. Pearson’s second skewness coefficient _
- this is also known as median skewness.
- Let’s implement the formula manually in Python:
import numpy as np
import pandas as pd
# health dataset
bmi = pd.Series([22, 24, 27, 30, 35, 40, 18, 25, 29, 32])
mean_bmi = bmi.mean()
median_bmi = bmi.median()
std_bmi = bmi.std()
skewness_bmi = (3 * (mean_bmi - median_bmi)) / std_bmi
print(
f"The Pearson's second skewness score of BMI distribution is {skewness_bmi:.5f}"
)
b. Moment-Based Formula (used in statistics libraries)
The more general definition of skewness uses the third standardized moment:
Where:
n represents the number of values in a distribution
x_i denotes each data point
import numpy as np
import pandas as pd
def moment_based_skew(distribution):
n = len(distribution)
mean = np.mean(distribution)
std = np.std(distribution)
# Formula broken into two parts
first_part = n / ((n - 1) * (n - 2))
second_part = np.sum(((distribution - mean) / std) ** 3)
skewness = first_part * second_part
return skewness
# Example health dataset: BMI values
bmi = pd.Series([18, 21, 23, 25, 27, 30, 34, 38, 42, 45])
print("Moment-based skewness of BMI distribution:", moment_based_skew(bmi))
Built-in methods from pandas or scipy:
import pandas as pd
from scipy.stats import skew
# BMI values
bmi = pd.Series([18, 21, 23, 25, 27, 30, 34, 38, 42, 45])
# Pandas version
print("Pandas skewness:", bmi.skew())
# SciPy version
print("SciPy skewness:", skew(bmi))
KURTOSIS
While skewness describes the asymmetry of a distribution, kurtosis measures its peakedness or flatness. A high kurtosis means a sharp peak, heavy tails, and a greater chance of extreme values.
Low kurtosis, on the other hand, indicates a flatter peak, lighter tails, and fewer extreme events. For reference, a normal distribution has a kurtosis of about 3.
Types of Kurtosis
Based on kurtosis values, distributions are classified into three types:
- Mesokurtic (kurtosis = 3, excess = 0): resembles a normal distribution.
- Leptokurtic (kurtosis > 3, excess > 0): tall peak with heavy tails.
- Platykurtic (kurtosis < 3, excess < 0): flatter peak with lighter tails.
How to calculate kurtosis
If you want a manual calculation of kurtosis, you can use the following formula:
- n = number of observations
- ˉx= sample mean
- s = sample standard deviation
- x_i= each data point
In Python, you can calculate kurtosis the same way as skewness, by using Pandas or SciPy.
import pandas as pd
from scipy.stats import kurtosis
# BMI values
bmi = pd.Series([18, 21, 23, 25, 27, 30, 34, 38, 42, 45])
print("Kurtosis of BMI distribution:", kurtosis(bmi))
In Pandas, kurtosis can be calculated using either kurt
or kurtosis
. The kurt method works only with Series objects, while kurtosis can be applied to entire DataFrames.
import pandas as pd
#health dataset
health = pd.DataFrame({
"BMI": [18, 21, 23, 25, 27, 30, 34, 38, 42, 45],
"BloodPressure": [110, 115, 120, 118, 125, 130, 135, 140, 145, 150],
"Cholesterol": [160, 170, 175, 180, 185, 190, 200, 210, 220, 230]
})
# Kurtosis of a single column (Series)
print("BMI kurtosis:", health["BMI"].kurt())
# Kurtosis of all numeric columns (DataFrame)
print("\nKurtosis of all health metrics:\n", health.kurtosis())
Skewness and kurtosis are powerful metrics in exploratory data analysis. Skewness helps us understand the asymmetry of a distribution, while kurtosis highlights its peakedness and tail behavior. Together, they provide deeper insights beyond simple measures like mean and variance, guiding decisions on preprocessing, transformations, and model selection. By combining visual inspection with these statistical measures, analysts can better assess data quality and prepare it for reliable modeling.
Top comments (0)