Sajjad Rahman

Posted on Aug 27

Skewness in Data: Concepts and Math Examples

#resources #machinelearning #tutorial #beginners

When analyzing data, summary statistics like mean and median often help us understand central tendency. But they don’t tell us about the shape of the distribution. Sometimes data leans more to one side — this property is called skewness.

Today we are learning to ;

What skewness is
Types of skewness
A worked-out math example
Real dataset results
Why skewness matters in data analysis and ML

🔹 What is Skewness?

Skewness measures how asymmetric a data distribution is.

Skewness ≈ 0 → Symmetric (Normal Distribution)
Skewness > 0 → Right (Positive) Skew
Skewness < 0 → Left (Negative) Skew

Formula:

Where:

🔹 Types of Skewness

Positive Skew (Right Skew)

Tail longer on the right.
Mean > Median.
Example: Salaries, house prices.

Negative Skew (Left Skew)

Tail longer on the left.
Mean < Median.
Example: Retirement age, exam scores (few very low values).

No Skew (Symmetric)

Balanced distribution.
Mean ≈ Median ≈ Mode.
Example: Human height.

🔹 Step-by-Step Math Example

Consider the Dataset:

    `X = [40, 45, 50, 55, 60, 65, 70, 75, 80, 100]`

✅ Result: Positive skew → most students scored near the average, but one high score (100) stretched the tail to the right.

🔹 Real Dataset Example

We computed skewness for three numerical features:

no_of_employees    12.26
yr_of_estab        -2.03
prevailing_wage     0.76

📌 Interpretation:

no_of_employees → Highly right-skewed (few very large companies).
yr_of_estab → Left-skewed (few very old companies).
prevailing_wage → Moderately right-skewed (most salaries are average, few very high).

🔹 Why is Skewness Important?

Outlier Detection

Extreme skew usually means outliers are present.

Feature Engineering

Many ML models assume normal distribution.
Highly skewed data may require transformations (log, Box-Cox, Yeo-Johnson).

Business Insights

Identifies rare but impactful cases (e.g., very high salaries, very large companies).

🔹 Rule of Thumb

-0.5 to +0.5 → Approximately symmetric
±0.5 to ±1 → Moderate skew
> ±1 → Highly skewed

🔹 Conclusion

Skewness is not just a number — it’s a window into your data’s hidden structure. By analyzing skewness during Exploratory Data Analysis (EDA), you can:

Detect and handle outliers
Apply transformations to stabilize variance
Improve machine learning model performance

So next time you see a skewness value in your dataset, take a closer look — it might reveal something important about your data’s story.

DEV Community