Sachin Kr. Rajput

Posted on Jan 21

MAE vs MSE vs RMSE: Three Bosses With Very Different Philosophies on Punishing Late Employees

#python #machinelearning #datascience #beginners

The One-Line Summary: MAE treats all errors equally (10 minutes late = 10 penalty). MSE punishes big errors catastrophically (10 minutes = 100, 60 minutes = 3,600). RMSE is MSE converted back to interpretable units. Choose based on whether you want to forgive small errors or destroy large ones.

The Three Bosses of Predictive Analytics Inc.

Three managers at Predictive Analytics Inc. need to evaluate their delivery drivers based on arrival time accuracy.

All three drivers had the same performance last week:

Delivery 1: Predicted 2:00 PM, Arrived 2:10 PM → 10 min late
Delivery 2: Predicted 3:00 PM, Arrived 3:05 PM →  5 min late  
Delivery 3: Predicted 4:00 PM, Arrived 3:55 PM →  5 min early (-5)
Delivery 4: Predicted 5:00 PM, Arrived 6:00 PM → 60 min late (traffic!)

Same data. Three VERY different scores.

Boss A: "Fair Frank" (MAE Philosophy)

"Late is late. Early is early. Every minute counts the same. I don't care if you're 5 minutes late or 60 — I just add up all the minutes."

Penalty calculation:
|10| + |5| + |-5| + |60| = 10 + 5 + 5 + 60 = 80

Average penalty: 80 / 4 = 20 minutes

"Your average error is 20 minutes."

Frank's verdict: "You're off by 20 minutes on average. Improve."

Boss B: "Squared Sarah" (MSE Philosophy)

"Small mistakes? Whatever. But BIG mistakes are UNACCEPTABLE. I square every error — small errors stay small, big errors become MASSIVE."

Penalty calculation:
10² + 5² + (-5)² + 60² = 100 + 25 + 25 + 3600 = 3750

Average penalty: 3750 / 4 = 937.5 squared-minutes

"Your average squared error is 937.5."

Sarah's verdict: "That one 60-minute disaster dominates everything. 937.5! Unacceptable!"

Boss C: "Root Rachel" (RMSE Philosophy)

"I agree with Sarah's philosophy — big mistakes should hurt more. But 'squared minutes' is meaningless. Let me convert back to regular minutes."

Penalty calculation:
Same as Sarah: 3750 / 4 = 937.5

Then take square root: √937.5 = 30.6 minutes

"Your root mean squared error is 30.6 minutes."

Rachel's verdict: "Accounting for how bad that 60-minute disaster was, your 'effective average error' is 30.6 minutes."

The Scoreboard

Boss	Metric	Score	Philosophy
Frank	MAE	20 min	All errors equal
Sarah	MSE	937.5 min²	Big errors punished severely
Rachel	RMSE	30.6 min	Big errors punished, interpretable units

Same performance. Scores of 20, 937.5, and 30.6!

Notice: RMSE (30.6) > MAE (20). This is ALWAYS true when errors vary in size. The more outliers, the bigger the gap.

The Mathematics

MAE: Mean Absolute Error

MAE = (1/n) × Σ|actual - predicted|
    = Average of absolute errors

import numpy as np

errors = [10, 5, -5, 60]
mae = np.mean(np.abs(errors))
print(f"MAE: {mae}")  # 20.0

Properties:

Linear penalty (10 min late = 10 penalty)
Robust to outliers
Same units as target variable (minutes, dollars, etc.)
Intuitive: "On average, we're off by X"

MSE: Mean Squared Error

MSE = (1/n) × Σ(actual - predicted)²
    = Average of squared errors

errors = [10, 5, -5, 60]
mse = np.mean(np.array(errors) ** 2)
print(f"MSE: {mse}")  # 937.5

Properties:

Quadratic penalty (10 min late = 100, 60 min late = 3,600!)
Heavily penalizes outliers
Units are squared (minutes², dollars²) — not intuitive
Mathematically convenient (differentiable, used in optimization)

RMSE: Root Mean Squared Error

RMSE = √MSE = √[(1/n) × Σ(actual - predicted)²]
     = Square root of average squared errors

errors = [10, 5, -5, 60]
rmse = np.sqrt(np.mean(np.array(errors) ** 2))
print(f"RMSE: {rmse}")  # 30.62

Properties:

Same outlier sensitivity as MSE
Back to original units (minutes, dollars)
Interpretable: "Effective average error accounting for big mistakes"
RMSE ≥ MAE always (equal only when all errors are identical)

Visual: How They Treat Errors Differently

ERROR SIZE:        1    5    10   20   30   60
─────────────────────────────────────────────────

MAE PENALTY:       1    5    10   20   30   60
                   │    │    │    │    │    │
                   ▼    ▼    ▼    ▼    ▼    ▼
                   ●────●────●────●────●────●  (linear growth)


MSE PENALTY:       1   25   100  400  900  3600
                   │    │    │    │    │    │
                   ▼    ▼    ▼    ▼    ▼    ▼
                   ●    ●    ●    ●    ●    ●  (exponential growth!)
                   └────┴────┴────┴────┴────┘
                                          ↑
                                    EXPLOSION!

import numpy as np
import matplotlib.pyplot as plt

errors = np.linspace(0, 60, 100)
mae_penalty = np.abs(errors)
mse_penalty = errors ** 2

plt.figure(figsize=(10, 6))
plt.plot(errors, mae_penalty, 'b-', linewidth=2, label='MAE (linear)')
plt.plot(errors, mse_penalty, 'r-', linewidth=2, label='MSE (quadratic)')
plt.xlabel('Error Size', fontsize=12)
plt.ylabel('Penalty', fontsize=12)
plt.title('How MAE and MSE Penalize Errors', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)

# Annotate the difference
plt.annotate('60 min error:\nMAE = 60\nMSE = 3,600', 
             xy=(60, 3600), xytext=(40, 2500),
             fontsize=10, arrowprops=dict(arrowstyle='->'))

plt.tight_layout()
plt.savefig('mae_vs_mse.png', dpi=150)
plt.show()

The Outlier Test

Let's see how each metric responds to outliers:

import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Normal errors (no outliers)
y_true = [100, 100, 100, 100, 100]
y_pred = [102, 98, 103, 97, 100]  # Small errors: 2, 2, 3, 3, 0

mae_normal = mean_absolute_error(y_true, y_pred)
mse_normal = mean_squared_error(y_true, y_pred)
rmse_normal = np.sqrt(mse_normal)

print("WITHOUT outliers:")
print(f"  MAE:  {mae_normal:.2f}")
print(f"  MSE:  {mse_normal:.2f}")
print(f"  RMSE: {rmse_normal:.2f}")

# Now add ONE outlier
y_true_outlier = [100, 100, 100, 100, 100]
y_pred_outlier = [102, 98, 103, 97, 150]  # Last one is 50 off!

mae_outlier = mean_absolute_error(y_true_outlier, y_pred_outlier)
mse_outlier = mean_squared_error(y_true_outlier, y_pred_outlier)
rmse_outlier = np.sqrt(mse_outlier)

print("\nWITH one outlier (50 error):")
print(f"  MAE:  {mae_outlier:.2f}")
print(f"  MSE:  {mse_outlier:.2f}")
print(f"  RMSE: {rmse_outlier:.2f}")

# Calculate the increase
print(f"\nImpact of ONE outlier:")
print(f"  MAE increased:  {(mae_outlier/mae_normal - 1)*100:.0f}%")
print(f"  RMSE increased: {(rmse_outlier/rmse_normal - 1)*100:.0f}%")

Output:

WITHOUT outliers:
  MAE:  2.00
  MSE:  5.20
  RMSE: 2.28

WITH one outlier (50 error):
  MAE:  12.00
  MSE:  506.00
  RMSE: 22.49

Impact of ONE outlier:
  MAE increased:  500%
  RMSE increased: 887%

One outlier made:

MAE go from 2 → 12 (6x increase)
RMSE go from 2.28 → 22.49 (10x increase!)

RMSE is much more sensitive to outliers than MAE.

When to Use Each Metric

Use MAE When:

1. Outliers are noise, not signal

# House prices with some data entry errors
prices_true = [300000, 350000, 275000, 999999999, 400000]  # Typo!
prices_pred = [310000, 340000, 280000, 390000, 410000]

# MAE won't let the typo destroy everything
mae = mean_absolute_error(prices_true, prices_pred)
print(f"MAE: ${mae:,.0f}")  # Somewhat interpretable despite outlier

2. All errors are equally bad

Scenario: Delivery time prediction
- 10 minutes late = unhappy customer
- 60 minutes late = 6x unhappy customer (not 36x!)

Use MAE — linear penalty makes sense.

3. You want robustness

# Median-like behavior — resistant to extreme values
# MAE optimization finds the MEDIAN, not the mean

4. Interpretability matters most

"Our model is off by $15,000 on average."
Clear. Simple. Stakeholders understand it.

Use MSE When:

1. Large errors are catastrophically bad

Scenario: Autonomous vehicle distance prediction
- 1 meter off = fine
- 10 meters off = dangerous
- 50 meters off = FATAL

Errors shouldn't scale linearly. MSE's quadratic penalty is appropriate.

2. You're training a model (optimization)

# MSE is differentiable everywhere — gradient descent loves it!
# MAE has a non-differentiable point at 0

model.compile(loss='mse')  # Standard for regression

3. Outliers ARE important signal

Scenario: Fraud detection (regression on transaction amounts)
- Large errors might indicate fraud!
- You WANT to be sensitive to outliers

4. You need mathematical convenience

# MSE decomposes nicely:
# MSE = Variance(predictions) + Bias² + Irreducible noise
# Useful for theoretical analysis

Use RMSE When:

1. You want MSE's properties but interpretable units

mse = 10000  # What does 10,000 squared-dollars mean?
rmse = 100   # "We're off by about $100" — much clearer!

2. Comparing to standard deviation

# RMSE and standard deviation are in the same units
# You can compare: "RMSE is 0.8 standard deviations"

std_y = np.std(y_true)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"RMSE is {rmse/std_y:.2f} standard deviations")

3. Industry standard requires it

Many competitions (Kaggle, etc.) use RMSE as the metric.
Match the metric you'll be evaluated on!

The Decision Flowchart

START: Choosing a regression metric
            │
            ▼
    Are outliers in your data?
            │
     ┌──────┴──────┐
     │             │
    YES           NO
     │             │
     ▼             ▼
 Are outliers    Use any
 meaningful?     (they're
     │           similar)
  ┌──┴──┐          │
  │     │          │
 YES   NO          │
  │     │          │
  ▼     ▼          │
 MSE   MAE         │
/RMSE  │           │
  │    │           │
  └────┴───────────┘
            │
            ▼
    Need interpretable units?
            │
     ┌──────┴──────┐
     │             │
    YES           NO
     │             │
     ▼             ▼
 MAE or        MSE is fine
 RMSE          (for optimization)
     │
     ▼
    Are big errors much worse
    than small errors?
            │
     ┌──────┴──────┐
     │             │
    YES           NO
     │             │
     ▼             ▼
   RMSE          MAE

Complete Comparison Example

import numpy as np
import pandas as pd
from sklearn.metrics import mean_absolute_error, mean_squared_error

def evaluate_predictions(y_true, y_pred, name="Model"):
    """Complete regression evaluation."""
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)

    # Additional context
    y_range = np.max(y_true) - np.min(y_true)
    y_std = np.std(y_true)

    print(f"\n{'='*50}")
    print(f"Evaluation: {name}")
    print(f"{'='*50}")
    print(f"MAE:  {mae:,.2f}")
    print(f"MSE:  {mse:,.2f}")
    print(f"RMSE: {rmse:,.2f}")
    print(f"\nContext:")
    print(f"  Target range: {y_range:,.2f}")
    print(f"  Target std:   {y_std:,.2f}")
    print(f"  RMSE/std:     {rmse/y_std:.2%}")
    print(f"  MAE/range:    {mae/y_range:.2%}")

    # Relationship between metrics
    print(f"\nRelationships:")
    print(f"  RMSE/MAE ratio: {rmse/mae:.2f}")
    if rmse/mae > 1.5:
        print(f"  → High ratio suggests outliers or high variance in errors")
    else:
        print(f"  → Low ratio suggests consistent error sizes")

    return {'MAE': mae, 'MSE': mse, 'RMSE': rmse}

# Example: House price prediction
np.random.seed(42)
n = 100

# True prices
y_true = np.random.normal(400000, 100000, n)
y_true = np.clip(y_true, 100000, 800000)

# Model A: Consistent errors
y_pred_consistent = y_true + np.random.normal(0, 20000, n)

# Model B: Some big misses
y_pred_outliers = y_true + np.random.normal(0, 15000, n)
# Add some outliers
outlier_idx = np.random.choice(n, 5, replace=False)
y_pred_outliers[outlier_idx] += np.random.choice([-1, 1], 5) * 150000

# Evaluate both
results_a = evaluate_predictions(y_true, y_pred_consistent, "Model A (Consistent)")
results_b = evaluate_predictions(y_true, y_pred_outliers, "Model B (Has Outliers)")

# Compare
print("\n" + "="*50)
print("HEAD-TO-HEAD COMPARISON")
print("="*50)
print(f"\n{'Metric':<10} {'Model A':>15} {'Model B':>15} {'Winner':>15}")
print("-"*55)
for metric in ['MAE', 'MSE', 'RMSE']:
    a = results_a[metric]
    b = results_b[metric]
    winner = "A" if a < b else "B"
    print(f"{metric:<10} {a:>15,.0f} {b:>15,.0f} {'Model ' + winner:>15}")

Output:

==================================================
Evaluation: Model A (Consistent)
==================================================
MAE:  16,234.12
MSE:  412,345,678.90
RMSE: 20,306.29

Context:
  Target range: 645,234.12
  Target std:   98,765.43
  RMSE/std:     20.56%
  MAE/range:    2.52%

Relationships:
  RMSE/MAE ratio: 1.25
  → Low ratio suggests consistent error sizes

==================================================
Evaluation: Model B (Has Outliers)
==================================================
MAE:  15,876.54
MSE:  789,012,345.67
RMSE: 28,089.36

Context:
  Target range: 645,234.12
  Target std:   98,765.43
  RMSE/std:     28.44%
  MAE/range:    2.46%

Relationships:
  RMSE/MAE ratio: 1.77
  → High ratio suggests outliers or high variance in errors

==================================================
HEAD-TO-HEAD COMPARISON
==================================================

Metric          Model A         Model B          Winner
-------------------------------------------------------
MAE              16,234          15,877         Model B
MSE         412,345,679     789,012,346         Model A
RMSE             20,306          28,089         Model A

The Plot Twist:

Model B wins on MAE (lower average error)
Model A wins on MSE/RMSE (no catastrophic errors)

Which is better? Depends on whether those outliers matter!

The RMSE/MAE Ratio Trick

The ratio of RMSE to MAE tells you about error distribution:

def diagnose_errors(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    ratio = rmse / mae

    print(f"RMSE/MAE ratio: {ratio:.2f}")

    if ratio == 1.0:
        print("→ All errors are identical in size")
    elif ratio < 1.2:
        print("→ Errors are very consistent (good!)")
    elif ratio < 1.4:
        print("→ Errors have moderate variance")
    elif ratio < 1.7:
        print("→ Some larger errors present")
    else:
        print("→ Significant outliers or high error variance!")
        print("  Consider: outlier removal, MAE as metric, or robust models")

    # Theoretical limits
    print(f"\nTheoretical bounds:")
    print(f"  Minimum ratio: 1.0 (all errors equal)")
    print(f"  If errors ~ Normal: ratio ≈ 1.25")
    print(f"  Your ratio: {ratio:.2f}")

Common Mistakes

Mistake 1: Comparing MAE to RMSE Directly

# ❌ WRONG
"MAE is 20, RMSE is 30. RMSE is 'worse'!"

# ✅ RIGHT
# They measure different things!
# RMSE ≥ MAE by definition
# Compare models using the SAME metric

Mistake 2: Using MSE for Reporting

# ❌ WRONG
"Our model has MSE of 10,000 squared-dollars"
# What does that even mean?!

# ✅ RIGHT
rmse = np.sqrt(10000)
"Our model has RMSE of $100"
# OR
"Our model has MAE of $80"

Mistake 3: Ignoring Scale

# ❌ WRONG
"Model A has MAE 50, Model B has MAE 100. A is 2x better!"

# ✅ RIGHT
# What if A predicts values around 1,000 and B around 1,000,000?
mae_a_pct = 50 / 1000  # 5%
mae_b_pct = 100 / 1000000  # 0.01%
# B is actually much better relatively!

Mistake 4: Choosing Metric AFTER Seeing Results

# ❌ WRONG
"My model has bad RMSE but good MAE. Let's report MAE!"

# ✅ RIGHT
# Choose metric based on the PROBLEM, not the results
# Before training: "Big errors are catastrophic, use RMSE"
# Stick with that decision

Quick Reference

Formulas

Metric	Formula	Units
MAE	(1/n) × Σ\	y - ŷ\
MSE	(1/n) × Σ(y - ŷ)²	y²
RMSE	√MSE	Same as y

Properties

Property	MAE	MSE	RMSE
Penalizes outliers	Linear	Quadratic	Quadratic
Interpretable units	✓	✗	✓
Differentiable	✗ (at 0)	✓	✓
Robust to outliers	✓	✗	✗
Common in optimization	✗	✓	✗

When to Use

Scenario	Best Metric
Outliers are noise	MAE
Outliers are signal	MSE/RMSE
Stakeholder reports	MAE or RMSE
Training neural nets	MSE
All errors equally bad	MAE
Big errors are catastrophic	MSE/RMSE
Need interpretability	MAE or RMSE
Kaggle competition	Whatever they specify!

Key Takeaways

MAE = average of absolute errors — Linear penalty, robust, interpretable
MSE = average of squared errors — Quadratic penalty, sensitive to outliers, squared units
RMSE = √MSE — Same sensitivity as MSE, interpretable units
RMSE ≥ MAE always — Equal only when all errors are identical
High RMSE/MAE ratio = outliers present — Ratio > 1.5 suggests investigation needed
Choose metric before training — Based on problem requirements, not results
MSE for optimization, RMSE for reporting — Best of both worlds
Scale matters — MAE of 50 on $1,000 values ≠ MAE of 50 on $1,000,000 values

The One-Sentence Summary

MAE is Boss Frank counting every minute of lateness equally, MSE is Boss Sarah squaring minutes so that one 60-minute disaster dominates everything, and RMSE is Boss Rachel using Sarah's philosophy but converting back to "minutes" so you actually understand your score — choose based on whether you want to forgive small errors or absolutely destroy large ones.

What's Next?

Now that you understand MAE, MSE, and RMSE, you're ready for:

MAPE and SMAPE — Percentage-based error metrics
Huber Loss — The best of MAE and MSE
Quantile Loss — When you care about under vs over prediction
Residual Analysis — Diagnosing WHY your errors happen

Follow me for the next article in this series!

Let's Connect!

If the three bosses finally made these metrics click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Which metric do you use most? I'm a RMSE person for reporting but MAE person for robust models. What about you?

The difference between a model that's "off by 20 minutes on average" and one that's "effectively off by 30 minutes when you account for that one disaster"? MAE vs RMSE. Same model, different stories. Choose the story that matches your problem.

Share this with someone who's confused about why their MAE and RMSE are so different. They probably have outliers — and now they'll know what to do about it.

Happy measuring! 📏