Sachin Kr. Rajput

Posted on Jan 21

When Accuracy Is a Lying Metric: The Weather Forecaster Who Was 96% Accurate and Still Got Everyone Killed

#python #machinelearning #datascience #beginners

The One-Line Summary: Accuracy measures how often you're right overall, but when 96% of your data is one class, a model that predicts that class every time gets 96% accuracy while being completely useless. Accuracy rewards laziness when classes are imbalanced.

The Desert Weather Forecaster

Maria was the most accurate weather forecaster in Phoenix, Arizona.

Every morning for 10 years, she gave her forecast:

"No rain today."

Every. Single. Day.

Her accuracy rate: 96.2%

Phoenix averages only 36 rainy days per year. Out of 365 days, Maria was wrong only about 14 times annually — when it actually rained.

The news station loved her. "Maria: Phoenix's most accurate forecaster!"

Then came August 19th.

A rare monsoon storm rolled in. Flash flood warnings were issued across the state. But Maria's forecast that morning?

"No rain today."

She was consistent. She was also catastrophically wrong.

127 people were caught in flash floods. Billions in damage. The city was devastated.

At the inquiry, Maria's defense was simple:

"I was right 96% of the time! I'm the most accurate forecaster you've ever had!"

The investigator leaned forward:

"Maria, you've never predicted rain. Not once in 10 years. You've missed every single storm. Your 'accuracy' comes entirely from predicting the thing that happens 96% of the time anyway. A broken clock could do that."

This is when accuracy becomes a dangerous lie.

Maria's model (always predict "no rain") had stellar accuracy. But she provided ZERO value. Anyone could predict "no rain" in a desert and be right most of the time.

The 4% of days that mattered? She missed every single one.

The Mathematics of the Lie

Let's formalize Maria's failure:

Phoenix weather over 10 years:
- Total days: 3,650
- Rainy days: 140 (3.8%)
- Non-rainy days: 3,510 (96.2%)

Maria's predictions:
- Predicted "No rain": 3,650 times (every day!)
- Predicted "Rain": 0 times

Results:
- Correct "No rain" predictions: 3,510 (True Negatives)
- Incorrect "No rain" predictions: 140 (False Negatives - missed storms!)
- Correct "Rain" predictions: 0 (True Positives)
- Incorrect "Rain" predictions: 0 (False Positives)

Accuracy = (3,510 + 0) / 3,650 = 96.2% ✨

But wait...
Recall for rain = 0 / 140 = 0% 💀
She caught ZERO storms!

96.2% accuracy. 0% usefulness.

Scenario 1: Class Imbalance (The #1 Killer)

This is Maria's problem. When one class dominates, accuracy rewards predicting that class.

import numpy as np
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.dummy import DummyClassifier

# Simulated: 1000 patients, only 20 have a rare disease (2%)
y_true = np.array([0]*980 + [1]*20)

# Model A: "Lazy" - just predicts majority class
model_lazy = DummyClassifier(strategy='most_frequent')
model_lazy.fit(np.zeros((1000, 1)), y_true)
y_lazy = model_lazy.predict(np.zeros((1000, 1)))

print("=" * 50)
print("MODEL A: 'The Lazy Predictor'")
print("(Predicts 'healthy' for everyone)")
print("=" * 50)
print(f"Accuracy:  {accuracy_score(y_true, y_lazy):.1%}")
print(f"Precision: {precision_score(y_true, y_lazy, zero_division=0):.1%}")
print(f"Recall:    {recall_score(y_true, y_lazy):.1%}")
print(f"F1 Score:  {f1_score(y_true, y_lazy):.1%}")
print(f"\nDiseased patients caught: {sum((y_lazy == 1) & (y_true == 1))}/{sum(y_true)}")

Output:

==================================================
MODEL A: 'The Lazy Predictor'
(Predicts 'healthy' for everyone)
==================================================
Accuracy:  98.0%
Precision: 0.0%
Recall:    0.0%
F1 Score:  0.0%

Diseased patients caught: 0/20

98% accuracy! Caught zero patients with the disease!

Now let's see a model that actually TRIES:

# Model B: Actually tries to find disease
# Not perfect, but makes an effort
np.random.seed(42)
y_effort = np.zeros(1000, dtype=int)

# Catches 15 of 20 diseased (75% recall)
diseased_indices = np.where(y_true == 1)[0]
y_effort[diseased_indices[:15]] = 1

# Also has 30 false positives
healthy_indices = np.where(y_true == 0)[0]
y_effort[np.random.choice(healthy_indices, 30, replace=False)] = 1

print("=" * 50)
print("MODEL B: 'The Effort Maker'")
print("(Actually tries to detect disease)")
print("=" * 50)
print(f"Accuracy:  {accuracy_score(y_true, y_effort):.1%}")
print(f"Precision: {precision_score(y_true, y_effort):.1%}")
print(f"Recall:    {recall_score(y_true, y_effort):.1%}")
print(f"F1 Score:  {f1_score(y_true, y_effort):.1%}")
print(f"\nDiseased patients caught: {sum((y_effort == 1) & (y_true == 1))}/{sum(y_true)}")

Output:

==================================================
MODEL B: 'The Effort Maker'
(Actually tries to detect disease)
==================================================
Accuracy:  95.5%
Precision: 33.3%
Recall:    75.0%
F1 Score:  46.2%

Diseased patients caught: 15/20

Model B has LOWER accuracy (95.5% vs 98%) but is infinitely more useful!

If accuracy is your only metric, you'd deploy the useless model.

Scenario 2: Different Error Costs

Accuracy treats all errors as equal. Real life doesn't.

TWO TYPES OF ERRORS:

Error Type 1: False Positive
  Model says "DISEASE" but patient is healthy
  Cost: Unnecessary tests, patient anxiety
  Severity: Low-Medium

Error Type 2: False Negative  
  Model says "HEALTHY" but patient has disease
  Cost: Missed diagnosis, patient might die
  Severity: CRITICAL

Accuracy treats these the same. They're NOT the same.

# Scenario: Medical diagnosis
# False Positive cost: $500 (extra tests)
# False Negative cost: $500,000 (wrongful death lawsuit / ethical failure)

def calculate_total_cost(y_true, y_pred, fp_cost, fn_cost):
    fp = sum((y_pred == 1) & (y_true == 0))
    fn = sum((y_pred == 0) & (y_true == 1))
    return fp * fp_cost + fn * fn_cost

# Using our models from before
cost_lazy = calculate_total_cost(y_true, y_lazy, fp_cost=500, fn_cost=500000)
cost_effort = calculate_total_cost(y_true, y_effort, fp_cost=500, fn_cost=500000)

print("Real-world cost comparison:")
print(f"Model A (98% accuracy): ${cost_lazy:,.0f}")
print(f"Model B (95% accuracy): ${cost_effort:,.0f}")
print(f"\nThe 'more accurate' model costs ${cost_lazy - cost_effort:,.0f} MORE!")

Output:

Real-world cost comparison:
Model A (98% accuracy): $10,000,000
Model B (95% accuracy): $2,515,000

The 'more accurate' model costs $7,485,000 MORE!

Higher accuracy = 7.5 million dollars MORE in costs!

Scenario 3: Multi-Class With Unequal Importance

# Image classification: Cat vs Dog vs Rare Endangered Tiger
# Misclassifying tigers is a conservation disaster!

y_true = ['cat']*450 + ['dog']*450 + ['tiger']*100
y_pred = ['cat']*450 + ['dog']*450 + ['cat']*100  # Classifies all tigers as cats!

from sklearn.metrics import classification_report

print(classification_report(y_true, y_pred))

Output:

              precision    recall  f1-score   support

         cat       0.82      1.00      0.90       450
         dog       1.00      1.00      1.00       450
       tiger       0.00      0.00      0.00       100

    accuracy                           0.90      1000
   macro avg       0.61      0.67      0.63      1000
weighted avg       0.82      0.90      0.85      1000

90% accuracy! But we missed EVERY SINGLE TIGER!

The endangered species classifier is useless for its primary purpose.

Scenario 4: Threshold-Sensitive Decisions

Accuracy depends on a fixed threshold (usually 0.5). But real-world decisions need flexibility.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)

# Get probabilities
probas = model.predict_proba(X_test)[:, 1]

# Compare metrics at different thresholds
thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]

print("Threshold | Accuracy | Precision | Recall  | Useful?")
print("-" * 55)

for thresh in thresholds:
    y_pred = (probas >= thresh).astype(int)
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, zero_division=0)
    rec = recall_score(y_test, y_pred)

    # Is it useful? Must catch at least 50% of positives
    useful = "✓" if rec >= 0.5 else "✗"

    print(f"   {thresh:.1f}    |  {acc:.1%}   |   {prec:.1%}    |  {rec:.1%}  |   {useful}")

Output:

Threshold | Accuracy | Precision | Recall  | Useful?
-------------------------------------------------------
   0.1    |  72.0%   |   12.5%    |  91.7%  |   ✓
   0.3    |  88.4%   |   28.6%    |  66.7%  |   ✓
   0.5    |  94.0%   |   50.0%    |  41.7%  |   ✗
   0.7    |  95.6%   |   66.7%    |  33.3%  |   ✗
   0.9    |  95.2%   |   50.0%    |   8.3%  |   ✗

The threshold with HIGHEST accuracy (95.6%) catches only 33% of cases!

The threshold with LOWEST accuracy (72%) catches 92% of cases!

If you optimize for accuracy, you get the useless model.

Scenario 5: Temporal/Concept Drift

Accuracy measured once can hide degradation over time.

# Fraud detection over 3 months
# Fraudsters adapt; old patterns stop working

months = ['January', 'February', 'March']
accuracy = [0.98, 0.95, 0.85]  # Looks okay...
recall = [0.90, 0.65, 0.20]   # Disaster brewing!

print("Month     | Accuracy | Recall (Fraud Caught)")
print("-" * 45)
for m, a, r in zip(months, accuracy, recall):
    warning = " ⚠️ DANGER!" if r < 0.5 else ""
    print(f"{m:10s}|  {a:.0%}    |       {r:.0%}{warning}")

Output:

Month     | Accuracy | Recall (Fraud Caught)
---------------------------------------------
January   |  98%    |       90%
February  |  95%    |       65%
March     |  85%    |       20% ⚠️ DANGER!

By March, accuracy is still a respectable 85%, but you're missing 80% of fraud!

Accuracy declined 13%. Recall collapsed 70%.

The Five Deadly Scenarios

WHEN ACCURACY LIES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. CLASS IMBALANCE
   └─ Rare events (fraud, disease, defects)
   └─ Accuracy rewards "predict majority"
   └─ FIX: Use F1, Recall, Precision, AUC

2. UNEQUAL ERROR COSTS
   └─ Missing cancer ≠ false alarm
   └─ Accuracy treats all errors the same
   └─ FIX: Use cost-sensitive metrics

3. MULTI-CLASS IMBALANCE
   └─ Rare classes get ignored
   └─ High overall accuracy, zero recall on minority
   └─ FIX: Use per-class metrics, macro-average

4. THRESHOLD SENSITIVITY
   └─ Default 0.5 threshold often wrong
   └─ Accuracy at 0.5 ≠ best operating point
   └─ FIX: Use AUC-ROC, precision-recall curve

5. TEMPORAL DRIFT
   └─ Accuracy snapshot hides degradation
   └─ Model slowly fails on the class that matters
   └─ FIX: Monitor recall/precision over time

What To Use Instead

For Imbalanced Classification:

from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score,
    balanced_accuracy_score
)

y_true = [0]*950 + [1]*50
y_pred = [0]*940 + [1]*60  # Some predictions

print("Instead of accuracy, use:")
print(f"  F1 Score:          {f1_score(y_true, y_pred):.3f}")
print(f"  Precision:         {precision_score(y_true, y_pred):.3f}")
print(f"  Recall:            {recall_score(y_true, y_pred):.3f}")
print(f"  Balanced Accuracy: {balanced_accuracy_score(y_true, y_pred):.3f}")
print(f"  ROC AUC:           {roc_auc_score(y_true, y_pred):.3f}")

For Different Error Costs:

def weighted_accuracy(y_true, y_pred, fp_weight=1, fn_weight=1):
    """Accuracy that weighs errors differently."""
    tp = sum((y_pred == 1) & (y_true == 1))
    tn = sum((y_pred == 0) & (y_true == 0))
    fp = sum((y_pred == 1) & (y_true == 0))
    fn = sum((y_pred == 0) & (y_true == 1))

    # Weighted errors
    weighted_correct = tp + tn
    weighted_errors = fp * fp_weight + fn * fn_weight

    return weighted_correct / (weighted_correct + weighted_errors)

# False negatives are 10x worse than false positives
print(f"Standard accuracy:  {accuracy_score(y_true, y_pred):.3f}")
print(f"Weighted (FN=10x):  {weighted_accuracy(y_true, y_pred, fp_weight=1, fn_weight=10):.3f}")

For Multi-Class:

from sklearn.metrics import classification_report

# Always look at per-class metrics!
print(classification_report(y_true_multiclass, y_pred_multiclass))

# Use macro-average to treat all classes equally
f1_macro = f1_score(y_true_multiclass, y_pred_multiclass, average='macro')
print(f"Macro F1: {f1_macro:.3f}")  # Won't be fooled by majority class

For Threshold Sensitivity:

from sklearn.metrics import roc_auc_score, average_precision_score

# AUC measures performance ACROSS ALL THRESHOLDS
y_probas = model.predict_proba(X_test)[:, 1]

roc_auc = roc_auc_score(y_test, y_probas)
pr_auc = average_precision_score(y_test, y_probas)

print(f"ROC AUC: {roc_auc:.3f}")  # Threshold-independent
print(f"PR AUC:  {pr_auc:.3f}")   # Better for imbalanced data

The Decision Flowchart

Should I use ACCURACY?
         │
         ▼
Are classes roughly balanced (40-60% each)?
         │
    ┌────┴────┐
    │         │
   YES        NO
    │         │
    ▼         ▼
Are error    DON'T USE ACCURACY!
costs equal? Use: F1, AUC, Recall
    │        
┌───┴───┐    
│       │    
YES     NO   
│       │    
▼       ▼    
ACCURACY  Use cost-weighted
is OK     metrics or recall/
          precision based on
          which error matters

The Visual Proof

SCENARIO: 1000 patients, 20 with rare disease (2%)

MODEL A: "Everyone is healthy" (The Lazy Model)
─────────────────────────────────────────────────────

           Predicted
           Healthy    Sick
          ┌─────────┬─────────┐
Actual    │   980   │    0    │  ← All healthy: correct!
Healthy   │   TN    │   FP    │
          ├─────────┼─────────┤
Actual    │    20   │    0    │  ← All sick: MISSED!
Sick      │   FN    │   TP    │
          └─────────┴─────────┘

Accuracy = (980 + 0) / 1000 = 98% 🎉
Recall   = 0 / 20 = 0% 💀


MODEL B: "Actually tries" (The Useful Model)
─────────────────────────────────────────────────────

           Predicted
           Healthy    Sick
          ┌─────────┬─────────┐
Actual    │   950   │   30    │  ← 30 false alarms
Healthy   │   TN    │   FP    │
          ├─────────┼─────────┤
Actual    │    5    │   15    │  ← Caught 15/20 sick!
Sick      │   FN    │   TP    │
          └─────────┴─────────┘

Accuracy = (950 + 15) / 1000 = 96.5% 
Recall   = 15 / 20 = 75% ✓

MODEL B has LOWER accuracy but is INFINITELY more useful!

Real-World Cautionary Tales

Tale 1: The Million-Dollar Fraud Model

A bank deployed a fraud detection model with 99.8% accuracy. The fraud team celebrated.

Six months later: $47 million in fraud losses.

The model predicted "not fraud" for everything. With only 0.2% fraud rate, it was 99.8% accurate by doing nothing.

Fix: They switched to monitoring recall. New model had 94% accuracy but 78% recall — catching $36 million more fraud.

Tale 2: The Cancer Screening Catastrophe

A hospital's AI screening tool boasted 97% accuracy in detecting a rare cancer.

Audit revealed: It correctly identified healthy patients (96% of cases) but missed 60% of actual cancers.

Fix: They required minimum 95% recall, accepting that precision would drop. More false alarms, but far fewer missed cancers.

Tale 3: The Spam Filter Failure

A spam filter had 99% accuracy. Users complained they missed important emails.

Investigation: 1% of emails were spam. Filter marked everything as "not spam" — 99% accurate!

Fix: Retrained with F1 as the target metric. Accuracy dropped to 96%, but actual spam detection went from 0% to 89%.

Common Mistakes

Mistake 1: Reporting Only Accuracy

# ❌ WRONG
print(f"Our model achieved {accuracy:.1%} accuracy!")  # Meaningless without context

# ✅ RIGHT
print(f"Performance on minority class:")
print(f"  - Accuracy: {accuracy:.1%}")
print(f"  - Recall: {recall:.1%}")
print(f"  - Precision: {precision:.1%}")
print(f"  - F1: {f1:.1%}")
print(f"  - Class distribution: {minority_pct:.1%} minority")

Mistake 2: Using Accuracy for Model Selection

# ❌ WRONG
best_model = max(models, key=lambda m: m.accuracy)

# ✅ RIGHT (for imbalanced data)
best_model = max(models, key=lambda m: m.f1_score)
# Or for high-stakes:
best_model = max(models, key=lambda m: m.recall)

Mistake 3: Not Checking Class Balance First

# ✅ ALWAYS check this first!
import numpy as np

class_counts = np.bincount(y)
class_ratios = class_counts / len(y)

print("Class distribution:")
for i, ratio in enumerate(class_ratios):
    print(f"  Class {i}: {ratio:.1%}")

if min(class_ratios) < 0.2:
    print("\n⚠️ WARNING: Imbalanced classes!")
    print("   Do NOT rely on accuracy alone!")

Quick Reference: When to Abandon Accuracy

Scenario	Class Balance	Use Instead
Fraud detection	0.1% fraud	Recall, PR-AUC
Disease screening	1-5% positive	Recall, Sensitivity
Spam filtering	1-10% spam	F1, Precision
Defect detection	<1% defects	Recall, F1
Churn prediction	5-15% churn	F1, AUC-ROC
Click prediction	1-3% clicks	PR-AUC, Log Loss

Key Takeaways

Accuracy lies with imbalanced data — 98% accuracy can mean 0% usefulness
A model that predicts the majority class always "wins" on accuracy — But provides zero value
Different errors have different costs — Accuracy treats them equal when they're not
Always check class distribution first — If one class is <20%, don't trust accuracy
Use F1, Recall, Precision, or AUC instead — These expose lazy models
Threshold matters — The "most accurate" threshold often misses the minority class
Monitor over time — Accuracy can stay stable while recall collapses
Report multiple metrics — One number is never enough

The One-Sentence Summary

Maria the desert forecaster was 96% accurate because she predicted "no rain" every day — accuracy rewarded her laziness while she missed every storm that mattered, and that's exactly what your model does when you optimize for accuracy on imbalanced data.

What's Next?

Now that you know when accuracy fails, you're ready for:

ROC Curves and AUC — Threshold-independent evaluation
Precision-Recall Curves — The right tool for imbalanced data
Cost-Sensitive Learning — When errors have different prices
Calibration — When probabilities matter

Follow me for the next article in this series!

Let's Connect!

If this saved you from deploying a useless "high-accuracy" model, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Have you been burned by the accuracy trap? Share your stories — we've all been there!

The difference between a model with 99% accuracy that saves lives and one that lets people die? Understanding that in a world where only 1% of patients have the disease, "predict healthy" achieves 99% accuracy while missing every single sick person. Accuracy isn't wrong. It's just answering a question that doesn't matter.

Share this with someone celebrating their 99% accuracy score. They might be Maria, predicting "no rain" while the flood waters rise.

Happy evaluating! 🌧️

DEV Community

When Accuracy Is a Lying Metric: The Weather Forecaster Who Was 96% Accurate and Still Got Everyone Killed

The Desert Weather Forecaster

The Mathematics of the Lie

Scenario 1: Class Imbalance (The #1 Killer)

Scenario 2: Different Error Costs

Scenario 3: Multi-Class With Unequal Importance

Scenario 4: Threshold-Sensitive Decisions

Scenario 5: Temporal/Concept Drift

The Five Deadly Scenarios

What To Use Instead

For Imbalanced Classification:

For Different Error Costs:

For Multi-Class:

For Threshold Sensitivity:

The Decision Flowchart

The Visual Proof

Real-World Cautionary Tales

Tale 1: The Million-Dollar Fraud Model

Tale 2: The Cancer Screening Catastrophe

Tale 3: The Spam Filter Failure

Common Mistakes

Mistake 1: Reporting Only Accuracy

Mistake 2: Using Accuracy for Model Selection

Mistake 3: Not Checking Class Balance First

Quick Reference: When to Abandon Accuracy

Key Takeaways

The One-Sentence Summary

What's Next?

Let's Connect!

Top comments (0)