Sachin Kr. Rajput

Posted on Jan 21

Accuracy, Precision, Recall, F1: The Four Judges Who Disagree on What Makes a Good Wolf Detector

#machinelearning #datascience #beginners #python

The One-Line Summary: Accuracy asks "how often are you right overall?" Precision asks "when you say wolf, are you sure?" Recall asks "did you catch all the wolves?" F1 asks "can you balance confidence with coverage?" Different questions, different answers, different winners.

The Tale of Two Village Guards

Two villages sat at the edge of a dark forest, each protected by a guard who watched for wolves.

Village A: Guard Gary "The Careful"

Gary had a philosophy:

"I will ONLY sound the alarm when I am 100% CERTAIN it's a wolf. False alarms waste everyone's time and make them stop trusting me."

Year-end statistics:

Alarms raised:     5
Actual wolves:     5 out of 5 alarms (100% were real!)
Wolves in forest:  47 that year
Wolves detected:   5 out of 47 (10.6%)
Wolves missed:     42

Village outcome: Mostly destroyed. Wolves walked right in.

Gary's defense: "But every time I DID raise the alarm, I was RIGHT! I never cried wolf falsely!"

Village B: Guard Sally "The Safe"

Sally had a different philosophy:

"I will sound the alarm for ANYTHING that might POSSIBLY be a wolf. I'd rather have false alarms than miss a real wolf."

Year-end statistics:

Alarms raised:     847
Actual wolves:     45 out of 847 alarms (5.3% were real)
Wolves in forest:  47 that year
Wolves detected:   45 out of 47 (95.7%)
Wolves missed:     2

Village outcome: Exhausted. 802 false alarms. People stopped responding.

Sally's defense: "But I caught 95% of the wolves! Only 2 got through!"

The Village Council's Dilemma

Who is the better guard?

Metric	Gary	Sally
When alarm raised, correct?	100%	5.3%
Wolves caught?	10.6%	95.7%

Gary is trustworthy but useless. When he says "wolf," you can believe him — but he misses almost everything.

Sally catches everything but isn't trusted. She found 95% of wolves — but cried wolf 802 times for nothing.

This is the precision-recall tradeoff.

And this is why you need MULTIPLE metrics to evaluate a classifier.

The Four Judges

Let's formalize this with the four metrics every data scientist must know.

First: The Confusion Matrix

Every prediction falls into one of four boxes:

                         ACTUAL
                    Wolf      Not Wolf
                 ┌─────────┬───────────┐
          Wolf   │   TP    │    FP     │
PREDICTED        │ (Hit!)  │ (False    │
                 │         │  Alarm)   │
                 ├─────────┼───────────┤
        Not Wolf │   FN    │    TN     │
                 │ (Missed │ (Correct  │
                 │  Wolf!) │  Silence) │
                 └─────────┴───────────┘

TP = True Positive  (Said wolf, was wolf)     ✓
FP = False Positive (Said wolf, wasn't wolf)  ✗ "Cried wolf"
FN = False Negative (Said safe, was wolf)     ✗ "Missed wolf"
TN = True Negative  (Said safe, was safe)     ✓

Judge 1: Accuracy

"How often is the guard correct overall?"

Accuracy = (TP + TN) / (TP + TN + FP + FN)
         = Correct Predictions / Total Predictions

Gary's Accuracy:

Wolves in year: 47
Non-wolves (safe nights): 318 (assuming 365 days)
Gary detected: 5 wolves, 0 false alarms

TP = 5, FP = 0, FN = 42, TN = 318

Accuracy = (5 + 318) / (5 + 318 + 0 + 42) = 323/365 = 88.5%

Sally's Accuracy:

TP = 45, FP = 802, FN = 2, TN = -484 ← Impossible! Let's recalculate...

Actually, let's say 1000 observations:
- 50 actual wolves
- 950 non-wolves
Sally: TP = 47, FP = 800, FN = 3, TN = 150

Accuracy = (47 + 150) / 1000 = 19.7%  ← Ouch!

The Accuracy Trap 🪤

What if wolves are RARE?

Scenario: 10,000 nights, only 10 wolves

A guard who NEVER raises an alarm:
TP = 0, FP = 0, FN = 10, TN = 9,990

Accuracy = (0 + 9,990) / 10,000 = 99.9%!

99.9% accuracy by doing NOTHING!

The guard is useless — missed every wolf — but accuracy says they're excellent.

Accuracy lies when classes are imbalanced.

Judge 2: Precision

"When the guard says 'wolf,' how often is there actually a wolf?"

Precision = TP / (TP + FP)
          = True Wolves / All Alarms Raised

Precision answers: "Can I trust the alarm?"

Gary's Precision:

Precision = 5 / (5 + 0) = 100%

When Gary says wolf, it's ALWAYS a wolf.
High trust. But he barely speaks.

Sally's Precision:

Precision = 45 / (45 + 802) = 5.3%

When Sally says wolf, it's usually nothing.
Low trust. Village ignores her.

Judge 3: Recall (Sensitivity)

"Of all the actual wolves, how many did the guard catch?"

Recall = TP / (TP + FN)
       = Caught Wolves / All Actual Wolves

Recall answers: "Did we miss any wolves?"

Gary's Recall:

Recall = 5 / (5 + 42) = 10.6%

Gary caught only 10% of wolves.
42 wolves walked right past him.

Sally's Recall:

Recall = 45 / (45 + 2) = 95.7%

Sally caught 95% of wolves.
Only 2 slipped through.

Judge 4: F1 Score

"Balance precision and recall into a single number"

F1 = 2 × (Precision × Recall) / (Precision + Recall)
   = Harmonic mean of Precision and Recall

F1 answers: "Are you good at BOTH trusting AND catching?"

Gary's F1:

F1 = 2 × (1.0 × 0.106) / (1.0 + 0.106) = 0.192 = 19.2%

High precision but terrible recall tanks his F1.

Sally's F1:

F1 = 2 × (0.053 × 0.957) / (0.053 + 0.957) = 0.100 = 10.0%

High recall but terrible precision tanks her F1.

Neither guard is good! F1 exposes them both.

The Complete Picture

import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

# Gary's predictions
y_true = [1]*47 + [0]*318  # 47 wolves, 318 safe nights
y_gary = [1]*5 + [0]*42 + [0]*318  # Detected 5 wolves, missed 42

# Sally's predictions (different scenario for valid numbers)
# 50 wolves, 950 safe nights
y_true_sally = [1]*50 + [0]*950
y_sally = [1]*48 + [0]*2 + [1]*800 + [0]*150  # 48 TP, 2 FN, 800 FP, 150 TN

print("=" * 50)
print("GARY 'THE CAREFUL'")
print("=" * 50)
print(f"Accuracy:  {accuracy_score(y_true, y_gary):.1%}")
print(f"Precision: {precision_score(y_true, y_gary):.1%}")
print(f"Recall:    {recall_score(y_true, y_gary):.1%}")
print(f"F1 Score:  {f1_score(y_true, y_gary):.1%}")
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_true, y_gary))

print("\n" + "=" * 50)
print("SALLY 'THE SAFE'")
print("=" * 50)
print(f"Accuracy:  {accuracy_score(y_true_sally, y_sally):.1%}")
print(f"Precision: {precision_score(y_true_sally, y_sally):.1%}")
print(f"Recall:    {recall_score(y_true_sally, y_sally):.1%}")
print(f"F1 Score:  {f1_score(y_true_sally, y_sally):.1%}")
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_true_sally, y_sally))

Output:

==================================================
GARY 'THE CAREFUL'
==================================================
Accuracy:  88.5%
Precision: 100.0%
Recall:    10.6%
F1 Score:  19.2%

Confusion Matrix:
[[318   0]
 [ 42   5]]

==================================================
SALLY 'THE SAFE'
==================================================
Accuracy:  19.8%
Precision: 5.7%
Recall:    96.0%
F1 Score:  10.7%

Confusion Matrix:
[[150 800]
 [  2  48]]

Visual: The Precision-Recall Tradeoff

                    THE TRADEOFF SLIDER

    GARY'S SIDE                           SALLY'S SIDE
    (Cautious)                            (Aggressive)
        │                                       │
        ▼                                       ▼

 Precision: HIGH ──────────────────────── Precision: LOW
 "Trust me"                               "False alarms everywhere"

 Recall: LOW ─────────────────────────── Recall: HIGH
 "Missed most"                           "Caught almost all"


        │◄─────── Threshold slider ──────►│

        Raise threshold    Lower threshold
        (More cautious)    (More aggressive)


        IDEAL: Find the sweet spot where BOTH are acceptable!

Which Metric When?

Use PRECISION When: False Alarms Are Costly

Examples:

Spam filter: False positive = important email goes to spam (missed opportunity, angry user)
Criminal conviction: False positive = innocent person jailed (devastating!)
Product recommendation: False positive = recommending something irrelevant (user loses trust)

# Spam detection: Optimize for precision
# We'd rather miss some spam than lose important emails

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

# Raise threshold to increase precision
probabilities = model.predict_proba(X_test)[:, 1]
threshold = 0.8  # Higher threshold = more cautious = higher precision

y_pred = (probabilities >= threshold).astype(int)
print(f"Precision: {precision_score(y_test, y_pred):.1%}")

Use RECALL When: Misses Are Costly

Examples:

Cancer screening: False negative = missed cancer (patient dies)
Fraud detection: False negative = missed fraud (company loses money)
Airport security: False negative = missed weapon (catastrophe)

# Cancer screening: Optimize for recall
# We'd rather have false alarms than miss actual cancer

# Lower threshold to increase recall
threshold = 0.2  # Lower threshold = more aggressive = higher recall

y_pred = (probabilities >= threshold).astype(int)
print(f"Recall: {recall_score(y_test, y_pred):.1%}")

Use F1 When: You Need Balance

Examples:

Information retrieval: Need to find relevant docs (recall) that are actually relevant (precision)
Named entity recognition: Need to find all entities (recall) without tagging non-entities (precision)
General classification: No strong preference, want overall performance

# Balanced scenario: Optimize for F1
# Find threshold that maximizes F1

from sklearn.metrics import f1_score

best_f1 = 0
best_threshold = 0.5

for threshold in np.arange(0.1, 0.9, 0.05):
    y_pred = (probabilities >= threshold).astype(int)
    f1 = f1_score(y_test, y_pred)
    if f1 > best_f1:
        best_f1 = f1
        best_threshold = threshold

print(f"Best threshold: {best_threshold}")
print(f"Best F1: {best_f1:.1%}")

Use ACCURACY When: Classes Are Balanced AND Errors Are Equal

Examples:

Coin flip prediction: 50/50 split, both errors equally bad
Image classification (equal classes): Cat vs Dog with same number of each

# Check if accuracy is appropriate
class_distribution = np.bincount(y) / len(y)
print(f"Class distribution: {class_distribution}")

if min(class_distribution) > 0.4:  # Roughly balanced
    print("Accuracy might be okay to use!")
else:
    print("Classes are imbalanced — DON'T trust accuracy alone!")

The Complete Decision Framework

START
  │
  ▼
Are classes balanced (each class ~40-60%)?
  │
  ├── YES ──► Are false positives and false negatives equally bad?
  │              │
  │              ├── YES ──► Use ACCURACY ✓
  │              │
  │              └── NO ───► Which is worse?
  │                             │
  │                             ├── False Positive worse ──► PRECISION
  │                             │
  │                             └── False Negative worse ──► RECALL
  │
  └── NO (Imbalanced) ──► What's your priority?
                            │
                            ├── "Don't miss any!" ──► RECALL
                            │   (Cancer, Fraud, Security)
                            │
                            ├── "Don't cry wolf!" ──► PRECISION
                            │   (Spam, Recommendations)
                            │
                            └── "Balance both" ──► F1 SCORE
                                (Search, NLP, General)

Beyond Binary: Macro, Micro, Weighted

When you have more than two classes:

from sklearn.metrics import classification_report

# Multi-class classification
y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2, 2]
y_pred = [0, 1, 1, 0, 2, 2, 0, 1, 2, 2]

print(classification_report(y_true, y_pred, target_names=['Cat', 'Dog', 'Bird']))

Output:

              precision    recall  f1-score   support

         Cat       1.00      1.00      1.00         3
         Dog       0.50      0.67      0.57         3
        Bird       0.75      0.75      0.75         4

    accuracy                           0.80        10
   macro avg       0.75      0.81      0.77        10
weighted avg       0.77      0.80      0.78        10

Three averaging strategies:

Strategy	How It Works	When to Use
Macro	Average each class equally	All classes equally important
Micro	Aggregate TP/FP/FN globally	Overall performance
Weighted	Weight by class frequency	Imbalanced, but frequency matters

from sklearn.metrics import precision_score, recall_score, f1_score

# Different averaging methods
print(f"Precision (macro):    {precision_score(y_true, y_pred, average='macro'):.2f}")
print(f"Precision (micro):    {precision_score(y_true, y_pred, average='micro'):.2f}")
print(f"Precision (weighted): {precision_score(y_true, y_pred, average='weighted'):.2f}")

Visualizing the Tradeoff: Precision-Recall Curve

import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Get probabilities
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, 'b-', linewidth=2, label=f'PR Curve (AP={ap:.2f})')
plt.fill_between(recall, precision, alpha=0.2)

# Mark specific thresholds
for thresh in [0.3, 0.5, 0.7]:
    idx = np.argmin(np.abs(thresholds - thresh))
    plt.scatter(recall[idx], precision[idx], s=100, zorder=5)
    plt.annotate(f'  threshold={thresh}', (recall[idx], precision[idx]))

plt.xlabel('Recall (Did we catch all wolves?)')
plt.ylabel('Precision (When we said wolf, were we right?)')
plt.title('Precision-Recall Tradeoff')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('pr_curve.png', dpi=150)
plt.show()

The Metrics at a Glance

                           PREDICTED
                        Pos         Neg
                    ┌──────────┬──────────┐
              Pos   │    TP    │    FN    │ ← Recall = TP/(TP+FN)
   ACTUAL           │          │          │   "Of actual positives,
                    ├──────────┼──────────┤    how many caught?"
              Neg   │    FP    │    TN    │
                    │          │          │
                    └──────────┴──────────┘
                          ↑
                          │
             Precision = TP/(TP+FP)
             "Of predicted positives,
              how many were right?"


    Accuracy = (TP + TN) / Total
    "Overall, how often correct?"

    F1 = 2 × (P × R) / (P + R)
    "Harmonic mean of precision and recall"

Common Mistakes

Mistake 1: Trusting Accuracy with Imbalanced Data

# ❌ WRONG: "99% accuracy, ship it!"
y_true = [0]*990 + [1]*10  # 1% positive class
y_pred = [0]*1000          # Predict all negative

print(f"Accuracy: {accuracy_score(y_true, y_pred):.1%}")  # 99%!
print(f"Recall: {recall_score(y_true, y_pred):.1%}")      # 0%!

# ✅ RIGHT: Check recall/precision for minority class
print(classification_report(y_true, y_pred))

Mistake 2: Ignoring the Business Context

# ❌ WRONG: Optimizing F1 for everything
# Cancer detection with F1 = 0.85? Maybe not good enough!

# ✅ RIGHT: Think about costs
# False negative (missed cancer) → patient might die
# False positive (false alarm) → extra tests, anxiety

# For cancer: Optimize RECALL, accept lower precision
# For spam: Optimize PRECISION, accept lower recall

Mistake 3: Not Looking at the Confusion Matrix

# ❌ WRONG: Just looking at aggregate metrics
print(f"F1: {f1_score(y_true, y_pred):.1%}")

# ✅ RIGHT: Look at WHERE errors happen
print(confusion_matrix(y_true, y_pred))
print(classification_report(y_true, y_pred))

# Maybe F1 is okay overall but one class is terrible!

Mistake 4: Confusing Precision and Recall

Remember the questions they answer:

PRECISION: "When I said WOLF, was I right?"
           (Of my predictions, how many were correct?)

RECALL:    "Did I catch all the WOLVES?"
           (Of actual wolves, how many did I find?)

Mnemonic:
- Precision = "Prediction quality"
- Recall = "Recovery rate"

Quick Reference

Metric	Formula	Question Answered	Optimize When
Accuracy	(TP+TN)/All	Overall correctness?	Balanced classes, equal error costs
Precision	TP/(TP+FP)	When I predict positive, am I right?	False positives are costly
Recall	TP/(TP+FN)	Did I find all positives?	False negatives are costly
F1	2×P×R/(P+R)	Balance of P and R?	Need both, no strong preference

Real-World Cheat Sheet

Use Case	Optimize For	Why
Cancer screening	Recall	Missing cancer = death
Spam filter	Precision	Losing real email = disaster
Fraud detection	Recall (usually)	Missing fraud = financial loss
Search results	F1 or Precision@K	Want relevant results
Criminal justice	Precision	Jailing innocent = terrible
Airport security	Recall	Missing threat = catastrophe
Product recommendations	Precision	Bad recs = lost trust
Disease outbreak detection	Recall	Missing outbreak = epidemic

Key Takeaways

Accuracy lies with imbalanced data — 99% accuracy can mean 0% usefulness
Precision = "Trust the alarm" — When you predict positive, are you right?
Recall = "Catch them all" — Of all actual positives, how many did you find?
F1 = "Balance" — Harmonic mean punishes extremes
Context determines the metric — Cancer screening ≠ spam filtering
Always check the confusion matrix — Aggregates hide important details
Precision and recall tradeoff — Improving one often hurts the other
Adjust your threshold — You can tune the precision-recall balance

The One-Sentence Summary

Gary the cautious guard has perfect precision but catches only 10% of wolves; Sally the aggressive guard catches 95% of wolves but has 5% precision; neither is good, and that's why you need precision, recall, AND F1 to judge a classifier — accuracy alone would have told you Gary is 88% correct while the village burns.

What's Next?

Now that you understand these metrics, you're ready for:

ROC Curves and AUC — Visualizing classifier performance
Confusion Matrix Deep Dive — Multi-class analysis
Threshold Tuning — Finding the optimal operating point
Business Metrics — Converting to dollars and impact

Follow me for the next article in this series!

Let's Connect!

If you finally understand the difference between precision and recall, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Which metric do you use most? Share your domain-specific experiences!

The difference between a cancer screening test that's "95% accurate" and one that actually saves lives? Understanding that accuracy might mean "correctly identified 95% of healthy people" while missing 80% of actual cancers. Ask precision. Ask recall. Ask F1. Don't just ask accuracy.

Share this with someone who thinks 99% accuracy means a good model. It might mean their wolf detector has never detected a wolf.

Happy evaluating! 🐺

DEV Community

Accuracy, Precision, Recall, F1: The Four Judges Who Disagree on What Makes a Good Wolf Detector

The Tale of Two Village Guards

Village A: Guard Gary "The Careful"

Village B: Guard Sally "The Safe"

The Village Council's Dilemma

The Four Judges

First: The Confusion Matrix

Judge 1: Accuracy

The Accuracy Trap 🪤

Judge 2: Precision

Judge 3: Recall (Sensitivity)

Judge 4: F1 Score

The Complete Picture

Visual: The Precision-Recall Tradeoff

Which Metric When?

Use PRECISION When: False Alarms Are Costly

Use RECALL When: Misses Are Costly

Use F1 When: You Need Balance

Use ACCURACY When: Classes Are Balanced AND Errors Are Equal

The Complete Decision Framework

Beyond Binary: Macro, Micro, Weighted

Visualizing the Tradeoff: Precision-Recall Curve

The Metrics at a Glance

Common Mistakes

Mistake 1: Trusting Accuracy with Imbalanced Data

Mistake 2: Ignoring the Business Context

Mistake 3: Not Looking at the Confusion Matrix

Mistake 4: Confusing Precision and Recall

Quick Reference

Real-World Cheat Sheet

Key Takeaways

The One-Sentence Summary

What's Next?

Let's Connect!

Top comments (0)