DEV Community

Cover image for Accuracy, Precision, Recall, F1: The Four Judges Who Disagree on What Makes a Good Wolf Detector
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Accuracy, Precision, Recall, F1: The Four Judges Who Disagree on What Makes a Good Wolf Detector

The One-Line Summary: Accuracy asks "how often are you right overall?" Precision asks "when you say wolf, are you sure?" Recall asks "did you catch all the wolves?" F1 asks "can you balance confidence with coverage?" Different questions, different answers, different winners.


The Tale of Two Village Guards

Two villages sat at the edge of a dark forest, each protected by a guard who watched for wolves.


Village A: Guard Gary "The Careful"

Gary had a philosophy:

"I will ONLY sound the alarm when I am 100% CERTAIN it's a wolf. False alarms waste everyone's time and make them stop trusting me."

Year-end statistics:

Alarms raised:     5
Actual wolves:     5 out of 5 alarms (100% were real!)
Wolves in forest:  47 that year
Wolves detected:   5 out of 47 (10.6%)
Wolves missed:     42

Village outcome: Mostly destroyed. Wolves walked right in.
Enter fullscreen mode Exit fullscreen mode

Gary's defense: "But every time I DID raise the alarm, I was RIGHT! I never cried wolf falsely!"


Village B: Guard Sally "The Safe"

Sally had a different philosophy:

"I will sound the alarm for ANYTHING that might POSSIBLY be a wolf. I'd rather have false alarms than miss a real wolf."

Year-end statistics:

Alarms raised:     847
Actual wolves:     45 out of 847 alarms (5.3% were real)
Wolves in forest:  47 that year
Wolves detected:   45 out of 47 (95.7%)
Wolves missed:     2

Village outcome: Exhausted. 802 false alarms. People stopped responding.
Enter fullscreen mode Exit fullscreen mode

Sally's defense: "But I caught 95% of the wolves! Only 2 got through!"


The Village Council's Dilemma

Who is the better guard?

Metric Gary Sally
When alarm raised, correct? 100% 5.3%
Wolves caught? 10.6% 95.7%

Gary is trustworthy but useless. When he says "wolf," you can believe him — but he misses almost everything.

Sally catches everything but isn't trusted. She found 95% of wolves — but cried wolf 802 times for nothing.


This is the precision-recall tradeoff.

And this is why you need MULTIPLE metrics to evaluate a classifier.


The Four Judges

Let's formalize this with the four metrics every data scientist must know.

First: The Confusion Matrix

Every prediction falls into one of four boxes:

                         ACTUAL
                    Wolf      Not Wolf
                 ┌─────────┬───────────┐
          Wolf   │   TP    │    FP     │
PREDICTED        │ (Hit!)  │ (False    │
                 │         │  Alarm)   │
                 ├─────────┼───────────┤
        Not Wolf │   FN    │    TN     │
                 │ (Missed │ (Correct  │
                 │  Wolf!) │  Silence) │
                 └─────────┴───────────┘

TP = True Positive  (Said wolf, was wolf)     ✓
FP = False Positive (Said wolf, wasn't wolf)  ✗ "Cried wolf"
FN = False Negative (Said safe, was wolf)     ✗ "Missed wolf"
TN = True Negative  (Said safe, was safe)     ✓
Enter fullscreen mode Exit fullscreen mode

Judge 1: Accuracy

"How often is the guard correct overall?"

Accuracy = (TP + TN) / (TP + TN + FP + FN)
         = Correct Predictions / Total Predictions
Enter fullscreen mode Exit fullscreen mode

Gary's Accuracy:

Wolves in year: 47
Non-wolves (safe nights): 318 (assuming 365 days)
Gary detected: 5 wolves, 0 false alarms

TP = 5, FP = 0, FN = 42, TN = 318

Accuracy = (5 + 318) / (5 + 318 + 0 + 42) = 323/365 = 88.5%
Enter fullscreen mode Exit fullscreen mode

Sally's Accuracy:

TP = 45, FP = 802, FN = 2, TN = -484 ← Impossible! Let's recalculate...

Actually, let's say 1000 observations:
- 50 actual wolves
- 950 non-wolves
Sally: TP = 47, FP = 800, FN = 3, TN = 150

Accuracy = (47 + 150) / 1000 = 19.7%  ← Ouch!
Enter fullscreen mode Exit fullscreen mode

The Accuracy Trap 🪤

What if wolves are RARE?

Scenario: 10,000 nights, only 10 wolves

A guard who NEVER raises an alarm:
TP = 0, FP = 0, FN = 10, TN = 9,990

Accuracy = (0 + 9,990) / 10,000 = 99.9%!
Enter fullscreen mode Exit fullscreen mode

99.9% accuracy by doing NOTHING!

The guard is useless — missed every wolf — but accuracy says they're excellent.

Accuracy lies when classes are imbalanced.


Judge 2: Precision

"When the guard says 'wolf,' how often is there actually a wolf?"

Precision = TP / (TP + FP)
          = True Wolves / All Alarms Raised
Enter fullscreen mode Exit fullscreen mode

Precision answers: "Can I trust the alarm?"

Gary's Precision:

Precision = 5 / (5 + 0) = 100%

When Gary says wolf, it's ALWAYS a wolf.
High trust. But he barely speaks.
Enter fullscreen mode Exit fullscreen mode

Sally's Precision:

Precision = 45 / (45 + 802) = 5.3%

When Sally says wolf, it's usually nothing.
Low trust. Village ignores her.
Enter fullscreen mode Exit fullscreen mode

Judge 3: Recall (Sensitivity)

"Of all the actual wolves, how many did the guard catch?"

Recall = TP / (TP + FN)
       = Caught Wolves / All Actual Wolves
Enter fullscreen mode Exit fullscreen mode

Recall answers: "Did we miss any wolves?"

Gary's Recall:

Recall = 5 / (5 + 42) = 10.6%

Gary caught only 10% of wolves.
42 wolves walked right past him.
Enter fullscreen mode Exit fullscreen mode

Sally's Recall:

Recall = 45 / (45 + 2) = 95.7%

Sally caught 95% of wolves.
Only 2 slipped through.
Enter fullscreen mode Exit fullscreen mode

Judge 4: F1 Score

"Balance precision and recall into a single number"

F1 = 2 × (Precision × Recall) / (Precision + Recall)
   = Harmonic mean of Precision and Recall
Enter fullscreen mode Exit fullscreen mode

F1 answers: "Are you good at BOTH trusting AND catching?"

Gary's F1:

F1 = 2 × (1.0 × 0.106) / (1.0 + 0.106) = 0.192 = 19.2%

High precision but terrible recall tanks his F1.
Enter fullscreen mode Exit fullscreen mode

Sally's F1:

F1 = 2 × (0.053 × 0.957) / (0.053 + 0.957) = 0.100 = 10.0%

High recall but terrible precision tanks her F1.
Enter fullscreen mode Exit fullscreen mode

Neither guard is good! F1 exposes them both.


The Complete Picture

import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

# Gary's predictions
y_true = [1]*47 + [0]*318  # 47 wolves, 318 safe nights
y_gary = [1]*5 + [0]*42 + [0]*318  # Detected 5 wolves, missed 42

# Sally's predictions (different scenario for valid numbers)
# 50 wolves, 950 safe nights
y_true_sally = [1]*50 + [0]*950
y_sally = [1]*48 + [0]*2 + [1]*800 + [0]*150  # 48 TP, 2 FN, 800 FP, 150 TN

print("=" * 50)
print("GARY 'THE CAREFUL'")
print("=" * 50)
print(f"Accuracy:  {accuracy_score(y_true, y_gary):.1%}")
print(f"Precision: {precision_score(y_true, y_gary):.1%}")
print(f"Recall:    {recall_score(y_true, y_gary):.1%}")
print(f"F1 Score:  {f1_score(y_true, y_gary):.1%}")
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_true, y_gary))

print("\n" + "=" * 50)
print("SALLY 'THE SAFE'")
print("=" * 50)
print(f"Accuracy:  {accuracy_score(y_true_sally, y_sally):.1%}")
print(f"Precision: {precision_score(y_true_sally, y_sally):.1%}")
print(f"Recall:    {recall_score(y_true_sally, y_sally):.1%}")
print(f"F1 Score:  {f1_score(y_true_sally, y_sally):.1%}")
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_true_sally, y_sally))
Enter fullscreen mode Exit fullscreen mode

Output:

==================================================
GARY 'THE CAREFUL'
==================================================
Accuracy:  88.5%
Precision: 100.0%
Recall:    10.6%
F1 Score:  19.2%

Confusion Matrix:
[[318   0]
 [ 42   5]]

==================================================
SALLY 'THE SAFE'
==================================================
Accuracy:  19.8%
Precision: 5.7%
Recall:    96.0%
F1 Score:  10.7%

Confusion Matrix:
[[150 800]
 [  2  48]]
Enter fullscreen mode Exit fullscreen mode

Visual: The Precision-Recall Tradeoff

                    THE TRADEOFF SLIDER

    GARY'S SIDE                           SALLY'S SIDE
    (Cautious)                            (Aggressive)
        │                                       │
        ▼                                       ▼

 Precision: HIGH ──────────────────────── Precision: LOW
 "Trust me"                               "False alarms everywhere"

 Recall: LOW ─────────────────────────── Recall: HIGH
 "Missed most"                           "Caught almost all"


        │◄─────── Threshold slider ──────►│

        Raise threshold    Lower threshold
        (More cautious)    (More aggressive)


        IDEAL: Find the sweet spot where BOTH are acceptable!
Enter fullscreen mode Exit fullscreen mode

Which Metric When?

Use PRECISION When: False Alarms Are Costly

Examples:

  • Spam filter: False positive = important email goes to spam (missed opportunity, angry user)
  • Criminal conviction: False positive = innocent person jailed (devastating!)
  • Product recommendation: False positive = recommending something irrelevant (user loses trust)
# Spam detection: Optimize for precision
# We'd rather miss some spam than lose important emails

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

# Raise threshold to increase precision
probabilities = model.predict_proba(X_test)[:, 1]
threshold = 0.8  # Higher threshold = more cautious = higher precision

y_pred = (probabilities >= threshold).astype(int)
print(f"Precision: {precision_score(y_test, y_pred):.1%}")
Enter fullscreen mode Exit fullscreen mode

Use RECALL When: Misses Are Costly

Examples:

  • Cancer screening: False negative = missed cancer (patient dies)
  • Fraud detection: False negative = missed fraud (company loses money)
  • Airport security: False negative = missed weapon (catastrophe)
# Cancer screening: Optimize for recall
# We'd rather have false alarms than miss actual cancer

# Lower threshold to increase recall
threshold = 0.2  # Lower threshold = more aggressive = higher recall

y_pred = (probabilities >= threshold).astype(int)
print(f"Recall: {recall_score(y_test, y_pred):.1%}")
Enter fullscreen mode Exit fullscreen mode

Use F1 When: You Need Balance

Examples:

  • Information retrieval: Need to find relevant docs (recall) that are actually relevant (precision)
  • Named entity recognition: Need to find all entities (recall) without tagging non-entities (precision)
  • General classification: No strong preference, want overall performance
# Balanced scenario: Optimize for F1
# Find threshold that maximizes F1

from sklearn.metrics import f1_score

best_f1 = 0
best_threshold = 0.5

for threshold in np.arange(0.1, 0.9, 0.05):
    y_pred = (probabilities >= threshold).astype(int)
    f1 = f1_score(y_test, y_pred)
    if f1 > best_f1:
        best_f1 = f1
        best_threshold = threshold

print(f"Best threshold: {best_threshold}")
print(f"Best F1: {best_f1:.1%}")
Enter fullscreen mode Exit fullscreen mode

Use ACCURACY When: Classes Are Balanced AND Errors Are Equal

Examples:

  • Coin flip prediction: 50/50 split, both errors equally bad
  • Image classification (equal classes): Cat vs Dog with same number of each
# Check if accuracy is appropriate
class_distribution = np.bincount(y) / len(y)
print(f"Class distribution: {class_distribution}")

if min(class_distribution) > 0.4:  # Roughly balanced
    print("Accuracy might be okay to use!")
else:
    print("Classes are imbalanced — DON'T trust accuracy alone!")
Enter fullscreen mode Exit fullscreen mode

The Complete Decision Framework

START
  │
  ▼
Are classes balanced (each class ~40-60%)?
  │
  ├── YES ──► Are false positives and false negatives equally bad?
  │              │
  │              ├── YES ──► Use ACCURACY ✓
  │              │
  │              └── NO ───► Which is worse?
  │                             │
  │                             ├── False Positive worse ──► PRECISION
  │                             │
  │                             └── False Negative worse ──► RECALL
  │
  └── NO (Imbalanced) ──► What's your priority?
                            │
                            ├── "Don't miss any!" ──► RECALL
                            │   (Cancer, Fraud, Security)
                            │
                            ├── "Don't cry wolf!" ──► PRECISION
                            │   (Spam, Recommendations)
                            │
                            └── "Balance both" ──► F1 SCORE
                                (Search, NLP, General)
Enter fullscreen mode Exit fullscreen mode

Beyond Binary: Macro, Micro, Weighted

When you have more than two classes:

from sklearn.metrics import classification_report

# Multi-class classification
y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2, 2]
y_pred = [0, 1, 1, 0, 2, 2, 0, 1, 2, 2]

print(classification_report(y_true, y_pred, target_names=['Cat', 'Dog', 'Bird']))
Enter fullscreen mode Exit fullscreen mode

Output:

              precision    recall  f1-score   support

         Cat       1.00      1.00      1.00         3
         Dog       0.50      0.67      0.57         3
        Bird       0.75      0.75      0.75         4

    accuracy                           0.80        10
   macro avg       0.75      0.81      0.77        10
weighted avg       0.77      0.80      0.78        10
Enter fullscreen mode Exit fullscreen mode

Three averaging strategies:

Strategy How It Works When to Use
Macro Average each class equally All classes equally important
Micro Aggregate TP/FP/FN globally Overall performance
Weighted Weight by class frequency Imbalanced, but frequency matters
from sklearn.metrics import precision_score, recall_score, f1_score

# Different averaging methods
print(f"Precision (macro):    {precision_score(y_true, y_pred, average='macro'):.2f}")
print(f"Precision (micro):    {precision_score(y_true, y_pred, average='micro'):.2f}")
print(f"Precision (weighted): {precision_score(y_true, y_pred, average='weighted'):.2f}")
Enter fullscreen mode Exit fullscreen mode

Visualizing the Tradeoff: Precision-Recall Curve

import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Create imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Get probabilities
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, 'b-', linewidth=2, label=f'PR Curve (AP={ap:.2f})')
plt.fill_between(recall, precision, alpha=0.2)

# Mark specific thresholds
for thresh in [0.3, 0.5, 0.7]:
    idx = np.argmin(np.abs(thresholds - thresh))
    plt.scatter(recall[idx], precision[idx], s=100, zorder=5)
    plt.annotate(f'  threshold={thresh}', (recall[idx], precision[idx]))

plt.xlabel('Recall (Did we catch all wolves?)')
plt.ylabel('Precision (When we said wolf, were we right?)')
plt.title('Precision-Recall Tradeoff')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('pr_curve.png', dpi=150)
plt.show()
Enter fullscreen mode Exit fullscreen mode

The Metrics at a Glance

                           PREDICTED
                        Pos         Neg
                    ┌──────────┬──────────┐
              Pos   │    TP    │    FN    │ ← Recall = TP/(TP+FN)
   ACTUAL           │          │          │   "Of actual positives,
                    ├──────────┼──────────┤    how many caught?"
              Neg   │    FP    │    TN    │
                    │          │          │
                    └──────────┴──────────┘
                          ↑
                          │
             Precision = TP/(TP+FP)
             "Of predicted positives,
              how many were right?"


    Accuracy = (TP + TN) / Total
    "Overall, how often correct?"

    F1 = 2 × (P × R) / (P + R)
    "Harmonic mean of precision and recall"
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake 1: Trusting Accuracy with Imbalanced Data

# ❌ WRONG: "99% accuracy, ship it!"
y_true = [0]*990 + [1]*10  # 1% positive class
y_pred = [0]*1000          # Predict all negative

print(f"Accuracy: {accuracy_score(y_true, y_pred):.1%}")  # 99%!
print(f"Recall: {recall_score(y_true, y_pred):.1%}")      # 0%!

# ✅ RIGHT: Check recall/precision for minority class
print(classification_report(y_true, y_pred))
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Ignoring the Business Context

# ❌ WRONG: Optimizing F1 for everything
# Cancer detection with F1 = 0.85? Maybe not good enough!

# ✅ RIGHT: Think about costs
# False negative (missed cancer) → patient might die
# False positive (false alarm) → extra tests, anxiety

# For cancer: Optimize RECALL, accept lower precision
# For spam: Optimize PRECISION, accept lower recall
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Not Looking at the Confusion Matrix

# ❌ WRONG: Just looking at aggregate metrics
print(f"F1: {f1_score(y_true, y_pred):.1%}")

# ✅ RIGHT: Look at WHERE errors happen
print(confusion_matrix(y_true, y_pred))
print(classification_report(y_true, y_pred))

# Maybe F1 is okay overall but one class is terrible!
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Confusing Precision and Recall

Remember the questions they answer:

PRECISION: "When I said WOLF, was I right?"
           (Of my predictions, how many were correct?)

RECALL:    "Did I catch all the WOLVES?"
           (Of actual wolves, how many did I find?)

Mnemonic:
- Precision = "Prediction quality"
- Recall = "Recovery rate"
Enter fullscreen mode Exit fullscreen mode

Quick Reference

Metric Formula Question Answered Optimize When
Accuracy (TP+TN)/All Overall correctness? Balanced classes, equal error costs
Precision TP/(TP+FP) When I predict positive, am I right? False positives are costly
Recall TP/(TP+FN) Did I find all positives? False negatives are costly
F1 2×P×R/(P+R) Balance of P and R? Need both, no strong preference

Real-World Cheat Sheet

Use Case Optimize For Why
Cancer screening Recall Missing cancer = death
Spam filter Precision Losing real email = disaster
Fraud detection Recall (usually) Missing fraud = financial loss
Search results F1 or Precision@K Want relevant results
Criminal justice Precision Jailing innocent = terrible
Airport security Recall Missing threat = catastrophe
Product recommendations Precision Bad recs = lost trust
Disease outbreak detection Recall Missing outbreak = epidemic

Key Takeaways

  1. Accuracy lies with imbalanced data — 99% accuracy can mean 0% usefulness

  2. Precision = "Trust the alarm" — When you predict positive, are you right?

  3. Recall = "Catch them all" — Of all actual positives, how many did you find?

  4. F1 = "Balance" — Harmonic mean punishes extremes

  5. Context determines the metric — Cancer screening ≠ spam filtering

  6. Always check the confusion matrix — Aggregates hide important details

  7. Precision and recall tradeoff — Improving one often hurts the other

  8. Adjust your threshold — You can tune the precision-recall balance


The One-Sentence Summary

Gary the cautious guard has perfect precision but catches only 10% of wolves; Sally the aggressive guard catches 95% of wolves but has 5% precision; neither is good, and that's why you need precision, recall, AND F1 to judge a classifier — accuracy alone would have told you Gary is 88% correct while the village burns.


What's Next?

Now that you understand these metrics, you're ready for:

  • ROC Curves and AUC — Visualizing classifier performance
  • Confusion Matrix Deep Dive — Multi-class analysis
  • Threshold Tuning — Finding the optimal operating point
  • Business Metrics — Converting to dollars and impact

Follow me for the next article in this series!


Let's Connect!

If you finally understand the difference between precision and recall, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Which metric do you use most? Share your domain-specific experiences!


The difference between a cancer screening test that's "95% accurate" and one that actually saves lives? Understanding that accuracy might mean "correctly identified 95% of healthy people" while missing 80% of actual cancers. Ask precision. Ask recall. Ask F1. Don't just ask accuracy.


Share this with someone who thinks 99% accuracy means a good model. It might mean their wolf detector has never detected a wolf.

Happy evaluating! 🐺

Top comments (0)