The One-Line Summary: Accuracy asks "how often are you right overall?" Precision asks "when you say wolf, are you sure?" Recall asks "did you catch all the wolves?" F1 asks "can you balance confidence with coverage?" Different questions, different answers, different winners.
The Tale of Two Village Guards
Two villages sat at the edge of a dark forest, each protected by a guard who watched for wolves.
Village A: Guard Gary "The Careful"
Gary had a philosophy:
"I will ONLY sound the alarm when I am 100% CERTAIN it's a wolf. False alarms waste everyone's time and make them stop trusting me."
Year-end statistics:
Alarms raised: 5
Actual wolves: 5 out of 5 alarms (100% were real!)
Wolves in forest: 47 that year
Wolves detected: 5 out of 47 (10.6%)
Wolves missed: 42
Village outcome: Mostly destroyed. Wolves walked right in.
Gary's defense: "But every time I DID raise the alarm, I was RIGHT! I never cried wolf falsely!"
Village B: Guard Sally "The Safe"
Sally had a different philosophy:
"I will sound the alarm for ANYTHING that might POSSIBLY be a wolf. I'd rather have false alarms than miss a real wolf."
Year-end statistics:
Alarms raised: 847
Actual wolves: 45 out of 847 alarms (5.3% were real)
Wolves in forest: 47 that year
Wolves detected: 45 out of 47 (95.7%)
Wolves missed: 2
Village outcome: Exhausted. 802 false alarms. People stopped responding.
Sally's defense: "But I caught 95% of the wolves! Only 2 got through!"
The Village Council's Dilemma
Who is the better guard?
| Metric | Gary | Sally |
|---|---|---|
| When alarm raised, correct? | 100% | 5.3% |
| Wolves caught? | 10.6% | 95.7% |
Gary is trustworthy but useless. When he says "wolf," you can believe him — but he misses almost everything.
Sally catches everything but isn't trusted. She found 95% of wolves — but cried wolf 802 times for nothing.
This is the precision-recall tradeoff.
And this is why you need MULTIPLE metrics to evaluate a classifier.
The Four Judges
Let's formalize this with the four metrics every data scientist must know.
First: The Confusion Matrix
Every prediction falls into one of four boxes:
ACTUAL
Wolf Not Wolf
┌─────────┬───────────┐
Wolf │ TP │ FP │
PREDICTED │ (Hit!) │ (False │
│ │ Alarm) │
├─────────┼───────────┤
Not Wolf │ FN │ TN │
│ (Missed │ (Correct │
│ Wolf!) │ Silence) │
└─────────┴───────────┘
TP = True Positive (Said wolf, was wolf) ✓
FP = False Positive (Said wolf, wasn't wolf) ✗ "Cried wolf"
FN = False Negative (Said safe, was wolf) ✗ "Missed wolf"
TN = True Negative (Said safe, was safe) ✓
Judge 1: Accuracy
"How often is the guard correct overall?"
Accuracy = (TP + TN) / (TP + TN + FP + FN)
= Correct Predictions / Total Predictions
Gary's Accuracy:
Wolves in year: 47
Non-wolves (safe nights): 318 (assuming 365 days)
Gary detected: 5 wolves, 0 false alarms
TP = 5, FP = 0, FN = 42, TN = 318
Accuracy = (5 + 318) / (5 + 318 + 0 + 42) = 323/365 = 88.5%
Sally's Accuracy:
TP = 45, FP = 802, FN = 2, TN = -484 ← Impossible! Let's recalculate...
Actually, let's say 1000 observations:
- 50 actual wolves
- 950 non-wolves
Sally: TP = 47, FP = 800, FN = 3, TN = 150
Accuracy = (47 + 150) / 1000 = 19.7% ← Ouch!
The Accuracy Trap 🪤
What if wolves are RARE?
Scenario: 10,000 nights, only 10 wolves
A guard who NEVER raises an alarm:
TP = 0, FP = 0, FN = 10, TN = 9,990
Accuracy = (0 + 9,990) / 10,000 = 99.9%!
99.9% accuracy by doing NOTHING!
The guard is useless — missed every wolf — but accuracy says they're excellent.
Accuracy lies when classes are imbalanced.
Judge 2: Precision
"When the guard says 'wolf,' how often is there actually a wolf?"
Precision = TP / (TP + FP)
= True Wolves / All Alarms Raised
Precision answers: "Can I trust the alarm?"
Gary's Precision:
Precision = 5 / (5 + 0) = 100%
When Gary says wolf, it's ALWAYS a wolf.
High trust. But he barely speaks.
Sally's Precision:
Precision = 45 / (45 + 802) = 5.3%
When Sally says wolf, it's usually nothing.
Low trust. Village ignores her.
Judge 3: Recall (Sensitivity)
"Of all the actual wolves, how many did the guard catch?"
Recall = TP / (TP + FN)
= Caught Wolves / All Actual Wolves
Recall answers: "Did we miss any wolves?"
Gary's Recall:
Recall = 5 / (5 + 42) = 10.6%
Gary caught only 10% of wolves.
42 wolves walked right past him.
Sally's Recall:
Recall = 45 / (45 + 2) = 95.7%
Sally caught 95% of wolves.
Only 2 slipped through.
Judge 4: F1 Score
"Balance precision and recall into a single number"
F1 = 2 × (Precision × Recall) / (Precision + Recall)
= Harmonic mean of Precision and Recall
F1 answers: "Are you good at BOTH trusting AND catching?"
Gary's F1:
F1 = 2 × (1.0 × 0.106) / (1.0 + 0.106) = 0.192 = 19.2%
High precision but terrible recall tanks his F1.
Sally's F1:
F1 = 2 × (0.053 × 0.957) / (0.053 + 0.957) = 0.100 = 10.0%
High recall but terrible precision tanks her F1.
Neither guard is good! F1 exposes them both.
The Complete Picture
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
# Gary's predictions
y_true = [1]*47 + [0]*318 # 47 wolves, 318 safe nights
y_gary = [1]*5 + [0]*42 + [0]*318 # Detected 5 wolves, missed 42
# Sally's predictions (different scenario for valid numbers)
# 50 wolves, 950 safe nights
y_true_sally = [1]*50 + [0]*950
y_sally = [1]*48 + [0]*2 + [1]*800 + [0]*150 # 48 TP, 2 FN, 800 FP, 150 TN
print("=" * 50)
print("GARY 'THE CAREFUL'")
print("=" * 50)
print(f"Accuracy: {accuracy_score(y_true, y_gary):.1%}")
print(f"Precision: {precision_score(y_true, y_gary):.1%}")
print(f"Recall: {recall_score(y_true, y_gary):.1%}")
print(f"F1 Score: {f1_score(y_true, y_gary):.1%}")
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_true, y_gary))
print("\n" + "=" * 50)
print("SALLY 'THE SAFE'")
print("=" * 50)
print(f"Accuracy: {accuracy_score(y_true_sally, y_sally):.1%}")
print(f"Precision: {precision_score(y_true_sally, y_sally):.1%}")
print(f"Recall: {recall_score(y_true_sally, y_sally):.1%}")
print(f"F1 Score: {f1_score(y_true_sally, y_sally):.1%}")
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_true_sally, y_sally))
Output:
==================================================
GARY 'THE CAREFUL'
==================================================
Accuracy: 88.5%
Precision: 100.0%
Recall: 10.6%
F1 Score: 19.2%
Confusion Matrix:
[[318 0]
[ 42 5]]
==================================================
SALLY 'THE SAFE'
==================================================
Accuracy: 19.8%
Precision: 5.7%
Recall: 96.0%
F1 Score: 10.7%
Confusion Matrix:
[[150 800]
[ 2 48]]
Visual: The Precision-Recall Tradeoff
THE TRADEOFF SLIDER
GARY'S SIDE SALLY'S SIDE
(Cautious) (Aggressive)
│ │
▼ ▼
Precision: HIGH ──────────────────────── Precision: LOW
"Trust me" "False alarms everywhere"
Recall: LOW ─────────────────────────── Recall: HIGH
"Missed most" "Caught almost all"
│◄─────── Threshold slider ──────►│
Raise threshold Lower threshold
(More cautious) (More aggressive)
IDEAL: Find the sweet spot where BOTH are acceptable!
Which Metric When?
Use PRECISION When: False Alarms Are Costly
Examples:
- Spam filter: False positive = important email goes to spam (missed opportunity, angry user)
- Criminal conviction: False positive = innocent person jailed (devastating!)
- Product recommendation: False positive = recommending something irrelevant (user loses trust)
# Spam detection: Optimize for precision
# We'd rather miss some spam than lose important emails
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
# Raise threshold to increase precision
probabilities = model.predict_proba(X_test)[:, 1]
threshold = 0.8 # Higher threshold = more cautious = higher precision
y_pred = (probabilities >= threshold).astype(int)
print(f"Precision: {precision_score(y_test, y_pred):.1%}")
Use RECALL When: Misses Are Costly
Examples:
- Cancer screening: False negative = missed cancer (patient dies)
- Fraud detection: False negative = missed fraud (company loses money)
- Airport security: False negative = missed weapon (catastrophe)
# Cancer screening: Optimize for recall
# We'd rather have false alarms than miss actual cancer
# Lower threshold to increase recall
threshold = 0.2 # Lower threshold = more aggressive = higher recall
y_pred = (probabilities >= threshold).astype(int)
print(f"Recall: {recall_score(y_test, y_pred):.1%}")
Use F1 When: You Need Balance
Examples:
- Information retrieval: Need to find relevant docs (recall) that are actually relevant (precision)
- Named entity recognition: Need to find all entities (recall) without tagging non-entities (precision)
- General classification: No strong preference, want overall performance
# Balanced scenario: Optimize for F1
# Find threshold that maximizes F1
from sklearn.metrics import f1_score
best_f1 = 0
best_threshold = 0.5
for threshold in np.arange(0.1, 0.9, 0.05):
y_pred = (probabilities >= threshold).astype(int)
f1 = f1_score(y_test, y_pred)
if f1 > best_f1:
best_f1 = f1
best_threshold = threshold
print(f"Best threshold: {best_threshold}")
print(f"Best F1: {best_f1:.1%}")
Use ACCURACY When: Classes Are Balanced AND Errors Are Equal
Examples:
- Coin flip prediction: 50/50 split, both errors equally bad
- Image classification (equal classes): Cat vs Dog with same number of each
# Check if accuracy is appropriate
class_distribution = np.bincount(y) / len(y)
print(f"Class distribution: {class_distribution}")
if min(class_distribution) > 0.4: # Roughly balanced
print("Accuracy might be okay to use!")
else:
print("Classes are imbalanced — DON'T trust accuracy alone!")
The Complete Decision Framework
START
│
▼
Are classes balanced (each class ~40-60%)?
│
├── YES ──► Are false positives and false negatives equally bad?
│ │
│ ├── YES ──► Use ACCURACY ✓
│ │
│ └── NO ───► Which is worse?
│ │
│ ├── False Positive worse ──► PRECISION
│ │
│ └── False Negative worse ──► RECALL
│
└── NO (Imbalanced) ──► What's your priority?
│
├── "Don't miss any!" ──► RECALL
│ (Cancer, Fraud, Security)
│
├── "Don't cry wolf!" ──► PRECISION
│ (Spam, Recommendations)
│
└── "Balance both" ──► F1 SCORE
(Search, NLP, General)
Beyond Binary: Macro, Micro, Weighted
When you have more than two classes:
from sklearn.metrics import classification_report
# Multi-class classification
y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2, 2]
y_pred = [0, 1, 1, 0, 2, 2, 0, 1, 2, 2]
print(classification_report(y_true, y_pred, target_names=['Cat', 'Dog', 'Bird']))
Output:
precision recall f1-score support
Cat 1.00 1.00 1.00 3
Dog 0.50 0.67 0.57 3
Bird 0.75 0.75 0.75 4
accuracy 0.80 10
macro avg 0.75 0.81 0.77 10
weighted avg 0.77 0.80 0.78 10
Three averaging strategies:
| Strategy | How It Works | When to Use |
|---|---|---|
| Macro | Average each class equally | All classes equally important |
| Micro | Aggregate TP/FP/FN globally | Overall performance |
| Weighted | Weight by class frequency | Imbalanced, but frequency matters |
from sklearn.metrics import precision_score, recall_score, f1_score
# Different averaging methods
print(f"Precision (macro): {precision_score(y_true, y_pred, average='macro'):.2f}")
print(f"Precision (micro): {precision_score(y_true, y_pred, average='micro'):.2f}")
print(f"Precision (weighted): {precision_score(y_true, y_pred, average='weighted'):.2f}")
Visualizing the Tradeoff: Precision-Recall Curve
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Create imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Get probabilities
y_proba = model.predict_proba(X_test)[:, 1]
# Calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, 'b-', linewidth=2, label=f'PR Curve (AP={ap:.2f})')
plt.fill_between(recall, precision, alpha=0.2)
# Mark specific thresholds
for thresh in [0.3, 0.5, 0.7]:
idx = np.argmin(np.abs(thresholds - thresh))
plt.scatter(recall[idx], precision[idx], s=100, zorder=5)
plt.annotate(f' threshold={thresh}', (recall[idx], precision[idx]))
plt.xlabel('Recall (Did we catch all wolves?)')
plt.ylabel('Precision (When we said wolf, were we right?)')
plt.title('Precision-Recall Tradeoff')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('pr_curve.png', dpi=150)
plt.show()
The Metrics at a Glance
PREDICTED
Pos Neg
┌──────────┬──────────┐
Pos │ TP │ FN │ ← Recall = TP/(TP+FN)
ACTUAL │ │ │ "Of actual positives,
├──────────┼──────────┤ how many caught?"
Neg │ FP │ TN │
│ │ │
└──────────┴──────────┘
↑
│
Precision = TP/(TP+FP)
"Of predicted positives,
how many were right?"
Accuracy = (TP + TN) / Total
"Overall, how often correct?"
F1 = 2 × (P × R) / (P + R)
"Harmonic mean of precision and recall"
Common Mistakes
Mistake 1: Trusting Accuracy with Imbalanced Data
# ❌ WRONG: "99% accuracy, ship it!"
y_true = [0]*990 + [1]*10 # 1% positive class
y_pred = [0]*1000 # Predict all negative
print(f"Accuracy: {accuracy_score(y_true, y_pred):.1%}") # 99%!
print(f"Recall: {recall_score(y_true, y_pred):.1%}") # 0%!
# ✅ RIGHT: Check recall/precision for minority class
print(classification_report(y_true, y_pred))
Mistake 2: Ignoring the Business Context
# ❌ WRONG: Optimizing F1 for everything
# Cancer detection with F1 = 0.85? Maybe not good enough!
# ✅ RIGHT: Think about costs
# False negative (missed cancer) → patient might die
# False positive (false alarm) → extra tests, anxiety
# For cancer: Optimize RECALL, accept lower precision
# For spam: Optimize PRECISION, accept lower recall
Mistake 3: Not Looking at the Confusion Matrix
# ❌ WRONG: Just looking at aggregate metrics
print(f"F1: {f1_score(y_true, y_pred):.1%}")
# ✅ RIGHT: Look at WHERE errors happen
print(confusion_matrix(y_true, y_pred))
print(classification_report(y_true, y_pred))
# Maybe F1 is okay overall but one class is terrible!
Mistake 4: Confusing Precision and Recall
Remember the questions they answer:
PRECISION: "When I said WOLF, was I right?"
(Of my predictions, how many were correct?)
RECALL: "Did I catch all the WOLVES?"
(Of actual wolves, how many did I find?)
Mnemonic:
- Precision = "Prediction quality"
- Recall = "Recovery rate"
Quick Reference
| Metric | Formula | Question Answered | Optimize When |
|---|---|---|---|
| Accuracy | (TP+TN)/All | Overall correctness? | Balanced classes, equal error costs |
| Precision | TP/(TP+FP) | When I predict positive, am I right? | False positives are costly |
| Recall | TP/(TP+FN) | Did I find all positives? | False negatives are costly |
| F1 | 2×P×R/(P+R) | Balance of P and R? | Need both, no strong preference |
Real-World Cheat Sheet
| Use Case | Optimize For | Why |
|---|---|---|
| Cancer screening | Recall | Missing cancer = death |
| Spam filter | Precision | Losing real email = disaster |
| Fraud detection | Recall (usually) | Missing fraud = financial loss |
| Search results | F1 or Precision@K | Want relevant results |
| Criminal justice | Precision | Jailing innocent = terrible |
| Airport security | Recall | Missing threat = catastrophe |
| Product recommendations | Precision | Bad recs = lost trust |
| Disease outbreak detection | Recall | Missing outbreak = epidemic |
Key Takeaways
Accuracy lies with imbalanced data — 99% accuracy can mean 0% usefulness
Precision = "Trust the alarm" — When you predict positive, are you right?
Recall = "Catch them all" — Of all actual positives, how many did you find?
F1 = "Balance" — Harmonic mean punishes extremes
Context determines the metric — Cancer screening ≠ spam filtering
Always check the confusion matrix — Aggregates hide important details
Precision and recall tradeoff — Improving one often hurts the other
Adjust your threshold — You can tune the precision-recall balance
The One-Sentence Summary
Gary the cautious guard has perfect precision but catches only 10% of wolves; Sally the aggressive guard catches 95% of wolves but has 5% precision; neither is good, and that's why you need precision, recall, AND F1 to judge a classifier — accuracy alone would have told you Gary is 88% correct while the village burns.
What's Next?
Now that you understand these metrics, you're ready for:
- ROC Curves and AUC — Visualizing classifier performance
- Confusion Matrix Deep Dive — Multi-class analysis
- Threshold Tuning — Finding the optimal operating point
- Business Metrics — Converting to dollars and impact
Follow me for the next article in this series!
Let's Connect!
If you finally understand the difference between precision and recall, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
Which metric do you use most? Share your domain-specific experiences!
The difference between a cancer screening test that's "95% accurate" and one that actually saves lives? Understanding that accuracy might mean "correctly identified 95% of healthy people" while missing 80% of actual cancers. Ask precision. Ask recall. Ask F1. Don't just ask accuracy.
Share this with someone who thinks 99% accuracy means a good model. It might mean their wolf detector has never detected a wolf.
Happy evaluating! 🐺
Top comments (0)