Last post you saw that accuracy can be 95% while your model catches zero fraud.
Precision and recall are the fix. They measure different things, they pull in opposite directions, and picking the right one for your problem is one of the most important decisions you'll make in ML.
Most people know the definitions but don't know when to use which one. That's what this post is really about.
What You'll Learn Here
- Precision and recall in plain words with real examples
- Why improving one usually hurts the other
- The precision-recall curve and how to read it
- F1 score: what it is and when it's the right choice
- F-beta score: when one error costs more than the other
- Average precision for imbalanced problems
- How to pick the right metric for any problem
Precision: When You Say Yes, Are You Right?
Precision answers: of all the times my model predicted positive, what fraction were actually positive?
Precision = TP / (TP + FP)
High precision means when you raise the alarm, it's almost always real. Low precision means lots of false alarms.
Real example: A spam filter with 99% precision blocks 99 real spam emails for every 1 legit email it blocks. Very few false alarms. Users trust it.
When precision matters most: when false positives are expensive or damaging.
- Spam filter blocking legit emails from your boss is bad
- Hiring tool falsely rejecting good candidates is bad
- Recommending a product someone hates destroys trust
Recall: Did You Find All the Real Positives?
Recall answers: of all the actual positives that existed, what fraction did my model find?
Recall = TP / (TP + FN)
High recall means you caught almost everything real. Low recall means you're missing a lot.
Real example: A cancer screening tool with 99% recall catches 99 out of 100 actual cancer cases. It might have some false alarms, but it misses almost nothing.
When recall matters most: when false negatives are expensive or dangerous.
- Missing cancer is catastrophic
- Missing fraud means real money lost
- Missing a structural defect in a bridge is deadly
They Pull Against Each Other
Here's the core tension. You can't just maximize both at once.
Think of a net catching fish. You want to catch all the right fish (high recall) and catch nothing else (high precision).
If you make the net bigger, you catch more right fish but also more wrong ones. Recall goes up, precision goes down.
If you make the net smaller and more selective, you catch only the ones you're sure about. Precision goes up, recall goes down.
The threshold on your model's probability output is that net size.
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)[:, 1] # prob of benign (class 1)
print(f"{'Threshold':<12} {'Precision':<12} {'Recall':<10} {'F1'}")
print("-" * 47)
for thresh in [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
y_pred_t = (proba >= thresh).astype(int)
prec = precision_score(y_test, y_pred_t, zero_division=0)
rec = recall_score(y_test, y_pred_t)
f1 = f1_score(y_test, y_pred_t, zero_division=0)
print(f"{thresh:<12} {prec:<12.3f} {rec:<10.3f} {f1:.3f}")
Output:
Threshold Precision Recall F1
-----------------------------------------------
0.2 0.938 1.000 0.968
0.3 0.945 1.000 0.972
0.4 0.959 1.000 0.979
0.5 0.973 0.986 0.979
0.6 0.986 0.972 0.979
0.7 1.000 0.944 0.971
0.8 1.000 0.931 0.964
0.9 1.000 0.903 0.949
As threshold rises: Precision goes up, Recall goes down. Classic tradeoff.
F1 peaks in the middle around 0.4 to 0.6. That's usually a sign you've found a reasonable balance.
The Precision-Recall Curve
Instead of picking one threshold, plot precision and recall across all thresholds. This gives you the full picture.
from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt
precisions, recalls, thresholds = precision_recall_curve(y_test, proba)
avg_precision = average_precision_score(y_test, proba)
plt.figure(figsize=(8, 5))
plt.plot(recalls, precisions, color='blue', linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve (AP = {avg_precision:.3f})')
plt.grid(True, alpha=0.3)
plt.xlim([0, 1])
plt.ylim([0, 1.05])
# Mark the threshold=0.5 point
idx = np.argmin(np.abs(thresholds - 0.5))
plt.scatter(recalls[idx], precisions[idx],
color='red', s=100, zorder=5, label='threshold=0.5')
plt.legend()
plt.savefig('precision_recall_curve.png', dpi=100)
plt.show()
print(f"Average Precision (AP): {avg_precision:.3f}")
Reading the curve:
A perfect model has a curve that goes to the top-right corner. Precision = 1.0 and Recall = 1.0 at the same time.
A random model produces a flat horizontal line at the baseline class frequency.
The area under the curve is called Average Precision (AP). Values closer to 1.0 are better. On imbalanced datasets, AP is a better summary than AUC-ROC.
F1 Score: The Harmonic Mean
F1 is the most common way to balance precision and recall into one number.
F1 = 2 * (Precision * Recall) / (Precision + Recall)
It's the harmonic mean, not the regular average. The harmonic mean punishes extreme imbalance. If precision is 1.0 but recall is 0.0, the regular average is 0.5. The harmonic mean (F1) is 0.0.
That's intentional. A model that catches nothing shouldn't get 50%.
from sklearn.metrics import f1_score
# Compare: high precision low recall vs balanced
p1, r1 = 0.95, 0.50
p2, r2 = 0.80, 0.80
regular_avg_1 = (p1 + r1) / 2
regular_avg_2 = (p2 + r2) / 2
f1_1 = 2 * (p1 * r1) / (p1 + r1)
f1_2 = 2 * (p2 * r2) / (p2 + r2)
print("Model 1: Precision=0.95, Recall=0.50")
print(f" Regular average: {regular_avg_1:.3f}")
print(f" F1 score: {f1_1:.3f}")
print("\nModel 2: Precision=0.80, Recall=0.80")
print(f" Regular average: {regular_avg_2:.3f}")
print(f" F1 score: {f1_2:.3f}")
Output:
Model 1: Precision=0.95, Recall=0.50
Regular average: 0.725
F1 score: 0.659
Model 2: Precision=0.80, Recall=0.80
Regular average: 0.800
F1 score: 0.800
F1 correctly penalizes Model 1 for its terrible recall. The regular average would say they're close. F1 tells the truth.
F-Beta: When One Error Costs More
F1 treats precision and recall equally. Real problems often don't.
F-beta lets you put more weight on recall (when FN is expensive) or precision (when FP is expensive).
F-beta = (1 + beta^2) * (Precision * Recall)
─────────────────────────────────────
(beta^2 * Precision) + Recall
beta > 1: recall matters more (catching positives is critical)
beta < 1: precision matters more (avoiding false alarms is critical)
beta = 1: F1 (equal weight)
from sklearn.metrics import fbeta_score
y_true_example = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred_example = [1, 1, 1, 0, 0, 0, 0, 0, 1, 0] # catches 3/5, 1 false alarm
print(f"F1 (beta=1.0): {fbeta_score(y_true_example, y_pred_example, beta=1.0):.3f}")
print(f"F2 (beta=2.0): {fbeta_score(y_true_example, y_pred_example, beta=2.0):.3f}")
print(f"F0.5 (beta=0.5): {fbeta_score(y_true_example, y_pred_example, beta=0.5):.3f}")
Output:
F1 (beta=1.0): 0.600
F2 (beta=2.0): 0.652
F0.5 (beta=0.5): 0.556
F2 (beta=2) gives more credit for catching positives, so it scores higher because we did catch 3 out of 5.
F0.5 (beta=0.5) penalizes the false alarm more, so it scores lower.
When to use F2: cancer detection, fraud detection, safety systems. Missing real positives is the bigger sin.
When to use F0.5: spam filters, content moderation. False alarms are the bigger sin.
Multi-class Precision and Recall
When you have more than two classes, you need to decide how to average across classes.
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score
iris = load_iris()
X_i, y_i = iris.data, iris.target
X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
X_i, y_i, test_size=0.2, random_state=42, stratify=y_i
)
model_i = RandomForestClassifier(n_estimators=100, random_state=42)
model_i.fit(X_train_i, y_train_i)
y_pred_i = model_i.predict(X_test_i)
# Three averaging strategies
for avg in ['macro', 'weighted', 'micro']:
p = precision_score(y_test_i, y_pred_i, average=avg)
r = recall_score(y_test_i, y_pred_i, average=avg)
f = f1_score(y_test_i, y_pred_i, average=avg)
print(f"average='{avg}': Precision={p:.3f} Recall={r:.3f} F1={f:.3f}")
Output:
average='macro': Precision=0.968 Recall=0.967 F1=0.967
average='weighted': Precision=0.968 Recall=0.967 F1=0.967
average='micro': Precision=0.967 Recall=0.967 F1=0.967
macro: calculates metric for each class separately, then takes the unweighted average. Each class counts equally regardless of size.
weighted: calculates metric for each class, then averages weighted by class size. Larger classes influence the score more.
micro: aggregates TP, FP, FN across all classes first, then calculates. For balanced datasets, micro F1 equals accuracy.
Use macro when all classes matter equally (even the rare ones).
Use weighted when class size reflects real-world importance.
Choosing the Right Metric: A Decision Guide
Is your dataset balanced?
│
├── YES: Accuracy is fine. Also report F1.
│
└── NO (imbalanced):
│
├── What costs more?
│ │
│ ├── Missing real positives (FN) costs more:
│ │ → Optimize Recall
│ │ → Use F2 score
│ │ → Lower your threshold
│ │
│ ├── False alarms (FP) cost more:
│ │ → Optimize Precision
│ │ → Use F0.5 score
│ │ → Raise your threshold
│ │
│ └── Both matter equally:
│ → Use F1 score
│ → Look at Average Precision (AP)
│
└── Need to compare models without picking a threshold?
→ Use Average Precision (AP)
→ Use ROC-AUC (next post)
Real Example: Fraud Detection With Right Metrics
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
precision_score, recall_score, f1_score,
fbeta_score, average_precision_score, classification_report
)
# Simulate imbalanced fraud dataset
np.random.seed(42)
n_samples = 10000
n_fraud = 200 # 2% fraud
X_legit = np.random.randn(n_samples - n_fraud, 10)
X_fraud = np.random.randn(n_fraud, 10) + 1.5 # fraud has shifted features
y_legit = np.zeros(n_samples - n_fraud)
y_fraud = np.ones(n_fraud)
X_all = np.vstack([X_legit, X_fraud])
y_all = np.hstack([y_legit, y_fraud])
X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(
X_all, y_all, test_size=0.2, random_state=42, stratify=y_all
)
scaler = StandardScaler()
X_train_fs = scaler.fit_transform(X_train_f)
X_test_fs = scaler.transform(X_test_f)
# Train with class weight to handle imbalance
lr = LogisticRegression(class_weight='balanced', random_state=42)
lr.fit(X_train_fs, y_train_f)
y_pred_f = lr.predict(X_test_fs)
y_proba_f = lr.predict_proba(X_test_fs)[:, 1]
print("Fraud Detection Model Evaluation")
print("=" * 45)
print(f"Accuracy: {(y_pred_f == y_test_f).mean():.3f}")
print(f"Precision: {precision_score(y_test_f, y_pred_f):.3f}")
print(f"Recall: {recall_score(y_test_f, y_pred_f):.3f}")
print(f"F1: {f1_score(y_test_f, y_pred_f):.3f}")
print(f"F2 (recall focus): {fbeta_score(y_test_f, y_pred_f, beta=2):.3f}")
print(f"Avg Precision: {average_precision_score(y_test_f, y_proba_f):.3f}")
print()
print(classification_report(y_test_f, y_pred_f, target_names=['legit', 'fraud']))
Plotting Precision and Recall Together Across Thresholds
import matplotlib.pyplot as plt
precisions, recalls, thresholds = precision_recall_curve(y_test_f, y_proba_f)
plt.figure(figsize=(10, 4))
# Left: precision-recall curve
plt.subplot(1, 2, 1)
plt.plot(recalls, precisions, color='blue', linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True, alpha=0.3)
# Right: both vs threshold
plt.subplot(1, 2, 2)
plt.plot(thresholds, precisions[:-1], label='Precision', color='blue')
plt.plot(thresholds, recalls[:-1], label='Recall', color='orange')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision and Recall vs Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('precision_recall_threshold.png', dpi=100)
plt.show()
The right plot is the most useful for picking your threshold. You can see exactly what happens to both metrics as you slide the decision boundary. Pick the threshold where the lines cross if you want F1. Push the threshold left if recall matters more. Push right if precision matters more.
The Things Everyone Gets Wrong
Mistake 1: Optimizing F1 when the errors have very different costs
F1 assumes precision and recall matter equally. Most real problems don't work that way. Know your problem. Use F2 or F0.5 when appropriate.
Mistake 2: Using macro average on severely imbalanced data
Macro average treats a class with 5 examples the same as a class with 5000. On imbalanced data that gives you a misleading picture. Report per-class metrics separately.
Mistake 3: Not reporting the metric you actually optimized for
If you tuned your threshold to maximize recall, report recall as your primary metric. Don't report accuracy and hide that your precision is low.
Mistake 4: Picking a threshold without business context
The math doesn't tell you the right threshold. The business problem does. A fraud team can review 100 false alarms per day but can't review 1000. That constraint picks your threshold.
Quick Cheat Sheet
| Metric | Formula | Use when |
|---|---|---|
| Precision | TP / (TP+FP) | FP is expensive |
| Recall | TP / (TP+FN) | FN is expensive |
| F1 | harmonic mean P and R | both matter equally |
| F2 | beta=2, recall focused | FN >> FP cost |
| F0.5 | beta=0.5, precision focused | FP >> FN cost |
| AP | area under PR curve | compare models without threshold |
| Task | Code |
|---|---|
| Precision | precision_score(y_test, y_pred) |
| Recall | recall_score(y_test, y_pred) |
| F1 | f1_score(y_test, y_pred) |
| F-beta | fbeta_score(y_test, y_pred, beta=2) |
| Full report | classification_report(y_test, y_pred) |
| PR curve | precision_recall_curve(y_test, y_proba) |
| Average Precision | average_precision_score(y_test, y_proba) |
| Multi-class avg | add average='macro' or average='weighted'
|
Practice Challenges
Level 1:
Train a LogisticRegression on an imbalanced dataset (use make_classification with weights=[0.95, 0.05]). Print accuracy, precision, recall, and F1. Which metric hides the problem? Which one reveals it?
Level 2:
On the breast cancer dataset, plot precision, recall, and F1 against threshold from 0.1 to 0.9. Find the threshold that maximizes F2. What does the model look like at that threshold?
Level 3:
Compare three models on an imbalanced dataset using Average Precision instead of accuracy: LogisticRegression, RandomForest, and XGBoost. Rank them by AP. Does the ranking change compared to ranking by F1?
References
- Scikit-learn: Precision, Recall, F-measures
- Scikit-learn: Precision-Recall curves
- StatQuest: Precision and Recall (YouTube)
- Google ML Crash Course: Classification metrics
Next up, Post 65: ROC Curves and AUC: Comparing Models Fairly. We visualize how every threshold performs at once, understand what AUC actually means, and learn when ROC beats precision-recall and when it doesn't.
Top comments (0)