DEV Community

Cover image for 64. Precision and Recall: Beyond Accuracy
Akhilesh
Akhilesh

Posted on

64. Precision and Recall: Beyond Accuracy

Last post you saw that accuracy can be 95% while your model catches zero fraud.

Precision and recall are the fix. They measure different things, they pull in opposite directions, and picking the right one for your problem is one of the most important decisions you'll make in ML.

Most people know the definitions but don't know when to use which one. That's what this post is really about.


What You'll Learn Here

  • Precision and recall in plain words with real examples
  • Why improving one usually hurts the other
  • The precision-recall curve and how to read it
  • F1 score: what it is and when it's the right choice
  • F-beta score: when one error costs more than the other
  • Average precision for imbalanced problems
  • How to pick the right metric for any problem

Precision: When You Say Yes, Are You Right?

Precision answers: of all the times my model predicted positive, what fraction were actually positive?

Precision = TP / (TP + FP)
Enter fullscreen mode Exit fullscreen mode

High precision means when you raise the alarm, it's almost always real. Low precision means lots of false alarms.

Real example: A spam filter with 99% precision blocks 99 real spam emails for every 1 legit email it blocks. Very few false alarms. Users trust it.

When precision matters most: when false positives are expensive or damaging.

  • Spam filter blocking legit emails from your boss is bad
  • Hiring tool falsely rejecting good candidates is bad
  • Recommending a product someone hates destroys trust

Recall: Did You Find All the Real Positives?

Recall answers: of all the actual positives that existed, what fraction did my model find?

Recall = TP / (TP + FN)
Enter fullscreen mode Exit fullscreen mode

High recall means you caught almost everything real. Low recall means you're missing a lot.

Real example: A cancer screening tool with 99% recall catches 99 out of 100 actual cancer cases. It might have some false alarms, but it misses almost nothing.

When recall matters most: when false negatives are expensive or dangerous.

  • Missing cancer is catastrophic
  • Missing fraud means real money lost
  • Missing a structural defect in a bridge is deadly

They Pull Against Each Other

Here's the core tension. You can't just maximize both at once.

Think of a net catching fish. You want to catch all the right fish (high recall) and catch nothing else (high precision).

If you make the net bigger, you catch more right fish but also more wrong ones. Recall goes up, precision goes down.

If you make the net smaller and more selective, you catch only the ones you're sure about. Precision goes up, recall goes down.

The threshold on your model's probability output is that net size.

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

proba = model.predict_proba(X_test)[:, 1]  # prob of benign (class 1)

print(f"{'Threshold':<12} {'Precision':<12} {'Recall':<10} {'F1'}")
print("-" * 47)

for thresh in [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]:
    y_pred_t = (proba >= thresh).astype(int)
    prec = precision_score(y_test, y_pred_t, zero_division=0)
    rec  = recall_score(y_test, y_pred_t)
    f1   = f1_score(y_test, y_pred_t, zero_division=0)
    print(f"{thresh:<12} {prec:<12.3f} {rec:<10.3f} {f1:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Threshold    Precision    Recall     F1
-----------------------------------------------
0.2          0.938        1.000      0.968
0.3          0.945        1.000      0.972
0.4          0.959        1.000      0.979
0.5          0.973        0.986      0.979
0.6          0.986        0.972      0.979
0.7          1.000        0.944      0.971
0.8          1.000        0.931      0.964
0.9          1.000        0.903      0.949
Enter fullscreen mode Exit fullscreen mode

As threshold rises: Precision goes up, Recall goes down. Classic tradeoff.

F1 peaks in the middle around 0.4 to 0.6. That's usually a sign you've found a reasonable balance.


The Precision-Recall Curve

Instead of picking one threshold, plot precision and recall across all thresholds. This gives you the full picture.

from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

precisions, recalls, thresholds = precision_recall_curve(y_test, proba)
avg_precision = average_precision_score(y_test, proba)

plt.figure(figsize=(8, 5))
plt.plot(recalls, precisions, color='blue', linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'Precision-Recall Curve (AP = {avg_precision:.3f})')
plt.grid(True, alpha=0.3)
plt.xlim([0, 1])
plt.ylim([0, 1.05])

# Mark the threshold=0.5 point
idx = np.argmin(np.abs(thresholds - 0.5))
plt.scatter(recalls[idx], precisions[idx],
            color='red', s=100, zorder=5, label='threshold=0.5')
plt.legend()
plt.savefig('precision_recall_curve.png', dpi=100)
plt.show()

print(f"Average Precision (AP): {avg_precision:.3f}")
Enter fullscreen mode Exit fullscreen mode

Reading the curve:

A perfect model has a curve that goes to the top-right corner. Precision = 1.0 and Recall = 1.0 at the same time.

A random model produces a flat horizontal line at the baseline class frequency.

The area under the curve is called Average Precision (AP). Values closer to 1.0 are better. On imbalanced datasets, AP is a better summary than AUC-ROC.


F1 Score: The Harmonic Mean

F1 is the most common way to balance precision and recall into one number.

F1 = 2 * (Precision * Recall) / (Precision + Recall)
Enter fullscreen mode Exit fullscreen mode

It's the harmonic mean, not the regular average. The harmonic mean punishes extreme imbalance. If precision is 1.0 but recall is 0.0, the regular average is 0.5. The harmonic mean (F1) is 0.0.

That's intentional. A model that catches nothing shouldn't get 50%.

from sklearn.metrics import f1_score

# Compare: high precision low recall vs balanced
p1, r1 = 0.95, 0.50
p2, r2 = 0.80, 0.80

regular_avg_1 = (p1 + r1) / 2
regular_avg_2 = (p2 + r2) / 2

f1_1 = 2 * (p1 * r1) / (p1 + r1)
f1_2 = 2 * (p2 * r2) / (p2 + r2)

print("Model 1: Precision=0.95, Recall=0.50")
print(f"  Regular average: {regular_avg_1:.3f}")
print(f"  F1 score:        {f1_1:.3f}")

print("\nModel 2: Precision=0.80, Recall=0.80")
print(f"  Regular average: {regular_avg_2:.3f}")
print(f"  F1 score:        {f1_2:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Model 1: Precision=0.95, Recall=0.50
  Regular average: 0.725
  F1 score:        0.659

Model 2: Precision=0.80, Recall=0.80
  Regular average: 0.800
  F1 score:        0.800
Enter fullscreen mode Exit fullscreen mode

F1 correctly penalizes Model 1 for its terrible recall. The regular average would say they're close. F1 tells the truth.


F-Beta: When One Error Costs More

F1 treats precision and recall equally. Real problems often don't.

F-beta lets you put more weight on recall (when FN is expensive) or precision (when FP is expensive).

F-beta = (1 + beta^2) * (Precision * Recall)
         ─────────────────────────────────────
         (beta^2 * Precision) + Recall

beta > 1: recall matters more (catching positives is critical)
beta < 1: precision matters more (avoiding false alarms is critical)
beta = 1: F1 (equal weight)
Enter fullscreen mode Exit fullscreen mode
from sklearn.metrics import fbeta_score

y_true_example = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred_example = [1, 1, 1, 0, 0, 0, 0, 0, 1, 0]  # catches 3/5, 1 false alarm

print(f"F1  (beta=1.0): {fbeta_score(y_true_example, y_pred_example, beta=1.0):.3f}")
print(f"F2  (beta=2.0): {fbeta_score(y_true_example, y_pred_example, beta=2.0):.3f}")
print(f"F0.5 (beta=0.5): {fbeta_score(y_true_example, y_pred_example, beta=0.5):.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

F1   (beta=1.0): 0.600
F2   (beta=2.0): 0.652
F0.5 (beta=0.5): 0.556
Enter fullscreen mode Exit fullscreen mode

F2 (beta=2) gives more credit for catching positives, so it scores higher because we did catch 3 out of 5.

F0.5 (beta=0.5) penalizes the false alarm more, so it scores lower.

When to use F2: cancer detection, fraud detection, safety systems. Missing real positives is the bigger sin.

When to use F0.5: spam filters, content moderation. False alarms are the bigger sin.


Multi-class Precision and Recall

When you have more than two classes, you need to decide how to average across classes.

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score

iris = load_iris()
X_i, y_i = iris.data, iris.target

X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
    X_i, y_i, test_size=0.2, random_state=42, stratify=y_i
)

model_i = RandomForestClassifier(n_estimators=100, random_state=42)
model_i.fit(X_train_i, y_train_i)
y_pred_i = model_i.predict(X_test_i)

# Three averaging strategies
for avg in ['macro', 'weighted', 'micro']:
    p = precision_score(y_test_i, y_pred_i, average=avg)
    r = recall_score(y_test_i, y_pred_i, average=avg)
    f = f1_score(y_test_i, y_pred_i, average=avg)
    print(f"average='{avg}':  Precision={p:.3f}  Recall={r:.3f}  F1={f:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

average='macro':    Precision=0.968  Recall=0.967  F1=0.967
average='weighted': Precision=0.968  Recall=0.967  F1=0.967
average='micro':    Precision=0.967  Recall=0.967  F1=0.967
Enter fullscreen mode Exit fullscreen mode

macro: calculates metric for each class separately, then takes the unweighted average. Each class counts equally regardless of size.

weighted: calculates metric for each class, then averages weighted by class size. Larger classes influence the score more.

micro: aggregates TP, FP, FN across all classes first, then calculates. For balanced datasets, micro F1 equals accuracy.

Use macro when all classes matter equally (even the rare ones).
Use weighted when class size reflects real-world importance.


Choosing the Right Metric: A Decision Guide

Is your dataset balanced?
│
├── YES: Accuracy is fine. Also report F1.
│
└── NO (imbalanced):
      │
      ├── What costs more?
      │     │
      │     ├── Missing real positives (FN) costs more:
      │     │     → Optimize Recall
      │     │     → Use F2 score
      │     │     → Lower your threshold
      │     │
      │     ├── False alarms (FP) cost more:
      │     │     → Optimize Precision
      │     │     → Use F0.5 score
      │     │     → Raise your threshold
      │     │
      │     └── Both matter equally:
      │           → Use F1 score
      │           → Look at Average Precision (AP)
      │
      └── Need to compare models without picking a threshold?
            → Use Average Precision (AP)
            → Use ROC-AUC (next post)
Enter fullscreen mode Exit fullscreen mode

Real Example: Fraud Detection With Right Metrics

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    fbeta_score, average_precision_score, classification_report
)

# Simulate imbalanced fraud dataset
np.random.seed(42)
n_samples = 10000
n_fraud   = 200  # 2% fraud

X_legit = np.random.randn(n_samples - n_fraud, 10)
X_fraud = np.random.randn(n_fraud, 10) + 1.5  # fraud has shifted features
y_legit = np.zeros(n_samples - n_fraud)
y_fraud = np.ones(n_fraud)

X_all = np.vstack([X_legit, X_fraud])
y_all = np.hstack([y_legit, y_fraud])

X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(
    X_all, y_all, test_size=0.2, random_state=42, stratify=y_all
)

scaler = StandardScaler()
X_train_fs = scaler.fit_transform(X_train_f)
X_test_fs  = scaler.transform(X_test_f)

# Train with class weight to handle imbalance
lr = LogisticRegression(class_weight='balanced', random_state=42)
lr.fit(X_train_fs, y_train_f)

y_pred_f  = lr.predict(X_test_fs)
y_proba_f = lr.predict_proba(X_test_fs)[:, 1]

print("Fraud Detection Model Evaluation")
print("=" * 45)
print(f"Accuracy:          {(y_pred_f == y_test_f).mean():.3f}")
print(f"Precision:         {precision_score(y_test_f, y_pred_f):.3f}")
print(f"Recall:            {recall_score(y_test_f, y_pred_f):.3f}")
print(f"F1:                {f1_score(y_test_f, y_pred_f):.3f}")
print(f"F2 (recall focus): {fbeta_score(y_test_f, y_pred_f, beta=2):.3f}")
print(f"Avg Precision:     {average_precision_score(y_test_f, y_proba_f):.3f}")
print()
print(classification_report(y_test_f, y_pred_f, target_names=['legit', 'fraud']))
Enter fullscreen mode Exit fullscreen mode

Plotting Precision and Recall Together Across Thresholds

import matplotlib.pyplot as plt

precisions, recalls, thresholds = precision_recall_curve(y_test_f, y_proba_f)

plt.figure(figsize=(10, 4))

# Left: precision-recall curve
plt.subplot(1, 2, 1)
plt.plot(recalls, precisions, color='blue', linewidth=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.grid(True, alpha=0.3)

# Right: both vs threshold
plt.subplot(1, 2, 2)
plt.plot(thresholds, precisions[:-1], label='Precision', color='blue')
plt.plot(thresholds, recalls[:-1],    label='Recall',    color='orange')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Precision and Recall vs Threshold')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('precision_recall_threshold.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

The right plot is the most useful for picking your threshold. You can see exactly what happens to both metrics as you slide the decision boundary. Pick the threshold where the lines cross if you want F1. Push the threshold left if recall matters more. Push right if precision matters more.


The Things Everyone Gets Wrong

Mistake 1: Optimizing F1 when the errors have very different costs

F1 assumes precision and recall matter equally. Most real problems don't work that way. Know your problem. Use F2 or F0.5 when appropriate.

Mistake 2: Using macro average on severely imbalanced data

Macro average treats a class with 5 examples the same as a class with 5000. On imbalanced data that gives you a misleading picture. Report per-class metrics separately.

Mistake 3: Not reporting the metric you actually optimized for

If you tuned your threshold to maximize recall, report recall as your primary metric. Don't report accuracy and hide that your precision is low.

Mistake 4: Picking a threshold without business context

The math doesn't tell you the right threshold. The business problem does. A fraud team can review 100 false alarms per day but can't review 1000. That constraint picks your threshold.


Quick Cheat Sheet

Metric Formula Use when
Precision TP / (TP+FP) FP is expensive
Recall TP / (TP+FN) FN is expensive
F1 harmonic mean P and R both matter equally
F2 beta=2, recall focused FN >> FP cost
F0.5 beta=0.5, precision focused FP >> FN cost
AP area under PR curve compare models without threshold
Task Code
Precision precision_score(y_test, y_pred)
Recall recall_score(y_test, y_pred)
F1 f1_score(y_test, y_pred)
F-beta fbeta_score(y_test, y_pred, beta=2)
Full report classification_report(y_test, y_pred)
PR curve precision_recall_curve(y_test, y_proba)
Average Precision average_precision_score(y_test, y_proba)
Multi-class avg add average='macro' or average='weighted'

Practice Challenges

Level 1:
Train a LogisticRegression on an imbalanced dataset (use make_classification with weights=[0.95, 0.05]). Print accuracy, precision, recall, and F1. Which metric hides the problem? Which one reveals it?

Level 2:
On the breast cancer dataset, plot precision, recall, and F1 against threshold from 0.1 to 0.9. Find the threshold that maximizes F2. What does the model look like at that threshold?

Level 3:
Compare three models on an imbalanced dataset using Average Precision instead of accuracy: LogisticRegression, RandomForest, and XGBoost. Rank them by AP. Does the ranking change compared to ranking by F1?


References


Next up, Post 65: ROC Curves and AUC: Comparing Models Fairly. We visualize how every threshold performs at once, understand what AUC actually means, and learn when ROC beats precision-recall and when it doesn't.

Top comments (0)