DEV Community

Cover image for Confusion Matrix, Precision, Recall, and F1: A Practical Medical Screening Guide
hqqqqy
hqqqqy

Posted on • Originally published at mathisimple.com

Confusion Matrix, Precision, Recall, and F1: A Practical Medical Screening Guide

Confusion Matrix, Precision, Recall, and F1: A Practical Medical Screening Guide

In medical screening, being "right most of the time" isn't good enough.

A model that always predicts "no disease" might achieve 99% accuracy on a rare condition — but it would miss every single sick patient. That's why we need better metrics than accuracy.


🌐 This is a cross-post from my interactive tutorial site mathisimple.com, where every chart and diagram is fully interactive — drag sliders, adjust parameters, and see the metrics change in real time.


When doctors use machine learning for screening tests (cancer, diabetes, infectious diseases), they care much more about not missing sick patients (high recall) while keeping false alarms manageable (reasonable precision).

Let's walk through a practical example that shows exactly how these metrics work and why they matter.

The Medical Screening Scenario

Imagine we're building a screening tool for a serious but relatively rare disease. In our test population of 1,000 people:

  • 20 people actually have the disease (2%)
  • 980 people do not

Our model makes predictions, and we get the following results:

Confusion Matrix

Predicted Positive Predicted Negative
Actual Positive True Positive (TP): 15 False Negative (FN): 5
Actual Negative False Positive (FP): 50 True Negative (TN): 930

This means:

  • The model correctly identified 15 out of 20 sick patients
  • It missed 5 sick patients (dangerous)
  • It incorrectly flagged 50 healthy people for further testing (inconvenient but better than missing cases)

Breaking Down the Metrics

1. Accuracy

The metric everyone quotes first — but often the most misleading.

Accuracy=TP+TNTP+TN+FP+FN=15+9301000=0.945=94.5% \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} = \frac{15 + 930}{1000} = 0.945 = 94.5\%

Looks pretty good, right? But remember — if the model predicted "negative" for everyone, it would have 98% accuracy. Accuracy hides the truth when classes are imbalanced.

2. Precision (Positive Predictive Value)

Of all the times the model said "positive", how many were actually positive?

Precision=TPTP+FP=1515+50=15650.231=23.1% \text{Precision} = \frac{TP}{TP + FP} = \frac{15}{15 + 50} = \frac{15}{65} \approx 0.231 = 23.1\%

This means only 23.1% of people flagged for further testing actually had the disease. The doctors would be doing a lot of unnecessary follow-up tests.

3. Recall (Sensitivity / True Positive Rate)

Of all the actual sick patients, how many did we catch?

Recall=TPTP+FN=1515+5=1520=0.75=75% \text{Recall} = \frac{TP}{TP + FN} = \frac{15}{15 + 5} = \frac{15}{20} = 0.75 = 75\%

We caught 75% of the sick patients. In medicine, this is often the most critical metric — missing a sick patient (false negative) can have severe consequences.

4. F1 Score

The harmonic mean of precision and recall. Useful when you need to balance both.

F1=2×Precision×RecallPrecision+Recall=2×0.231×0.750.231+0.750.353 F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = 2 \times \frac{0.231 \times 0.75}{0.231 + 0.75} \approx 0.353

Why These Metrics Tell Different Stories

This example perfectly illustrates the tension in medical ML:

  • High recall is crucial because missing a case can be life-threatening
  • Reasonable precision is important because too many false positives waste medical resources and cause patient anxiety

In real medical applications, the acceptable trade-off depends on the disease:

  • For aggressive cancers: prioritize recall even if it means more false positives
  • For less serious conditions: might accept lower recall to reduce unnecessary procedures

Interactive Exploration on mathisimple.com

The static table above doesn't tell the full story.

On the original article, you can:

  • Adjust the model's sensitivity using an interactive threshold slider
  • See how changing the decision threshold affects all four metrics simultaneously
  • Experiment with different disease prevalence rates
  • Watch the confusion matrix update live as you tune the model

👉 Try the interactive Confusion Matrix tutorial

You'll see firsthand why a "94.5% accurate" model might still be clinically unacceptable.


This is part of the Machine Learning Foundations series, where we focus on building intuition through concrete examples rather than abstract theory. The next article will cover data preprocessing pitfalls that break models before training even begins.

What medical or high-stakes application have you worked on where traditional accuracy was misleading? I'd love to hear your experiences in the comments.

Top comments (0)