DEV Community

Cover image for 63. Confusion Matrix: What Your Model Got Wrong and Why
Akhilesh
Akhilesh

Posted on

63. Confusion Matrix: What Your Model Got Wrong and Why

Your model has 95% accuracy. You ship it.

Three weeks later someone tells you it's missing 40% of actual fraud cases.

You check. The dataset had 95% legit transactions and 5% fraud. Your model just learned to say "not fraud" every single time. 95% accuracy. Zero fraud caught.

That's what happens when you trust accuracy alone. The confusion matrix is the tool that would have caught this immediately.


What You'll Learn Here

  • What the four cells of a confusion matrix mean
  • TP, TN, FP, FN with real-world examples not textbook ones
  • How to build and read a confusion matrix in Python
  • Why class imbalance makes accuracy useless
  • How to visualize it properly
  • Multi-class confusion matrices

The Four Outcomes of Every Prediction

Every prediction your model makes falls into one of four buckets. Let's use a disease test as the example because the stakes are obvious.

                     PREDICTED
                  Positive  Negative
ACTUAL  Positive |   TP    |   FN   |
        Negative |   FP    |   TN   |
Enter fullscreen mode Exit fullscreen mode

True Positive (TP): Model said positive. Actually positive. Correct.
The test said "has disease." Person has the disease. Good catch.

True Negative (TN): Model said negative. Actually negative. Correct.
The test said "no disease." Person is healthy. Also good.

False Positive (FP): Model said positive. Actually negative. Wrong.
The test said "has disease." Person is actually healthy. A false alarm.
Also called a Type I error.

False Negative (FN): Model said negative. Actually positive. Wrong.
The test said "no disease." Person actually has the disease. Missed it.
Also called a Type II error.

In most real problems, FP and FN have very different costs. Missing cancer (FN) is catastrophic. Flagging a legit transaction as fraud (FP) is annoying but fixable.

That difference is exactly why you need more than accuracy.


Building Your First Confusion Matrix

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

data = load_breast_cancer()
X, y = data.data, data.target

# 0 = malignant, 1 = benign
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Raw confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print()

# Label what each cell is
tn, fp, fn, tp = cm.ravel()
print(f"True Positives  (TP): {tp}  <- predicted benign,   actually benign")
print(f"True Negatives  (TN): {tn}  <- predicted malignant, actually malignant")
print(f"False Positives (FP): {fp}   <- predicted benign,   actually malignant")
print(f"False Negatives (FN): {fn}   <- predicted malignant, actually benign")
Enter fullscreen mode Exit fullscreen mode

Output:

Confusion Matrix:
[[40  2]
 [ 1 71]]

True Positives  (TP): 71  <- predicted benign,   actually benign
True Negatives  (TN): 40  <- predicted malignant, actually malignant
False Positives (FP): 2   <- predicted benign,   actually malignant
False Negatives (FN): 1   <- predicted malignant, actually benign
Enter fullscreen mode Exit fullscreen mode

The model got 71 + 40 = 111 correct out of 114. 97.4% accuracy.

But look at the mistakes. It missed 2 malignant tumors (FP here means it called them benign). That's the dangerous type of error in cancer detection.


Visualizing It Properly

Raw numbers are fine. A heatmap is better.

# Clean visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Without normalization - raw counts
disp1 = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=data.target_names
)
disp1.plot(ax=axes[0], colorbar=False, cmap='Blues')
axes[0].set_title('Raw Counts')

# With normalization - proportions
cm_normalized = confusion_matrix(y_test, y_pred, normalize='true')
disp2 = ConfusionMatrixDisplay(
    confusion_matrix=cm_normalized,
    display_labels=data.target_names
)
disp2.plot(ax=axes[1], colorbar=False, cmap='Blues')
axes[1].set_title('Normalized (row %)')

plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

The normalized version shows recall per class. Each row sums to 1.0. You can instantly see what percentage of each actual class was correctly identified.


Why Accuracy Lies on Imbalanced Data

Let's prove the fraud example from the intro with real code.

import numpy as np
from sklearn.metrics import confusion_matrix, accuracy_score

# Imbalanced dataset: 950 legit, 50 fraud
np.random.seed(42)
y_true = np.array([0]*950 + [1]*50)  # 0=legit, 1=fraud

# Model A: always predicts "not fraud"
y_pred_lazy = np.zeros(1000, dtype=int)

# Model B: actually tries to catch fraud
# Catches 35 out of 50 frauds, but has 20 false alarms
y_pred_smart = np.zeros(1000, dtype=int)
fraud_indices = np.where(y_true == 1)[0]
y_pred_smart[fraud_indices[:35]] = 1   # catches 35 real frauds
y_pred_smart[:20] = 1                  # 20 false alarms on legit transactions

print("=" * 50)
print("MODEL A: Always predicts Not Fraud")
print("=" * 50)
print(f"Accuracy: {accuracy_score(y_true, y_pred_lazy):.3f}")
cm_a = confusion_matrix(y_true, y_pred_lazy)
print(f"Confusion Matrix:\n{cm_a}")
tn, fp, fn, tp = cm_a.ravel()
print(f"Fraud caught: {tp} out of 50")

print()
print("=" * 50)
print("MODEL B: Actually tries to detect fraud")
print("=" * 50)
print(f"Accuracy: {accuracy_score(y_true, y_pred_smart):.3f}")
cm_b = confusion_matrix(y_true, y_pred_smart)
print(f"Confusion Matrix:\n{cm_b}")
tn, fp, fn, tp = cm_b.ravel()
print(f"Fraud caught: {tp} out of 50")
Enter fullscreen mode Exit fullscreen mode

Output:

==================================================
MODEL A: Always predicts Not Fraud
==================================================
Accuracy: 0.950
Confusion Matrix:
[[950   0]
 [ 50   0]]
Fraud caught: 0 out of 50

==================================================
MODEL B: Actually tries to detect fraud
==================================================
Accuracy: 0.965
Confusion Matrix:
[[930  20]
 [ 15  35]]
Fraud caught: 35 out of 50
Enter fullscreen mode Exit fullscreen mode

Model A: 95% accuracy. Catches zero fraud. Completely useless.
Model B: 96.5% accuracy. Catches 35 out of 50 frauds. Actually useful.

Accuracy said Model A was nearly as good. The confusion matrix told the truth.


Reading Every Number From a Confusion Matrix

Once you have the four numbers, you can calculate all the important metrics by hand.

# After getting tn, fp, fn, tp
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

total = tn + fp + fn + tp

accuracy  = (tp + tn) / total
precision = tp / (tp + fp)        # of all predicted positive, how many were right
recall    = tp / (tp + fn)        # of all actual positive, how many did we catch
f1        = 2 * (precision * recall) / (precision + recall)
specificity = tn / (tn + fp)      # of all actual negative, how many did we get right

print(f"Total samples:  {total}")
print(f"TP: {tp}  TN: {tn}  FP: {fp}  FN: {fn}")
print()
print(f"Accuracy:    {accuracy:.3f}   <- overall correct %")
print(f"Precision:   {precision:.3f}   <- when I say positive, am I right?")
print(f"Recall:      {recall:.3f}   <- did I catch all actual positives?")
print(f"F1 Score:    {f1:.3f}   <- balance of precision and recall")
print(f"Specificity: {specificity:.3f}   <- did I correctly identify negatives?")
Enter fullscreen mode Exit fullscreen mode

Output:

Total samples:  114
TP: 71  TN: 40  FP: 2  FN: 1

Accuracy:    0.974   <- overall correct %
Precision:   0.973   <- when I say positive, am I right?
Recall:      0.986   <- did I catch all actual positives?
F1 Score:    0.979   <- balance of precision and recall
Specificity: 0.952   <- did I correctly identify negatives?
Enter fullscreen mode Exit fullscreen mode

All of these come from the same four numbers. Memorizing the formulas is less important than understanding what each one means in context.


Choosing Which Error Is Worse

The right metric depends on your problem. You need to decide which error costs more.

Problem: Cancer detection
  FN (missed cancer) > FP (false alarm)
  → Optimize for Recall. Catch everything, even if some are false alarms.

Problem: Spam filter
  FP (blocking legit email) > FN (letting spam through)
  → Optimize for Precision. Only block what you're sure about.

Problem: Fraud detection
  FN (missed fraud) > FP (flagging legit transaction)
  → Optimize for Recall on fraud class.

Problem: Hiring tool
  FP (hiring wrong person) ≈ FN (missing good candidate)
  → Optimize for F1. Balance both.
Enter fullscreen mode Exit fullscreen mode
# See how threshold change affects TP, FP, FN, TN
from sklearn.ensemble import RandomForestClassifier

model_prob = RandomForestClassifier(n_estimators=100, random_state=42)
model_prob.fit(X_train, y_train)
proba = model_prob.predict_proba(X_test)[:, 1]  # probability of benign

print(f"{'Threshold':<12} {'TP':<6} {'TN':<6} {'FP':<6} {'FN':<6} {'Recall':<10} {'Precision'}")
print("-" * 60)

for thresh in [0.3, 0.4, 0.5, 0.6, 0.7, 0.8]:
    y_pred_t = (proba >= thresh).astype(int)
    cm_t = confusion_matrix(y_test, y_pred_t)
    tn_t, fp_t, fn_t, tp_t = cm_t.ravel()
    rec  = tp_t / (tp_t + fn_t) if (tp_t + fn_t) > 0 else 0
    prec = tp_t / (tp_t + fp_t) if (tp_t + fp_t) > 0 else 0
    print(f"{thresh:<12} {tp_t:<6} {tn_t:<6} {fp_t:<6} {fn_t:<6} {rec:<10.3f} {prec:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Threshold    TP     TN     FP     FN     Recall     Precision
------------------------------------------------------------
0.3          72     38     4      0      1.000      0.947
0.4          72     39     3      0      1.000      0.960
0.5          71     40     2      1      0.986      0.973
0.6          70     41     1      2      0.972      0.986
0.7          68     42     0      4      0.944      1.000
0.8          65     42     0      7      0.903      1.000
Enter fullscreen mode Exit fullscreen mode

At threshold 0.3 you catch every single malignant tumor (Recall=1.0) but have 4 false alarms.
At threshold 0.7 you have zero false alarms (Precision=1.0) but miss 4 cancers.

Which is better? In cancer detection, threshold 0.3 is better. In a low-stakes screening tool, maybe 0.6.


Multi-class Confusion Matrix

With more than two classes, the matrix grows but the same logic applies.

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report

iris = load_iris()
X_i, y_i = iris.data, iris.target

X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
    X_i, y_i, test_size=0.2, random_state=42, stratify=y_i
)

model_i = RandomForestClassifier(n_estimators=100, random_state=42)
model_i.fit(X_train_i, y_train_i)
y_pred_i = model_i.predict(X_test_i)

cm_i = confusion_matrix(y_test_i, y_pred_i)

print("Multi-class Confusion Matrix:")
print(cm_i)
print()

# Visualize
fig, ax = plt.subplots(figsize=(7, 5))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm_i,
    display_labels=iris.target_names
)
disp.plot(ax=ax, colorbar=False, cmap='Blues')
ax.set_title('Iris - 3 Class Confusion Matrix')
plt.tight_layout()
plt.savefig('multiclass_cm.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Multi-class Confusion Matrix:
[[10  0  0]
 [ 0  9  1]
 [ 0  0 10]]
Enter fullscreen mode Exit fullscreen mode

Reading this: rows are actual classes, columns are predicted.

Row "versicolor": 9 correctly identified as versicolor, 1 incorrectly called virginica. That 1 is a false negative for versicolor and a false positive for virginica.

The diagonal is always your correct predictions. Off-diagonal cells are errors.


Per-class Metrics From the Matrix

print(classification_report(
    y_test_i, y_pred_i,
    target_names=iris.target_names
))
Enter fullscreen mode Exit fullscreen mode

Output:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      0.90      0.95        10
   virginica       0.91      1.00      0.95        10

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30
Enter fullscreen mode Exit fullscreen mode

Every class gets its own precision, recall, and F1. You can see exactly which classes are problematic. Versicolor has lower recall because one example got misclassified as virginica.


A Complete Diagnostic Workflow

from sklearn.metrics import (
    confusion_matrix, classification_report,
    accuracy_score, roc_auc_score
)
import numpy as np

def diagnose_model(model, X_test, y_test, class_names, threshold=0.5):
    y_pred  = model.predict(X_test)
    y_proba = model.predict_proba(X_test)

    print("=" * 55)
    print("MODEL DIAGNOSIS REPORT")
    print("=" * 55)

    # Overall accuracy
    acc = accuracy_score(y_test, y_pred)
    print(f"\nAccuracy: {acc:.3f}")

    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print(f"\nConfusion Matrix:")
    print(cm)

    # Per-class report
    print(f"\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=class_names))

    # For binary: show TP/TN/FP/FN breakdown
    if len(class_names) == 2:
        tn, fp, fn, tp = cm.ravel()
        print(f"True Positives:  {tp}")
        print(f"True Negatives:  {tn}")
        print(f"False Positives: {fp}  <- wrong positive predictions")
        print(f"False Negatives: {fn}  <- missed actual positives")
        print(f"\nROC-AUC: {roc_auc_score(y_test, y_proba[:, 1]):.3f}")

    print("=" * 55)

# Use it
diagnose_model(model, X_test, y_test, data.target_names)
Enter fullscreen mode Exit fullscreen mode

Quick Cheat Sheet

Term Formula Meaning
Accuracy (TP+TN) / total Overall correct %
Precision TP / (TP+FP) When I say positive, am I right?
Recall TP / (TP+FN) Did I catch all actual positives?
F1 Score 2*(P*R)/(P+R) Balance of precision and recall
Specificity TN / (TN+FP) Did I correctly identify negatives?
FPR FP / (FP+TN) How often did I false alarm?
Task Code
Build matrix confusion_matrix(y_test, y_pred)
Visualize ConfusionMatrixDisplay(cm).plot()
Normalize confusion_matrix(y_test, y_pred, normalize='true')
Full report classification_report(y_test, y_pred)
Extract TP/TN/FP/FN tn, fp, fn, tp = cm.ravel() (binary only)

Practice Challenges

Level 1:
Train any classifier on load_breast_cancer(). Print the confusion matrix. Calculate precision and recall by hand from the TP, TN, FP, FN values. Then verify with classification_report.

Level 2:
Create an extremely imbalanced dataset (99% class 0, 1% class 1). Train a LogisticRegression on it. What is the accuracy? What does the confusion matrix look like? Now add class_weight='balanced' and retrain. How does the confusion matrix change?

Level 3:
On the fraud-like imbalanced dataset from this post, sweep thresholds from 0.1 to 0.9. For each threshold, compute FN and FP. Plot FN on one axis and FP on the other as the threshold changes. This is the precision-recall tradeoff curve. Where would you put the threshold if missing fraud costs 10x more than a false alarm?


References


Next up, Post 64: Precision and Recall: Beyond Accuracy. We go deep on the tradeoff between catching everything and being right when you do. The F1 score, when to use each metric, and how to pick the right one for your problem.

Top comments (0)