Why Accuracy Keeps Lying to You in Imbalanced Classification

#datascience #machinelearning #python #beginners

My fraud detection model achieved 99% accuracy in testing. I deployed it to production, and it caught exactly zero fraudulent transactions. The model was predicting "not fraud" for every single transaction.

The Accuracy Paradox

Here's the dataset that fooled me: 10,000 transactions, 100 fraudulent (1%), 9,900 legitimate (99%). A model that predicts "not fraud" for everything gets 99% accuracy without learning anything useful.

import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Simulated predictions: model predicts "not fraud" (0) for everything
y_true = np.array([0]*9900 + [1]*100)  # 100 frauds in 10,000 transactions
y_pred = np.array([0]*10000)            # Model predicts all "not fraud"

print(f"Accuracy: {accuracy_score(y_true, y_pred):.3f}")      # 0.990 - looks great!
print(f"Precision: {precision_score(y_true, y_pred, zero_division=0):.3f}")  # 0.000 - disaster
print(f"Recall: {recall_score(y_true, y_pred):.3f}")         # 0.000 - catches nothing
print(f"F1 Score: {f1_score(y_true, y_pred):.3f}")           # 0.000 - useless model

Accuracy is (TP + TN) / Total. When 99% of samples are negative, the model gets 9,900 true negatives for free. The 100 missed frauds barely dent the accuracy.

In my deep dive into precision, recall, and the confusion matrix, I found that what you measure determines what you optimize. If you measure accuracy on imbalanced data, you'll build a model that ignores the minority class.

The Confusion Matrix: What's Actually Happening

Here's the confusion matrix for my "99% accurate" fraud detector:

from sklearn.metrics import confusion_matrix
import pandas as pd

cm = confusion_matrix(y_true, y_pred)
cm_df = pd.DataFrame(cm, 
                     index=['Actual: Legit', 'Actual: Fraud'],
                     columns=['Predicted: Legit', 'Predicted: Fraud'])
print(cm_df)

	Predicted: Legit	Predicted: Fraud
Actual: Legit	9900 (TN)	0 (FP)
Actual: Fraud	100 (FN)	0 (TP)

True Negatives (TN): 9,900 — correctly identified legitimate transactions
False Positives (FP): 0 — legitimate transactions flagged as fraud
False Negatives (FN): 100 — frauds that slipped through (disaster!)
True Positives (TP): 0 — frauds correctly caught

The model has perfect precision (no false alarms) but zero recall (catches nothing). Accuracy hides this because TN dominates the calculation.

The Right Metrics for Imbalanced Data

Metric	Formula	What It Measures	When to Use
Precision	TP / (TP + FP)	Of all fraud predictions, how many were correct?	When false alarms are expensive (e.g., blocking legitimate transactions)
Recall	TP / (TP + FN)	Of all actual frauds, how many did we catch?	When missing positives is expensive (e.g., letting fraud through)
F1 Score	2 × (Precision × Recall) / (Precision + Recall)	Harmonic mean of precision and recall	When you need a single metric balancing both
F2 Score	5 × (Precision × Recall) / (4 × Precision + Recall)	Weighted toward recall	When recall matters more than precision

Here's a real fraud detector with 85% accuracy but actually useful:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Generate imbalanced dataset
np.random.seed(42)
X = np.random.randn(10000, 5)
y = np.array([0]*9900 + [1]*100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

# Train with class weights to handle imbalance
model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall: {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")

This model might have lower accuracy (85%) but catches 70% of frauds with 30% precision — far more useful than 99% accuracy with zero recall.

The Precision-Recall Tradeoff

You can't maximize both precision and recall simultaneously. Adjusting the classification threshold shifts the balance:

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Get probability predictions instead of hard classifications
y_proba = model.predict_proba(X_test)[:, 1]

precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)

# Find threshold for 90% recall (catch 90% of frauds)
idx_90_recall = np.argmax(recalls >= 0.9)
threshold_90_recall = thresholds[idx_90_recall]
precision_at_90_recall = precisions[idx_90_recall]

print(f"To catch 90% of frauds, accept {precision_at_90_recall:.1%} precision")
print(f"Use threshold: {threshold_90_recall:.3f}")

Production decision framework:

High-stakes fraud (credit cards): Optimize for recall, accept more false alarms
Low-stakes spam (email): Optimize for precision, let some spam through
Medical diagnosis: Optimize for recall in screening, precision in confirmation

What Most Tutorials Miss

The biggest mistake I made was using model.predict() directly. This uses a fixed 0.5 threshold, which is wrong for imbalanced data. Instead, use predict_proba() and choose your own threshold:

# WRONG: Fixed 0.5 threshold
y_pred_wrong = model.predict(X_test)

# RIGHT: Custom threshold based on business needs
y_proba = model.predict_proba(X_test)[:, 1]
threshold = 0.3  # Lower threshold = higher recall, lower precision
y_pred_right = (y_proba >= threshold).astype(int)

print(f"Recall at 0.5 threshold: {recall_score(y_test, y_pred_wrong):.3f}")
print(f"Recall at 0.3 threshold: {recall_score(y_test, y_pred_right):.3f}")

Another gotcha: train_test_split without stratify=y can put all frauds in one set. Always use stratified splitting on imbalanced data:

# WRONG: Random split might put all frauds in training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# RIGHT: Stratified split maintains class distribution
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

Key Takeaways for Developers

Accuracy is meaningless on imbalanced data — a model predicting the majority class gets high accuracy without learning anything
Use precision when false alarms are expensive, recall when missing positives is expensive, F1 when you need balance
Always check the confusion matrix to see what's actually happening (TP, TN, FP, FN)
Adjust the classification threshold based on business needs — don't use the default 0.5
Use class_weight='balanced' in sklearn models to handle imbalance automatically

The fraud detector that looked perfect in testing was useless in production because I measured the wrong thing. If you want to experiment with different metrics and thresholds interactively, check out the confusion matrix visualizer — it shows exactly how precision, recall, and F1 change as you adjust the threshold.

For more on evaluation metrics, see the scikit-learn classification metrics guide and this excellent paper on the precision-recall tradeoff.