My fraud detection model achieved 99% accuracy in testing. I deployed it to production, and it caught exactly zero fraudulent transactions. The model was predicting "not fraud" for every single transaction.
The Accuracy Paradox
Here's the dataset that fooled me: 10,000 transactions, 100 fraudulent (1%), 9,900 legitimate (99%). A model that predicts "not fraud" for everything gets 99% accuracy without learning anything useful.
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Simulated predictions: model predicts "not fraud" (0) for everything
y_true = np.array([0]*9900 + [1]*100) # 100 frauds in 10,000 transactions
y_pred = np.array([0]*10000) # Model predicts all "not fraud"
print(f"Accuracy: {accuracy_score(y_true, y_pred):.3f}") # 0.990 - looks great!
print(f"Precision: {precision_score(y_true, y_pred, zero_division=0):.3f}") # 0.000 - disaster
print(f"Recall: {recall_score(y_true, y_pred):.3f}") # 0.000 - catches nothing
print(f"F1 Score: {f1_score(y_true, y_pred):.3f}") # 0.000 - useless model
Accuracy is (TP + TN) / Total. When 99% of samples are negative, the model gets 9,900 true negatives for free. The 100 missed frauds barely dent the accuracy.
In my deep dive into precision, recall, and the confusion matrix, I found that what you measure determines what you optimize. If you measure accuracy on imbalanced data, you'll build a model that ignores the minority class.
The Confusion Matrix: What's Actually Happening
Here's the confusion matrix for my "99% accurate" fraud detector:
from sklearn.metrics import confusion_matrix
import pandas as pd
cm = confusion_matrix(y_true, y_pred)
cm_df = pd.DataFrame(cm,
index=['Actual: Legit', 'Actual: Fraud'],
columns=['Predicted: Legit', 'Predicted: Fraud'])
print(cm_df)
| Predicted: Legit | Predicted: Fraud | |
|---|---|---|
| Actual: Legit | 9900 (TN) | 0 (FP) |
| Actual: Fraud | 100 (FN) | 0 (TP) |
- True Negatives (TN): 9,900 — correctly identified legitimate transactions
- False Positives (FP): 0 — legitimate transactions flagged as fraud
- False Negatives (FN): 100 — frauds that slipped through (disaster!)
- True Positives (TP): 0 — frauds correctly caught
The model has perfect precision (no false alarms) but zero recall (catches nothing). Accuracy hides this because TN dominates the calculation.
The Right Metrics for Imbalanced Data
| Metric | Formula | What It Measures | When to Use |
|---|---|---|---|
| Precision | TP / (TP + FP) | Of all fraud predictions, how many were correct? | When false alarms are expensive (e.g., blocking legitimate transactions) |
| Recall | TP / (TP + FN) | Of all actual frauds, how many did we catch? | When missing positives is expensive (e.g., letting fraud through) |
| F1 Score | 2 × (Precision × Recall) / (Precision + Recall) | Harmonic mean of precision and recall | When you need a single metric balancing both |
| F2 Score | 5 × (Precision × Recall) / (4 × Precision + Recall) | Weighted toward recall | When recall matters more than precision |
Here's a real fraud detector with 85% accuracy but actually useful:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Generate imbalanced dataset
np.random.seed(42)
X = np.random.randn(10000, 5)
y = np.array([0]*9900 + [1]*100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
# Train with class weights to handle imbalance
model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall: {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
This model might have lower accuracy (85%) but catches 70% of frauds with 30% precision — far more useful than 99% accuracy with zero recall.
The Precision-Recall Tradeoff
You can't maximize both precision and recall simultaneously. Adjusting the classification threshold shifts the balance:
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# Get probability predictions instead of hard classifications
y_proba = model.predict_proba(X_test)[:, 1]
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
# Find threshold for 90% recall (catch 90% of frauds)
idx_90_recall = np.argmax(recalls >= 0.9)
threshold_90_recall = thresholds[idx_90_recall]
precision_at_90_recall = precisions[idx_90_recall]
print(f"To catch 90% of frauds, accept {precision_at_90_recall:.1%} precision")
print(f"Use threshold: {threshold_90_recall:.3f}")
Production decision framework:
- High-stakes fraud (credit cards): Optimize for recall, accept more false alarms
- Low-stakes spam (email): Optimize for precision, let some spam through
- Medical diagnosis: Optimize for recall in screening, precision in confirmation
What Most Tutorials Miss
The biggest mistake I made was using model.predict() directly. This uses a fixed 0.5 threshold, which is wrong for imbalanced data. Instead, use predict_proba() and choose your own threshold:
# WRONG: Fixed 0.5 threshold
y_pred_wrong = model.predict(X_test)
# RIGHT: Custom threshold based on business needs
y_proba = model.predict_proba(X_test)[:, 1]
threshold = 0.3 # Lower threshold = higher recall, lower precision
y_pred_right = (y_proba >= threshold).astype(int)
print(f"Recall at 0.5 threshold: {recall_score(y_test, y_pred_wrong):.3f}")
print(f"Recall at 0.3 threshold: {recall_score(y_test, y_pred_right):.3f}")
Another gotcha: train_test_split without stratify=y can put all frauds in one set. Always use stratified splitting on imbalanced data:
# WRONG: Random split might put all frauds in training set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# RIGHT: Stratified split maintains class distribution
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
Key Takeaways for Developers
- Accuracy is meaningless on imbalanced data — a model predicting the majority class gets high accuracy without learning anything
- Use precision when false alarms are expensive, recall when missing positives is expensive, F1 when you need balance
- Always check the confusion matrix to see what's actually happening (TP, TN, FP, FN)
- Adjust the classification threshold based on business needs — don't use the default 0.5
-
Use
class_weight='balanced'in sklearn models to handle imbalance automatically
The fraud detector that looked perfect in testing was useless in production because I measured the wrong thing. If you want to experiment with different metrics and thresholds interactively, check out the confusion matrix visualizer — it shows exactly how precision, recall, and F1 change as you adjust the threshold.
For more on evaluation metrics, see the scikit-learn classification metrics guide and this excellent paper on the precision-recall tradeoff.
Top comments (0)