If you've ever built a classification model, you probably started by measuring its accuracy. But what happens when your data is imbalanced?
Example: Spam Detector
Imagine you’re building a spam detector.
- Out of 100 emails in the dataset, 99 are not spam and only 1 is spam.
A naive model could just predict every email as “not spam” and still get 99% accuracy—but it fails to detect single spam email, so it never learns to recognize spam
Even if you feed it emails with strong spam-like features, it will still call them “not spam”
This isn’t really overfitting—it’s more of an imbalance issue (the model is biased toward the majority class)
Sample dataset for illustration:
Words_Caps | Num_Links | Email_Length | Spam |
---|---|---|---|
5 | 2 | 120 | 1 |
3 | 0 | 80 | 1 |
0 | 1 | 200 | 0 |
1 | 0 | 150 | 0 |
7 | 3 | 95 | 1 |
0 | 0 | 300 | 0 |
1 | 0 | 250 | 0 |
2 | 1 | 180 | 0 |
6 | 4 | 110 | 1 |
0 | 0 | 220 | 0 |
Why Thresholds Matter
Most classification models don’t directly say “Yes” or “No.”
After you train a classifier (like logistic regression, random forest, XGBoost, etc.), when you call predict_proba
or an equivalent function, the model gives probabilities for each class.
Words_Caps | Num_Links | Email_Length | Spam | Probability_score |
---|---|---|---|---|
5 | 2 | 120 | 1 | 0.6 |
3 | 0 | 80 | 1 | 0.3 |
0 | 1 | 200 | 0 | 0.6 |
1 | 0 | 150 | 0 | 0.3 |
7 | 3 | 95 | 1 | 0.8 |
0 | 0 | 300 | 0 | 0.5 |
1 | 0 | 250 | 0 | 0.3 |
2 | 1 | 180 | 0 | 0.6 |
6 | 4 | 110 | 1 | 0.9 |
0 | 0 | 220 | 0 | 0.2 |
Classification doesn’t directly predict Yes/No.
We have to set a threshold (default 0.5):
threshold = 0.5
predicted = [1 if p >= threshold else 0 for p in probabilities]
At threshold = 0.5, the confusion matrix is:
That means:
- True Positives: 3
- False Negatives: 1
- False Positives: 3
- True Negatives: 3
To improve results, you’d have to keep changing the threshold from 0.0 to 1.0 and checking the confusion matrix each time.
But that’s messy and time-consuming.
Receiver Operating Characteristic (ROC)
Instead of testing thresholds manually, ROC does this for you.
For each threshold, we compute:
- TPR (True Positive Rate / Recall) = TP / (TP + FN)
- FPR (False Positive Rate) = FP / (FP + TN)
Then, we plot TPR vs FPR for all thresholds (from 0.0 → 1.0).
- At threshold = 0.0, everything is predicted Yes, so we start at point (1,1).
- At threshold = 1.0, everything is predicted No, so we end at point (0,0).
- In between, we get a curve that shows the trade-off between catching positives and avoiding false alarms.
The ROC curve helps identify thresholds that give you high TPR and low FPR.
Area Under the Curve (AUC)
Here’s the key part:
- ROC → Helps visualize trade-offs and choose a threshold
- AUC → A single number that measures how well the model separates classes independent of threshold
Interpretation:
- AUC = 1.0 → Perfect separation
- AUC = 0.5 → Random guessing
- AUC < 0.5: A model worse than random guess
Think of AUC as a summary score of your model’s ranking ability.
It tells you how often your model ranks a real positive higher than a real negative.
ROC-AUC plot in Python using scikit-learn
.
import pandas as pd
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
# Compute ROC curve
fpr,tpr, thresholds=roc_curve(Spam ,Probability_score)
# Compute AUC
auc=roc_auc_score(Spam ,Probability_score)
print("AUC Score:",auc)
# Plot ROC curve
plt.figure(figsize=(6,6))
plt.plot(fpr,tpr,marker='o',label=f'ROC curve (AUC={auc:.2f})')
plt.plot([0, 1], [0, 1],linestyle='--', color='gray', label='Random Guessing')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve for Spam Detector')
plt.legend()
plt.grid(True)
plt.show()
- ROC alone doesn’t give a single threshold—it gives all possible thresholds.
- The “best” threshold depends on your problem.
- The ideal point on an ROC curve is the top-left corner
(FPR=0, TPR=1)
, which represents a perfect classifier. - The point at
(FPR ≈ 0.33, TPR = 0.75)
looks like a strong candidate that catches most positives. - Imagine you absolutely cannot tolerate important emails going to spam. You want an FPR as close to 0 as possible. In this case, you'd choose the point at
(FPR=0,TPR=0.5)
. - A common method is Youden’s J statistic:
J = TPR - FPR
. Pick the threshold that maximizes J, giving the best trade-off.
.
Top comments (0)