Beyond Accuracy: How ROC-AUC Reveals the True Power of Your Model

Akshay Shetty — Tue, 30 Sep 2025 17:47:05 +0000

If you've ever built a classification model, you probably started by measuring its accuracy. But what happens when your data is imbalanced?

Example: Spam Detector

Imagine you’re building a spam detector.

Out of 100 emails in the dataset, 99 are not spam and only 1 is spam.
A naive model could just predict every email as “not spam” and still get 99% accuracy—but it fails to detect single spam email, so it never learns to recognize spam
Even if you feed it emails with strong spam-like features, it will still call them “not spam”
This isn’t really overfitting—it’s more of an imbalance issue (the model is biased toward the majority class)

Sample dataset for illustration:

Words_Caps	Num_Links	Email_Length	Spam
5	2	120	1
3	0	80	1
0	1	200	0
1	0	150	0
7	3	95	1
0	0	300	0
1	0	250	0
2	1	180	0
6	4	110	1
0	0	220	0

Why Thresholds Matter

Most classification models don’t directly say “Yes” or “No.”
After you train a classifier (like logistic regression, random forest, XGBoost, etc.), when you call predict_proba or an equivalent function, the model gives probabilities for each class.

Words_Caps	Num_Links	Email_Length	Spam	Probability_score
5	2	120	1	0.6
3	0	80	1	0.3
0	1	200	0	0.6
1	0	150	0	0.3
7	3	95	1	0.8
0	0	300	0	0.5
1	0	250	0	0.3
2	1	180	0	0.6
6	4	110	1	0.9
0	0	220	0	0.2

Classification doesn’t directly predict Yes/No.

We have to set a threshold (default 0.5):

threshold = 0.5
predicted = [1 if p >= threshold else 0 for p in probabilities]

At threshold = 0.5, the confusion matrix is:

That means:

True Positives: 3
False Negatives: 1
False Positives: 3
True Negatives: 3

To improve results, you’d have to keep changing the threshold from 0.0 to 1.0 and checking the confusion matrix each time.
But that’s messy and time-consuming.

Receiver Operating Characteristic (ROC)

Instead of testing thresholds manually, ROC does this for you.

For each threshold, we compute:

TPR (True Positive Rate / Recall) = TP / (TP + FN)
FPR (False Positive Rate) = FP / (FP + TN)

Then, we plot TPR vs FPR for all thresholds (from 0.0 → 1.0).

At threshold = 0.0, everything is predicted Yes, so we start at point (1,1).
At threshold = 1.0, everything is predicted No, so we end at point (0,0).

In between, we get a curve that shows the trade-off between catching positives and avoiding false alarms.

The ROC curve helps identify thresholds that give you high TPR and low FPR.

Area Under the Curve (AUC)

Here’s the key part:

ROC → Helps visualize trade-offs and choose a threshold
AUC → A single number that measures how well the model separates classes independent of threshold

Interpretation:

AUC = 1.0 → Perfect separation
AUC = 0.5 → Random guessing
AUC < 0.5: A model worse than random guess

Think of AUC as a summary score of your model’s ranking ability.

It tells you how often your model ranks a real positive higher than a real negative.

ROC-AUC plot in Python using scikit-learn.


import pandas as pd
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

# Compute ROC curve
fpr,tpr, thresholds=roc_curve(Spam ,Probability_score)

# Compute AUC
auc=roc_auc_score(Spam ,Probability_score)

print("AUC Score:",auc)

# Plot ROC curve
plt.figure(figsize=(6,6))
plt.plot(fpr,tpr,marker='o',label=f'ROC curve (AUC={auc:.2f})')
plt.plot([0, 1], [0, 1],linestyle='--', color='gray', label='Random Guessing')

plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curve for Spam Detector')
plt.legend()
plt.grid(True)
plt.show()

ROC alone doesn’t give a single threshold—it gives all possible thresholds.
The “best” threshold depends on your problem.
The ideal point on an ROC curve is the top-left corner (FPR=0, TPR=1), which represents a perfect classifier.
The point at (FPR ≈ 0.33, TPR = 0.75) looks like a strong candidate that catches most positives.
Imagine you absolutely cannot tolerate important emails going to spam. You want an FPR as close to 0 as possible. In this case, you'd choose the point at (FPR=0,TPR=0.5).
A common method is Youden’s J statistic: J = TPR - FPR . Pick the threshold that maximizes J, giving the best trade-off.

Words_Caps	Num_Links	Email_Length	Spam	Probability_score
5	2	120	1	0.6
3	0	80	1	0.3
0	1	200	0	0.6
1	0	150	0	0.3
7	3	95	1	0.8
0	0	300	0	0.5
1	0	250	0	0.3
2	1	180	0	0.6
6	4	110	1	0.9
0	0	220	0	0.2

Words_Caps	Num_Links	Email_Length	Spam	Probability_score
5	2	120	1	0.6
3	0	80	1	0.3
0	1	200	0	0.6
1	0	150	0	0.3
7	3	95	1	0.8
0	0	300	0	0.5
1	0	250	0	0.3
2	1	180	0	0.6
6	4	110	1	0.9
0	0	220	0	0.2

DEV Community: Akshay Shetty

Beyond Accuracy: How ROC-AUC Reveals the True Power of Your Model

Example: Spam Detector

Why Thresholds Matter

Receiver Operating Characteristic (ROC)

Area Under the Curve (AUC)

Words_Caps	Num_Links	Email_Length	Spam	Probability_score
5	2	120	1	0.6
3	0	80	1	0.3
0	1	200	0	0.6
1	0	150	0	0.3
7	3	95	1	0.8
0	0	300	0	0.5
1	0	250	0	0.3
2	1	180	0	0.6
6	4	110	1	0.9
0	0	220	0	0.2