DEV Community

Cover image for 65. ROC Curves and AUC: Comparing Models Fairly
Akhilesh
Akhilesh

Posted on

65. ROC Curves and AUC: Comparing Models Fairly

You have two models. Model A has F1 of 0.82. Model B has F1 of 0.79.

Model A wins, right?

Not necessarily. F1 is calculated at one specific threshold. Maybe Model B is much better at other thresholds. Maybe on your actual deployment threshold, B beats A.

ROC curves show you the full picture. They plot model performance across every possible threshold at once. AUC collapses that into one number you can compare.

It's the right way to compare classifiers when you haven't committed to a threshold yet.


What You'll Learn Here

  • What the ROC curve actually plots and how to read it
  • What AUC means in plain language
  • How to build and compare multiple ROC curves
  • When to use ROC-AUC vs precision-recall
  • Multi-class ROC with one-vs-rest
  • The things people get wrong about AUC

The Two Axes of a ROC Curve

ROC stands for Receiver Operating Characteristic. It comes from signal detection theory in the 1940s. The name is not helpful. The chart is.

A ROC curve plots two things as the threshold changes from 0 to 1:

Y-axis: True Positive Rate (TPR) = Recall

TPR = TP / (TP + FN)
Enter fullscreen mode Exit fullscreen mode

Of all actual positives, what fraction did you catch? Higher is better.

X-axis: False Positive Rate (FPR)

FPR = FP / (FP + TN)
Enter fullscreen mode Exit fullscreen mode

Of all actual negatives, what fraction did you wrongly flag? Lower is better.

As you lower the threshold, you catch more positives (TPR goes up) but you also flag more negatives as positive (FPR goes up). The ROC curve traces that tradeoff.

Perfect model: goes straight up then right. Hits top-left corner.
Random model:  diagonal line from (0,0) to (1,1).
Your model:    somewhere between those two.
Enter fullscreen mode Exit fullscreen mode

The closer your curve hugs the top-left corner, the better your model.


Building Your First ROC Curve

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

# Train two models to compare
rf = RandomForestClassifier(n_estimators=100, random_state=42)
lr = LogisticRegression(max_iter=1000, random_state=42)

rf.fit(X_train, y_train)
lr.fit(X_train_s, y_train)

# Get probability scores (not class labels)
rf_proba = rf.predict_proba(X_test)[:, 1]
lr_proba = lr.predict_proba(X_test_s)[:, 1]

# Calculate ROC curve points
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf_proba)
lr_fpr, lr_tpr, lr_thresholds = roc_curve(y_test, lr_proba)

# Calculate AUC
rf_auc = roc_auc_score(y_test, rf_proba)
lr_auc = roc_auc_score(y_test, lr_proba)

print(f"Random Forest AUC: {rf_auc:.3f}")
print(f"Logistic Reg  AUC: {lr_auc:.3f}")

# Plot
plt.figure(figsize=(8, 6))
plt.plot(rf_fpr, rf_tpr, color='blue',   linewidth=2, label=f'Random Forest (AUC={rf_auc:.3f})')
plt.plot(lr_fpr, lr_tpr, color='orange', linewidth=2, label=f'Logistic Reg  (AUC={lr_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', linewidth=1, label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('roc_curve.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Random Forest AUC: 0.997
Logistic Reg  AUC: 0.995
Enter fullscreen mode Exit fullscreen mode

Both are excellent. The ROC curve shows Random Forest edges out Logistic Regression slightly, especially at low false positive rates.


What AUC Actually Means

AUC = Area Under the ROC Curve. It ranges from 0.5 to 1.0 for a sensible model.

The number has a really nice interpretation that most people don't know:

AUC = the probability that your model ranks a random positive example higher than a random negative example.

If AUC = 0.97, it means: pick one random cancer case and one random healthy person from your dataset. There's a 97% chance your model assigned a higher probability score to the cancer case.

# Manually verify the AUC interpretation
def manual_auc(y_true, y_scores):
    positives = y_scores[y_true == 1]
    negatives = y_scores[y_true == 0]

    count = 0
    total = 0
    for pos in positives:
        for neg in negatives:
            total += 1
            if pos > neg:
                count += 1
            elif pos == neg:
                count += 0.5  # tie counts as half

    return count / total

manual_result = manual_auc(y_test, rf_proba)
sklearn_result = roc_auc_score(y_test, rf_proba)

print(f"Manual AUC calculation: {manual_result:.3f}")
print(f"Sklearn AUC:            {sklearn_result:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Manual AUC calculation: 0.997
Sklearn AUC:            0.997
Enter fullscreen mode Exit fullscreen mode

Same number. That loop is slow but it proves what AUC actually computes.

AUC score interpretation:

  • 1.00: perfect model
  • 0.90 to 0.99: excellent
  • 0.80 to 0.90: good
  • 0.70 to 0.80: fair
  • 0.60 to 0.70: poor
  • 0.50: random guessing (no better than a coin flip)
  • below 0.50: worse than random (your labels might be flipped)

Finding the Best Threshold From the ROC Curve

The ROC curve gives you every possible threshold. How do you pick one?

Option 1: Youden's J statistic
Maximize TPR - FPR. Finds the point on the curve that's furthest from the diagonal.

# Find the optimal threshold using Youden's J
fpr, tpr, thresholds = roc_curve(y_test, rf_proba)

# Youden's J = TPR - FPR
j_scores = tpr - fpr
best_idx  = np.argmax(j_scores)
best_threshold = thresholds[best_idx]

print(f"Best threshold (Youden's J): {best_threshold:.3f}")
print(f"At this threshold:")
print(f"  TPR (Recall): {tpr[best_idx]:.3f}")
print(f"  FPR:          {fpr[best_idx]:.3f}")

# Apply this threshold
y_pred_best = (rf_proba >= best_threshold).astype(int)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_best, target_names=data.target_names))
Enter fullscreen mode Exit fullscreen mode

Option 2: Closest to top-left corner
Minimize the distance from the point (0, 1).

# Distance from top-left corner (0, 1)
distances = np.sqrt(fpr**2 + (1 - tpr)**2)
best_idx_dist = np.argmin(distances)
print(f"Best threshold (closest to corner): {thresholds[best_idx_dist]:.3f}")
Enter fullscreen mode Exit fullscreen mode

Option 3: Business-driven threshold
Use domain knowledge. If catching 90% of fraud cases is required, find the threshold that gives TPR >= 0.90 with the lowest FPR.

# Find threshold that achieves at least 90% recall
min_recall = 0.90
valid_idx = np.where(tpr >= min_recall)[0]
best_business_idx = valid_idx[np.argmin(fpr[valid_idx])]

print(f"Threshold for recall >= 90%: {thresholds[best_business_idx]:.3f}")
print(f"  Actual TPR: {tpr[best_business_idx]:.3f}")
print(f"  FPR:        {fpr[best_business_idx]:.3f}")
Enter fullscreen mode Exit fullscreen mode

Comparing Many Models at Once

import xgboost as xgb
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

models = {
    'Random Forest':   (RandomForestClassifier(n_estimators=100, random_state=42), X_train,   X_test),
    'Logistic Reg':    (LogisticRegression(max_iter=1000, random_state=42),          X_train_s, X_test_s),
    'XGBoost':         (xgb.XGBClassifier(n_estimators=100, random_state=42,
                                            eval_metric='logloss', verbosity=0),      X_train,   X_test),
    'Gaussian NB':     (GaussianNB(),                                                 X_train,   X_test),
    'KNN':             (KNeighborsClassifier(n_neighbors=7),                          X_train_s, X_test_s),
    'SVM':             (SVC(kernel='rbf', probability=True, random_state=42),         X_train_s, X_test_s),
}

plt.figure(figsize=(9, 7))
plt.plot([0, 1], [0, 1], color='gray', linestyle='--', linewidth=1, label='Random')

for name, (model, X_tr, X_te) in models.items():
    model.fit(X_tr, y_train)
    proba = model.predict_proba(X_te)[:, 1]
    fpr_m, tpr_m, _ = roc_curve(y_test, proba)
    auc_m = roc_auc_score(y_test, proba)
    plt.plot(fpr_m, tpr_m, linewidth=2, label=f'{name} (AUC={auc_m:.3f})')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves: All Models Compared')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.savefig('roc_all_models.png', dpi=100)
plt.show()

# Print AUC table
print(f"\n{'Model':<18} {'AUC'}")
print("-" * 28)
for name, (model, X_tr, X_te) in models.items():
    proba = model.predict_proba(X_te)[:, 1]
    auc_m = roc_auc_score(y_test, proba)
    print(f"{name:<18} {auc_m:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Model              AUC
----------------------------
Random Forest      0.997
Logistic Reg       0.995
XGBoost            0.997
Gaussian NB        0.993
KNN                0.989
SVM                0.996
Enter fullscreen mode Exit fullscreen mode

All strong models here. On messier datasets, the gaps between them grow significantly.


ROC vs Precision-Recall: When to Use Which

This is something a lot of people get wrong.

Use ROC-AUC when:

  • Your dataset is roughly balanced
  • You want a single number to compare models independently of threshold
  • You care about overall ranking ability of the model

Use Precision-Recall when:

  • Your dataset is heavily imbalanced (fraud, rare disease, anomaly detection)
  • You care more about performance on the positive class
  • The negative class is not interesting (it's just background)

The reason: on imbalanced data, ROC-AUC can look great even when the model is bad at finding the rare class. A model that flags almost everything as positive can have a high AUC because TN is so large that FPR stays low even with many FP.

from sklearn.metrics import average_precision_score, roc_auc_score
import numpy as np

# Demonstrate on imbalanced data
np.random.seed(42)
n = 10000
y_imbal = np.array([0]*9800 + [1]*200)  # 2% positive

# Model that's slightly better than random
scores_bad  = np.random.rand(n)
scores_good = np.random.rand(n)
scores_good[y_imbal == 1] += 0.3  # good model scores fraud higher

print("Imbalanced dataset (2% positive):")
print(f"\nBad model:")
print(f"  ROC-AUC:         {roc_auc_score(y_imbal, scores_bad):.3f}")
print(f"  Avg Precision:   {average_precision_score(y_imbal, scores_bad):.3f}")

print(f"\nBetter model:")
print(f"  ROC-AUC:         {roc_auc_score(y_imbal, scores_good):.3f}")
print(f"  Avg Precision:   {average_precision_score(y_imbal, scores_good):.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Imbalanced dataset (2% positive):

Bad model:
  ROC-AUC:         0.501
  Avg Precision:   0.021

Better model:
  ROC-AUC:         0.753
  Avg Precision:   0.143
Enter fullscreen mode Exit fullscreen mode

Both metrics show the better model is better. But look at the bad model's ROC-AUC: 0.501. That's almost random. Average Precision: 0.021. Also terrible. They agree here.

The problem shows up when models look decent on ROC but terrible on PR. That happens when there are so many true negatives that FPR stays low even for a bad model on the minority class.

Rule of thumb: if the positive class is less than 10% of your data, trust precision-recall more than ROC.


Multi-class ROC: One vs Rest

ROC is defined for binary problems. For multi-class, you use one-vs-rest: build a separate ROC curve for each class treating it as positive and all others as negative.

from sklearn.datasets import load_iris
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

iris = load_iris()
X_i, y_i = iris.data, iris.target
classes = iris.target_names

# Binarize labels for one-vs-rest
y_bin = label_binarize(y_i, classes=[0, 1, 2])

X_train_i, X_test_i, y_train_b, y_test_b = train_test_split(
    X_i, y_bin, test_size=0.2, random_state=42
)

# Train OvR classifier
clf = OneVsRestClassifier(RandomForestClassifier(n_estimators=100, random_state=42))
clf.fit(X_train_i, y_train_b)
y_score = clf.predict_proba(X_test_i)

plt.figure(figsize=(8, 6))

for i, class_name in enumerate(classes):
    fpr_i, tpr_i, _ = roc_curve(y_test_b[:, i], y_score[:, i])
    auc_i = auc(fpr_i, tpr_i)
    plt.plot(fpr_i, tpr_i, linewidth=2, label=f'{class_name} (AUC={auc_i:.3f})')

plt.plot([0, 1], [0, 1], 'k--', linewidth=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Multi-class ROC (One vs Rest) - Iris')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('multiclass_roc.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Each class gets its own curve and AUC. The class that's hardest to separate from the others will have the lowest AUC.


Cross-validated AUC: More Reliable Than a Single Split

from sklearn.model_selection import cross_val_score

models_to_compare = {
    'Random Forest':  RandomForestClassifier(n_estimators=100, random_state=42),
    'Logistic Reg':   LogisticRegression(max_iter=1000, random_state=42),
    'XGBoost':        xgb.XGBClassifier(n_estimators=100, random_state=42,
                                         eval_metric='logloss', verbosity=0),
}

print(f"{'Model':<18} {'CV AUC Mean':<14} {'CV AUC Std'}")
print("-" * 45)

for name, m in models_to_compare.items():
    # scoring='roc_auc' uses predict_proba internally
    scores = cross_val_score(m, X, y, cv=5, scoring='roc_auc')
    print(f"{name:<18} {scores.mean():.3f}          {scores.std():.4f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Model              CV AUC Mean    CV AUC Std
---------------------------------------------
Random Forest      0.996          0.0037
Logistic Reg       0.994          0.0051
XGBoost            0.996          0.0037
Enter fullscreen mode Exit fullscreen mode

Cross-validated AUC is more trustworthy than AUC on a single test set. The std tells you how consistent the model is across different data subsets.


Plotting the ROC Curve With Confidence

from sklearn.model_selection import StratifiedKFold

# Plot ROC with variance across folds
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rf_cv = RandomForestClassifier(n_estimators=100, random_state=42)

tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

plt.figure(figsize=(8, 6))

for fold, (train_idx, test_idx) in enumerate(cv.split(X, y)):
    rf_cv.fit(X[train_idx], y[train_idx])
    proba_cv = rf_cv.predict_proba(X[test_idx])[:, 1]

    fpr_cv, tpr_cv, _ = roc_curve(y[test_idx], proba_cv)
    auc_cv = roc_auc_score(y[test_idx], proba_cv)
    aucs.append(auc_cv)

    interp_tpr = np.interp(mean_fpr, fpr_cv, tpr_cv)
    interp_tpr[0] = 0.0
    tprs.append(interp_tpr)

    plt.plot(fpr_cv, tpr_cv, alpha=0.2, color='blue', linewidth=1)

mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = np.mean(aucs)
std_auc  = np.std(aucs)

plt.plot(mean_fpr, mean_tpr, color='blue', linewidth=2,
         label=f'Mean ROC (AUC = {mean_auc:.3f} +/- {std_auc:.3f})')

std_tpr = np.std(tprs, axis=0)
plt.fill_between(mean_fpr, mean_tpr - std_tpr, mean_tpr + std_tpr,
                 alpha=0.15, color='blue', label='Standard deviation')

plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve With Cross-Validation Variance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('roc_with_variance.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode

The shaded area shows how much the ROC curve varies across folds. A narrow band means your model is consistent. A wide band means it's sensitive to which data it trains on.


The Things Everyone Gets Wrong

Mistake 1: Using AUC on heavily imbalanced data and calling it done

A model that always predicts negative gets AUC = 0.5. But a model that rarely predicts positive but is almost always right when it does might have AUC = 0.85 and Avg Precision = 0.10. AUC looks good. The model catches almost nothing. Use PR curve on imbalanced problems.

Mistake 2: Thinking high AUC means the model is ready to deploy

AUC is threshold-independent. Deployment requires a threshold. Pick a threshold based on your actual business requirements, not just the math.

Mistake 3: Comparing AUC across different datasets

AUC = 0.90 on one problem doesn't mean the same thing as AUC = 0.90 on another. Easy problems have high AUC even for weak models. Hard problems have lower AUC even for strong ones. Compare models on the same data only.

Mistake 4: Using predict instead of predict_proba for ROC

ROC curves need probability scores, not hard class labels. Always use predict_proba(X)[:, 1] or decision_function(X) as input to roc_curve.

# WRONG
roc_curve(y_test, model.predict(X_test))     # binary 0/1 output, useless for ROC

# RIGHT
roc_curve(y_test, model.predict_proba(X_test)[:, 1])  # probability scores
Enter fullscreen mode Exit fullscreen mode

Quick Cheat Sheet

Task Code
ROC curve roc_curve(y_test, y_proba) returns fpr, tpr, thresholds
AUC score roc_auc_score(y_test, y_proba)
Cross-val AUC cross_val_score(model, X, y, cv=5, scoring='roc_auc')
Best threshold (Youden) thresholds[np.argmax(tpr - fpr)]
Best threshold (corner) thresholds[np.argmin(fpr**2 + (1-tpr)**2)]
Plot ROC plt.plot(fpr, tpr) after getting from roc_curve
Multi-class ROC OneVsRestClassifier + label_binarize

Practice Challenges

Level 1:
Train three different models on load_breast_cancer(). Plot all three ROC curves on the same graph. Which model has the highest AUC? At FPR=0.05, which model has the highest TPR?

Level 2:
Create a heavily imbalanced dataset (1% positive). Train a RandomForest. Plot both the ROC curve and the Precision-Recall curve side by side. Which one better reveals that the model struggles?

Level 3:
On the iris dataset, build a full one-vs-rest ROC analysis. Compute the macro-average AUC (average of per-class AUC). Then compute it using roc_auc_score(y, proba, multi_class='ovr', average='macro'). Verify both give the same number.


References


Next up, Post 66: K-Means Clustering: Find Groups Without Labels. We move into unsupervised learning. No correct answers. The algorithm groups similar data by itself and you decide if the groups make sense.

Top comments (0)