DEV Community

Cover image for AUC-ROC Explained: The Smoke Detector With a Sensitivity Dial and the One Number That Tells You If It's Any Good
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

AUC-ROC Explained: The Smoke Detector With a Sensitivity Dial and the One Number That Tells You If It's Any Good

The One-Line Summary: AUC-ROC measures how well your model can distinguish between classes ACROSS ALL possible thresholds. An AUC of 1.0 means perfect separation. An AUC of 0.5 means your model is no better than random coin flips. It's the ONE number that captures overall discriminative power.


The Tale of Two Smoke Detectors

You're buying a smoke detector for your kitchen.

The salesman shows you two models, each with a sensitivity dial you can adjust from 1 (ignore everything) to 10 (maximum paranoia).


Detector A: "The Neurotic"

You test it at different settings:

Setting 1:  Ignores everything. Missed the actual fire. Also ignored toast. 
Setting 3:  Caught the fire! But also screamed at toast.
Setting 5:  Caught the fire! But also screamed at toast, steam, and dust.
Setting 7:  Caught the fire! But also screamed at toast, steam, dust, and humidity.
Setting 10: Caught the fire! But screams constantly at EVERYTHING.
Enter fullscreen mode Exit fullscreen mode

Summary: At EVERY setting, if it catches fires, it also has tons of false alarms. There's no sweet spot.


Detector B: "The Smart One"

Setting 1:  Ignores everything. Missed the fire.
Setting 3:  Caught the fire! Ignored toast and steam.
Setting 5:  Caught the fire! Ignored toast, steam, and dust.
Setting 7:  Caught the fire! Only false alarmed on burnt toast.
Setting 10: Caught the fire! False alarmed on burnt toast and heavy smoke from cooking.
Enter fullscreen mode Exit fullscreen mode

Summary: At medium settings, it catches fires WITHOUT false alarming on toast. There's a sweet spot!


The Key Question

Both detectors can be tuned to catch 100% of fires (just crank to 10). Both can be tuned to have 0% false alarms (just set to 1, but miss fires).

So how do you compare them?

You need to see performance across ALL settings, not just one.

That's exactly what the ROC curve does.


What Is an ROC Curve?

ROC = Receiver Operating Characteristic

(The name comes from WWII radar operators trying to distinguish enemy aircraft from noise. Same problem!)

An ROC curve plots:

Y-axis: True Positive Rate (TPR) = "What % of actual fires did we catch?"
        Also called: Recall, Sensitivity
        Formula: TP / (TP + FN)

X-axis: False Positive Rate (FPR) = "What % of non-fires did we falsely alarm on?"
        Formula: FP / (FP + TN)
Enter fullscreen mode Exit fullscreen mode

Each point on the curve represents a different threshold (sensitivity dial setting).


Drawing the ROC Curve

Let's trace through with our smoke detectors:

Detector A (The Neurotic)

Setting Fires Caught (TPR) False Alarms (FPR)
1 0% 0%
3 50% 40%
5 75% 65%
7 90% 85%
10 100% 100%

Detector B (The Smart One)

Setting Fires Caught (TPR) False Alarms (FPR)
1 0% 0%
3 70% 10%
5 90% 20%
7 95% 35%
10 100% 60%

Plotting Them

True Positive Rate (Fires Caught)
     ↑
 100%│                    ●───● Detector B
     │                ●───┘
     │            ●───┘
  75%│        ●───┘          ●─── Detector A
     │       ╱            ●──┘
     │      ╱          ●──┘
  50%│     ╱       ●───┘
     │    ╱    ●───┘
     │   ╱ ●───┘
  25%│  ╱──┘
     │ ╱
     │╱.......................... Random (diagonal)
   0%└─────────────────────────────────→
     0%   25%   50%   75%  100%
          False Positive Rate (False Alarms)
Enter fullscreen mode Exit fullscreen mode

What do you see?

  • Detector B curves toward the top-left — High fire detection with low false alarms
  • Detector A hugs the diagonal — To catch more fires, it must accept proportionally more false alarms
  • The diagonal line — A random detector (flip a coin) would sit here

What Is AUC?

AUC = Area Under the Curve

Literally the area under the ROC curve.

     ↑
 100%│██████████████████████●
     │██████████████████●───┘
     │██████████████●───┘
     │██████████●───┘         ← Detector B: AUC ≈ 0.92
     │██████●───┘                (Lots of area!)
     │██●───┘
     │●─┘
   0%└─────────────────────────→
     0%                    100%


     ↑
 100%│                    ●───●
     │                ●───┘
     │████████████●───┘
     │████████●───┘           ← Detector A: AUC ≈ 0.58
     │████●───┘                  (Not much more than diagonal)
     │█●──┘
     │●─┘
   0%└─────────────────────────→
     0%                    100%
Enter fullscreen mode Exit fullscreen mode

Interpretation:

AUC Meaning
1.0 Perfect — catches all positives before any false positives
0.9+ Excellent — strong discrimination
0.8-0.9 Good
0.7-0.8 Fair
0.6-0.7 Poor
0.5 Useless — no better than random
<0.5 Worse than random — model is inverted!

The Intuitive Interpretation

Here's the most intuitive way to understand AUC:

AUC = The probability that a randomly chosen positive example ranks higher than a randomly chosen negative example.

In smoke detector terms:

AUC = If I show the detector one real fire and one non-fire, what's the probability it gives a higher "danger score" to the fire?

  • AUC = 1.0: The detector ALWAYS scores fires higher than non-fires
  • AUC = 0.5: The detector is guessing randomly
  • AUC = 0.8: 80% of the time, fires score higher than non-fires

Code: Computing ROC and AUC

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc, roc_auc_score

# Create dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, n_redundant=5,
                           random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train two models
model_good = RandomForestClassifier(n_estimators=100, random_state=42)
model_weak = LogisticRegression(C=0.001, max_iter=1000)  # Intentionally weak

model_good.fit(X_train, y_train)
model_weak.fit(X_train, y_train)

# Get probability scores (not binary predictions!)
proba_good = model_good.predict_proba(X_test)[:, 1]
proba_weak = model_weak.predict_proba(X_test)[:, 1]

# Calculate ROC curves
fpr_good, tpr_good, thresholds_good = roc_curve(y_test, proba_good)
fpr_weak, tpr_weak, thresholds_weak = roc_curve(y_test, proba_weak)

# Calculate AUC
auc_good = roc_auc_score(y_test, proba_good)
auc_weak = roc_auc_score(y_test, proba_weak)

print(f"Good Model AUC: {auc_good:.3f}")
print(f"Weak Model AUC: {auc_weak:.3f}")

# Plot
plt.figure(figsize=(10, 8))

plt.plot(fpr_good, tpr_good, 'b-', linewidth=2, 
         label=f'Random Forest (AUC = {auc_good:.3f})')
plt.plot(fpr_weak, tpr_weak, 'r-', linewidth=2,
         label=f'Weak Logistic (AUC = {auc_weak:.3f})')
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random (AUC = 0.500)')

plt.fill_between(fpr_good, tpr_good, alpha=0.3)

plt.xlabel('False Positive Rate (False Alarms)', fontsize=12)
plt.ylabel('True Positive Rate (Fires Caught)', fontsize=12)
plt.title('ROC Curve Comparison', fontsize=14)
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Good Model AUC: 0.967
Weak Model AUC: 0.723
Enter fullscreen mode Exit fullscreen mode

Understanding the Threshold Connection

Each point on the ROC curve corresponds to a threshold:

# Show what happens at different thresholds
print("Threshold | FPR    | TPR    | What it means")
print("-" * 55)

for thresh in [0.1, 0.3, 0.5, 0.7, 0.9]:
    # Find closest threshold in our curve
    idx = np.argmin(np.abs(thresholds_good - thresh))

    print(f"  {thresh:.1f}     | {fpr_good[idx]:.1%}  | {tpr_good[idx]:.1%}  |", end=" ")

    if thresh < 0.3:
        print("Aggressive: catch everything, many false alarms")
    elif thresh < 0.6:
        print("Balanced: good tradeoff")
    else:
        print("Conservative: few false alarms, might miss some")
Enter fullscreen mode Exit fullscreen mode

Output:

Threshold | FPR    | TPR    | What it means
-------------------------------------------------------
  0.1     | 12.3%  | 98.5%  | Aggressive: catch everything, many false alarms
  0.3     | 4.2%   | 95.2%  | Balanced: good tradeoff
  0.5     | 2.1%   | 91.8%  | Balanced: good tradeoff
  0.7     | 0.7%   | 85.3%  | Conservative: few false alarms, might miss some
  0.9     | 0.0%   | 72.1%  | Conservative: few false alarms, might miss some
Enter fullscreen mode Exit fullscreen mode

The ROC curve shows ALL these operating points at once!


Why AUC Is Threshold-Independent

This is AUC's superpower.

Problem with accuracy/precision/recall: They depend on your chosen threshold (usually 0.5).

from sklearn.metrics import accuracy_score, precision_score, recall_score

# Same model, different thresholds
for thresh in [0.3, 0.5, 0.7]:
    y_pred = (proba_good >= thresh).astype(int)

    print(f"\nThreshold = {thresh}")
    print(f"  Accuracy:  {accuracy_score(y_test, y_pred):.1%}")
    print(f"  Precision: {precision_score(y_test, y_pred):.1%}")
    print(f"  Recall:    {recall_score(y_test, y_pred):.1%}")
Enter fullscreen mode Exit fullscreen mode

Output:

Threshold = 0.3
  Accuracy:  93.0%
  Precision: 89.5%
  Recall:    97.2%

Threshold = 0.5
  Accuracy:  94.3%
  Precision: 93.1%
  Recall:    95.2%

Threshold = 0.7
  Accuracy:  93.7%
  Precision: 97.8%
  Recall:    89.1%
Enter fullscreen mode Exit fullscreen mode

Same model, different metrics depending on threshold!

But AUC? Always 0.967. It doesn't care what threshold you pick later.

AUC measures how CAPABLE your model is of separating classes. Threshold selection comes after.


Visual: What Different AUCs Look Like

AUC = 1.0 (Perfect)              AUC = 0.9 (Excellent)
     ↑                                ↑
 100%│■■■■■■■■■■■■■■■■●          100%│          ●────●
     │■■■■■■■■■■■■■■■■│               │      ●───┘
     │■■■■■■■■■■■■■■■■│               │  ●───┘
     │■■■■■■■■■■■■■■■■│               │●─┘
   0%●────────────────┘            0%└───────────────→

Perfect separation!              Strong separation


AUC = 0.7 (Fair)                 AUC = 0.5 (Useless)
     ↑                                ↑
 100%│            ●────●         100%│              ●
     │        ●───┘                   │          ●──┘
     │    ●───┘                       │      ●───┘
     │●───┘                           │  ●───┘
   0%└───────────────→            0%●─┘──────────────→

Okay separation                  Random guessing (diagonal)
Enter fullscreen mode Exit fullscreen mode

When to Use AUC-ROC

✅ Use AUC-ROC When:

1. You need to compare models before choosing a threshold

# Compare multiple models
models = {
    'Logistic': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(probability=True)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    proba = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, proba)
    print(f"{name}: AUC = {auc:.3f}")

# Pick the best model by AUC, THEN choose threshold for deployment
Enter fullscreen mode Exit fullscreen mode

2. Classes are roughly balanced

AUC-ROC works best when positive and negative classes are similar in size.

3. You care about ranking quality

"Does the model rank positives higher than negatives?" — AUC directly measures this.

4. The operating threshold will be tuned later

If you'll choose the threshold based on business needs anyway, AUC tells you model quality independent of that choice.


❌ Don't Use AUC-ROC When:

1. Classes are highly imbalanced

# Imbalanced: 95% negative, 5% positive
# AUC can look great while precision is terrible!

# Use Precision-Recall AUC instead:
from sklearn.metrics import average_precision_score, precision_recall_curve

pr_auc = average_precision_score(y_test, proba)
print(f"PR-AUC: {pr_auc:.3f}")  # More informative for imbalanced data
Enter fullscreen mode Exit fullscreen mode

2. You care about a specific operating point

If you KNOW you'll use threshold 0.5, just measure precision/recall there.

3. False positives and false negatives have very different costs

AUC treats all thresholds equally. But if FN costs $1M and FP costs $1, you need cost-sensitive analysis.


AUC-ROC vs Precision-Recall AUC

For imbalanced datasets, use PR-AUC instead:

from sklearn.metrics import precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

# Imbalanced dataset
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)[:, 1]

# ROC-AUC
roc_auc = roc_auc_score(y_test, proba)

# PR-AUC
pr_auc = average_precision_score(y_test, proba)

print(f"ROC-AUC: {roc_auc:.3f}")  # Often looks good even with imbalance
print(f"PR-AUC:  {pr_auc:.3f}")   # More honest for imbalanced data

# Plot both
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, proba)
axes[0].plot(fpr, tpr, 'b-', linewidth=2)
axes[0].plot([0, 1], [0, 1], 'k--')
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title(f'ROC Curve (AUC = {roc_auc:.3f})')
axes[0].fill_between(fpr, tpr, alpha=0.3)

# PR Curve
precision, recall, _ = precision_recall_curve(y_test, proba)
axes[1].plot(recall, precision, 'r-', linewidth=2)
baseline = sum(y_test) / len(y_test)
axes[1].axhline(y=baseline, color='k', linestyle='--', label=f'Baseline ({baseline:.2f})')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title(f'Precision-Recall Curve (AUC = {pr_auc:.3f})')
axes[1].fill_between(recall, precision, alpha=0.3)
axes[1].legend()

plt.tight_layout()
plt.savefig('roc_vs_pr.png', dpi=150)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

ROC-AUC: 0.943  ← Looks great!
PR-AUC:  0.612  ← Actually harder to find the 5% minority class
Enter fullscreen mode Exit fullscreen mode

Rule of thumb:

  • Balanced data → ROC-AUC
  • Imbalanced data → PR-AUC

Finding the Optimal Threshold

The ROC curve shows all thresholds. But which one should you USE?

Method 1: Youden's J Statistic

Maximize (TPR - FPR) — the point farthest from the diagonal.

# Find optimal threshold using Youden's J
fpr, tpr, thresholds = roc_curve(y_test, proba)
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]

print(f"Optimal threshold: {optimal_threshold:.3f}")
print(f"At this threshold: TPR = {tpr[optimal_idx]:.1%}, FPR = {fpr[optimal_idx]:.1%}")
Enter fullscreen mode Exit fullscreen mode

Method 2: Target a Specific FPR

"I can tolerate 5% false alarms. What's the best TPR I can get?"

# Find threshold that gives ~5% FPR
target_fpr = 0.05
idx = np.argmin(np.abs(fpr - target_fpr))
threshold_for_5pct_fpr = thresholds[idx]

print(f"For FPR ≈ 5%:")
print(f"  Threshold: {threshold_for_5pct_fpr:.3f}")
print(f"  Actual FPR: {fpr[idx]:.1%}")
print(f"  TPR achieved: {tpr[idx]:.1%}")
Enter fullscreen mode Exit fullscreen mode

Method 3: Target a Specific TPR

"I must catch 95% of fires. What FPR do I have to accept?"

# Find threshold that gives ~95% TPR
target_tpr = 0.95
idx = np.argmin(np.abs(tpr - target_tpr))
threshold_for_95pct_tpr = thresholds[idx]

print(f"For TPR ≈ 95%:")
print(f"  Threshold: {threshold_for_95pct_tpr:.3f}")
print(f"  Actual TPR: {tpr[idx]:.1%}")
print(f"  FPR cost: {fpr[idx]:.1%}")
Enter fullscreen mode Exit fullscreen mode

Multi-Class AUC

For more than 2 classes, compute AUC for each class vs rest:

from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_auc_score

# Multi-class: 3 classes
y_true_multi = [0]*100 + [1]*100 + [2]*100
y_proba_multi = model.predict_proba(X_test_multi)

# One-vs-Rest AUC for each class
for i in range(3):
    y_true_binary = (np.array(y_true_multi) == i).astype(int)
    auc_i = roc_auc_score(y_true_binary, y_proba_multi[:, i])
    print(f"Class {i} vs Rest: AUC = {auc_i:.3f}")

# Overall (macro average)
y_true_binarized = label_binarize(y_true_multi, classes=[0, 1, 2])
auc_macro = roc_auc_score(y_true_binarized, y_proba_multi, average='macro', multi_class='ovr')
print(f"\nMacro AUC: {auc_macro:.3f}")
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake 1: Using Predictions Instead of Probabilities

# ❌ WRONG: Using hard predictions
y_pred = model.predict(X_test)  # 0s and 1s
auc = roc_auc_score(y_test, y_pred)  # This gives you just 1 point!

# ✅ RIGHT: Using probability scores
y_proba = model.predict_proba(X_test)[:, 1]  # Continuous 0-1
auc = roc_auc_score(y_test, y_proba)  # Full curve!
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Trusting AUC with Imbalanced Data

# ❌ DANGEROUS: High AUC with 99% negative class
# ROC-AUC can be 0.95 while precision is 0.10!

# ✅ RIGHT: Also check PR-AUC for imbalanced data
print(f"ROC-AUC: {roc_auc_score(y_test, proba):.3f}")
print(f"PR-AUC:  {average_precision_score(y_test, proba):.3f}")
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Thinking AUC = Good at Threshold 0.5

# ❌ WRONG assumption
"AUC is 0.95, so my model at threshold 0.5 must be great!"

# ✅ RIGHT understanding
# AUC measures ranking ability, not performance at any specific threshold
# Always check metrics AT your chosen threshold too
y_pred_50 = (proba >= 0.5).astype(int)
print(f"Performance at threshold 0.5:")
print(f"  Precision: {precision_score(y_test, y_pred_50):.1%}")
print(f"  Recall: {recall_score(y_test, y_pred_50):.1%}")
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Comparing AUC Across Different Datasets

# ❌ WRONG
"Model A on Dataset X has AUC 0.85"
"Model B on Dataset Y has AUC 0.80"
"Therefore Model A is better!"

# ✅ RIGHT
# AUC depends on the difficulty of the problem!
# A harder dataset might have lower AUC for all models
# Only compare AUC on the SAME dataset
Enter fullscreen mode Exit fullscreen mode

Quick Reference

The ROC Curve

Y-axis: True Positive Rate = TP / (TP + FN) = Recall
X-axis: False Positive Rate = FP / (FP + TN)

Each point = one threshold setting
Curve = all thresholds from 0 to 1
Enter fullscreen mode Exit fullscreen mode

AUC Interpretation

AUC Interpretation
1.0 Perfect classifier
0.9-1.0 Excellent
0.8-0.9 Good
0.7-0.8 Fair
0.6-0.7 Poor
0.5 Random guessing
<0.5 Worse than random (flip predictions!)

When to Use What

Scenario Metric
Balanced classes, comparing models ROC-AUC
Imbalanced classes PR-AUC
Specific threshold already chosen Precision, Recall, F1
Cost-sensitive decisions Custom cost function

Key Takeaways

  1. ROC curve shows TPR vs FPR at ALL thresholds — Not just one operating point

  2. AUC summarizes the curve into one number — Area under the ROC curve

  3. AUC = probability that a random positive ranks higher than a random negative — Intuitive interpretation

  4. AUC is threshold-independent — Measures model capability, not performance at 0.5

  5. AUC = 0.5 means random, AUC = 1.0 means perfect — Easy to interpret scale

  6. Use probabilities, not predictions — You need continuous scores to draw the curve

  7. For imbalanced data, prefer PR-AUC — ROC-AUC can be misleading

  8. High AUC ≠ high precision at your threshold — Always check both


The One-Sentence Summary

AUC-ROC is like rating a smoke detector not at one sensitivity setting, but across ALL settings — telling you if it's fundamentally capable of distinguishing fires from toast, regardless of where you eventually set the dial.


What's Next?

Now that you understand AUC-ROC, you're ready for:

  • Precision-Recall Curves — The better choice for imbalanced data
  • Calibration — When you need reliable probability estimates
  • Cost-Sensitive Learning — When FP ≠ FN in dollars
  • Lift and Gain Charts — For marketing and targeting

Follow me for the next article in this series!


Let's Connect!

If AUC-ROC finally makes sense now, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's the highest AUC you've achieved on a real problem? I once hit 0.99 and immediately suspected data leakage (I was right 😅).


The difference between a smoke detector that scores 99% at sensitivity 5 and one that's fundamentally good at detecting fires? The ROC curve. One might only work at that exact setting. The other works well across ALL settings. AUC tells you which is which.


Share this with someone who keeps comparing models at threshold 0.5. There's a whole curve they're missing.

Happy evaluating! 🔥

Top comments (0)