Gervais Yao Amoah

Posted on May 23

Stop Trusting Your Accuracy Score: A Practical Guide to Evaluating Logistic Regression Models

#machinelearning #ai #beginners #datascience

"Accuracy lied to you. Here's the complete toolkit—confusion matrix, precision, recall, F1, ROC/AUC, log loss, and cross-validation—that separates models that look good from models that actually work."

You trained your first classifier, ran .score(), and got 97% accuracy. You shipped it. Three weeks later, your fraud team tells you it's catching zero fraudulent transactions.

Sound familiar? You fell into the accuracy trap—and it's the most common mistake from developers moving into ML.

This guide will give you the mental model and the code to evaluate binary classifiers properly. By the end, you'll know which metrics to reach for, when accuracy actively lies to you, how to read a ROC curve, and the seven pitfalls that silently kill production models.

Why Linear Regression Breaks for Classification

Before we get to evaluation, one minute on why we use logistic regression at all—because understanding the limitation it solves makes the evaluation choices clearer.

When you apply linear regression to a yes/no problem, you get predictions like 1.3 or -0.2. These aren't probabilities. They can't be thresholded reliably. And a single outlier in your training set can physically shift your decision boundary by several units:

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Binary labels: 0 or 1
X = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(-1, 1)
y = np.array([0, 0, 0, 0, 1, 1, 1, 1])

model = LinearRegression().fit(X, y)
print(model.predict([[0]]))   # Predicts -0.36 — not a valid probability
print(model.predict([[10]]))  # Predicts 1.36 — also not valid

Logistic regression fixes this by wrapping the linear combination in a sigmoid function:

σ(z) = 1 / (1 + e^(-z))    where z = β₀ + β₁X₁ + ... + βₙXₙ

The sigmoid squashes any real number into the interval (0, 1), giving you an actual probability. It also models the log-odds of the positive class linearly, which is the statistician's way of saying "we get interpretable coefficients."

Under the hood, the model is optimized with Maximum Likelihood Estimation, minimizing cross-entropy loss (not squared error). The decision boundary is linear—a straight line in 2D feature space—but the output is a calibrated probability.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=1000, n_features=10,
    n_informative=5, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Note: C is the inverse of regularization strength (C = 1/λ)
# Smaller C = stronger regularization = less overfitting
model = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

# Hard class predictions
y_pred = model.predict(X_test)

# Calibrated probabilities — use these for most evaluation tasks
y_prob = model.predict_proba(X_test)[:, 1]

Quick note on regularization: LogisticRegression in scikit-learn uses L2 regularization by default (penalty='l2'). Use penalty='l1' with solver='liblinear' if you want automatic feature selection via sparsity.

The Problem: Accuracy Actively Misleads You on Imbalanced Data

Here's the scenario that trips up almost everyone.

Fraud detection dataset:

10,000 transactions
9,900 legitimate (99%)
100 fraudulent (1%)

Build a model that predicts "legitimate" for every single transaction:

import numpy as np
from sklearn.metrics import accuracy_score

y_true = np.array([0]*9900 + [1]*100)
y_dummy = np.zeros(10000)  # Predicts "not fraud" always

print(f"Dummy accuracy: {accuracy_score(y_true, y_dummy):.1%}")
# Output: Dummy accuracy: 99.0%

Your "model" achieves 99% accuracy and catches zero fraud cases.

This isn't a gotcha edge case—it's the normal situation in fraud detection, medical diagnosis, churn prediction, and anomaly detection. Whenever your classes are imbalanced, accuracy is nearly useless as a primary metric.

The root problem: accuracy treats all errors as equal. But missing a fraudulent transaction (false negative) is catastrophically different from flagging a legitimate one (false positive). You need metrics that distinguish between error types.

The Confusion Matrix: Your Evaluation Foundation

Everything useful in binary classification evaluation flows from the confusion matrix—a 2×2 breakdown of where your predictions agree and disagree with reality.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap='Blues')
plt.title("Confusion Matrix")
plt.show()

scikit-learn convention (rows = actual, columns = predicted):

	Predicted: Negative (0)	Predicted: Positive (1)
Actual: Negative	True Negative (TN) ✅	False Positive (FP) ❌
Actual: Positive	False Negative (FN) ❌	True Positive (TP) ✅

Plain English:

True Positive (TP): You predicted fraud. It was fraud.
True Negative (TN): You predicted legit. It was legit.
False Positive (FP): You cried wolf. Customer was innocent. (Type I error)
False Negative (FN): You missed the fraudster. (Type II error)

⚠️ Heads up: Some textbooks and tools swap the axis convention. When reading someone else's confusion matrix, always check the axis labels before drawing conclusions.

Precision, Recall, F1, and Specificity

Once you have the confusion matrix, every classification metric is just arithmetic on those four numbers.

Precision: "When I fire, do I hit?"

Precision = TP / (TP + FP)

Of all the positives you predicted, what fraction were actually positive? High precision means you rarely raise false alarms.

Reach for precision when false positives are expensive: spam filtering (you don't want to delete legitimate emails), content moderation (you don't want to wrongly remove posts).

Recall (Sensitivity): "Do I catch everything?"

Recall = TP / (TP + FN)

Of all the positives that actually exist, what fraction did you catch? High recall means you miss very few real positives.

Reach for recall when false negatives are dangerous: cancer screening (missing a tumor is catastrophic), fraud detection (missing fraud costs money), churn (missing a leaving customer means lost revenue).

The Unavoidable Trade-off

Lower your classification threshold → you predict positive more often → recall goes up, precision goes down. Raise it → fewer positive predictions → precision goes up, recall goes down. They move in opposite directions; there's no free lunch.

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)

plt.figure(figsize=(8, 5))
plt.plot(thresholds, precisions[:-1], label='Precision', color='blue')
plt.plot(thresholds, recalls[:-1], label='Recall', color='red')
plt.xlabel('Classification Threshold')
plt.ylabel('Score')
plt.title('Precision-Recall Trade-off vs Threshold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

F1 Score: Balancing Both

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The F1 score is the harmonic mean of precision and recall. Unlike the arithmetic mean, it punishes imbalance: a model with precision=1.0 and recall=0.0 gets an F1 of 0.0, not 0.5. Both have to be high to score well.

Use F1 when you need a single headline number and care roughly equally about precision and recall. It's especially useful for comparing models on imbalanced datasets.

Specificity (True Negative Rate): The Clinical Counterpart

Specificity = TN / (TN + FP)

The flip side of recall, but for negatives. "Of all actual negatives, how many did I correctly rule out?" Common in medical contexts:

High recall (sensitivity): Use for initial screening—catch every possible case.
High specificity: Use for confirmatory testing—avoid false diagnoses.

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred, target_names=['Legit', 'Fraud']))

🔑 Read classification_report carefully. The accuracy row at the bottom tells you almost nothing here. Look at the per-class precision, recall, and F1 for your minority class.

Choosing the Right Metric for Your Situation

Here's the decision framework I use before I even start training:

Question	Answer	→ Use
Is the dataset imbalanced?	Yes	Precision / Recall / F1 / PR-AUC
	No	Accuracy is acceptable as a secondary metric
FP costly, FN cheap?	Yes	Optimize Precision
FN costly, FP cheap?	Yes	Optimize Recall
Both costly?	Yes	F1 or cost-weighted metric
Need threshold-independent comparison?	Yes	AUC-ROC or AUC-PR

For fraud, churn, and disease: optimize recall first, then set a precision floor your business can tolerate. For spam filters and recommendation engines: optimize precision, accept some misses.

ROC Curve and AUC: Threshold-Independent Evaluation

All the metrics above assume a fixed decision threshold (typically 0.5). But the right threshold depends on your business context and changes as requirements evolve. How do you compare two models before you've even decided on a threshold?

Enter: the ROC (Receiver Operating Characteristic) curve.

ROC plots True Positive Rate (Recall) on the Y-axis against False Positive Rate on the X-axis, across every possible threshold. Each point on the curve is one threshold value.

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)

plt.figure(figsize=(7, 6))
plt.plot(fpr, tpr, color='steelblue', lw=2, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', lw=1, label='Random Guessing (AUC = 0.5)')
plt.fill_between(fpr, tpr, alpha=0.15, color='steelblue')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Recall / Sensitivity)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

AUC: Reading the Number

AUC (Area Under the ROC Curve) condenses the entire curve into one number that tells you how well your model ranks positives above negatives.

AUC Value	Interpretation
1.0	Perfect — model always ranks positives above negatives
0.9 – 1.0	Outstanding
0.8 – 0.9	Excellent
0.7 – 0.8	Acceptable
0.5	Random guessing — the model has no discriminative ability
< 0.5	Worse than random (flip predictions to get > 0.5)

The AUC has a beautiful probabilistic interpretation: it equals the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example by your model.

When to Ditch ROC for the Precision-Recall Curve

ROC curves can be overly optimistic on severely imbalanced datasets. Why? FPR (False Positive Rate = FP / (FP + TN)) has the large TN count in the denominator. When there are thousands of true negatives, even many false positives produce a tiny FPR—making your ROC curve look good while your precision is terrible.

Rule of thumb:

Balanced classes → ROC/AUC is reliable.
Heavy class imbalance → Use the Precision-Recall curve and AUC-PR instead.

from sklearn.metrics import average_precision_score, PrecisionRecallDisplay

ap = average_precision_score(y_test, y_prob)
display = PrecisionRecallDisplay.from_predictions(
    y_test, y_prob, name=f"AP = {ap:.3f}"
)
display.ax_.set_title("Precision-Recall Curve")
plt.show()

Log Loss: The Probabilistic Metric You Should Be Using More

Accuracy, precision, and recall all evaluate hard predictions (the 0/1 decision). But your model produces probabilities, and evaluating only the binary output throws away information.

Log loss (cross-entropy) measures how well-calibrated your probability estimates are:

Log Loss = -(1/n) × Σ [y_i × log(p_i) + (1 - y_i) × log(1 - p_i)]

In plain terms: predict 0.99 probability for a positive that turns out to be negative, and you're penalized harshly. Predict a confident 0.60 instead of 0.51, and you get a better log loss even if both produce the same hard prediction.

Log loss is preferred when:

Downstream systems consume probabilities, not labels (e.g., expected value calculations)
You're comparing two models that produce identical accuracy/F1 but different calibration
You're using the output to set a custom business threshold

from sklearn.metrics import log_loss

ll = log_loss(y_test, y_prob)
print(f"Log Loss: {ll:.4f}")
# Perfect model: 0.0
# Random guessing: ln(2) ≈ 0.693

Lower log loss = better calibrated probabilities. A model with log loss > 0.693 is effectively worse than random probability assignment.

Cross-Validation: Getting Evaluation You Can Trust

Single train/test splits are noisy. If you got lucky (or unlucky) with how data was randomly partitioned, your metrics don't generalize. Cross-validation gives you a reliable estimate.

k-Fold Cross-Validation

Split data into k folds. Train on k-1 folds, test on the remaining fold. Repeat k times (once per fold as the test set). Average the k results.

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Always put preprocessing inside the pipeline!
# This prevents data leakage from the scaler.
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate multiple metrics
for metric in ['accuracy', 'f1', 'roc_auc', 'neg_log_loss']:
    scores = cross_val_score(pipe, X, y, cv=cv, scoring=metric)
    label = metric.replace('neg_', '-')
    print(f"{label:>15}: {scores.mean():.4f} ± {scores.std():.4f}")

       accuracy: 0.8920 ± 0.0145
             f1: 0.8911 ± 0.0163
        roc_auc: 0.9587 ± 0.0098
      -log_loss: -0.2734 ± 0.0121

Why Stratified, Not Regular k-Fold?

With imbalanced classes, a random split might put almost all the minority class examples in one fold—making some folds impossible to evaluate meaningfully.

StratifiedKFold preserves the class ratio in each fold. Use it by default for classification, especially with imbalanced data. It's almost always the right choice.

7 Pitfalls That Will Silently Break Your Evaluation

1. Reporting Only Accuracy

Symptom: Your model scores 97% accuracy and gets shipped. It catches nothing useful.

Fix: Always report precision, recall, F1, and AUC alongside accuracy. If classes are imbalanced, accuracy is a secondary metric at best.

2. Data Leakage in Preprocessing

Symptom: Suspiciously high validation metrics that don't hold up in production.

Cause: Fitting your scaler, imputer, or feature selector on the full dataset before splitting, letting test-set information influence your transforms.

# ❌ WRONG — scaler sees test data, leaks information
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)           # fit on everything
X_train, X_test = train_test_split(X_scaled) # split afterward

# ✅ CORRECT — use a pipeline or manually split first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit on train only
X_test_scaled = scaler.transform(X_test)        # transform test with train params

3. Using Default 0.5 Threshold Without Questioning It

Symptom: Good AUC, terrible precision or recall in production.

Fix: Find the threshold that matches your business cost ratio, then tune it on a validation set.

# Find threshold that maximizes F1 on validation data
from sklearn.metrics import f1_score

best_f1, best_threshold = 0, 0.5
for threshold in np.arange(0.1, 0.9, 0.01):
    y_pred_t = (y_prob >= threshold).astype(int)
    f = f1_score(y_test, y_pred_t)
    if f > best_f1:
        best_f1 = f
        best_threshold = threshold

print(f"Best threshold: {best_threshold:.2f}, F1: {best_f1:.4f}")

4. Ignoring Class Imbalance in CV

Symptom: Cross-validation folds have inconsistent class distributions; some folds fail or give wild metric swings.

Fix: Use StratifiedKFold (shown above). Also consider class_weight='balanced' in your model:

model = LogisticRegression(class_weight='balanced', max_iter=1000)

5. Evaluating on the Test Set More Than Once

Symptom: You iterate by checking test metrics, making changes, re-checking—and unknowingly over-fit to the test set.

Fix: Use a three-way split or cross-validation for development; touch the test set exactly once for final reporting.

# Three-way split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)
# Train on X_train, tune on X_val, final eval on X_test (once!)

6. Over-Relying on AUC-ROC with Severe Imbalance

Symptom: ROC-AUC looks great; actual fraud/disease detection rate is awful.

Fix: Switch to AUC-PR (average_precision_score) for heavily imbalanced problems.

7. Skipping a Baseline Comparison

Symptom: You report 0.87 AUC with no context.

Fix: Always compare against a dummy baseline.

from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy='most_frequent')
baseline_auc = cross_val_score(dummy, X, y, cv=5, scoring='roc_auc').mean()
model_auc = cross_val_score(pipe, X, y, cv=5, scoring='roc_auc').mean()

print(f"Baseline AUC: {baseline_auc:.3f}")
print(f"Model AUC:    {model_auc:.3f}")
print(f"Improvement:  {model_auc - baseline_auc:+.3f}")

The Complete Evaluation Workflow

Here's what we should actually run before declaring a model ready:

from sklearn.metrics import (
    classification_report, confusion_matrix, ConfusionMatrixDisplay,
    roc_auc_score, average_precision_score, log_loss, f1_score
)

def evaluate_classifier(model, X_test, y_test, threshold=0.5):
    y_prob = model.predict_proba(X_test)[:, 1]
    y_pred = (y_prob >= threshold).astype(int)

    print("=" * 50)
    print(f"EVALUATION REPORT (threshold = {threshold})")
    print("=" * 50)

    # 1. Classification report (precision, recall, F1 per class)
    print("\n--- Per-Class Metrics ---")
    print(classification_report(y_test, y_pred))

    # 2. Probabilistic metrics
    print(f"ROC-AUC:       {roc_auc_score(y_test, y_prob):.4f}")
    print(f"PR-AUC:        {average_precision_score(y_test, y_prob):.4f}")
    print(f"Log Loss:      {log_loss(y_test, y_prob):.4f}")

    # 3. Confusion matrix
    print("\n--- Confusion Matrix ---")
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(cm)
    disp.plot(cmap='Blues')
    plt.title(f"Confusion Matrix (threshold={threshold})")
    plt.tight_layout()
    plt.show()

evaluate_classifier(model, X_test, y_test, threshold=0.5)

Practical Takeaways Checklist

Before you call any classifier production-ready:

[ ] Look at the confusion matrix first. Numbers before plots.
[ ] Report precision, recall, and F1 for the minority class, not just overall accuracy.
[ ] Use StratifiedKFold cross-validation to get reliable metric estimates.
[ ] Compare ROC-AUC and PR-AUC. If classes are imbalanced, PR-AUC is your primary signal.
[ ] Check log loss to verify your probabilities are well-calibrated, not just your hard predictions.
[ ] Question the 0.5 threshold. Tune it to match the real cost of FP vs. FN in your domain.
[ ] Use a Pipeline to prevent data leakage from preprocessing steps.
[ ] Run a DummyClassifier baseline before celebrating your AUC score.
[ ] Reserve your test set. If you've looked at it more than once during development, it's a validation set.
[ ] Tie your metric choice to a business outcome. "We want to catch 90% of churners while maintaining > 60% precision" beats "maximize F1."

Final Thought

Model evaluation isn't about finding the best model in the abstract—it's about finding the right model for your specific problem. A 95% accuracy model can be completely useless. An 80% accuracy model can save lives or prevent fraud, depending on where it's wrong.

The metrics are just tools. The judgment—knowing which errors your system can tolerate and which it can't—is what makes you a useful engineer, not just a code runner.

Go measure wisely.

Found this useful? I'd love to hear which pitfall stung you hardest—drop it in the comments.

DEV Community