Qss Technosoft

Posted on Jun 11

Your Churn Model's 80% Accuracy Is Lying to You

#datascience #tutorial #python #machinelearning

We've built churn models for clients in logistics, SaaS, and healthcare. The pattern is always the same: someone trains a classifier, sees 80% accuracy, and declares victory. Then the model ships, the retention team acts on it, and nothing improves.

The model wasn't broken. The measurement was.

This post walks through the gap between a churn model that scores well in a notebook and one that actually changes a business outcome. We'll use real, messy data — not a synthetic dataset rigged to make the model look smart — and spend most of our time on the parts that decide whether a churn project succeeds or quietly fails. Spoiler: almost none of it is the algorithm.

Basic Python is enough to follow along. The mindset is the part worth taking away.

The 80% accuracy trap

Let's start with the number everyone reaches for first.

We'll use the Telco Customer Churn dataset — a real, public dataset with ~7,000 customers. It's a good stand-in for what a SaaS or telecom company actually has: a mix of contract types, payment methods, service add-ons, and tenure.

The first thing to know about it is the churn rate: about 26.5% of customers churned. Which means a model that predicts "nobody ever churns" is already 73.5% accurate and has done literally nothing.

from sklearn.dummy import DummyClassifier

baseline = DummyClassifier(strategy="most_frequent")
baseline.fit(X_train, y_train)
print(f"Baseline accuracy: {baseline.score(X_test, y_test):.2%}")
# Baseline accuracy: 73.46%

So when your real model hits 80%, you haven't gained 80 points of insight. You've gained about six points over a model that does nothing. That's the number you should be reporting to leadership, and it's the number most churn write-ups conveniently omit.

This is the first rule we apply on every engagement: establish the dumb baseline before you celebrate the smart model. If you can't beat "predict the majority class" by a meaningful margin on the metric that matters, you don't have a model — you have a coin flip with extra steps.

Loading real data (and the gotcha that breaks it)

Real data fights back. The Telco dataset has a well-known trap: TotalCharges is stored as a string, and brand-new customers (tenure = 0) have a blank space instead of a number. Load it naively and your pipeline either crashes or silently treats a numeric column as categorical text.

import pandas as pd

df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")

# TotalCharges looks numeric but isn't — 11 rows are " " (blank).
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
print(f"Missing TotalCharges after coercion: {df['TotalCharges'].isna().sum()}")
# Missing TotalCharges after coercion: 11

# These are tenure-0 customers who haven't been billed yet.
# A defensible choice: their total charges are effectively 0.
df["TotalCharges"] = df["TotalCharges"].fillna(0)

# Build the target and drop the ID (an ID has zero predictive value
# and is a classic accidental-leakage vector if left in).
y = (df["Churn"] == "Yes").astype(int)
X = df.drop(columns=["customerID", "Churn"])

That blank-space gotcha is trivial once you know it's there. The point isn't this one dataset — it's that every real dataset has its own version of this, and finding it is the unglamorous work that separates a model that holds up from one that breaks the first week in production.

Build it right: pipeline, not a pile of scripts

Here's the most common silent bug we find in inherited churn code: the encoder or scaler is fit on the entire dataset before the train/test split. That leaks information from the test set into training, and your reported accuracy becomes a fantasy.

The fix is to do all preprocessing inside a Pipeline and ColumnTransformer, fit on the training data only. This also makes the model deployable as a single object instead of a fragile chain of manual steps.

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
cat_cols = [c for c in X.columns if c not in num_cols]

# Stratify so the 26.5% churn rate is preserved in both splits.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

preprocess = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
])

model = Pipeline([
    ("pre", preprocess),
    # class_weight handles the imbalance so the model doesn't just
    # learn to predict "no churn" for everyone.
    ("clf", LogisticRegression(max_iter=1000, class_weight="balanced")),
])

model.fit(X_train, y_train)

Three deliberate choices here, each one a lesson paid for in production:

stratify=y — without it, an unlucky split can hand you train and test sets with different churn rates, and your evaluation lies to you.
handle_unknown="ignore" — in production you will see a category the model never trained on. This stops it from crashing at 2 a.m.
class_weight="balanced" — with 26.5% positives, a model optimizing raw accuracy is tempted to ignore churners entirely. This forces it to take them seriously.

Measure what the business actually cares about

Accuracy is the wrong headline metric for an imbalanced problem. The metrics that matter for churn are about ranking and catching the right people.

from sklearn.metrics import roc_auc_score, average_precision_score, classification_report

proba = model.predict_proba(X_test)[:, 1]

print(f"ROC AUC: {roc_auc_score(y_test, proba):.3f}")
print(f"PR AUC:  {average_precision_score(y_test, proba):.3f}")
print(classification_report(y_test, (proba >= 0.5).astype(int)))

ROC AUC asks: if you pick a random churner and a random non-churner, how often does the model score the churner higher? PR AUC is the metric to watch when positives are rare, because it focuses on how clean your "likely to churn" list actually is. Report these, not accuracy.

And in the classification report, recall on the churn class is usually the number that matters most — missing a churner means losing a customer, while a false alarm just means a cheap retention offer went to someone who'd have stayed. Which brings us to the part nobody teaches and everybody needs.

The 0.5 threshold is an accident, not a decision

predict() uses a 0.5 probability cutoff by default. There is no business reason for 0.5. The right threshold depends entirely on the economics of the action you take.

Suppose a retention offer costs $50, a saved customer is worth $500 in retained lifetime value, and the offer succeeds 40% of the time. Now we can find the threshold that maximizes money, not accuracy:

import numpy as np

offer_cost   = 50    # cost of making a retention offer
save_value   = 500   # value of a customer we successfully retain
success_rate = 0.40  # offers that actually work

best_t, best_ev = 0.5, -np.inf
for t in np.linspace(0.05, 0.95, 19):
    flagged          = proba >= t
    n_offers         = flagged.sum()
    churners_flagged = (flagged & (y_test.values == 1)).sum()

    cost  = n_offers * offer_cost
    saved = churners_flagged * success_rate * save_value
    ev    = saved - cost

    if ev > best_ev:
        best_ev, best_t = ev, t

print(f"Best threshold: {best_t:.2f}  |  Expected value on test set: ${best_ev:,.0f}")

Run this and the optimal threshold almost never lands on 0.5. Change the offer cost or the LTV and it moves again. This is the deliverable. A churn model that hands the business a tuned, economics-aware threshold is worth real money. A churn model that hands them predict() output at 0.5 is a science-fair project.

If you act on probabilities, calibrate them

There's a subtle trap once you start using predict_proba for decisions: many models output scores that rank well but aren't true probabilities. A model can say "80% likely to churn" for a group that churns 50% of the time. If your retention budget is allocated by predicted probability, miscalibration burns money directly.

from sklearn.metrics import brier_score_loss
from sklearn.calibration import calibration_curve

print(f"Brier score: {brier_score_loss(y_test, proba):.3f}")  # lower is better
prob_true, prob_pred = calibration_curve(y_test, proba, n_bins=10)
# Plot prob_true vs prob_pred; the closer to the diagonal, the better calibrated.

Logistic regression is reasonably calibrated out of the box, which is one reason it's still our default first model. Tree ensembles like gradient boosting often rank better but need CalibratedClassifierCV wrapped around them before you trust their probabilities. Decide based on probabilities, and you've signed up to check calibration.

Interpretability that survives a stakeholder meeting

Logistic regression earns its keep here: you can show exactly which factors drive churn, in plain terms, to a non-technical room.

feature_names = model.named_steps["pre"].get_feature_names_out()
coefs = model.named_steps["clf"].coef_[0]

drivers = (
    pd.DataFrame({"feature": feature_names, "weight": coefs})
    .sort_values("weight", ascending=False)
)
print(drivers.head(8))   # strongest churn drivers
print(drivers.tail(8))   # strongest retention drivers

On this dataset, month-to-month contracts and fiber-optic service push hard toward churn; long tenure and two-year contracts pull strongly the other way. That's not a black-box prediction — it's a list of levers the business can actually pull. "Move month-to-month customers onto annual contracts" is a strategy. "The neural net said so" is not.

What actually breaks in production

Everything above gets you a defensible model. Keeping it useful is a different job, and it's where most of our client work actually lives:

Leakage hides in the schema. The single most common reason a churn model shows suspiciously high accuracy is a feature that's only populated after the customer churns — a cancellation reason code, a final-month billing adjustment, an account-status flag. We audit every feature against the question "would this value exist at prediction time?" before trusting a single metric.
Models drift. Customer behavior shifts with pricing changes, new competitors, and seasonality. A model trained last year quietly degrades. It needs monitoring on live performance, not just a one-time test-set score.
The handoff is the hard part. A churn score sitting in a database changes nothing. The value appears only when it's wired into a workflow — a CRM trigger, a retention queue, an automated offer — with a feedback loop that records whether the intervention worked. For a logistics client, getting that loop right is what turned a model into a measurable operating-cost reduction. The model was maybe 20% of the effort.
Compliance is non-negotiable in regulated industries. When we build predictive models in healthcare, the model is the easy part; doing it inside HIPAA constraints, with auditable decisions, is the engagement.

The framing skill underneath all of this is the same one that decides every ML project before a line of code is written: turning "we're losing customers" into "predict P(churn) per active account each month, and trigger a $50 offer above the break-even threshold." Get that framing right and a 30-line logistic regression delivers value. Get it wrong and a 200-million-parameter model delivers a dashboard nobody acts on.

Where to take this next

If you want to push the model itself further: try gradient boosting (XGBoost, LightGBM) and compare on PR-AUC, not accuracy; use cross-validation instead of a single split; and wrap probability calibration around any tree ensemble before acting on its scores. But be honest about where the marginal return is. On most real churn problems, a clean logistic-regression pipeline with a properly tuned threshold beats a fancier model with a careless one — and it's far easier to explain, deploy, and defend.

Written by the team at QSS Technosoft, where we build and ship production ML and AI systems — from churn and forecasting models to generative AI — for clients in healthcare, fintech, and logistics. If you've got a model that scores well but isn't moving a number that matters, that gap is exactly the work we do.

DEV Community