Mansi Somayajula

Posted on Apr 14

Your ML Model Is Lying to You — Here's How to Catch It.

#machinelearning #python #mlops #datascience

The car that pulls to the right

Have you ever driven a car that slowly starts pulling to the right?

At first you don't even notice. You unconsciously make tiny steering corrections. Days pass. Weeks pass. You adapt so naturally that it feels normal.

Then someone else gets in your car, drives for thirty seconds, and immediately says — "something's wrong with this car."

You hadn't noticed. Not because you weren't paying attention. But because the change was so gradual, so slow, that your brain adjusted to it without raising an alarm.

This is exactly what happens to ML models in production.

They don't suddenly break. They drift. Slowly, quietly, consistently — until one day a user complains, a metric tanks, or someone from outside your team says "this thing isn't working right."

By then? The drift has been happening for weeks.

In this post I want to show you what drift actually looks like, why it's so easy to miss, and — most importantly — how to catch it before it catches you.

What is drift, really?

Before we get into code, let me make this concrete with something you experience every day.

The coffee shop analogy

Imagine a coffee shop near a big office building. For three years, they've had a morning rush of office workers every weekday at 8 AM. They've trained their staff, stocked their inventory, and scheduled their deliveries perfectly around this pattern.

Then the company in that office building goes fully remote. Overnight, the morning rush disappears.

The coffee shop's "model" of the world — built on three years of data — is now wrong. They're still ordering the same supplies. Still scheduling the same staff. Still expecting the same rush.

Nothing broke. The espresso machine still works. The staff still show up. But the predictions are wrong because the world changed and the model didn't.

This is data drift. The inputs to your model — the patterns, the distributions, the behaviors — have shifted away from what your model was trained on.

The weather app analogy

Now imagine a weather app trained entirely on summer data from California. Sunshine, warm temperatures, low humidity. It gets really good at predicting California summers.

Then it gets deployed to predict weather in... Minnesota. In January.

The model isn't broken. It's confidently predicting sunshine and 75°F. It's just completely wrong — because the world it's being asked to predict looks nothing like the world it learned from.

This is concept drift. The relationship between inputs and outputs has changed. What used to be true no longer is.

The gym scale analogy

One more. Imagine a scale at your gym that slowly starts reading 5 pounds lighter than it should.

You weigh yourself every Monday. The numbers look good! You're making progress! You feel great!

But the numbers are quietly, consistently wrong. Not dramatically wrong. Just... off. By a small enough margin that nothing looks suspicious — until you step on a different scale and suddenly wonder what happened.

This is measurement drift. The data coming into your system is corrupted in a subtle way — and because it's subtle, it's the hardest kind to catch.

Why drift is the sneakiest problem in ML

Here's what makes drift so dangerous compared to regular bugs.

A regular bug is obvious. The system crashes. An error gets thrown. Something visibly breaks. You know immediately that something is wrong.

Drift gives you no such signal. Everything keeps running. Predictions keep coming out. The API returns 200. Your dashboards show green. And somewhere underneath all of that apparent health, your model is getting progressively worse at its job.

I think of it like carbon monoxide. Unlike smoke, you can't see it. You can't smell it. The first sign that something's wrong might be when it's already too late. That's exactly why carbon monoxide detectors exist — not to tell you there's a problem after you're sick, but to catch the invisible signal early.

ML monitoring is your carbon monoxide detector.

The three types of drift you need to watch

Not all drift is the same. Here are the three types I've found most important to monitor:

Type 1: Input drift (the coffee shop problem)

The distribution of your input data changes over time.

Everyday example: A recommendation system trained when most users were in their 20s suddenly has to serve an older demographic. The inputs look different — different browsing patterns, different purchase histories, different time-of-day behaviors.

What to watch: Average values, variance, and distribution of your input features over time. If they start shifting away from your training distribution, your model is operating in territory it wasn't trained for.

import numpy as np
from scipy import stats
import json
from datetime import datetime

class InputDriftDetector:
    def __init__(self, reference_data: np.ndarray, feature_name: str):
        """
        reference_data: your training data distribution (sample)
        feature_name: name of the feature you're monitoring
        """
        self.reference = reference_data
        self.feature_name = feature_name
        self.alerts = []

    def check(self, current_data: np.ndarray, threshold: float = 0.05) -> bool:
        """
        Uses KS test to compare current data distribution
        against training distribution.

        p_value < threshold = distributions are significantly different = drift!
        Think of it like: "are these two groups of numbers
        drawn from the same population?"
        """
        ks_stat, p_value = stats.ks_2samp(self.reference, current_data)

        drift_detected = p_value < threshold

        result = {
            "timestamp": datetime.utcnow().isoformat(),
            "feature": self.feature_name,
            "ks_statistic": round(ks_stat, 4),
            "p_value": round(p_value, 4),
            "drift_detected": drift_detected
        }

        if drift_detected:
            self.alerts.append(result)
            print(f"⚠️  Input drift detected in '{self.feature_name}'!")
            print(f"    KS stat: {ks_stat:.4f} | p-value: {p_value:.4f}")
            print(f"    Your input distribution has shifted from training.")

        return drift_detected

# Example usage
# Simulate training distribution: mostly young users (20-35)
training_ages = np.random.normal(loc=28, scale=5, size=1000)

# Simulate current distribution: older users (40-55)
current_ages = np.random.normal(loc=47, scale=6, size=200)

detector = InputDriftDetector(training_ages, feature_name="user_age")
detector.check(current_ages)
# ⚠️  Input drift detected in 'user_age'!

Type 2: Prediction drift (the weather app problem)

The distribution of your model's outputs changes over time.

Everyday example: A customer churn classifier that used to predict "will churn" for 15% of users now predicts it for 40%. Did behavior actually change? Or is the model going off the rails?

What to watch: The distribution of predicted classes or predicted probabilities. Significant shifts — without a corresponding shift in actual outcomes — are a red flag.

class PredictionDriftDetector:
    def __init__(self, baseline_predictions: list):
        """
        baseline_predictions: list of predictions from a known-good period
        """
        self.baseline = baseline_predictions
        self.baseline_pos_rate = sum(baseline_predictions) / len(baseline_predictions)

    def check(self, current_predictions: list, tolerance: float = 0.1) -> bool:
        """
        Checks if the positive prediction rate has shifted
        significantly from the baseline.

        Like checking: "are we predicting churn for WAY more
        people than we used to? If so — something changed."
        """
        current_pos_rate = sum(current_predictions) / len(current_predictions)
        shift = abs(current_pos_rate - self.baseline_pos_rate)

        drift_detected = shift > tolerance

        print(f"📊 Prediction distribution check:")
        print(f"   Baseline positive rate: {self.baseline_pos_rate:.2%}")
        print(f"   Current positive rate:  {current_pos_rate:.2%}")
        print(f"   Shift: {shift:.2%}")

        if drift_detected:
            print(f"⚠️  Prediction drift detected! Shift exceeds {tolerance:.0%} tolerance.")
            if current_pos_rate > self.baseline_pos_rate:
                print(f"   Model is predicting positive MORE often than before.")
            else:
                print(f"   Model is predicting positive LESS often than before.")
        else:
            print(f"✅ Prediction distribution looks stable.")

        return drift_detected

# Example
baseline = [1 if x > 0.85 else 0 for x in np.random.random(1000)]  # ~15% positive rate
current  = [1 if x > 0.60 else 0 for x in np.random.random(200)]   # ~40% positive rate

detector = PredictionDriftDetector(baseline)
detector.check(current)

Type 3: Data quality drift (the broken scale problem)

The data coming into your system starts degrading in quality — missing values increase, formats change, upstream systems start sending garbage.

Everyday example: An e-commerce recommendation engine that relies on product view data. Silently, a front-end change stops logging certain types of product views. The model still runs. But it's now working with incomplete information — like a chef trying to cook a recipe when someone quietly removed half the ingredients from the kitchen without telling them.

class DataQualityMonitor:
    def __init__(self, 
                 baseline_null_rate: float,
                 baseline_value_range: tuple):
        """
        baseline_null_rate: expected % of null values (from training data)
        baseline_value_range: expected (min, max) of values
        """
        self.baseline_null_rate = baseline_null_rate
        self.value_min, self.value_max = baseline_value_range
        self.issues = []

    def check(self, data: list, feature_name: str) -> dict:
        issues_found = []

        # Check null rate
        null_rate = sum(1 for x in data if x is None) / len(data)
        if null_rate > self.baseline_null_rate * 2:  # alert if 2x baseline
            issues_found.append(
                f"Null rate: {null_rate:.1%} (baseline: {self.baseline_null_rate:.1%})"
            )

        # Check value range
        valid_values = [x for x in data if x is not None]
        if valid_values:
            actual_min = min(valid_values)
            actual_max = max(valid_values)

            if actual_min < self.value_min or actual_max > self.value_max:
                issues_found.append(
                    f"Values out of range: [{actual_min}, {actual_max}] "
                    f"(expected: [{self.value_min}, {self.value_max}])"
                )

        if issues_found:
            print(f"⚠️  Data quality issues in '{feature_name}':")
            for issue in issues_found:
                print(f"   - {issue}")
        else:
            print(f"✅ '{feature_name}' data quality looks good.")

        return {"feature": feature_name, "issues": issues_found}

# Example
monitor = DataQualityMonitor(
    baseline_null_rate=0.02,    # 2% nulls in training data
    baseline_value_range=(0, 100)  # values expected between 0 and 100
)

# Simulate degraded data: more nulls, some out-of-range values
bad_data = [None, None, None, None, None, 45, 23, 150, 67, 89, None, 34]
monitor.check(bad_data, feature_name="purchase_amount")

Putting it all together: a simple drift monitoring pipeline

Here's a minimal but real drift monitoring setup you can drop into any ML project:

from dataclasses import dataclass, field
from typing import List, Optional
from datetime import datetime
import numpy as np

@dataclass
class DriftAlert:
    timestamp: str
    drift_type: str
    feature: str
    severity: str  # "low", "medium", "high"
    message: str

class MLDriftMonitor:
    """
    A simple, practical drift monitor for production ML systems.

    Think of this as your smoke detector — runs quietly in the
    background, only makes noise when something actually needs
    your attention.
    """

    def __init__(self, model_name: str):
        self.model_name = model_name
        self.alerts: List[DriftAlert] = []
        self.check_count = 0

    def check_input_drift(self, 
                          feature_name: str,
                          reference: np.ndarray, 
                          current: np.ndarray,
                          threshold: float = 0.05):
        from scipy import stats
        _, p_value = stats.ks_2samp(reference, current)

        if p_value < threshold:
            severity = "high" if p_value < 0.01 else "medium"
            self.alerts.append(DriftAlert(
                timestamp=datetime.utcnow().isoformat(),
                drift_type="input_drift",
                feature=feature_name,
                severity=severity,
                message=f"Input distribution shifted (p={p_value:.4f})"
            ))

    def check_prediction_drift(self,
                               baseline_preds: list,
                               current_preds: list,
                               tolerance: float = 0.1):
        baseline_rate = sum(baseline_preds) / len(baseline_preds)
        current_rate = sum(current_preds) / len(current_preds)
        shift = abs(current_rate - baseline_rate)

        if shift > tolerance:
            severity = "high" if shift > 0.2 else "medium"
            self.alerts.append(DriftAlert(
                timestamp=datetime.utcnow().isoformat(),
                drift_type="prediction_drift",
                feature="model_output",
                severity=severity,
                message=f"Prediction rate shifted by {shift:.1%}"
            ))

    def check_data_quality(self,
                           feature_name: str,
                           data: list,
                           max_null_rate: float = 0.05):
        null_rate = sum(1 for x in data if x is None) / len(data)

        if null_rate > max_null_rate:
            severity = "high" if null_rate > 0.2 else "low"
            self.alerts.append(DriftAlert(
                timestamp=datetime.utcnow().isoformat(),
                drift_type="data_quality",
                feature=feature_name,
                severity=severity,
                message=f"Null rate: {null_rate:.1%} (max allowed: {max_null_rate:.1%})"
            ))

    def report(self):
        print(f"\n{'='*50}")
        print(f"Drift Monitor Report — {self.model_name}")
        print(f"{'='*50}")

        if not self.alerts:
            print("✅ No drift detected. Model looks healthy.")
            return

        high   = [a for a in self.alerts if a.severity == "high"]
        medium = [a for a in self.alerts if a.severity == "medium"]
        low    = [a for a in self.alerts if a.severity == "low"]

        print(f"Total alerts: {len(self.alerts)}")
        print(f"  🔴 High:   {len(high)}")
        print(f"  🟡 Medium: {len(medium)}")
        print(f"  🟢 Low:    {len(low)}")
        print()

        for alert in self.alerts:
            icon = {"high": "🔴", "medium": "🟡", "low": "🟢"}[alert.severity]
            print(f"{icon} [{alert.drift_type}] {alert.feature}")
            print(f"   {alert.message}")
            print(f"   Detected at: {alert.timestamp}")
            print()

# Usage example
monitor = MLDriftMonitor("churn_classifier_v2")

# Simulate some drift
training_data = np.random.normal(28, 5, 1000)
current_data  = np.random.normal(47, 6, 200)
monitor.check_input_drift("user_age", training_data, current_data)

baseline_preds = [1 if x > 0.85 else 0 for x in np.random.random(1000)]
current_preds  = [1 if x > 0.60 else 0 for x in np.random.random(200)]
monitor.check_prediction_drift(baseline_preds, current_preds)

messy_data = [None, None, 45, None, 23, None, None, 67, None, 89]
monitor.check_data_quality("purchase_amount", messy_data)

monitor.report()

When to actually trigger alerts

Not every drift signal needs to wake someone up at 2 AM. Here's how I think about severity:

🔴 High — act immediately

Input distribution shifted significantly (p < 0.01)
Prediction positive rate changed by more than 20%
Null rate above 20%

These are "pull the car over" signals. Something is seriously wrong.

🟡 Medium — investigate soon

Input drift detected but mild (0.01 < p < 0.05)
Prediction rate shifted 10–20%
Null rate 5–20%

These are "the car is pulling to the right" signals. Not an emergency, but don't ignore them.

🟢 Low — log and watch

Small shifts within acceptable ranges
Null rate slightly elevated
Minor value range violations

These are "keep an eye on it" signals. Log them. Trend them. If they keep moving in the wrong direction, escalate.

The one thing I'd suggest you do today

If you have a model in production right now — even a simple one — add this:

import json
from datetime import datetime

def log_prediction(input_data: dict, prediction, confidence: float):
    """
    The simplest possible monitoring setup.
    Just log everything. You'll thank yourself later.
    """
    log_entry = {
        "timestamp": datetime.utcnow().isoformat(),
        "inputs": input_data,
        "prediction": prediction,
        "confidence": round(confidence, 4)
    }

    # Write to a log file (or your logging system of choice)
    with open("predictions.log", "a") as f:
        f.write(json.dumps(log_entry) + "\n")

That's it. Just start logging. Every prediction, every input, every confidence score.

You can build sophisticated drift detection on top of logs. You cannot build it on top of nothing.

The gym scale was wrong for weeks before anyone noticed — because nobody was keeping a record to compare against.

Start the record today.

Quick recap

Three types of drift to watch:

Input drift — your data looks different from training data. Like a coffee shop where the customers completely changed.

Prediction drift — your model's outputs are shifting. Like a weather app confidently predicting sunshine in a Minnesota winter.

Data quality drift — your incoming data is degrading silently. Like a scale that slowly reads wrong.

The common thread: none of these announce themselves. You have to build the detector. The model will not tell you it's struggling. It'll just quietly get worse — until someone notices.

Build the smoke detector before the fire. 🚨

What's next

Now that we know how to catch drift — the next question is: what do LLMs break differently?

Traditional ML models drift in predictable ways. LLMs have entirely new failure modes — hallucinations that increase over time, prompt brittleness, retrieval degradation in RAG systems.

Next post: "LLMs Break Differently Than Traditional ML — Here's What Nobody Warns You About"

Found this useful? Follow for Part 3 — the unique failure modes of LLMs in production that you won't find in any tutorial. 🚀

Tried implementing drift detection? I'd love to hear what you ran into — drop it in the comments.

DEV Community