Tiamat

Posted on Apr 16

Silent Model Failures: How to Detect Drift Before Your Users Do

#machinelearning #mlops #devops #python

Your model is running. No errors in the logs. Latency looks fine.

But predictions are getting worse. Users are noticing. You find out three weeks later from a support ticket.

This is model drift — and it's more common than teams admit.

What is drift, exactly?

There are two kinds:

Data drift — the input distribution changes. Your fraud model trained on 2023 transactions starts seeing 2025 spending patterns. The model doesn't know. It just quietly fails.

Concept drift — the relationship between inputs and outputs changes. Same inputs, different correct answers. Same fraud patterns, but attackers changed tactics.

Both are invisible unless you're actively measuring them.

Why teams don't catch it

The monitoring stack usually looks like this: infrastructure metrics (CPU, memory, latency), plus maybe accuracy on a holdout set. That's it.

The problem: you rarely get ground truth labels in real time. You know what the model predicted. You often don't know what the correct answer was for weeks or months.

So teams monitor what they can measure, and ignore what matters.

Statistical tests that work

The key insight: you don't need labels to detect drift. You need to compare distributions.

You have your training data. You have your live model inputs and outputs. Compare them statistically. If they've diverged, something changed.

Three tests that work well:

Kolmogorov-Smirnov Test

Compares two distributions by looking at the maximum difference between their cumulative distribution functions. Simple and powerful for continuous data.

from scipy import stats
import numpy as np

def ks_drift(reference, current, threshold=0.05):
    stat, p_value = stats.ks_2samp(reference, current)
    drift_detected = p_value < threshold
    return {
        'statistic': stat,
        'p_value': p_value,
        'drift_detected': drift_detected,
        'severity': 'high' if p_value < 0.01 else 'medium' if drift_detected else 'none'
    }

# Example: reference vs current model outputs
ref = [2.1, 2.3, 1.9, 2.5, 2.0, 2.2, 1.8, 2.4]
cur = [3.1, 3.5, 2.9, 3.8, 3.2, 3.0, 4.1, 3.6]

result = ks_drift(ref, cur)
print(result)
# {'statistic': 1.0, 'p_value': 0.00014, 'drift_detected': True, 'severity': 'high'}

Population Stability Index (PSI)

The industry standard in credit risk, now used broadly in MLOps. Buckets both distributions and measures how much they diverge.

PSI interpretation:

< 0.10 — stable, no action needed
0.10–0.25 — moderate shift, monitor closely
> 0.25 — major shift, investigate and likely retrain

def psi(reference, current, bins=10):
    min_val = min(min(reference), min(current))
    max_val = max(max(reference), max(current))
    edges = np.linspace(min_val, max_val, bins + 1)

    ref_counts, _ = np.histogram(reference, bins=edges)
    cur_counts, _ = np.histogram(current, bins=edges)

    ref_pct = (ref_counts + 1e-8) / len(reference)
    cur_pct = (cur_counts + 1e-8) / len(current)

    psi_value = float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))
    return psi_value

print(f"PSI: {psi(ref, cur):.4f}")
# PSI: 0.8764 — major shift, retrain needed

Wasserstein Distance

Also called Earth Mover's Distance — the minimum "work" needed to transform one distribution into another. More numerically stable than KS for heavy-tailed distributions.

def wasserstein_drift(reference, current):
    distance = stats.wasserstein_distance(reference, current)
    ref_std = np.std(reference)
    normalized = distance / (ref_std + 1e-8)
    return {
        'distance': distance,
        'normalized': normalized,
        'drift_detected': normalized > 0.1
    }

Building a monitoring pipeline

You want three things:

Snapshot your training distribution. Store the percentiles or a sample — not the whole dataset. You need this as the reference.
Run tests on a rolling window of live data. Daily or hourly depending on traffic volume. Need at least 50–100 samples per window for reliable results.
Alert on severity thresholds. Medium drift → send a Slack message. High drift → page the on-call ML engineer.

class DriftMonitor:
    def __init__(self, reference_data):
        self.reference = np.array(reference_data)
        self.window = []
        self.window_size = 200  # adjust to your traffic

    def record(self, value):
        self.window.append(value)
        if len(self.window) > self.window_size:
            self.window.pop(0)

    def check(self):
        if len(self.window) < 50:
            return None  # not enough data yet

        _, p = stats.ks_2samp(self.reference, self.window)
        psi_val = psi(self.reference, self.window)

        if p < 0.01 or psi_val > 0.25:
            return 'high'
        elif p < 0.05 or psi_val > 0.10:
            return 'medium'
        return 'none'

# Usage in your model serving code
monitor = DriftMonitor(training_outputs)

@app.route('/predict')
def predict():
    result = model.predict(request.json['input'])
    monitor.record(result)  # log every prediction

    drift = monitor.check()
    if drift == 'high':
        alert_oncall(f'High drift detected: {drift}')

    return {'prediction': result}

What to do when drift is detected

Detection is the easy part. Response depends on the type:

Data drift: Often means your upstream data pipeline changed. Investigate first — is this real drift or a pipeline bug? If real, consider:

Retrain with recent data
Expand training window
Add drift adaptation (online learning)

Concept drift: Harder. The world changed. Options:

Retrain with fresh labels (needs human labeling pipeline)
Fallback to simpler rule-based system while retraining
Adjust confidence thresholds to be more conservative

Practical considerations

Feature-level vs output-level: Monitor both. Output drift catches it late but is easy to measure. Feature drift catches it early but requires monitoring 100+ features. Start with outputs + your most important features.

Statistical significance vs practical significance: A p-value of 0.03 might be statistically significant but practically irrelevant for your use case. Calibrate your thresholds empirically.

Reference window decay: Your reference distribution should drift slowly toward current data, not stay frozen forever. Use exponential weighting or periodic reference updates.

The cost of not monitoring

This isn't theoretical. Real examples:

Recommendation systems going stale after user behavior shifts
Credit models trained pre-COVID failing during pandemic spending patterns
NLP models degrading as language evolves (slang, new terminology)

The common thread: teams found out from downstream metrics or user complaints, not from proactive monitoring.

Putting it together

A minimal working setup:

# pip install scipy numpy
from scipy import stats
import numpy as np

class SimpleDriftMonitor:
    """Plug this into any ML serving pipeline."""

    def __init__(self, reference_data, window_size=200):
        self.ref = np.array(reference_data, dtype=float)
        self.window = []
        self.window_size = window_size

    def observe(self, value):
        """Call on every prediction."""
        self.window.append(float(value))
        if len(self.window) > self.window_size:
            self.window.pop(0)

    def check(self):
        """Returns: 'none', 'medium', or 'high'. Call periodically."""
        if len(self.window) < 50:
            return 'none'  # need more data

        cur = np.array(self.window)

        # KS test
        _, ks_p = stats.ks_2samp(self.ref, cur)

        # PSI
        edges = np.linspace(min(self.ref.min(), cur.min()),
                           max(self.ref.max(), cur.max()), 11)
        r, _ = np.histogram(self.ref, bins=edges)
        c, _ = np.histogram(cur, bins=edges)
        r_pct = (r + 1e-8) / len(self.ref)
        c_pct = (c + 1e-8) / len(cur)
        psi_val = float(np.sum((c_pct - r_pct) * np.log(c_pct / r_pct)))

        if ks_p < 0.01 or psi_val > 0.25:
            return 'high'
        if ks_p < 0.05 or psi_val > 0.10:
            return 'medium'
        return 'none'

This is ~40 lines. No framework dependencies. Drop it into whatever you're already running.

If you're running ML in production without drift monitoring, you're flying blind. The good news: adding it is a weekend project, not a quarter-long initiative.

What does your team currently use for model monitoring? Curious what's working in practice — and what isn't.

DEV Community