You trained a model. It's in production. Six weeks later, something feels off — accuracy is down, users are complaining, and your on-call rotation is stressed. What happened? Model drift. Your production data distribution shifted away from what you trained on. The frustrating part: it's not hard to detect. Most teams just never wire it up. Here's the math, the code, and a one-call API if you want to skip implementing it yourself. ## What drift detection actually measures The industry standard metric is Population Stability Index (PSI). It measures how much a distribution has shifted between two samples. The thresholds that matter:
- PSI < 0.10 — stable, no action needed
- PSI 0.10–0.25 — investigate your data pipeline
-
PSI > 0.25 — retrain now Why PSI and not just mean difference? Because the mean can stay the same while the distribution shape completely changes. PSI catches that. ## Implement it yourself (50 lines)
python import math import statistics def compute_psi(baseline: list, current: list, bins: int = 10) -> float: """Population Stability Index between two score distributions.""" all_vals = baseline + current min_val, max_val = min(all_vals), max(all_vals) if min_val == max_val: return 0.0 bin_width = (max_val - min_val) / bins baseline_counts = [0] * bins current_counts = [0] * bins for v in baseline: b = min(int((v - min_val) / bin_width), bins - 1) baseline_counts[b] += 1 for v in current: b = min(int((v - min_val) / bin_width), bins - 1) current_counts[b] += 1 psi = 0.0 epsilon = 1e-10 for i in range(bins): p = (baseline_counts[i] / len(baseline)) + epsilon q = (current_counts[i] / len(current)) + epsilon psi += (q - p) * math.log(q / p) return round(psi, 4) def detect_drift(baseline_scores, current_scores): psi = compute_psi(baseline_scores, current_scores) if psi < 0.10: action = "stable — monitor normally" elif psi < 0.25: action = "warning — investigate data pipeline" else: action = "critical — retrain now" mean_shift = statistics.mean(current_scores) - statistics.mean(baseline_scores) return { "psi": psi, "drift_detected": psi >= 0.10, "action": action, "mean_shift": round(mean_shift, 4), "mean_shift_pct": round(mean_shift / statistics.mean(baseline_scores) * 100, 2) } # Usage baseline = [0.91, 0.87, 0.93, 0.89, 0.92, 0.88, 0.94, 0.90, 0.86, 0.93] current = [0.72, 0.68, 0.75, 0.70, 0.66, 0.73, 0.69, 0.71, 0.67, 0.74] result = detect_drift(baseline, current) print(f"PSI: {result['psi']} → {result['action']}") print(f"Mean shift: {result['mean_shift_pct']}%")Output:
PSI: 0.3142 → critical — retrain now
Mean shift: -21.93%
## When to run this check A few patterns that work: Batch check (simplest): run once per day against yesterday's scores vs. Your training distribution. Put it in a cron job.
# In your daily batch job
if result['drift_detected']: send_alert(f"Model drift: {result['psi']} PSI — {result['action']}")
Rolling window: check the last N predictions against the last M predictions from a week ago. Catches gradual drift before it compounds. On deploy: run against your holdout set every time you push a new model version. PSI > 0.25 blocks the deploy. ## What to feed into it PSI works on any numeric distribution. Good candidates: - Confidence scores from classifiers (most common)
- Predicted probabilities
- Embedding distances if you're using semantic similarity
- Output token counts for LLM-based pipelines
-
Feature values from your input data (feature drift ≠ model drift, but often precedes it) ## Skip the implementation If you'd rather not maintain this, I deployed a hosted version with the same math:
python import requests result = requests.post( "https://the-service.live/drift/api/detect", json={ "baseline": baseline_scores, "current": current_scores, "model_id": "my-classifier-v2", "metric": "confidence" } ).json() if result["drift_detected"]: print(f"Drift: {result['classification']['level']}") print(f"PSI: {result['psi']} | Action: {result['classification']['action']}")Returns PSI, KL divergence, mean shift %, and classification. Free tier is 100 checks/day. Live demo at the-service.live/drift. ## The part people miss Drift detection is not the same as accuracy monitoring. You often can't monitor accuracy in real-time because you don't have ground truth labels yet. Drift detection gives you an early warning signal before you have labels. If your score distribution is shifting, something changed — data pipeline issue, seasonal pattern, population shift, adversarial drift. You don't need labels to know something is wrong. The teams that catch this earliest are the ones logging confidence scores from day one and running PSI weekly. Takes 10 minutes to wire up. Saves a lot of incident calls. --- Built by EnergenAI. The drift detection API is live at the-service.live/drift. Feedback welcome — what metrics would make this useful for your stack?
Top comments (0)