How to Detect LLM Drift Before It Breaks Your Users

#ai #llm #devops

The most common LLM production incident I see is not prompt injection or model hallucinations. It is silent quality degradation — the model outputs look fine, but they are subtly worse than they used to be.

This is LLM drift. Here is how to detect it before it breaks your users.

What Drift Looks Like

You shipped a classification endpoint in January. It was 94% accurate. In March, you check and it is 89% accurate. You did not change anything. The model provider changed something.

This happens. Providers update models, fine-tune weights, change inference infrastructure. The model name is the same. The model behavior is different.

The Simple Detection Method

Run your prompt with 10 baseline inputs
Store the outputs as your "golden" set
Re-run weekly with the same inputs
Compare new outputs to golden outputs using embedding similarity

If similarity drops below 0.85, investigate.

The Code

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

BASELINE_OUTPUTS = [...]  # Your golden outputs
CURRENT_OUTPUTS = [...]   # Recent production outputs

def measure_drift(baseline, current, threshold=0.15):
    baseline_emb = embed(baseline)
    current_emb = embed(current)
    similarity = cosine_similarity([baseline_emb], [current_emb])[0][0]
    drift_score = 1 - similarity

    if drift_score > threshold:
        alert(f"Drift detected: {drift_score:.2f} (threshold: {threshold})")

    return drift_score

Why Embeddings

String matching fails because the model might rephrase. Embeddings capture semantic similarity — is the meaning the same, not just the words?

The Monitoring Stack

Drift detection without alerting is useless. You need:

Baseline recording — done once, stored permanently
Weekly comparison — automated, no manual work
Alert routing — Slack, email, PagerDuty when drift detected
Historical tracking — see drift over time

Real Numbers

After monitoring 50 production prompts for 90 days:

23% showed measurable drift within 30 days
8% showed significant drift (>0.3 score)
Classification tasks drift most frequently

What To Do When Drift Is Detected

Re-record baseline (accept new behavior as correct)
Add prompt constraints (tighten the instructions)
Switch to a more stable model

Option 1 is most common. Drift is not always bad — sometimes the model improves.

The key insight: if you are not measuring drift, you do not know if your LLM application is working as well today as it did when you shipped it.

Try DriftWatch — from £9.90/mo

Automated drift detection and alerting for production LLM applications.

DEV Community