The most common LLM production incident I see is not prompt injection or model hallucinations. It is silent quality degradation — the model outputs look fine, but they are subtly worse than they used to be.
This is LLM drift. Here is how to detect it before it breaks your users.
What Drift Looks Like
You shipped a classification endpoint in January. It was 94% accurate. In March, you check and it is 89% accurate. You did not change anything. The model provider changed something.
This happens. Providers update models, fine-tune weights, change inference infrastructure. The model name is the same. The model behavior is different.
The Simple Detection Method
- Run your prompt with 10 baseline inputs
- Store the outputs as your "golden" set
- Re-run weekly with the same inputs
- Compare new outputs to golden outputs using embedding similarity
If similarity drops below 0.85, investigate.
The Code
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
BASELINE_OUTPUTS = [...] # Your golden outputs
CURRENT_OUTPUTS = [...] # Recent production outputs
def measure_drift(baseline, current, threshold=0.15):
baseline_emb = embed(baseline)
current_emb = embed(current)
similarity = cosine_similarity([baseline_emb], [current_emb])[0][0]
drift_score = 1 - similarity
if drift_score > threshold:
alert(f"Drift detected: {drift_score:.2f} (threshold: {threshold})")
return drift_score
Why Embeddings
String matching fails because the model might rephrase. Embeddings capture semantic similarity — is the meaning the same, not just the words?
The Monitoring Stack
Drift detection without alerting is useless. You need:
- Baseline recording — done once, stored permanently
- Weekly comparison — automated, no manual work
- Alert routing — Slack, email, PagerDuty when drift detected
- Historical tracking — see drift over time
Real Numbers
After monitoring 50 production prompts for 90 days:
- 23% showed measurable drift within 30 days
- 8% showed significant drift (>0.3 score)
- Classification tasks drift most frequently
What To Do When Drift Is Detected
- Re-record baseline (accept new behavior as correct)
- Add prompt constraints (tighten the instructions)
- Switch to a more stable model
Option 1 is most common. Drift is not always bad — sometimes the model improves.
The key insight: if you are not measuring drift, you do not know if your LLM application is working as well today as it did when you shipped it.
Try DriftWatch — from £9.90/mo
Automated drift detection and alerting for production LLM applications.
Top comments (0)