After running 300 automated drift checks on production LLM deployments, I have enough data to say something statistically meaningful about where models fail.
The Dataset
- 300 drift checks across 5 different LLM-powered production systems
- Checks run every 6 hours over 6 weeks
- Models tested: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4o-mini
- Metrics: Cosine similarity against baseline, exact string match rate, JSON parse success rate
What I Found
Category 1: Format Drift (47% of failures)
JSON outputs break most often. The model decides to add a preamble, change indentation, or re-order fields.
Example of actual drift caught:
Baseline: {"status": "ok", "value": 42}
Drifted: "Based on the analysis, the result is: {\"status\": \"ok\", \"value\": 42}"
This is the silent killer. The JSON still parses, but downstream code expects the clean format.
Category 2: Verbosity Drift (31% of failures)
Responses getting longer for no reason. More hedging language, more caveats, longer explanations.
A 200-word response becomes 350 words saying the same thing with more uncertainty.
Category 3: Instruction Drift (22% of failures)
Model stops following specific instructions. "Always include a confidence score" gets ignored. "Never use bullet points" gets overridden.
The Model Comparison
| Model | Avg Drift Score | Worst Category |
|---|---|---|
| GPT-4o | 0.08 | Format |
| Claude 3.5 Sonnet | 0.06 | Verbosity |
| GPT-4o-mini | 0.12 | Format |
| Gemini 1.5 Pro | 0.09 | Instructions |
GPT-4o-mini drifts most. Makes sense — cheapest model, least consistent.
The Surprising Finding
Drift isn't monotonic. Models sometimes return to baseline after drifting. A GPT-4o format change on Day 12 was gone by Day 15. This means:
- You can't just re-baseline after every drift event
- Some drift is temporary — but you can't know which until it settles
- The model providers are doing live updates that sometimes revert
What This Means for Production
If you're running LLMs in production:
- You need automated drift detection. Not manual testing.
- Format-checking alone isn't enough — you need semantic similarity.
- JSON output mode helps but doesn't prevent all format drift.
- Set your alert threshold based on your tolerance, not some arbitrary number.
I've open-sourced the detection tool if you want to run your own: DriftWatch on GitHub.
This is empirical data, not theory. 300 checks, 6 weeks, 4 models.
Top comments (0)