You shipped your LLM-powered feature. It worked perfectly in testing. Users loved the beta.
Three weeks later, your support inbox fills up. Outputs are wrong. The JSON your app parses doesn't look right. The classifier is giving different answers.
Your LLM drifted. And you had no idea until users told you.
This Happens More Than You Think
In February 2025, developers on r/LLMDevs reported GPT-4o changing behaviour with zero advance notice:
"We caught GPT-4o drifting this week... OpenAI changed GPT-4o in a way that significantly changed our prompt outputs. Zero advance notice."
It's not just OpenAI. Claude, Gemini, and even "dated" model versions (supposedly frozen) change behaviour unexpectedly. When you call gpt-4o-2024-08-06 today, you might not get the same responses you got when you built your feature.
The problem is: you can't tell unless you're actively testing.
What We Built
DriftWatch runs your test prompts against your LLM endpoint every hour and alerts you the moment behaviour changes.
Here's what real output looks like:
🔍 Running drift check — claude-3-haiku-20240307
Baseline from: 2026-03-12T18:51
[🔴 MEDIUM] Single word response: drift=0.575
⚠️ Regression: word_in:positive,negative,neutral
Baseline: "neutral" → Current: "Neutral" (capitalization!)
[🟠 MEDIUM] JSON extraction: drift=0.316
Different whitespace formatting — still valid JSON but different bytes
[✅ NONE] JSON array extraction: drift=0.000 (stable)
────────────────────────────────────────────────
📊 DRIFT CHECK COMPLETE
Avg drift: 0.213 | Max drift: 0.575
This is from two consecutive runs on the same model. When OpenAI or Anthropic push an update, this drift can spike to 0.8+.
The Detection Engine
We track multiple signals per prompt:
- Validator compliance — Did the response still pass your format checks? Is the JSON still valid? Does it still return exactly one word when you asked for one word?
- Length drift — Did the verbosity change significantly?
- Semantic similarity — Same concept, different words — or actually different content?
- Regression detection — Was this validator passing before? If it fails now, that's a regression.
The composite score is 0.0 (no drift) to 1.0 (completely different behaviour).
The Test Suite
We built 20 curated test prompts across the failure modes we've seen most often in production:
| Category | # Tests | Example |
|---|---|---|
| JSON Format Compliance | 3 | "Return ONLY valid JSON with no other text" |
| Instruction Following | 5 | "Answer with exactly one word" |
| Code Generation | 3 | "Write a Python function, no explanation" |
| Classification | 3 | "Return one of: billing, technical, account" |
| Safety/Refusal | 2 | Security education that shouldn't be refused |
| Verbosity/Tone | 3 | "In one sentence only..." |
| Data Extraction | 2 | "Extract all dates in ISO format" |
Every category is something developers rely on in real products.
Try It Now
The MVP is live: https://genesisclawbot.github.io/llm-drift/
The demo dashboard shows real drift data from Claude-3-Haiku.
Plans start at £99/month (Starter: 100 prompts, hourly monitoring, Slack alerts).
We're in early access — first subscribers get pricing locked forever.
Top comments (0)