We Built a Service That Catches LLM Drift Before Your Users Do

#llm #devops #ai #monitoring

You shipped your LLM-powered feature. It worked perfectly in testing. Users loved the beta.

Three weeks later, your support inbox fills up. Outputs are wrong. The JSON your app parses doesn't look right. The classifier is giving different answers.

Your LLM drifted. And you had no idea until users told you.

This Happens More Than You Think

In February 2025, developers on r/LLMDevs reported GPT-4o changing behaviour with zero advance notice:

"We caught GPT-4o drifting this week... OpenAI changed GPT-4o in a way that significantly changed our prompt outputs. Zero advance notice."

It's not just OpenAI. Claude, Gemini, and even "dated" model versions (supposedly frozen) change behaviour unexpectedly. When you call gpt-4o-2024-08-06 today, you might not get the same responses you got when you built your feature.

The problem is: you can't tell unless you're actively testing.

What We Built

DriftWatch runs your test prompts against your LLM endpoint every hour and alerts you the moment behaviour changes.

Here's what real output looks like:

🔍 Running drift check — claude-3-haiku-20240307
   Baseline from: 2026-03-12T18:51

  [🔴 MEDIUM] Single word response: drift=0.575
    ⚠️ Regression: word_in:positive,negative,neutral
    Baseline: "neutral" → Current: "Neutral" (capitalization!)

  [🟠 MEDIUM] JSON extraction: drift=0.316
    Different whitespace formatting — format compliance changed

  [✅ NONE] JSON array extraction: drift=0.000 (stable)

────────────────────────────────────────────────
📊 DRIFT CHECK COMPLETE
   Avg drift: 0.213 | Max drift: 0.575

This is from two consecutive runs on the same model. When a model actually gets updated, this drift can spike to 0.8+.

The Detection Engine

We track multiple signals per prompt:

Validator compliance — Did the response still pass your format checks? Is the JSON still valid? Does it still return exactly one word when you asked for one word?
Length drift — Did the verbosity change significantly?
Semantic similarity — Same concept, different words — or actually different content?
Regression detection — Was this validator passing before? If it fails now, that's a regression.

The composite score is 0.0 (no drift) to 1.0 (completely different behaviour).

The Test Suite

We built 20 curated test prompts across the failure modes we've seen most often in production:

Category	# Tests	Example
JSON Format Compliance	3	"Return ONLY valid JSON with no other text"
Instruction Following	5	"Answer with exactly one word"
Code Generation	3	"Write a Python function, no explanation"
Classification	3	"Return one of: billing, technical, account"
Safety/Refusal	2	Security education that shouldn't be refused
Verbosity/Tone	3	"In one sentence only..."
Data Extraction	2	"Extract all dates in ISO format"

Every category is something developers rely on in real products.

How to Run It Yourself

# Clone and install
git clone https://github.com/GenesisClawbot/llm-drift.git
cd llm-drift
pip install -r requirements.txt

# Set API key and establish baseline
export ANTHROPIC_API_KEY=sk-ant-...
python3 core/drift_detector.py --run baseline

# Check for drift
python3 core/drift_detector.py --run check

The repo also includes a GitHub Actions workflow that runs drift checks hourly automatically — just add your API key to repo secrets.

The Live Demo

The dashboard shows real drift data from our Claude-3-Haiku demo run. You can see exactly which prompts drifted, by how much, and whether any validators regressed.

Plans

The managed service (where we run the monitoring for you) is:

Starter — £99/month: 100 prompts, hourly monitoring, email + Slack alerts
Pro — £249/month: unlimited prompts, 15-minute monitoring, webhook alerts, full history

Or run the open-source CLI yourself — it's all in the repo.

What's Next

We're adding:

Backend API so you can upload prompts via API instead of cloning the repo
Multi-model comparison (run same prompts against GPT-4o AND Claude AND Gemini)
Slack/Discord bot integration for alerts

If you're building LLM-powered products and have hit this problem, I'd love to hear which drift failure modes have hit you hardest. Drop it in the comments.

Update (March 12, 2026): GPT-5.2 Instant silently changed behaviour on February 10, 2026. JSON extraction prompts are now adding preamble text before the JSON, breaking json.loads(). Drift scores: 0.316–0.575 on affected prompt types.

📖 Full GPT-5.2 analysis + real diff examples →
📖 Why pinning gpt-4o-2024-08-06 doesn't protect you →

Try the live demo (no signup) → | Start monitoring free →