We Built a Service That Catches LLM Drift Before Your Users Do

#llm #ai #devops #monitoring

You shipped your LLM-powered feature. It worked perfectly in testing. Users loved the beta.

Three weeks later, your support inbox fills up. Outputs are wrong. The JSON your app parses doesn't look right. The classifier is giving different answers.

Your LLM drifted. And you had no idea until users told you.

This Happens More Than You Think

In February 2025, developers on r/LLMDevs reported GPT-4o changing behaviour with zero advance notice:

"We caught GPT-4o drifting this week... OpenAI changed GPT-4o in a way that significantly changed our prompt outputs. Zero advance notice."

It's not just OpenAI. Claude, Gemini, and even "dated" model versions (supposedly frozen) change behaviour unexpectedly. When you call gpt-4o-2024-08-06 today, you might not get the same responses you got when you built your feature.

The problem is: you can't tell unless you're actively testing.

What We Built

DriftWatch runs your test prompts against your LLM endpoint every hour and alerts you the moment behaviour changes.

Here's what real output looks like:

🔍 Running drift check — claude-3-haiku-20240307
   Baseline from: 2026-03-12T18:51

  [🔴 MEDIUM] Single word response: drift=0.575
    ⚠️ Regression: word_in:positive,negative,neutral
    Baseline: "neutral" → Current: "Neutral" (capitalization!)

  [🟠 MEDIUM] JSON extraction: drift=0.316
    Different whitespace formatting — still valid JSON but different bytes

  [✅ NONE] JSON array extraction: drift=0.000 (stable)

────────────────────────────────────────────────
📊 DRIFT CHECK COMPLETE
   Avg drift: 0.213 | Max drift: 0.575

This is from two consecutive runs on the same model. When OpenAI or Anthropic push an update, this drift can spike to 0.8+.

The Detection Engine

We track multiple signals per prompt:

Validator compliance — Did the response still pass your format checks? Is the JSON still valid? Does it still return exactly one word when you asked for one word?
Length drift — Did the verbosity change significantly?
Semantic similarity — Same concept, different words — or actually different content?
Regression detection — Was this validator passing before? If it fails now, that's a regression.

The composite score is 0.0 (no drift) to 1.0 (completely different behaviour).

The Test Suite

We built 20 curated test prompts across the failure modes we've seen most often in production:

Category	# Tests	Example
JSON Format Compliance	3	"Return ONLY valid JSON with no other text"
Instruction Following	5	"Answer with exactly one word"
Code Generation	3	"Write a Python function, no explanation"
Classification	3	"Return one of: billing, technical, account"
Safety/Refusal	2	Security education that shouldn't be refused
Verbosity/Tone	3	"In one sentence only..."
Data Extraction	2	"Extract all dates in ISO format"