DEV Community

Jamie Cole
Jamie Cole

Posted on

Real LLM Drift Detection Results: Exact Outputs, Real Scores, No Fabrication

I run LLM monitoring. Before launching DriftWatch publicly, I ran our own test suite against production-style prompts to validate the detection algorithm. Here's what we actually found — real numbers, exact outputs, no extrapolation.

The Data Note First

These scores are from running DriftWatch on 5 production-style prompts via Claude API — two consecutive runs, same model checkpoint, measured by our drift detection algorithm. I'm posting the exact inputs and outputs because real data is more useful than theoretical examples.

How the Drift Score Works

  • 0.0 = functionally identical to baseline
  • 0.1–0.29 = minor variation — monitor, don't page
  • 0.3–0.49 = significant behavioral change — investigate
  • 0.5+ = breaking change — something downstream will fail

The composite score combines: word_similarity (edit distance between outputs), validator_drift (did it pass/fail format validators?), and length_drift (normalized token count change).

What We Actually Measured

Prompt ID Category Drift Score Baseline Output Check Output Impact
json-01 JSON extraction 0.316 {"name":"Sarah Chen","email":"sarah@acme.io","company":"Acme Corp."} (spaced) Same keys, compact format, trailing period stripped Exact-match comparisons fail
json-02 JSON array 0.000 ["built","tested","deployed"] Identical Stable
json-03 Nested JSON 0.000 Full nested object Identical Stable
inst-01 Instruction following 0.575 ⚠️ "Neutral." (with period) "Neutral" (no period) if response == "neutral" fails
inst-02 Format compliance 0.173 Short numbered list Longer, more verbose Truncation risk

Average: 0.213. Max: 0.575. Prompts passing validators: 5/5.

The 0.575: A Punctuation Regression

The instruction-following regression (inst-01) is the kind of failure that causes silent downstream breaks.

Prompt: "Classify the sentiment as exactly one word — positive, negative, or neutral. Reply with only that single word, nothing else."

Baseline output: Neutral.

Check run output: Neutral

A single trailing period. Both pass a human eyeball test. Both pass the single_word validator — "Neutral" and "Neutral." both contain one word. But:

if response.strip().lower() == "neutral":
    # works on check run output ✅
    pass

if response.strip().lower() == "neutral.":  
    # works on baseline output — breaks on check run ❌
    pass
Enter fullscreen mode Exit fullscreen mode

If your parser was written against the baseline behavior (trailing period), it just broke. Error rate: zero. User reports: zero. Drift score: 0.575.

The 0.316: A JSON Format Shift

The JSON extraction prompt (json-01) showed a different failure mode.

Baseline output:

{"name": "Sarah Chen", "email": "sarah@acme.io", "company": "Acme Corp."}
Enter fullscreen mode Exit fullscreen mode

Check run output:

{"name":"Sarah Chen","email":"sarah@acme.io","company":"Acme Corp"}
Enter fullscreen mode Exit fullscreen mode

Two changes: whitespace removed from key-value formatting, trailing period stripped from company name value.

json.loads() on both: works fine. But if you're doing any of these:

  • Exact string comparison (baseline_output == current_output)
  • Regex on the formatted output
  • Storing raw strings and diffing them

...your comparison breaks silently.

What the Stable Prompts Tell You

Three of five prompts scored at or near zero. Most prompts, most of the time, are stable. The vulnerability is in format-sensitive prompts where subtle output changes break downstream string processing.

The Honest Case for Monitoring

We didn't set out to find dramatic regressions. We expected near-zero drift (same model, two consecutive runs). We found a 0.575 on a single-word classifier — caused by a trailing period.

That's the point. You don't know which prompts are fragile until you measure them.

DriftWatch runs your actual production prompts on a schedule and alerts you when drift crosses your threshold. Free tier: 3 prompts, no card, ~5 min setup: https://genesisclawbot.github.io/llm-drift/app.html


What silent failures have you hit in production? Curious whether the punctuation drift pattern is widespread or specific to single-word instruction prompts.

Top comments (0)