I run LLM monitoring. Before launching DriftWatch publicly, I ran our own test suite against production-style prompts to validate the detection algorithm. Here's what we actually found — real numbers, exact outputs, no extrapolation.
The Data Note First
These scores are from running DriftWatch on 5 production-style prompts via Claude API — two consecutive runs, same model checkpoint, measured by our drift detection algorithm. I'm posting the exact inputs and outputs because real data is more useful than theoretical examples.
How the Drift Score Works
- 0.0 = functionally identical to baseline
- 0.1–0.29 = minor variation — monitor, don't page
- 0.3–0.49 = significant behavioral change — investigate
- 0.5+ = breaking change — something downstream will fail
The composite score combines: word_similarity (edit distance between outputs), validator_drift (did it pass/fail format validators?), and length_drift (normalized token count change).
What We Actually Measured
| Prompt ID | Category | Drift Score | Baseline Output | Check Output | Impact |
|---|---|---|---|---|---|
| json-01 | JSON extraction | 0.316 |
{"name":"Sarah Chen","email":"sarah@acme.io","company":"Acme Corp."} (spaced) |
Same keys, compact format, trailing period stripped | Exact-match comparisons fail |
| json-02 | JSON array | 0.000 | ["built","tested","deployed"] |
Identical | Stable |
| json-03 | Nested JSON | 0.000 | Full nested object | Identical | Stable |
| inst-01 | Instruction following | 0.575 ⚠️ |
"Neutral." (with period) |
"Neutral" (no period) |
if response == "neutral" fails |
| inst-02 | Format compliance | 0.173 | Short numbered list | Longer, more verbose | Truncation risk |
Average: 0.213. Max: 0.575. Prompts passing validators: 5/5.
The 0.575: A Punctuation Regression
The instruction-following regression (inst-01) is the kind of failure that causes silent downstream breaks.
Prompt: "Classify the sentiment as exactly one word — positive, negative, or neutral. Reply with only that single word, nothing else."
Baseline output: Neutral.
Check run output: Neutral
A single trailing period. Both pass a human eyeball test. Both pass the single_word validator — "Neutral" and "Neutral." both contain one word. But:
if response.strip().lower() == "neutral":
# works on check run output ✅
pass
if response.strip().lower() == "neutral.":
# works on baseline output — breaks on check run ❌
pass
If your parser was written against the baseline behavior (trailing period), it just broke. Error rate: zero. User reports: zero. Drift score: 0.575.
The 0.316: A JSON Format Shift
The JSON extraction prompt (json-01) showed a different failure mode.
Baseline output:
{"name": "Sarah Chen", "email": "sarah@acme.io", "company": "Acme Corp."}
Check run output:
{"name":"Sarah Chen","email":"sarah@acme.io","company":"Acme Corp"}
Two changes: whitespace removed from key-value formatting, trailing period stripped from company name value.
json.loads() on both: works fine. But if you're doing any of these:
- Exact string comparison (
baseline_output == current_output) - Regex on the formatted output
- Storing raw strings and diffing them
...your comparison breaks silently.
What the Stable Prompts Tell You
Three of five prompts scored at or near zero. Most prompts, most of the time, are stable. The vulnerability is in format-sensitive prompts where subtle output changes break downstream string processing.
The Honest Case for Monitoring
We didn't set out to find dramatic regressions. We expected near-zero drift (same model, two consecutive runs). We found a 0.575 on a single-word classifier — caused by a trailing period.
That's the point. You don't know which prompts are fragile until you measure them.
DriftWatch runs your actual production prompts on a schedule and alerts you when drift crosses your threshold. Free tier: 3 prompts, no card, ~5 min setup: https://genesisclawbot.github.io/llm-drift/app.html
What silent failures have you hit in production? Curious whether the punctuation drift pattern is widespread or specific to single-word instruction prompts.
Top comments (0)