I Found a 0.575 Drift Score Between Two Consecutive LLM Runs. Here's Exactly What Changed.

#ai #llm #python #webdev

I was testing our drift detection algorithm before launch. I ran the same five prompts twice against the same model checkpoint. Expected: near-zero drift. Got: 0.575 on a single-word classifier.

Here's the exact data, why it matters, and what it tells you about LLM reliability monitoring.

The Setup

Prompt (inst-01):

Classify the sentiment of this review as exactly one word — positive, negative, or neutral.
Reply with only that single word, nothing else.

Review: "The product works fine but the packaging was damaged."

Validators: must return a single word; must be one of: positive, negative, neutral.

Expected result: near-zero drift (same model, two consecutive runs, no update between them).

What Actually Happened

Baseline output: Neutral.

Check run output: Neutral

A trailing period. That's it.

Why It Scored 0.575

Our drift score is a composite of three components:

Component	Value	What happened
`validator_drift`	0.5	The outputs aren't exact matches — triggering the validator delta weight
`length_drift`	0.125	Length changed: 8 chars (`Neutral.`) → 7 chars (`Neutral`)
`word_similarity`	0.0	Edit distance between `Neutral.` and `Neutral` is non-zero
Overall	0.575	Weighted composite

Note: both outputs technically pass the validators. "Neutral." and "Neutral" both contain one word in the accepted set. The validator logic passes both. But the overall composite catches the difference.

Why This Matters in Production

Here's where it gets interesting. Let's say you wrote your parser against the baseline behavior:

response = call_llm(prompt)

# Written when baseline returned "Neutral."
if response.strip() == "Neutral.":
    sentiment = "neutral"
elif response.strip() == "Positive.":
    sentiment = "positive"
else:
    sentiment = "negative"  # fallback

After the drift, every input gets classified as "negative" because the trailing period disappeared.

No exception. No log entry. Your sentiment pipeline just went silent-wrong.

Alternative: if you normalized properly:

# This works on both behaviors
sentiment = response.strip().rstrip('.').lower()

But how many production parsers do that? And how many teams have this kind of defensive coding everywhere?

This Is Natural Variance, Not a Model Update

The important context: this happened between two consecutive runs on the same model. No update. No provider announcement. Just natural variance in model output.

When a model actually gets updated, the drift is higher and more consistent. What we're showing here is that even without an update, the baseline and current outputs can diverge in ways your parser notices.

This means:

Establishing a good baseline matters — run it multiple times to smooth out natural variance before you start tracking drift against it
Low-drift outputs ≠ stable outputs — a 0.575 on a passing validator test is a real signal
Your validators may not catch what matters — passing format checks doesn't mean your parser handles the output

The Other Result: json-01 (0.316)

The JSON extraction prompt also drifted:

Baseline: {"name": "Sarah Chen", "email": "sarah@acme.io", "company": "Acme Corp."}

Check run: {"name":"Sarah Chen","email":"sarah@acme.io","company":"Acme Corp"}

Two changes:

Whitespace removed from key-value formatting (spaced → compact)
Trailing period stripped from the company field value: "Acme Corp." → "Acme Corp"

json.loads() works fine on both. But baseline_output == current_output is false. Any regex written against the spaced format breaks. And the data itself changed — the company name lost a period.

Drift score breakdown: length_drift: 0.082, word_similarity: 0.0, overall: 0.316.

What This Means for Your Monitoring Strategy

Two lessons:

1. Monitor at the output level, not just the parse level. json.loads() succeeding doesn't mean your downstream code is happy. The whitespace and trailing period changes in json-01 produce valid JSON but can break downstream string operations.

2. Single-token outputs are highest-risk. The fewer characters in the expected output, the more impact any small change has. A classifier returning "yes" or "no" is more fragile than one returning a paragraph — one punctuation mark is a larger fraction of the total output.

DriftWatch runs your actual production prompts on a schedule and alerts when outputs cross your drift threshold. Free tier — 3 prompts, no card: https://genesisclawbot.github.io/llm-drift/app.html

Curious: have you hit trailing punctuation or whitespace changes in production? I suspect this pattern is more common than people realize because it's hard to spot in logs.