My LLM Started Lying to My App and I Didn't Notice for Three Days

#llm #testing #devops #ai

It started with a Slack message from a user: "Your summaries look weird." Not an error. Not a crash. Just... weird.

By the time I'd triaged, reproduced, and traced it back to the root cause, the model had been returning malformed JSON on about 12% of requests for 72 hours. Our error handling was swallowing the parse failures and returning stale cache. Users were getting yesterday's data labeled as today's. Nobody's monitoring caught it because no exception was raised. The model just quietly stopped following the format instructions it had followed reliably for months.

The model version hadn't changed. The prompt hadn't changed. But the behaviour had.

The Silent Update Problem

LLM providers push updates to their models constantly — safety tuning, RLHF adjustments, infrastructure changes. The uncomfortable truth is that even pinned, dated model versions are not immutable. OpenAI has documented this explicitly; Anthropic's model cards acknowledge that behaviour can shift between deployments of the same version. There's no changelog. There's no deprecation notice. The endpoint just starts returning something slightly different.

The failure modes I've seen personally or read about in detail:

JSON format compliance breaks — the model starts wrapping responses in markdown code fences it wasn't using before, or drops a required field
Instruction following degrades — "respond only in Spanish" starts getting ignored on edge cases, then more cases
Output verbosity shifts — responses get longer or shorter, which breaks downstream parsing or UX assumptions
Semantic meaning drifts — sentiment classification that was calibrated against the model's output starts returning different distributions without the labels changing

None of these trigger a 500. None of them show up in your p99 latency. They show up in user complaints, A/B test anomalies, and that sinking feeling when you go back through logs.

What to Actually Measure

"Test your prompts" is advice that sounds obvious and is almost useless without specifics. Here's what actually gives you signal:

1. Format compliance rate

If your prompt asks for structured output, track the percentage of responses that parse successfully against your schema. Not just "did JSON.parse() succeed" — validate the shape. A model that starts wrapping JSON in backticks will still technically return valid JSON if you strip the fences, but it's telling you something changed.

2. Instruction adherence on canary prompts

Maintain a small set of prompts with deterministic expectations. "List exactly three items." "Respond in French." "Do not include any preamble." Run these on a schedule and check the output programmatically, not with another LLM.

3. Response length distribution

Track token count per request category. Sudden mean shift is a reliable early warning. A model that starts adding "Certainly! Here's..." preamble will show up in your length stats before it shows up in your error logs.

4. Semantic similarity against a baseline

For non-deterministic outputs, cosine similarity between a current response and a stored baseline response (for the same prompt) gives you a drift signal without requiring exact matching. Embed both and compare.

A Minimal Drift Detector

Here's the core of a home-rolled detection loop. This runs a canary prompt, compares it to a stored baseline, and flags when similarity drops below threshold:

import openai
import numpy as np
from datetime import datetime

client = openai.OpenAI()

CANARY_PROMPT = "Summarise the following in exactly two sentences, JSON only, keys: summary, sentiment.\n\nText: The product launch exceeded expectations with strong pre-orders."
BASELINE_EMBEDDING = None  # load from your store

def get_embedding(text: str) -> list[float]:
    resp = client.embeddings.create(model="text-embedding-3-small", input=text)
    return resp.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def check_drift(threshold: float = 0.92) -> dict:
    resp = client.chat.completions.create(
        model="gpt-4o-2024-08-06",
        messages=[{"role": "user", "content": CANARY_PROMPT}],
        temperature=0,
    )
    current_text = resp.choices[0].message.content

    # Format check
    try:
        import json
        parsed = json.loads(current_text)
        format_ok = {"summary", "sentiment"}.issubset(parsed.keys())
    except json.JSONDecodeError:
        format_ok = False

    # Semantic check
    current_embedding = get_embedding(current_text)
    similarity = cosine_similarity(BASELINE_EMBEDDING, current_embedding)

    return {
        "timestamp": datetime.utcnow().isoformat(),
        "format_ok": format_ok,
        "semantic_similarity": similarity,
        "drifted": not format_ok or similarity < threshold,
    }

Run this on a cron every hour. Store the results. Alert when drifted is True for two consecutive checks (one failure is noise; two is a pattern).

The Honest Limitations of Rolling Your Own

The code above will catch the obvious cases. It won't catch subtle semantic drift unless your canary prompts are carefully designed for your actual use cases. It won't tell you which behaviour changed. It won't give you a baseline comparison across multiple models or versions. And maintaining a suite of canary prompts that actually reflects production behaviour is genuinely tedious — most teams let it rot.

The other problem is operational: who owns the alert at 3am when your canary fails? How do you distinguish a transient API blip from a real behaviour change? How do you know if the drift you're seeing is consistent across all users or only on certain input patterns?

I've been evaluating DriftWatch as a way to handle this without maintaining the infrastructure myself. It runs your test prompts hourly across model versions, tracks the signals I described above, and alerts when something shifts. Their demo dashboard shows real drift data from Claude-3-Haiku — worth looking at if you want a sense of how frequently "stable" models actually move. It's in early access, starts at £99/month, which is roughly the cost of one engineer-hour of incident response per month if your team is anything like mine.

Whether you use a service or build your own, the key point is that passive monitoring — watching for errors and latency — is not enough.

What to Do Right Now

Pick two or three prompts that represent your highest-stakes LLM calls
Store a baseline response and its embedding today, while you trust the model's output
Run a comparison hourly and alert on format failures or similarity below 0.90
Log token counts per request category so you have a length baseline

The goal isn't perfect detection. It's cutting that 72-hour discovery window down to something you can actually respond to before users notice.

Your users shouldn't be your monitoring system.