You tested your prompts. They worked. Six weeks later, something broke. Here's why prompt testing isn't enough.
The Testing Gap
Most LLM integration testing looks like this:
def test_classifier():
result = llm("Classify: 'I love this product'")
assert "positive" in result.lower()
This is fine for launch. It's useless for production.
What Prompt Testing Can't Catch
1. Model Updates
OpenAI, Anthropic, Google — they all update models without advance notice. Your prompt might work on GPT-4o from March 1st and fail on GPT-4o from March 15th.
The model behaved the same. It performs differently.
2. Context Degradation
Long conversations cause context pollution. A prompt that works in isolation fails after 20 messages.
Your test suite runs clean prompts. Production runs messy conversations.
3. Distribution Shift
Your training data and your production users behave differently. The classifier that worked on your test set drifts when real users throw edge cases at it.
What Actually Catches Drift
Continuous Monitoring
# Run against baseline every N hours
def check_drift():
baseline = load_baseline("baseline.json")
current = llm_batch(test_prompts)
return cosine_similarity(baseline, current)
You compare outputs, not just correctness. A slightly different word choice is an early warning sign.
Semantic Checks, Not String Matching
# Bad: exact match
assert output == expected # Fragile, catches nothing real
# Good: semantic similarity
assert similarity(output, expected) > 0.85 # Catches drift before it breaks things
Production Traffic Analysis
# Flag outputs that look unusual
def is_anomalous(output):
score = embedding_similarity(output, historical_average)
return score < 0.7 # Low similarity = investigate
Anomalies in production traffic often precede user-visible bugs by hours or days.
The Missing Piece
Prompt testing tells you: "Does this work right now?"
Drift monitoring tells you: "Is this still working?"
You need both.
The drift detector I built: DriftWatch — free, open source, deploys to Railway in minutes.
The gap between "tested" and "production-safe" is where most LLM incidents live.
Top comments (0)