DEV Community

Jamie Cole
Jamie Cole

Posted on

Why Prompt Testing Alone Won't Catch LLM Drift (And What Will)

You tested your prompts. They worked. Six weeks later, something broke. Here's why prompt testing isn't enough.

The Testing Gap

Most LLM integration testing looks like this:

def test_classifier():
    result = llm("Classify: 'I love this product'")
    assert "positive" in result.lower()
Enter fullscreen mode Exit fullscreen mode

This is fine for launch. It's useless for production.


What Prompt Testing Can't Catch

1. Model Updates

OpenAI, Anthropic, Google — they all update models without advance notice. Your prompt might work on GPT-4o from March 1st and fail on GPT-4o from March 15th.

The model behaved the same. It performs differently.

2. Context Degradation

Long conversations cause context pollution. A prompt that works in isolation fails after 20 messages.

Your test suite runs clean prompts. Production runs messy conversations.

3. Distribution Shift

Your training data and your production users behave differently. The classifier that worked on your test set drifts when real users throw edge cases at it.


What Actually Catches Drift

Continuous Monitoring

# Run against baseline every N hours
def check_drift():
    baseline = load_baseline("baseline.json")
    current = llm_batch(test_prompts)
    return cosine_similarity(baseline, current)
Enter fullscreen mode Exit fullscreen mode

You compare outputs, not just correctness. A slightly different word choice is an early warning sign.

Semantic Checks, Not String Matching

# Bad: exact match
assert output == expected  # Fragile, catches nothing real

# Good: semantic similarity
assert similarity(output, expected) > 0.85  # Catches drift before it breaks things
Enter fullscreen mode Exit fullscreen mode

Production Traffic Analysis

# Flag outputs that look unusual
def is_anomalous(output):
    score = embedding_similarity(output, historical_average)
    return score < 0.7  # Low similarity = investigate
Enter fullscreen mode Exit fullscreen mode

Anomalies in production traffic often precede user-visible bugs by hours or days.


The Missing Piece

Prompt testing tells you: "Does this work right now?"

Drift monitoring tells you: "Is this still working?"

You need both.


The drift detector I built: DriftWatch — free, open source, deploys to Railway in minutes.


The gap between "tested" and "production-safe" is where most LLM incidents live.

Top comments (0)