DEV Community

Jamie Cole
Jamie Cole

Posted on

The LLM Failure Mode Nobody Tests For (Until It Breaks Production)

There's a failure mode that breaks production silently. It happens after 10-15 messages in a conversation. Nobody tests for it until users report the problem.

The Problem: Context Pollution

LLMs process conversation context with every message. The problem is: previous messages in the conversation affect how the model interprets new messages.

After ~10-15 messages, you start seeing:

  1. Responses getting longer — the model starts summarizing previous messages instead of answering
  2. Specificity decreasing — the model starts hedging more, qualifying everything
  3. Format breaking — JSON or structured outputs start failing when they worked fine at the start
  4. Forgetting constraints — "never do X" instructions get silently ignored

Why This Happens

The model has a finite context window. As you fill it with conversation history, the model starts giving more weight to the conversation context than to your current prompt.


How to Detect It

Run 20 standardized prompts with increasing conversation lengths. Track cosine similarity against baseline. Look for the drop after message 10.


How to Fix It

Summarization works best but adds latency and cost.

Sliding window is simplest — only keep the last N messages.

Separate threads is best for user-facing chatbots.


The Test You Should Run

Before shipping any conversation feature, run a 15-turn conversation test with a known-good prompt. If output quality degrades, your production users will notice.


The Monitoring Solution

I added conversation length tracking to DriftWatch. It alerts when any conversation exceeds 15 messages without summary or reset.

Try DriftWatch — from £9.90/mo

Context pollution is the silent killer of LLM conversation quality. Test for it before your users find it.

Top comments (0)