There's a failure mode that breaks production silently. It happens after 10-15 messages in a conversation. Nobody tests for it until users report the problem.
The Problem: Context Pollution
LLMs process conversation context with every message. The problem is: previous messages in the conversation affect how the model interprets new messages.
After ~10-15 messages, you start seeing:
- Responses getting longer — the model starts summarizing previous messages instead of answering
- Specificity decreasing — the model starts hedging more, qualifying everything
- Format breaking — JSON or structured outputs start failing when they worked fine at the start
- Forgetting constraints — "never do X" instructions get silently ignored
Why This Happens
The model has a finite context window. As you fill it with conversation history, the model starts giving more weight to the conversation context than to your current prompt.
How to Detect It
Run 20 standardized prompts with increasing conversation lengths. Track cosine similarity against baseline. Look for the drop after message 10.
How to Fix It
Summarization works best but adds latency and cost.
Sliding window is simplest — only keep the last N messages.
Separate threads is best for user-facing chatbots.
The Test You Should Run
Before shipping any conversation feature, run a 15-turn conversation test with a known-good prompt. If output quality degrades, your production users will notice.
The Monitoring Solution
I added conversation length tracking to DriftWatch. It alerts when any conversation exceeds 15 messages without summary or reset.
Try DriftWatch — from £9.90/mo
Context pollution is the silent killer of LLM conversation quality. Test for it before your users find it.
Top comments (0)