The LLM Failure Mode Nobody Tests For (Until It Breaks Production)

#ai #programming #llm

There's a failure mode that breaks production silently. It happens after 10-15 messages in a conversation. Nobody tests for it until users report the problem.

The Problem: Context Pollution

LLMs process conversation context with every message. The problem is: previous messages in the conversation affect how the model interprets new messages.

After ~10-15 messages, you start seeing:

Responses getting longer — the model starts summarizing previous messages instead of answering
Specificity decreasing — the model starts hedging more, qualifying everything
Format breaking — JSON or structured outputs start failing when they worked fine at the start
Forgetting constraints — "never do X" instructions get silently ignored

Why This Happens

The model has a finite context window. As you fill it with conversation history, the model starts giving more weight to the conversation context than to your current prompt.

How to Detect It

Run 20 standardized prompts with increasing conversation lengths. Track cosine similarity against baseline. Look for the drop after message 10.

How to Fix It

Summarization works best but adds latency and cost.

Sliding window is simplest — only keep the last N messages.

Separate threads is best for user-facing chatbots.

The Test You Should Run

Before shipping any conversation feature, run a 15-turn conversation test with a known-good prompt. If output quality degrades, your production users will notice.