- Book: Agents in Production — Building, Tracing, and Shipping Multi-Step AI You Can Trust
- Also by me: Observability for LLM Applications — the companion book in The AI Engineer's Library (2-book series)
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You give an agent a long task. It runs fine for six steps. On step nine the latency doubles. On step fourteen the answers get vague, and it starts repeating a tool call it already made. On step twenty the request dies with a context-length error, and your on-call engineer inherits a 300KB message array to read at 2am.
Nothing broke. The context window filled up. That is context bloat, and it is the failure mode nobody warns you about when you wire your first agent loop together. The window is finite. A long-running agent will find the edge of it, every time, unless you decide up front what to keep, what to summarize, and what to throw away.
Why context grows without bound
Each turn of an agent loop appends. The model asks for a tool. You run it. You feed the result back. That result gets stapled to the messages list and rides along in every future prompt until the task ends.
Tool results are the problem. A single JSON blob from a search API or a database query can be over a thousand tokens on its own. Ten of those and your buffer is heavier than a forty-turn chat from yesterday. The model has a big window, but big is not infinite. Claude Opus 4.6 gives you 1M tokens. You will still hit it if every step drags its full history forward.
Two things degrade before you hit the hard wall. Latency climbs roughly linearly with input tokens, so a bloated buffer makes every turn slower. And quality drops on long inputs because of the "lost in the middle" effect, where the model attends worse to material sitting in the interior of a long prompt. The goal you gave it on step one is now buried under nineteen tool results, and the model is paying less attention to it than to the noise.
The rule of thumb: start compressing at around 60% of the window. Do not wait for the error.
Measure tokens, not turns
The instinct is to prune on turn count. Keep the last 20 messages, drop the rest. That instinct is wrong because turns are not the same size. One turn holding a large tool result outweighs ten short chat turns.
Count tokens instead. Anthropic gives you an endpoint for this, and it is worth the call because tool results and image blocks do not map cleanly to character length.
# budget.py -- pip install "anthropic==0.94.1"
from anthropic import Anthropic
client = Anthropic()
COUNTER_MODEL = "claude-sonnet-4-6"
TOKEN_BUDGET = 60_000 # start compressing here
def count_tokens(messages: list[dict]) -> int:
r = client.messages.count_tokens(
model=COUNTER_MODEL,
messages=messages,
)
return r.input_tokens
def over_budget(messages: list[dict]) -> bool:
return count_tokens(messages) > TOKEN_BUDGET
Call over_budget() before every model call. When it trips, you prune. The budget is your threshold, and you tune it so pruning fires on roughly one turn in ten under normal load, not on every single turn.
Eviction policy 1: the sliding window
The bluntest policy that works: keep the last N messages, discard everything older.
KEEP_LAST = 12
def slide(messages: list[dict]) -> list[dict]:
system = [m for m in messages if m["role"] == "system"]
rest = [m for m in messages if m["role"] != "system"]
return system + rest[-KEEP_LAST:]
This is one line of real logic and it is right for stateless tool-use tasks where context beyond the last few actions is actively confusing. A code-review agent stepping through a diff does not need step three once it is on step fifteen. The pull request is the state, and the last few actions are all that matter.
The trade is total. Anything the user said twenty turns back is gone. For a conversational assistant that is a bug, because the diet preference the user mentioned at the start vanishes the moment it scrolls out of the window. Match the policy to the workload. Sliding window for stateless trajectories, something with memory for dialogue.
Pin the goal so it never gets evicted
Here is the mistake that makes both policies dangerous. If you prune the head of the buffer, you can evict the original instruction. The agent then wanders, because nothing in its context points back at what it was asked to do.
Pin the goal. Hold the task statement outside the prunable buffer and re-inject it on every turn, so no eviction policy can ever touch it.
def build_prompt(goal: str, buffer: list[dict]) -> list[dict]:
pinned = {
"role": "user",
"content": f"[current goal, do not lose this]\n{goal}",
}
return [pinned] + buffer
The goal costs a few hundred tokens and buys you an agent that still knows its objective on step thirty. Do the same for hard constraints (a budget cap, an output schema, a "never email the customer" rule). Constraints that must survive to the end of the task do not belong in the prunable history. They belong pinned.
Eviction policy 2: summarize, then drop
For dialogue and long research tasks, dropping is too lossy. Summarize instead. Keep the last few turns verbatim, hand everything older to a cheap model, and glue the summary back on as a synthetic message.
# compress.py
KEEP_VERBATIM = 6
SUMMARIZER = "claude-sonnet-4-6"
def summarize(old: list[dict]) -> str:
text = "\n".join(
f"{m['role']}: {m['content']}" for m in old
)
resp = client.messages.create(
model=SUMMARIZER,
max_tokens=512,
messages=[{
"role": "user",
"content": (
"Summarize this agent history in under "
"200 words. Preserve decisions, user "
"preferences, unresolved questions, and "
"any tool result the agent still needs.\n\n"
+ text
),
}],
)
return resp.content[0].text
The summarizer gets its own model call. The compressor decides when to invoke it, keeps the recent tail verbatim, and glues the summary on:
# compress.py (continued)
def compress(messages: list[dict]) -> list[dict]:
if not over_budget(messages):
return messages
if len(messages) <= KEEP_VERBATIM:
return messages
old = messages[:-KEEP_VERBATIM]
recent = messages[-KEEP_VERBATIM:]
summary = {
"role": "user",
"content": "[prior history summary]\n" + summarize(old),
}
result = [summary] + recent
if over_budget(result):
# tail alone blows the budget: recurse tighter
return compress(result)
return result
Three things earn their place. The summarizer is Claude Sonnet 4.6, cheap enough that the call disappears in the noise and smart enough not to lose the thread. The prompt names exactly what to preserve, so the summary keeps decisions and preferences instead of collapsing to "the user asked some questions." And the recursion handles the case where even the verbatim tail is too big, halving until it fits, which is cheaper than letting the main call fail and retrying the whole loop.
The catch: summaries are lossy by design. The agent will sometimes claim it does not remember a fact you gave it earlier. If a fact must survive compression, do not trust the summary to carry it. Promote it. Write it to a key-value store or a durable memory the way you would write anything you need next week, and read it back deterministically. Short-term buffer is for the flow of the task. Anything you would write on an index card belongs somewhere the summarizer cannot erase.
The decision, per item
The three verbs map to three questions you can ask about any message in the buffer.
- Keep it verbatim if it is the goal, a hard constraint, or one of the last few turns. Pin the first two so eviction can never reach them.
- Summarize it if it carries context the agent still needs but not word-for-word. Old dialogue, early tool results that shaped a decision.
- Drop it if it is noise. An acknowledgement, a retried tool call, an aside that changed nothing.
Run that decision by token budget, not turn count. Fire it at 60% of the window, not at the error. Pin the goal so the agent never forgets why it is running. Do that, and a long-running agent stops finding the edge of its context and falling off it.
If this was useful
Context bloat is one of the failure modes that only shows up once your agent runs longer than a demo, and the fix lives in how you build the loop and how you watch it. Agents in Production covers the memory architecture behind these policies, from scratchpad hygiene to durable recall. Observability for LLM Applications, its companion in The AI Engineer's Library, is the tracing and evals side, so you can see a summary drop the one fact that mattered before a user does.

Top comments (0)