Gabriel Anhaia

Posted on May 24

Multi-Turn Agent Context Window: 4 Truncation Strategies That Don't Break the Agent

#agents #ai #llm #python

Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your agent crosses the context window at turn 47 and the SDK throws context_length_exceeded: 207843 tokens > 200000. You reach for the first fix that comes to mind (usually "drop the oldest messages"), and the agent forgets the original task. The bug becomes harder to debug because now the agent answers, but answers something nobody asked for.

There are four real strategies. Each one trades a different piece of fidelity. If you pick the wrong one for the workload, the agent gets dumber in a way your eval suite probably won't catch.

What blows up at turn 47

Take a coding agent. It opens a repo, runs grep, reads four files, runs the test suite, reads the failures, edits a file, runs the tests again. Each tool result lands as a tool_result message. A single cat of a 400-line file is ~5,000 tokens. A failing pytest log can be 20,000. The system prompt and the original instructions sit at the top, immutable.

By turn 30 you're at 80k tokens. By turn 47 you've crossed 200k. The provider rejects the next request:

anthropic.BadRequestError: Error code: 400 - {
  'type': 'error',
  'error': {
    'type': 'invalid_request_error',
    'message': 'prompt is too long: 207843 tokens > 200000 maximum'
  }
}

You need to ship messages with fewer tokens. There are four ways to do that.

Strategy 1: Sliding window

Drop the oldest user/assistant pairs until you fit. Easiest to write, worst default.

import anthropic

client = anthropic.Anthropic()
SYSTEM = "You are a coding agent. Use tools to read and edit files."
MAX_INPUT_TOKENS = 180_000  # leave headroom

def count_tokens(messages: list[dict]) -> int:
    # the count_tokens endpoint is free and accurate
    return client.messages.count_tokens(
        model="claude-opus-4-5",
        system=SYSTEM,
        messages=messages,
    ).input_tokens

def slide(messages: list[dict]) -> list[dict]:
    # keep dropping the oldest user+assistant pair until we fit
    while count_tokens(messages) > MAX_INPUT_TOKENS and len(messages) > 2:
        # drop in pairs so we don't orphan a tool_use without its result
        messages = messages[2:]
    return messages

What breaks: the agent forgets the original instruction. At turn 47, the user message at turn 0 ("refactor the auth module to use the new Session class, do not touch the database layer") is the first thing to go. The agent keeps doing useful-looking work, but it has no anchor. It "fixes" the database layer because that's what it sees most recently.

You also have to drop in pairs. If you drop an assistant turn that contained a tool_use block and keep its tool_result, the API rejects the request: tool_result block references unknown tool_use_id. Same the other way. Always slide a user+assistant pair together, or walk the boundary until both ends are clean.

Use sliding windows for chat: single-task conversations where the latest turn really is the most relevant. Don't use them for agents.

Strategy 2: Summarisation checkpoints

Every N turns, replace the older messages with a single paragraph summary written by the model itself. The agent keeps continuity, you stay under the limit.

SUMMARY_PROMPT = """You are summarising an agent trace for context compression.
Write a single paragraph (max 400 tokens) capturing:
- the original task the user gave the agent
- what the agent has tried and the outcome of each attempt
- any decisions, constraints, or facts discovered from tool results
- what state the agent is currently in (file open, command running, etc.)

No filler. No "the agent did X, then Y, then Z" narration. Extract facts."""

def summarise(messages: list[dict]) -> str:
    resp = client.messages.create(
        model="claude-haiku-4-5",  # cheap model is fine for compression
        max_tokens=600,
        system=SUMMARY_PROMPT,
        messages=[{
            "role": "user",
            "content": (
                "Compress this trace into a checkpoint paragraph:\n\n"
                + format_trace(messages)
            ),
        }],
    )
    return resp.content[0].text

def checkpoint(messages: list[dict], keep_recent: int = 6) -> list[dict]:
    if len(messages) <= keep_recent:
        return messages
    old, recent = messages[:-keep_recent], messages[-keep_recent:]
    summary = summarise(old)
    # the summary becomes a synthetic user message at the start
    return [
        {"role": "user", "content": f"[CHECKPOINT]\n{summary}"},
        {"role": "assistant", "content": "Understood. Continuing."},
    ] + recent

Two warnings the docs don't shout about.

First, the summariser is a model call. It can hallucinate. Run it with temperature=0 and a tight system prompt that says "extract facts, don't infer." Even then, audit a sample of your traces. The model will sometimes invent a "decision" the agent never made.

Second, summaries compound. If you checkpoint every 6 turns, by turn 30 you're summarising a summary of a summary. Each pass loses fidelity. A clean implementation keeps the original task message verbatim at position 0, and only summarises everything between it and the recent window.

def checkpoint_pinned(messages: list[dict], keep_recent: int = 6) -> list[dict]:
    # pin the original task; it never gets summarised
    pinned, rest = messages[0], messages[1:]
    if len(rest) <= keep_recent:
        return messages
    old, recent = rest[:-keep_recent], rest[-keep_recent:]
    summary = summarise(old)
    return [
        pinned,
        {"role": "assistant", "content": "Acknowledged."},
        {"role": "user", "content": f"[CHECKPOINT]\n{summary}"},
        {"role": "assistant", "content": "Continuing."},
    ] + recent

Strategy 3: Tool-result pruning

The cheapest win in the whole game. Most of your tokens aren't from reasoning. They're from tool outputs the agent never refers to again.

A typical coding agent trace, broken down by token type:

System prompt + tool schemas: ~2,000 tokens (constant)
User messages: ~500 tokens total
Assistant reasoning + tool calls: ~15,000 tokens
Tool results: ~180,000 tokens

You can drop the result body and keep the tool_use_id shell, so the structure stays valid and the agent still sees "I ran X, here was Y at the time, but the full output is no longer available."

PRUNE_AFTER_TURNS = 10  # how far back a tool_result has to be before pruning
MAX_RESULT_TOKENS = 2_000  # keep small results in full

def prune_tool_results(messages: list[dict]) -> list[dict]:
    pruned = []
    for i, msg in enumerate(messages):
        turns_ago = len(messages) - i
        if msg["role"] != "user" or turns_ago < PRUNE_AFTER_TURNS:
            pruned.append(msg)
            continue
        # user messages can hold tool_result blocks
        if not isinstance(msg.get("content"), list):
            pruned.append(msg)
            continue
        new_content = []
        for block in msg["content"]:
            if block.get("type") != "tool_result":
                new_content.append(block)
                continue
            body = block.get("content", "")
            body_str = body if isinstance(body, str) else str(body)
            if estimate_tokens(body_str) <= MAX_RESULT_TOKENS:
                new_content.append(block)
                continue
            # replace the body with a stub
            new_content.append({
                "type": "tool_result",
                "tool_use_id": block["tool_use_id"],
                "content": (
                    f"[pruned: {estimate_tokens(body_str)} tokens, "
                    f"{turns_ago} turns ago. Re-run the tool if needed.]"
                ),
            })
        pruned.append({"role": "user", "content": new_content})
    return pruned

def estimate_tokens(s: str) -> int:
    # rough heuristic; use count_tokens for accuracy
    return len(s) // 4

The agent learns to re-run a tool if it actually needs the data. That's the right behaviour. The alternative is dragging 50k tokens of a pytest log through every turn for the rest of the conversation.

Watch the threshold. If you prune too aggressively (say, after 3 turns), the agent re-runs the same tool in a loop, which is its own failure mode. 8–12 turns is a sensible floor for coding agents.

Strategy 4: Selective recall

For agents that run for hours, no fixed window is enough. You keep the recent messages in context and stuff the older trace into a vector store. When the agent needs old information, it retrieves it.

from chromadb import PersistentClient

chroma = PersistentClient(path="./agent_memory")
collection = chroma.get_or_create_collection("trace_001")

def archive(messages_to_archive: list[dict]) -> None:
    # store one chunk per turn, with the turn index as metadata
    for i, msg in enumerate(messages_to_archive):
        text = format_message(msg)
        collection.add(
            documents=[text],
            metadatas=[{"turn": i, "role": msg["role"]}],
            ids=[f"turn-{i}"],
        )

def recall(query: str, k: int = 3) -> str:
    results = collection.query(query_texts=[query], n_results=k)
    return "\n---\n".join(results["documents"][0])

# expose recall as a tool to the agent itself
RECALL_TOOL = {
    "name": "recall_earlier_trace",
    "description": (
        "Search the archived portion of this conversation for relevant "
        "context. Use when you need information from earlier turns that "
        "are no longer in your direct context."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {"type": "string"},
        },
        "required": ["query"],
    },
}

The agent decides when to recall. That's the whole point: you stop guessing what's relevant on its behalf.

Selective recall is the right answer for long-running agents (research, multi-day workflows, anything where "the task" runs for more than 50 turns). For a 20-turn coding session it's overkill, and the embedding latency adds up.

A hybrid that actually ships

Sliding windows alone are dangerous. Pure summarisation is expensive and lossy. Tool-result pruning is great but doesn't help once the reasoning gets long. Selective recall is heavy machinery.

What works for most production agents is two stages, in order:

Prune tool results older than 10 turns. Cheap, deterministic, no model call.
If still over budget, checkpoint everything except the original task message and the last 6 turns. One model call, pinned task, summarised middle.

def manage_context(messages: list[dict]) -> list[dict]:
    # step 1: always prune old tool results
    messages = prune_tool_results(messages)
    if count_tokens(messages) <= MAX_INPUT_TOKENS:
        return messages
    # step 2: checkpoint if still too large
    messages = checkpoint_pinned(messages, keep_recent=6)
    if count_tokens(messages) <= MAX_INPUT_TOKENS:
        return messages
    # step 3: last resort, slide the recent window itself
    while count_tokens(messages) > MAX_INPUT_TOKENS and len(messages) > 4:
        # drop the oldest non-pinned pair
        messages = [messages[0]] + messages[3:]
    return messages

Call manage_context before every turn. The first stage is free of model calls and handles 80% of bloat. The checkpoint stage runs only when needed.

The gotcha: prompt cache invalidation

Here's the one nobody warns you about.

Anthropic, OpenAI, and Google all support some form of prompt caching: the provider hashes your message prefix, and if a future request shares that prefix, you pay ~10% of the input cost for the cached portion. For long-running agents this is the difference between a $200/day API bill and a $40/day one.

Caching is prefix-based. The cache hit is invalidated the moment the prefix changes, even by one token.

Now think about what summarisation checkpoints do. They rewrite the start of your message array. Every checkpoint invalidates the prompt cache for everything that follows. The next 10 turns pay full input cost instead of cached cost.

# BAD: every checkpoint kills the cache
messages = [
    pinned_original_task,
    {"role": "user", "content": f"[CHECKPOINT]\n{NEW_SUMMARY}"},  # changes!
    *recent,
]

# BETTER: keep the checkpoint stable for as long as you can
messages = [
    pinned_original_task,
    {"role": "user", "content": f"[CHECKPOINT v{checkpoint_id}]\n{summary}"},
    {"role": "assistant", "content": "Continuing."},
    {"role": "user", "content": [
        {"type": "text", "text": "...", "cache_control": {"type": "ephemeral"}}
    ]},
    *recent,
]

Two things help. First, only re-checkpoint when you have to, not on a fixed cadence. If you checkpoint every 6 turns "to be safe," you pay the cache penalty 6× more than necessary. Checkpoint when count_tokens(messages) > MAX_INPUT_TOKENS * 0.85 and not before.

Second, mark the checkpoint message itself with cache_control: ephemeral (Anthropic) so the prefix up to and including the checkpoint stays cached until the next rewrite. Your recent turns no longer share a cache prefix with the previous version of the conversation, but at least the system prompt + tool schemas + checkpoint are reused on every following turn.

The cost difference is real. A 100-turn coding agent with naive checkpointing every 10 turns can burn through ~$8 in input tokens. The same agent with threshold-triggered checkpointing and ephemeral markers lands closer to ~$1.40. Same task, same model, same final output.

Tool-result pruning, by the way, doesn't have this problem. Pruning happens after the cached prefix, so the cache survives. That's another reason it's the cheapest win.

Pick the right tool for the workload

Chat assistant (one task per conversation): sliding window is fine.
Coding agent (20–50 turns, lots of tool output): tool-result pruning + threshold-triggered checkpointing.
Research agent (hours, hundreds of turns): selective recall as a tool, with pruning + checkpointing as the safety net.

The agent loop you ship is going to outlive your first guess about context management. Build the four strategies as composable functions you can swap, measure cache-hit rate and task-completion rate per strategy, and let the data pick.

Which strategy bit you first in production, and what did you switch to? Drop the war story in the comments.

If this was useful

This piece pulls from the context-management chapter of the AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs, which walks through the same four strategies with the failure traces that go with each one, plus the eval setup to compare them on your own workload. If you're shipping anything that runs more than 20 turns, the chapter on hybrid context management is the one to read first.