Context-Window Eviction: What to Drop When the Agent Fills Up

#python #ai #agents #llm

Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You give the agent a long task. Refactor a module, walk a directory, call a dozen tools along the way. Twenty turns in, the response comes back truncated, or the provider returns a context-length error, or worse: the model quietly forgets what you asked it to do and starts answering a question from turn three. The window filled up. Something had to give, and the thing that gave was the wrong thing.

This is the part of agent design nobody draws in the architecture diagram. The loop has a budget. The window has a ceiling. Every turn appends an assistant message and a pile of tool results, and the conversation only grows. At some point the running history plus the next prompt no longer fits. You either decide what to drop, or the provider decides for you, and the provider's decision is "reject the request."

So you need an eviction policy. The same problem a CPU cache has, with one extra rule: some messages are load-bearing and must never be evicted.

What "full" means before you hit the wall

Do not wait for the 400. Pick a soft ceiling below the model's hard limit and act on it. Say the limit is 200K tokens, your manager should start evicting somewhere around 150K, leaving headroom for the next turn's tool results and the response.

You need a token count to make any of this work. Use the provider's tokenizer where you can. Anthropic and OpenAI both ship one, and a count that is right to the token beats a guess. When you cannot reach a tokenizer, the rough heuristic len(text) // 4 for English text is close enough to drive an eviction decision, and being approximately right early beats being exactly right after the request already failed.

def estimate_tokens(text: str) -> int:
    # Swap for the provider tokenizer in prod.
    return len(text) // 4


def message_tokens(message) -> int:
    return estimate_tokens(str(message["content"]))

Three policies, from blunt to careful

Oldest-first. Drop the front of the history until you fit. This is FIFO, and it is the cheapest thing that works. It maps to how a conversation ages: turn two matters less than turn nineteen once the task has moved on. The risk is that turn two held the only mention of a constraint the user cares about, and now it is gone.

Least-relevant-first. Score each message against the current goal and evict the lowest scorers. The score can be recency, an embedding similarity to the active task, or a cheap keyword overlap. This keeps the messages that still matter and drops the dead branches: the tool call that returned an error you already routed around, the search result you never used. It costs more to compute and it can misjudge relevance, but it keeps the window denser with useful material.

Summarize-then-drop. Before evicting a block of old turns, ask a cheap model to compress them into a few sentences, then replace the block with that summary. You keep the information and lose the token weight. A summary of ten tool-heavy turns might be 80 tokens instead of 8,000. The cost is an extra model call and the chance the summary flattens a detail you needed. This is the right policy for long agentic runs where early turns carry decisions, not just chatter.

Most production agents end up combining them. Summarize the oldest chunk, then fall back to oldest-first if you are still over after summarizing. Start with oldest-first because it is the one you can ship this afternoon, and add the others when traces tell you the blunt policy is dropping things it should not.

Pin the goal, always

Here is the rule that separates a context manager from a cache: the system prompt and the original task statement are not evictable. Ever.

The failure mode where the agent "forgets what it was doing" is almost always an eviction policy that treated the goal as just another old message. It was the oldest, so oldest-first dropped it. Now the model is steering off a summary of its own intermediate work with no anchor to the actual objective.

Mark those messages pinned and skip them in every eviction pass. A system message that defines the agent's role, the user's original request, and any hard constraint the user stated up front. Everything else is fair game. The pinned set should be small and stable; if you find yourself pinning half the history, the policy underneath it is doing nothing.

A 40-line context manager

This wires the pieces together. It is provider-agnostic: it works on a list of message dicts, counts tokens through the function above, keeps pinned messages, and evicts oldest-first until the history fits under the soft ceiling.

from dataclasses import dataclass, field


@dataclass
class ContextManager:
    soft_limit: int = 150_000
    pinned: set = field(default_factory=set)

    def pin(self, index: int) -> None:
        self.pinned.add(index)

    def total(self, messages) -> int:
        return sum(message_tokens(m) for m in messages)

    def fit(self, messages):
        if self.total(messages) <= self.soft_limit:
            return messages, []

        kept = list(messages)
        evicted = []
        # Walk evictable messages oldest-first.
        order = [
            i for i in range(len(messages))
            if i not in self.pinned
        ]
        for i in order:
            if self.total(kept) <= self.soft_limit:
                break
            evicted.append(messages[i])
            kept[i] = None

        kept = [m for m in kept if m is not None]
        return kept, evicted

Call fit before every model call. It returns the trimmed message list plus whatever it dropped, so the caller can log the eviction or stash it somewhere durable. Pin index 0 (the system prompt) and the index of the original user task right after you build the conversation, and they survive every pass.

The kept[i] = None then filter-out is doing real work: it lets you remove messages by their original position without shifting indices mid-loop, so the pinned set stays correct. Drop oldest evictable first, stop the moment you are under the ceiling, and keep the rest intact.

Add summarization without rewriting the loop

When oldest-first alone starts dropping turns that carried decisions, slot summarization in front of it. Pull the oldest evictable block, compress it, and splice the summary back as a single pinned-ish message.

def summarize_block(model_call, block) -> dict:
    text = "\n".join(str(m["content"]) for m in block)
    prompt = (
        "Summarize the key facts, decisions, and "
        "open threads in these agent turns. Be terse.\n\n"
        + text
    )
    summary = model_call(prompt)
    return {"role": "user", "content": f"[earlier turns] {summary}"}

Replace the block with one summary message, then run fit again. If you are still over, oldest-first finishes the job on what is left. The order matters: summarize the cheap-to-lose detail into prose first, evict raw turns second, and never touch the pinned goal.

What to log

Every eviction is a small bet that you dropped the right thing. Make the bet visible. Send to your trace, per turn:

tokens_before and tokens_after the fit pass
count of evicted messages and their roles
whether a summarization ran, and the summary's token cost
the pinned set size

The signal you watch for is a turn where the model's output stops referencing the original task. Correlate that with an eviction that ran on the same turn and you have found a policy that is too aggressive, or a goal that was not pinned. Without the log you are guessing why a long run went sideways.

Where this stops

A context manager handles one conversation that grew too long. It does not handle a task that genuinely needs more state than any window holds. For that you reach for external memory: write the decisions and intermediate results to a store, and let the agent retrieve the slice it needs per turn instead of carrying all of it inline. Eviction buys you a longer run. Retrieval buys you a longer memory. Most real agents end up using both, with eviction as the fast path and a store behind it.

Start with oldest-first and a pinned goal. Ship it before your first context-length error, not after. Then read your traces and add summarization to the spots where the blunt policy is dropping things it should not.

If this was useful

The AI Agents Pocket Guide covers context management alongside the other moving parts of an agent loop: bounded iteration, tool design, memory, and the recovery patterns each one needs. The chapter on managing long-running context pairs directly with the manager in this post.