Mukunda Rao Katta

Posted on May 25

agent-message-window: Slide Your Context Window Without Dropping Tool Calls

#hermeschallenge #ai #python #agents

The Error That Should Not Exist

anthropic.BadRequestError: messages: tool_use block with id 'tu_abc123' 
must be followed by a tool_result block

The agent had been running for 40 minutes. It had used 12 tools. The context window was getting full, so a naive sliding window dropped the oldest messages. One of those messages was a tool_use block. Its paired tool_result was newer and stayed. Now the history had an orphaned tool_result with no matching tool_use.

Anthropic's API rejected the entire request. The agent crashed.

This is not a corner case. Any long-running agent that uses tools and needs to trim history will hit this. The pairing constraint is strict: every tool_use block must be immediately followed by a tool_result block that references its ID. You cannot drop one without dropping the other.

agent-message-window is a Python library that maintains a sliding window of messages with this constraint built in. When the window needs to drop messages to stay under the limit, it always drops tool_use and tool_result pairs together. You never get a half-pair.

Shape of the Fix

from agent_message_window import MessageWindow

window = MessageWindow(max_messages=20)

# Add messages as the conversation grows:
window.push({"role": "user", "content": "Search for recent papers on diffusion models"})

window.push({
    "role": "assistant",
    "content": [
        {"type": "text", "text": "I will search for that."},
        {"type": "tool_use", "id": "tu_001", "name": "web_search",
         "input": {"query": "diffusion models 2025"}},
    ]
})

window.push({
    "role": "user",
    "content": [
        {"type": "tool_result", "tool_use_id": "tu_001",
         "content": "Found 8 papers..."},
    ]
})

# Continue the conversation...
window.push({"role": "assistant", "content": "Here is what I found..."})
window.push({"role": "user", "content": "Now summarize the top 3"})

# Get a context-safe slice for the next LLM call:
messages = window.get()
# Always a valid Anthropic message list. No orphaned tool blocks.

The get() method returns a list you can pass directly to client.messages.create(messages=...).

When the window is at capacity and a new message arrives, the oldest messages are dropped in safe units. A plain user or assistant text message drops alone. A tool_use plus its tool_result drops together as a unit.

What It Does NOT Do

It does not count tokens. max_messages is a count, not a token budget. If you need token-budget-based trimming, you will need to combine this with a token counter. The architecture supports this: you can call window.get(), measure the token count, and pop messages manually if needed.

It does not summarize dropped messages. Old context is simply removed. No compression, no summary injection. If you want to inject a "summary of earlier conversation" message, do that in your agent loop before calling window.get().

It does not handle multi-turn tool use chains where one tool_result feeds into another tool_use in the same assistant turn. Chained tool calls in a single assistant message block are kept or dropped as a unit. Cross-message chains are handled by the pairing constraint.

It does not validate that tool IDs in tool_result blocks actually match a prior tool_use. It trusts that your agent built the messages correctly. The library's job is to trim safely, not to audit correctness.

Inside the Lib

The window stores messages as a deque with a max length. The pairing map is maintained as messages arrive.

from collections import deque

class MessageWindow:
    def __init__(self, max_messages=20):
        self._max = max_messages
        self._messages = deque()
        self._pairs = {}  # tool_use_id -> index of tool_result

    def push(self, message):
        self._index_pairs(message)
        self._messages.append(message)
        while len(self._messages) > self._max:
            self._drop_oldest()

The _index_pairs step scans each new message for tool_use content blocks and records their IDs. When a tool_result arrives, the map is updated with the pairing.

The _drop_oldest step pops the front of the deque. If the popped message contains a tool_use block, it finds the matching tool_result and marks it as dropped. On the next get() call, any message with dropped pairs is also excluded.

def _drop_oldest(self):
    oldest = self._messages.popleft()
    for block in self._extract_tool_use_ids(oldest):
        self._pairs[block] = "dropped"

def get(self):
    result = []
    for msg in self._messages:
        tool_result_ids = self._extract_tool_result_ids(msg)
        if any(self._pairs.get(tid) == "dropped" for tid in tool_result_ids):
            continue  # skip orphaned tool_result
        result.append(msg)
    return result

The design avoids modifying messages in place. The deque holds the original message dicts. The pairing map is the only mutable state. This makes it straightforward to reason about what get() will return.

The full library is under 200 lines. No external dependencies. Type annotations throughout. Python 3.10 and above.

When Useful / When Not

Useful for any long-running agent that uses tools and does not have unlimited context. The Anthropic API's tool call pairing constraint is strict, so anything that trims history without respecting it will eventually crash.

Useful when you want a simple drop-in solution that does not require rethinking your agent loop. Push messages as they arrive. Call get() before each model call. Done.

Not useful if your agent never needs to trim history. If the total conversation fits in the context window, this is unnecessary overhead. Not useful if you need token-based trimming instead of message-count-based trimming. Not useful if you need sophisticated context management like summarization or memory retrieval. This library only does the sliding window part.

The critical use case is unattended agents running multi-step tasks over periods long enough that the raw message count would exceed the window. Research agents, code review agents, data processing pipelines. Any agent that runs more than a few tool calls before completing.

Install

pip install agent-message-window

PyPI publish is pending. Clone from GitHub in the meantime:

git clone https://github.com/MukundaKatta/agent-message-window
cd agent-message-window
pip install -e .

No runtime dependencies. Python 3.10 and above. Run the tests with:

pytest tests/

The test suite has 18 tests. Key scenarios covered: dropping plain messages, dropping tool pairs together, multi-tool calls in a single assistant message, window at exact capacity, and the get() view after drops.

Siblings

Library	What it does	Language
`agentfit`	Tool argument schema validation	Python
`agentfit-rs`	Tool argument schema validation	Rust
`llm-stop-conditions`	Composable agent loop stop conditions	Python
`agent-deadline`	Cooperative per-task deadline	Python
`prompt-token-counter`	Approximate token count for messages	Python
`llm-context-rotate`	Stateful rolling chat history	Python
`agent-message-sanitize`	Sanitize LLM message lists	Python

The closest sibling in spirit is llm-context-rotate. That library manages rolling history with more configuration options. This library is narrower: it only solves the tool pair constraint problem. If you need full context management, llm-context-rotate is more capable. If you just need safe sliding window trimming for a tool-using agent, this is the smaller, more focused choice.

prompt-token-counter pairs well with this library. Use the token counter to decide when to trim, then use the message window to do the trim safely.

What Is Next

v0.2.0 targets:

Token-budget mode. Instead of max_messages, accept max_tokens and a counting function. The window drops messages until the token count is under budget.
Summary injection. An optional callback that receives the dropped messages and returns a summary string. The window injects a system or user message at position 0 containing the summary.
Async push. A non-blocking variant for agents that build messages on async streams and need to push without blocking.
Persistence. A method to serialize and restore the window state, so long-running agents can checkpoint and resume.

The pairing constraint is a real operational hazard for tool-using agents. Most teams discover it after a production crash. Having a tested, drop-in fix removes the hazard before it happens.

Pull requests welcome at MukundaKatta/agent-message-window. Part of the Hermes Agent Challenge sprint.

DEV Community