DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

Sliding Message Window for Agent Loops: Trim Context Without Splitting Tool Pairs

Your agent loop runs long. You are trimming the message list to stay inside the context window. You drop the oldest messages. You send the request. The API returns an error: tool_use block in turn 5 has no matching tool_result block.

You split a tool_use/tool_result pair. The model sent a tool call. You kept the result but dropped the call, or kept the call but dropped the result. Either way, the message list is invalid and the API rejects it.

agent-message-window is a sliding window that trims messages while preserving tool_use/tool_result pairs.


The Shape of the Fix

from agent_message_window import MessageWindow

window = MessageWindow(max_messages=20, preserve_system=True)

messages = []  # grows with each turn

# After each LLM turn, trim before the next call
messages = window.trim(messages)

response = anthropic_client.messages.create(
    model="claude-sonnet-4-6",
    system=system_prompt,
    messages=messages,
    max_tokens=1024,
)
Enter fullscreen mode Exit fullscreen mode

trim() drops the oldest messages down to max_messages, but never drops a tool_use message without also dropping its paired tool_result, and never drops a tool_result without also dropping its paired tool_use.


What It Does NOT Do

agent-message-window does not count tokens. It trims by message count, not by token count. If your model has a 200K token context window and your messages are very long, counting messages is a loose proxy. For exact token counting, use prompt-token-counter to estimate tokens and gate trimming on that estimate instead of message count.

It does not summarize dropped messages. When messages are dropped to enforce the window, they are gone. There is no summarization of what was dropped. For summarization-based context management, you need a separate LLM call to summarize the old messages before dropping them.

It does not handle all possible Anthropic message formats. It handles the standard role: assistant with content: [{type: tool_use}] and role: user with content: [{type: tool_result}] pattern. Non-standard message structures (interleaved tool calls, batch tool results) may not be detected as pairs.


Inside the Library

The pair detection walks the message list to find tool_use IDs and their corresponding tool_result IDs:

def _find_pairs(self, messages: list[dict]) -> set[frozenset]:
    """Returns set of (tool_use_index, tool_result_index) pairs."""
    # Find all tool_use blocks and their IDs
    tool_use_idx: dict[str, int] = {}  # tool_use_id -> message index
    for i, msg in enumerate(messages):
        if msg["role"] == "assistant":
            for block in msg.get("content", []):
                if isinstance(block, dict) and block.get("type") == "tool_use":
                    tool_use_idx[block["id"]] = i

    # Match with tool_result blocks
    pairs = set()
    for i, msg in enumerate(messages):
        if msg["role"] == "user":
            for block in msg.get("content", []):
                if isinstance(block, dict) and block.get("type") == "tool_result":
                    tool_id = block.get("tool_use_id")
                    if tool_id in tool_use_idx:
                        pairs.add(frozenset([tool_use_idx[tool_id], i]))

    return pairs
Enter fullscreen mode Exit fullscreen mode

The trim logic:

def trim(self, messages: list[dict]) -> list[dict]:
    if len(messages) <= self._max_messages:
        return messages

    pairs = self._find_pairs(messages)

    # Build pair membership: which message indices are in pairs
    paired = set()
    for pair in pairs:
        paired.update(pair)

    # Find how many we need to drop
    to_drop = len(messages) - self._max_messages
    dropped = 0
    drop_indices = set()

    for i in range(len(messages)):
        if dropped >= to_drop:
            break
        if i in paired:
            # Find the other member of the pair
            partner = next(j for j in range(len(messages)) if frozenset([i, j]) in pairs)
            # Drop both or neither
            if partner < to_drop + dropped:
                drop_indices.add(i)
                drop_indices.add(partner)
                dropped += 2
        else:
            drop_indices.add(i)
            dropped += 1

    return [msg for i, msg in enumerate(messages) if i not in drop_indices]
Enter fullscreen mode Exit fullscreen mode

When to Use It

Use it in any agent loop with a bounded message history. Long conversations, multi-step tasks, and any loop with more than a dozen turns will hit context limits eventually. The window prevents API errors from split pairs.

Use it alongside token-based limits. Check token count with prompt-token-counter. If the estimate is over 80% of the model's context limit, call window.trim() to reduce message count. The two approaches are complementary.

Use it for stateful agents that maintain conversation history across user turns. Each user message grows the history. Without trimming, the history grows unbounded. With trimming, the agent forgets old context but never sends malformed message lists.

Skip it for short single-shot interactions. If your agent runs for 3-5 turns and terminates, context trimming is not needed. The window adds overhead that only pays off for longer-running loops.


Install

pip install git+https://github.com/MukundaKatta/agent-message-window

# Or from PyPI
pip install agent-message-window
Enter fullscreen mode Exit fullscreen mode
from agent_message_window import MessageWindow

window = MessageWindow(
    max_messages=30,
    preserve_system=True,  # Never drop the first system message
    min_keep_recent=4,     # Always keep at least 4 recent messages
)

messages = []

async def agent_loop(initial_task: str) -> str:
    messages.append({"role": "user", "content": initial_task})

    while True:
        # Trim before every call
        trimmed = window.trim(messages)

        response = await anthropic_client.messages.create(
            model="claude-sonnet-4-6",
            system=system_prompt,
            messages=trimmed,
            max_tokens=1024,
            tools=tool_schemas,
        )

        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            return extract_text(response)

        # Process tool calls, append results
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result),
                })

        messages.append({"role": "user", "content": tool_results})
Enter fullscreen mode Exit fullscreen mode

Sibling Libraries

Library What it solves
prompt-token-counter Approximate token count before trimming
agent-message-sanitize Structural fixer: merge consecutive roles, enforce alternation
llm-context-rotate Stateful rolling chat history with configurable retention
llm-token-split Split long documents into chunks for context injection
agent-resume Checkpoint/resume for when context fills and the run must stop

The context management stack: agent-message-window for sliding window trimming, prompt-token-counter for token estimates, agent-message-sanitize for structural validation, agent-resume for checkpointing when the run grows too large to continue.


What's Next

Token-aware trimming: accept a max_tokens parameter and a token estimator function, trim by estimated token count rather than message count. This would make the window exact rather than a proxy.

Summarization hook: a summarize_fn callback that receives dropped messages and returns a summary string. The summary is prepended to the remaining messages as an injected user message. This preserves context semantics at the cost of an extra LLM call.

Trim stats: window.trim_stats() returning how many messages and turns were dropped in the last trim, and how many tool pairs were preserved intact. Useful for tuning max_messages.


Built as part of the agent-stack family: composable Python primitives for production LLM agents.

Top comments (0)