Mukunda Rao Katta

Posted on May 25

Compress Your Agent's Context Without Losing What Matters

#hermeschallenge #ai #python #agents

The 180k Token Wall

Fifty turns into a research agent run, you hit the context limit.

The conversation history is bloated. The agent called a web search tool that returned 4,000 words of raw HTML. It called a database tool that returned 300 rows as a JSON array. Each tool result sits in the context, uncompressed, for the rest of the run.

You are at 180k tokens. The model supports 200k. You have maybe 10 more turns before you hit the ceiling and the run fails.

The standard answers to this are summarization (call the LLM to compress its own history) or RAG (retrieve only relevant chunks). Both work. Both also add latency, cost, and complexity.

There is a simpler tier that runs before you need either of those: structural compression. No LLM call required. Just trim what does not need to be full-sized.

This post shows three techniques that compose into a context management pipeline.

Three Tools, One Pipeline

Sliding message window (agent-message-window): keep only the last N message pairs. Old tool calls and their results drop out of context. The agent operates on recent history only.

Tool output truncation (tool-output-truncate-py): intercept every tool result before it enters context. Trim long strings, truncate lists, clip nested objects. The agent sees a short version, not 4,000 words of raw text.

Output format optimization (tool-output-format): convert verbose tool results into compact representations. A list of 50 dicts becomes a markdown table. A JSON tree becomes an indented outline. Same information, fewer tokens.

Main Code Example

import asyncio
from agent_message_window import MessageWindow
from tool_output_truncate_py import Truncator, TruncateConfig
from tool_output_format import Formatter, FormatMode

# Sliding window: keep the last 20 message pairs (user + assistant each)
window = MessageWindow(max_pairs=20, preserve_system=True)

# Truncator: clip strings at 800 chars, lists at 30 items, depth at 3
truncator = Truncator(
    config=TruncateConfig(
        max_string_chars=800,
        max_list_items=30,
        max_dict_depth=3,
        add_truncation_notice=True,  # appends "[...N chars omitted]"
    )
)

# Formatter: auto-detect and convert verbose structures to markdown
formatter = Formatter(mode=FormatMode.AUTO)


def compress_tool_result(tool_name: str, raw_result: object) -> str:
    """
    Compress a tool result before it enters the conversation.
    Returns a string suitable for the tool_result message content.
    """
    # Step 1: truncate deep/long structures
    truncated = truncator.truncate(raw_result)

    # Step 2: format as compact markdown
    formatted = formatter.format(truncated, hint=tool_name)

    return formatted


async def run_agent_turn(
    messages: list[dict],
    user_input: str,
) -> tuple[str, list[dict]]:
    """
    Run one turn of the agent loop with context compression.
    Returns (agent_response, updated_messages).
    """
    # Add user message
    messages.append({"role": "user", "content": user_input})

    # Apply sliding window BEFORE sending to model
    windowed = window.apply(messages)

    # Call the model
    response = await your_llm_client(windowed)

    # Process tool calls if any
    updated = list(messages)  # keep full history on our side
    if response.tool_calls:
        for tool_call in response.tool_calls:
            raw = await dispatch_tool(tool_call)

            # Compress before appending to context
            compressed = compress_tool_result(tool_call.name, raw)

            updated.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": compressed,
            })

        # One more model call with tool results
        windowed_with_tools = window.apply(updated)
        response = await your_llm_client(windowed_with_tools)

    updated.append({"role": "assistant", "content": response.content})
    return response.content, updated


async def main():
    messages = []
    conversation = [
        "Search for Python async patterns from the last year.",
        "What were the three most discussed topics?",
        "Summarize the first one in two sentences.",
        "Compare it to the second topic.",
        # ... agent runs for many more turns
    ]

    for user_input in conversation:
        reply, messages = await run_agent_turn(messages, user_input)
        print(f"Agent: {reply[:100]}...")

        # Track context size
        total_chars = sum(len(str(m.get("content", ""))) for m in messages)
        print(f"  Context size: {total_chars:,} chars ({len(messages)} messages)")


if __name__ == "__main__":
    asyncio.run(main())

The pipeline runs in order: tool dispatch, truncate, format, window. By the time the model sees the result, it is already compressed. The full raw result stays in your local updated list for logging or debugging.

What This Does NOT Do

This approach does not compress semantically. If the tool returned 30 rows of data and your agent needs all 30, truncating to 30 items does not help. You will need RAG or summarization for that case.

It does not guarantee the window contains the most relevant messages. A sliding window keeps the most recent messages. If a critical instruction appeared in turn 3 and you are now on turn 25, it may be gone. Use a system message or a pinned context block for persistent instructions.

It does not reduce your prompt cost to zero. Compression reduces context size, which reduces cost. But you still pay for what remains. A 50k-token context compressed to 20k still costs money.

It does not handle multi-modal content. Image and audio tokens in context are a separate problem. These libraries work on text content only.

Design Reasoning

The pipeline order matters. Truncate before format. Raw structures can have deeply nested repeating keys that inflate the character count before formatting. Truncating first removes those. Formatting then produces a cleaner, more predictable output.

Compression happens at insertion time, not at retrieval time. Some implementations apply compression lazily when the context gets too long. That means you store large tool results, pay for them in memory, and then trim them later. Compressing on insert is cheaper and simpler.

The full history stays local. Your agent process keeps the uncompressed messages. Only the windowed, compressed slice goes to the API. This gives you two things: accurate local debugging logs and the ability to re-process history with different window settings without re-running the agent.

Adding truncation notices is important. When tool-output-truncate-py clips a list from 300 items to 30, it appends [...270 items omitted]. The agent sees this notice and knows the list was partial. Without the notice, the agent might treat the 30 items as the complete result.

When This Applies

Long-running agents that call data-heavy tools. Database agents, search agents, document processing agents. Any agent that accumulates large tool results over many turns.

Background batch agents where you cannot tune the model context window interactively. You set the compression policy up front and let it run.

Cost-sensitive agents where you need to fit within a token budget per run. Compression is free (no LLM call). It is the first optimization to apply before adding cost to the pipeline.

This does NOT fit short agents with 3-5 turns. The overhead of setting up the pipeline is not worth it. Use direct API calls without a message manager for short interactions.

It also does NOT fit agents where historical accuracy is critical. A legal or compliance agent may need every detail from every prior turn. For those, store the full history externally and retrieve by query rather than by recency.

Quick-Start Snippet

pip install agent-message-window tool-output-truncate-py tool-output-format

from agent_message_window import MessageWindow
from tool_output_truncate_py import Truncator, TruncateConfig
from tool_output_format import Formatter, FormatMode

window = MessageWindow(max_pairs=20, preserve_system=True)
truncator = Truncator(config=TruncateConfig(max_string_chars=800, max_list_items=30))
formatter = Formatter(mode=FormatMode.AUTO)

# In your tool dispatch loop:
raw_result = await dispatch_tool(tool_call)
compressed = formatter.format(truncator.truncate(raw_result))
# Append compressed to messages, apply window before each API call.

Three lines of setup. Drop into any existing agent loop.

Siblings

Library	What it does	When to reach for it
`agent-message-window`	Sliding window over message history	Context overflow from long conversations
`tool-output-truncate-py`	Clip long strings, lists, nested dicts	Large raw tool results
`tool-output-format`	Convert verbose structures to compact markdown	JSON arrays or dicts that can be tables
`agentfit`	Measure agent loop performance	After compression, verify it helped
`prompt-token-counter`	Estimate token count before sending	Know exactly how big the context is
`llm-token-split`	Split long docs into overlapping chunks	Source documents too large to include whole

What's Next

The next level is semantic compression: use a small, cheap model to summarize old message pairs before they drop out of the window. You keep the meaning without the tokens. The sliding window from agent-message-window gives you the hook: it fires a callback before dropping messages. In that callback, you summarize and inject a condensed note into the system prompt.

For agents with external memory (vector stores, databases), the pattern shifts. Instead of compressing what's in context, you store tool results externally and retrieve by query. The agent-context-builder library handles composing the system prompt from retrieved chunks. That is a larger architectural change, but compression is the right first step.

Tool output truncation and format optimization are the cheapest wins. Apply them first, measure with agentfit, and only add complexity if you still need it.

DEV Community