DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

My Hermes agent loop blew the context window at turn 47. llm-context-trim fixed it.

Hermes Agent Challenge Submission: Build With Hermes Agent

This is a submission for the Hermes Agent Challenge.

My Hermes research agent ran in a loop. The supervisor asked a question, a worker searched, the supervisor synthesized, repeat. After 47 turns, the API returned a context length error.

I knew this would happen eventually. I needed to trim the message list before each call, but with two hard rules: never drop the system prompt, and never drop the last two turns (the current question and the previous answer). Everything else was fair game.

That's llm-context-trim.

The problem with rolling windows

The obvious fix is a rolling window: keep the last N messages. But N what? If I keep 20 messages and the system prompt is 800 tokens, I still need to count. If a few messages have tool call results that are unusually long, 20 messages might still overflow. And a fixed N doesn't adapt to the actual content of the messages.

What I wanted was: keep as many middle messages as fit in the remaining budget, newest first, always guarantee system + tail.

One function

from llm_context_trim import trim_messages

result = trim_messages(
    messages,        # the full conversation history
    max_tokens=4096, # my budget for the messages portion of the call
    keep_last=2,     # always keep the last 2 messages
)

print(f"Was {result.original_count} messages, now {result.trimmed_count} removed")
print(f"~{result.estimated_tokens} tokens")

# Pass to the next LLM call
response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=result.messages,
    max_tokens=1024,
)
Enter fullscreen mode Exit fullscreen mode

What it keeps

In priority order:

  1. System message — always kept if it's the first message with role="system". Disable with keep_system=False.
  2. Last keep_last messages — always kept. Default is 2 (the current user turn and the previous assistant turn).
  3. Middle messages — added newest-first until the budget runs out. Older middle messages are dropped first.

Integration in an agent loop

from llm_context_trim import trim_messages, ContextTrimError

def run_loop(system_prompt, history, new_user_msg, max_context_tokens=6000):
    history.append({"role": "user", "content": new_user_msg})

    try:
        trimmed = trim_messages(history, max_tokens=max_context_tokens, keep_last=3)
    except ContextTrimError as e:
        # System + last 3 already over budget — need to shorten keep_last or system prompt
        raise RuntimeError(f"Context too tight: {e}") from e

    response = client.messages.create(
        model="claude-sonnet-4-6",
        system=system_prompt,
        messages=trimmed.messages,
        max_tokens=1024,
    )

    history.append({"role": "assistant", "content": response.content[0].text})
    return response.content[0].text
Enter fullscreen mode Exit fullscreen mode

I pass system separately in Anthropic's API, so keep_system=False in that case and I don't add the system message to my history list at all. Either pattern works.

Token estimation

No tokenizer dependency. The estimate uses chars / 4 + 4 per message — the same rough heuristic that most LLM providers document for planning purposes. It's deliberately conservative: it over-estimates slightly so trimming never cuts too close to the edge.

If you need exact token counts, run your tokenizer first and pass the result as max_tokens:

from llm_context_trim import estimate_tokens

rough_estimate = sum(estimate_tokens(m["content"]) for m in messages)
Enter fullscreen mode Exit fullscreen mode

Error handling

If the system message + last keep_last messages alone already exceed max_tokens, the function raises ContextTrimError instead of returning a list that's already over budget. You get an explicit failure rather than a silent overflow:

ContextTrimError: System + last 2 messages already use ~4800 tokens
which exceeds max_tokens=4096. Increase max_tokens or reduce keep_last.
Enter fullscreen mode Exit fullscreen mode

Technical notes

19 tests. Zero runtime dependencies. Python 3.10+. The test suite covers the basic no-trim case, zero/negative budget errors, the mandatory-exceeds-budget error path, system message preservation, keep_system=False, keep_last edge cases (zero, all), order preservation after trimming, Anthropic content blocks, and TrimResult metadata correctness.

Repo: https://github.com/MukundaKatta/llm-context-trim

pip install llm-context-trim
Enter fullscreen mode Exit fullscreen mode

Top comments (0)