This is a submission for the Hermes Agent Challenge.
My Hermes research agent ran in a loop. The supervisor asked a question, a worker searched, the supervisor synthesized, repeat. After 47 turns, the API returned a context length error.
I knew this would happen eventually. I needed to trim the message list before each call, but with two hard rules: never drop the system prompt, and never drop the last two turns (the current question and the previous answer). Everything else was fair game.
That's llm-context-trim.
The problem with rolling windows
The obvious fix is a rolling window: keep the last N messages. But N what? If I keep 20 messages and the system prompt is 800 tokens, I still need to count. If a few messages have tool call results that are unusually long, 20 messages might still overflow. And a fixed N doesn't adapt to the actual content of the messages.
What I wanted was: keep as many middle messages as fit in the remaining budget, newest first, always guarantee system + tail.
One function
from llm_context_trim import trim_messages
result = trim_messages(
messages, # the full conversation history
max_tokens=4096, # my budget for the messages portion of the call
keep_last=2, # always keep the last 2 messages
)
print(f"Was {result.original_count} messages, now {result.trimmed_count} removed")
print(f"~{result.estimated_tokens} tokens")
# Pass to the next LLM call
response = client.messages.create(
model="claude-sonnet-4-6",
messages=result.messages,
max_tokens=1024,
)
What it keeps
In priority order:
-
System message — always kept if it's the first message with
role="system". Disable withkeep_system=False. -
Last
keep_lastmessages — always kept. Default is 2 (the current user turn and the previous assistant turn). - Middle messages — added newest-first until the budget runs out. Older middle messages are dropped first.
Integration in an agent loop
from llm_context_trim import trim_messages, ContextTrimError
def run_loop(system_prompt, history, new_user_msg, max_context_tokens=6000):
history.append({"role": "user", "content": new_user_msg})
try:
trimmed = trim_messages(history, max_tokens=max_context_tokens, keep_last=3)
except ContextTrimError as e:
# System + last 3 already over budget — need to shorten keep_last or system prompt
raise RuntimeError(f"Context too tight: {e}") from e
response = client.messages.create(
model="claude-sonnet-4-6",
system=system_prompt,
messages=trimmed.messages,
max_tokens=1024,
)
history.append({"role": "assistant", "content": response.content[0].text})
return response.content[0].text
I pass system separately in Anthropic's API, so keep_system=False in that case and I don't add the system message to my history list at all. Either pattern works.
Token estimation
No tokenizer dependency. The estimate uses chars / 4 + 4 per message — the same rough heuristic that most LLM providers document for planning purposes. It's deliberately conservative: it over-estimates slightly so trimming never cuts too close to the edge.
If you need exact token counts, run your tokenizer first and pass the result as max_tokens:
from llm_context_trim import estimate_tokens
rough_estimate = sum(estimate_tokens(m["content"]) for m in messages)
Error handling
If the system message + last keep_last messages alone already exceed max_tokens, the function raises ContextTrimError instead of returning a list that's already over budget. You get an explicit failure rather than a silent overflow:
ContextTrimError: System + last 2 messages already use ~4800 tokens
which exceeds max_tokens=4096. Increase max_tokens or reduce keep_last.
Technical notes
19 tests. Zero runtime dependencies. Python 3.10+. The test suite covers the basic no-trim case, zero/negative budget errors, the mandatory-exceeds-budget error path, system message preservation, keep_system=False, keep_last edge cases (zero, all), order preservation after trimming, Anthropic content blocks, and TrimResult metadata correctness.
Repo: https://github.com/MukundaKatta/llm-context-trim
pip install llm-context-trim
Top comments (0)