DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

agentfit: Trim Agent Message History to Fit Your Context Window

The agent that started failing after ten turns

I had a support agent that handled long conversations. It had a system prompt, a few example turns for style, and then the real conversation. For short threads it worked fine. Then, after ten or fifteen turns in a real session, the API started returning errors. Context window exceeded.

The naive fix is to drop the oldest messages from the array. Most people reach for this first. Slice the front of the array, keep the tail, send what remains. That often works until it does not. The problem is tool calling. Many LLMs enforce a pairing requirement: every tool_use message must have a corresponding tool_result message. If you slice the array at an arbitrary index and cut a tool_use without its matching tool_result, the API rejects the request. The error is not always a clear "broken tool pair." Sometimes it is a generic invalid message array error, which takes a while to trace back to the pairing issue.

Beyond tool pairs, I also needed to keep the system prompt at position zero in the array. The system prompt sets the agent's behavior, its constraints, the persona it uses. Losing it mid-conversation meant the agent started drifting behavior in ways that were hard to predict. The system prompt was the most important thing in the array and also the thing most likely to get dropped by a naive slice.

So the "drop oldest" implementation was not three lines. It was: identify the system prompt and lock it, scan forward from the oldest non-system messages, find the first safe drop boundary that does not break a tool pair, drop that group, recalculate the token count, repeat until the count is under the limit. I wrote this from scratch twice across different projects. On the second time I extracted it into a library.

That library is agentfit.

Shape of the fix

import { AgentFit } from "@mukundakatta/agentfit";

const fit = new AgentFit({
  maxTokens: 100000,
  strategy: "drop_oldest",
  keepSystem: true,
});

// Pass your full message history:
const trimmed = fit.trim(messages);

// Safe to pass directly to your LLM client:
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  messages: trimmed,
  max_tokens: 4096,
});
Enter fullscreen mode Exit fullscreen mode

The trim method returns a new array and does not mutate the input. If the messages fit within maxTokens, it returns them unchanged. If they do not, it applies the strategy until the messages fit or it determines that no further trimming is possible without breaking required message constraints. In the latter case, it throws with a clear error explaining why trimming could not complete.

The token count is approximate by default. agentfit uses a character-based heuristic (roughly four characters per token) unless you pass a custom tokenizer. For development and prototyping, the heuristic is usually close enough. For production use where the context window limit is tight, pass a real tokenizer from your model provider's SDK.

What it does NOT do

agentfit does not summarize dropped messages. It drops them. The history before the trim boundary is gone. If you want conversation compression, where dropped turns are replaced with a short summary that preserves context, that requires an LLM call. agentfit does not make LLM calls. It is a pure function on your message array.

agentfit also does not know your model's context window size. You pass maxTokens explicitly, based on the model you are using. If you pass the wrong value, the trim will target the wrong threshold. Check your model's documentation. For claude-sonnet-4-6 the context window is 200,000 tokens, but you typically want to leave headroom for the response, so a value like 190,000 or 180,000 is more realistic in practice.

Multimodal messages with image content blocks are not supported. Image token counts vary significantly by resolution and model-specific processing, and there is no reliable character-based heuristic for them. If your messages include images, you will need to handle those token counts yourself before calling trim.

Inside the lib

The drop_oldest strategy works by scanning from the oldest non-system messages and identifying the first safe drop point. A safe drop point is a position where removing a message or group of messages does not orphan a tool_use or leave a tool_result without its corresponding tool_use. The scanner walks forward through the array, identifies message boundaries (user turns, assistant turns, complete tool pairs), drops the oldest complete unit, recalculates the token count using the tokenizer, and repeats until the count is under the limit.

There is also a drop_middle strategy for cases where both the beginning and end of a conversation are important, but the middle can be sacrificed. This comes up in task-oriented agents where the user's original intent (stated at the start of the conversation) and the current task state (the most recent turns) both need to stay in context, but the intermediate back-and-forth in between can be dropped. The drop_middle strategy identifies the oldest half of the non-system messages and targets those for removal before trimming from the recent end.

The keepSystem flag, when true, marks the first message in the array as permanent if it has role: "system". It is never a candidate for dropping regardless of how many rounds of trimming are required. If your messages array does not start with a system message, the flag has no effect. It checks the role, not the position.

The custom tokenizer interface is a single function: (text: string) => number. Pass it in the constructor as tokenizer. This function is called once per message when calculating the current token count. If you are using the Anthropic SDK, you can use the countTokens beta method. If you are using OpenAI, tiktoken produces accurate counts. The interface accepts any function that takes a string and returns a number, so any tokenizer works.

When useful

  • Your agent runs long conversations and hits context window limits in production
  • You have tool-calling agents where a broken tool pair causes API validation errors
  • You want to keep the system prompt intact regardless of how aggressively the history is trimmed
  • You need a predictable, deterministic drop strategy rather than summarization or compression

When not useful

  • You need conversation compression where dropped history is replaced with a summary to preserve context
  • Your messages include images and you need accurate token counts for multimodal content
  • You want the trim to happen server-side or as part of the LLM request rather than client-side
  • Your conversations are short enough that context overflow is not a concern in your current deployment

Install

npm install @mukundakatta/agentfit
# or
yarn add @mukundakatta/agentfit
Enter fullscreen mode Exit fullscreen mode

Requires Node 18+. Zero runtime dependencies. BYO tokenizer for accurate production token counts.

Siblings

Library What it does Registry
agentfit-rs Rust port of the same message trimming concept crates.io
agent-message-window Sliding window with tool_use/tool_result pairing (Python) PyPI
prompt-token-counter Approximate token counts across model providers crates.io / PyPI
@mukundakatta/agentsnap Snapshot tests for agent tool call sequences npm
@mukundakatta/agenttrace Cost and latency tracking across LLM calls GitHub

What is next

The most useful addition is a summarize strategy that calls your LLM once to summarize the group of messages being dropped and injects a synthetic assistant message at the trim boundary. This preserves context continuity without keeping the full history. It would require passing an LLM caller function to the constructor, which the current interface does not need, so it will likely be an optional config field that enables that strategy.

Better multimodal support, where image blocks are counted using a model-specific formula rather than ignored, is also on the list. That needs per-model pricing tables for image tokens and a way to identify image block sizes, but it is a natural follow-on once the text-only core is stable.

Top comments (0)