Mukunda Rao Katta

Posted on May 25

agentfit-rs: Token-Aware Message Truncation for Rust LLM Agents

#hermeschallenge #ai #rust #agents

The 400 that took hours to debug

The agent had been running in production for a while. Short conversations worked fine. Then a user opened a long support thread. Four hundred messages back and forth. The agent kept up for the first hundred or so, then started returning errors.

The HTTP status was 400. The error body said something about context length. The fix seemed easy: drop the oldest messages until the history fits.

The code to do that was three lines. But those three lines dropped a tool_use message without dropping the tool_result that followed it. The API rejected the conversation because a tool_result without a paired tool_use is an invalid message sequence. The error was now different. It took two hours to understand why.

Then the fix for that was another few lines. But those lines dropped messages from the wrong end, removing the most recent context instead of the oldest. The agent started answering as if the last several user turns had never happened.

Every conversation that hit the length limit was broken in a different way depending on which fix was in place.

agentfit-rs handles all of this in one place.

The shape of the fix

[dependencies]
agentfit-rs = "0.1"

Trim a message list to fit a target token budget:

use agentfit_rs::{fit, Strategy, Message};

let messages: Vec<Message> = vec![
    Message::system("You are a helpful assistant."),
    Message::user("Hello"),
    Message::assistant("Hi! How can I help?"),
    // ... 400 more messages
];

// Target: 8192 tokens. Drop oldest user/assistant turns first.
let trimmed = fit(&messages, 8192, Strategy::Head)?;

Strategy::Head drops from the oldest end. Strategy::Tail drops from the newest end. Strategy::Middle drops from the center, preserving recent and early context.

The system message is never touched. It is excluded from the count and excluded from truncation. If you need to account for its tokens against the budget, you can mark it included:

use agentfit_rs::{FitOptions, Strategy};

let trimmed = FitOptions::new(8192, Strategy::Head)
    .system_counts_toward_budget(true)
    .fit(&messages)?;

If the system message alone exceeds the target budget, the crate returns Err(AgentFitError::SystemPromptTooLarge). It does not truncate the system message silently.

Paired tool calls are protected automatically:

let messages = vec![
    Message::user("Fetch the current price."),
    Message::tool_use("call_01", "get_price", json!({ "ticker": "AAPL" })),
    Message::tool_result("call_01", "183.25"),
    Message::assistant("The current price is $183.25."),
];

// fit() will not drop the tool_use without also dropping the tool_result:
let trimmed = fit(&messages, 100, Strategy::Head)?;

The pairing check walks the message list and treats each tool_use / tool_result pair as an atomic unit. Both are dropped or neither is.

Counting tokens

The default estimator is chars / 4, which is a rough approximation used by many tokenizer comparisons. For exact counts, enable the tiktoken feature:

[dependencies]
agentfit-rs = { version = "0.1", features = ["tiktoken"] }

With the feature enabled, the crate uses tiktoken's cl100k_base encoding by default. You can also supply a custom tokenizer via the Tokenizer trait if you need a provider-specific encoding.

The difference matters for code-heavy conversations. Code tokenizes at roughly 1 token per 3-4 chars in many encodings, but the ratio varies by language and formatting. If your agent handles code, use the tiktoken feature.

What it does NOT do

It does not make any LLM API calls. It trims a Vec of messages and hands it back. What you do with the trimmed list is your business.
It does not count image or document tokens. It tokenizes text content only. If your messages include base64 image blocks, those byte counts are not included.
It does not split a single long message. If one message exceeds the budget on its own, the crate returns an error rather than splitting mid-message.
It does not implement multi-modal token estimation. Vision models have different per-image token costs that are not covered by a text tokenizer.

Inside the lib: SystemPromptTooLarge is an error, not a silent truncation

Most message-trimming utilities treat the system prompt as just another message. If the history is too long, they start dropping from the top, and the system prompt can end up dropped or truncated along with everything else.

That is wrong for LLM agents. The system prompt defines the agent's persona, its tool definitions, its output format constraints, its safety instructions. Truncating it does not produce a shorter-context version of the same agent. It produces a different, usually worse, agent that does not know what it is supposed to do.

agentfit-rs treats the system message as untouchable. The truncation logic only ever touches user and assistant turns. The system message is passed through to the output unchanged regardless of its length.

But that creates a different problem: what if the system prompt itself is longer than the target budget? If the crate silently passes it through, the caller sends a request that will still be rejected. The caller has no way to know why.

The answer is to make it a hard error. SystemPromptTooLarge tells the caller exactly what happened. The system prompt is too long for the target window. The caller needs to shorten it, or increase the target, or choose a model with a larger context window. The crate cannot make that decision. It can only surface the problem clearly.

This feels strict, but the alternative is a silent failure that produces an agent behaving unexpectedly. Explicit errors are easier to debug than mysteriously bad agent output.

When this is useful

Use agentfit-rs when:

You have a multi-turn agent and conversations can get long. Support agents, coding assistants, research agents, anything where the user interacts for more than a few turns.
You need to stay within a context window budget and want tool_use/tool_result pairing to be handled automatically.
You want to control the truncation strategy. Dropping from the oldest end (Head) is the default for most use cases, but tail or middle strategies are useful when you want to preserve early context (session setup, key facts established early).
You use tiktoken and want exact token counts rather than char/4 estimates.

When NOT to use it

Skip agentfit-rs when:

Your conversations are short and you have confirmed they will never approach the context window. The overhead is small but nonzero.
Your provider has native conversation summarization or compression that handles context management. Some APIs now support automatic context pruning server-side.
You need to truncate within a single message rather than across messages. For that you need a different tool.

Install

[dependencies]
agentfit-rs = "0.1"

# Optional: exact token counts via tiktoken
agentfit-rs = { version = "0.1", features = ["tiktoken"] }

GitHub: MukundaKatta/agentfit-rs

Requires Rust stable. Default build has zero additional dependencies. The tiktoken feature adds the tiktoken-rs crate.

Siblings

Lib	Boundary	Repo
agentfit (Python)	Same truncation semantics for Python agents	MukundaKatta/agentfit
agent-message-window	Python sliding window with the same paired-tool protection	MukundaKatta/agent-message-window
llm-content-blocks-rs	Build the content blocks that go into each message	MukundaKatta/llm-content-blocks-rs
prompt-token-counter-rs	Count tokens for budget checks upstream of truncation	MukundaKatta/prompt-token-counter-rs

The typical flow: prompt-token-counter-rs counts the current conversation, you compare against your budget, if over budget you call agentfit-rs with the appropriate strategy, and the trimmed message list goes to the API.

What's next

Summarization hook. A callback that fires when the oldest block is about to be dropped, giving the caller a chance to summarize the dropped section and prepend a summary message before the remaining history. This preserves long-term context at the cost of one extra LLM call.
Target buffer. A configurable safety margin so the trimmed list targets budget - buffer tokens, leaving headroom for the response without re-trimming.
Per-message importance weights. A way to mark certain messages as high-priority so they are dropped last regardless of position. Useful for messages that contain key facts established early in the conversation.

v0.1.0 shipped 2026-05-09. The core truncation and pairing protection is stable. If you find a message sequence where the pairing check drops a pair incorrectly, open an issue with the sequence.

Part of the Hermes Agent Challenge sprint. The full agent-stack series is at MukundaKatta on GitHub.

DEV Community