Rust: Check If Your Payload Fits a Model's Context Window Before You Send It

#hermeschallenge #ai #rust #agents

The 400 came back from the API with "prompt too long."

The Rust agent had been building context by appending tool results to a message list. At some point the accumulated context exceeded the model's limit. The error landed in the error handler, which had to log it, truncate the context, and retry. That worked, but it was backward. The truncation decision belonged before the API call, not after it.

Adding a pre-flight check was three lines.

The shape of the fix

use prompt_token_counter::{FitsRequest, SystemSpec};

let messages_json = serde_json::json!([
    {"role": "user", "content": "What are the quarterly results?"},
    {"role": "assistant", "content": long_tool_output},
]);

let fits = FitsRequest::new(messages_json)
    .system("You are a financial analyst.")
    .limit(SystemSpec::Claude3Sonnet)
    .fits();

if !fits {
    // truncate before calling the API
}

That is the whole check. No API call. No SDK. No tokenizer install required.

The default counting method is chars / 4. That is a rough approximation. For most use cases, it keeps you well within window bounds with a conservative margin. If you need precision, you can plug in a real tokenizer.

use prompt_token_counter::{FitsRequest, SystemSpec, TokenCounter};

struct TiktokenCounter;

impl TokenCounter for TiktokenCounter {
    fn count(&self, text: &str) -> usize {
        // call your tiktoken binding here
        text.len() / 3 // placeholder
    }
}

let fits = FitsRequest::new(messages_json)
    .counter(TiktokenCounter)
    .limit(SystemSpec::Gpt4o)
    .fits();

What it does NOT do

It does not call any LLM API. It counts locally.
It does not handle tokenizer installation or model-specific vocabulary files. If you want tiktoken or sentencepiece accuracy, you wire that in through the TokenCounter trait.
It does not truncate for you. It tells you whether the payload fits. Truncation strategy is your decision.
It does not account for tool schema token overhead. It counts message content. If your API client adds tool definitions to the payload, those tokens are not included unless you pass them in explicitly.

Inside the lib

SystemSpec is an enum of known models with their context window sizes:

pub enum SystemSpec {
    Claude3Sonnet,   // 200_000 tokens
    Claude3Opus,     // 200_000 tokens
    Gpt4o,           // 128_000 tokens
    Gpt4Turbo,       // 128_000 tokens
    Custom(usize),   // caller-specified
}

The interesting design choice: SystemSpec::max_tokens() returns the window size for known variants, but the builder does not require you to pick a SystemSpec. You can pass a raw usize limit directly via Custom. Unknown models are not rejected. The caller sets the limit. This keeps the crate from going stale as new models ship. There is no internal allowlist that has to be updated.

// works for any model, now or future
FitsRequest::new(payload)
    .limit(SystemSpec::Custom(1_000_000))
    .fits()

FitsRequest counts system prompt tokens, message tokens, and any extra payload you pass. It sums them, compares to the limit, and returns a bool. The builder pattern keeps it composable.

let token_count = FitsRequest::new(messages)
    .system(system_prompt)
    .count(); // returns usize instead of bool

count() is useful when you want to log the token estimate before deciding what to do.

The serde_json dependency is the only required dep. The library accepts serde_json::Value for message payloads so it stays compatible with any JSON-shaped message format.

When useful

You are building an agent loop in Rust that accumulates context across turns. Before each API call, you want a cheap local check that the payload fits. This is cheaper than handling a 400 error, which requires a retry path, error logging, and extra latency.

You are doing context management: picking which messages to keep in the sliding window. You use count() to measure each candidate window before selecting one to send.

You are writing tests for context management logic. The crate has no external side effects, so tests are fast and deterministic.

When NOT useful

If your payload is always a single short user message and a short system prompt, the check adds no value. You already know it fits.

If you need exact token counts that match the API's billing math, the chars / 4 default will not be precise enough. Integrate your own tokenizer through the TokenCounter trait, or accept that the estimate is conservative and move on.

If you are working in Python or TypeScript, look at the Python sibling instead. The Rust crate is for Rust agent codebases.

Install

[dependencies]
prompt-token-counter = "0.1"
serde_json = "1"

Siblings

Lib	Boundary	Repo
prompt-token-counter (Python)	Same concept, Python API	MukundaKatta/prompt-token-counter
agentfit-rs	Token-aware message truncation	MukundaKatta/agentfit-rs
agent-message-window (Python)	Sliding window with paired-call protection	MukundaKatta/agent-message-window
llm-stop-conditions (Python)	MaxTokens stop condition composable	MukundaKatta/llm-stop-conditions
token-budget-pool	Concurrent token/USD budget pool	MukundaKatta/token-budget-pool

What is next

A natural addition is a split_at_limit() helper that takes a message list and a limit and returns the largest prefix that fits. That covers the most common use case: given all accumulated messages, what is the most context I can send?

Per-model overhead constants for tool schema tokens would make the estimate more accurate for tool-heavy agents. That data exists in Anthropic's documentation and could be encoded as constants per SystemSpec variant.

Source: MukundaKatta/prompt-token-counter-rs

Part of the Hermes Agent Challenge sprint.