Mukunda Rao Katta

Posted on May 25

llm-budget-window: Per-Minute and Per-Hour Token Caps That Actually Work

#hermeschallenge #ai #rust #agents

The 3 AM invoice

The agent ran overnight. Scheduled job, 2 AM start, supposed to process a backlog of 5,000 documents. By 3 AM it had processed 200 documents. By 6 AM the job was still running.

The invoice the next morning was $800. For a $20 job.

The loop had a bug. One document category triggered a recursive summarization chain that did not terminate correctly. Each document in that category launched five more LLM calls. Those five calls each produced output that triggered more calls. The agent did not have any way to stop itself.

There was a session budget limit. It was a total cap, not a rate cap. The session cap was set to $500 because that was what the whole batch was expected to cost. The runaway job stayed under that ceiling for hours before the recursive calls compounded enough to blow through it. By then, $800 was gone.

A per-minute token cap would have killed the job within minutes. When one document triggers 200 LLM calls in 60 seconds instead of the expected 2, a rate window detects that immediately and stops the loop. That is what llm-budget-window does.

Shape of the fix

[dependencies]
llm-budget-window = "0.1"

Define windows and record token usage after each call:

use llm_budget_window::{BudgetWindow, WindowSpec, BudgetError};

let budget = BudgetWindow::new(vec![
    WindowSpec::per_minute(10_000),   // 10k tokens per minute
    WindowSpec::per_hour(200_000),    // 200k tokens per hour
    WindowSpec::per_day(2_000_000),   // 2M tokens per day
]);

// After each LLM response, record actual usage:
let tokens_used = response.usage.input_tokens + response.usage.output_tokens;

match budget.record(tokens_used) {
    Ok(()) => { /* continue the loop */ }
    Err(BudgetError::WindowExceeded { window, used, limit, reset_in_secs }) => {
        println!(
            "Budget hit: {} used {}/{} tokens. Resets in {}s.",
            window, used, limit, reset_in_secs
        );
        // pause, retry after reset_in_secs, or abort
    }
}

Multiple windows are checked atomically. If the per-minute window is the binding constraint, the error says so and gives the reset time. If per-hour is binding, that window is reported. The first exceeded window wins.

You can also query current state without recording:

let status = budget.status();
for window in &status.windows {
    println!(
        "{}: {}/{} tokens, resets in {}s",
        window.name, window.used, window.limit, window.reset_in_secs
    );
}

USD budgets work the same way. Pair with claude-cost or bedrock-cost to compute cost per call, then record USD instead of tokens:

let budget_usd = BudgetWindow::new(vec![
    WindowSpec::per_minute_usd(0.10),   // $0.10 per minute
    WindowSpec::per_hour_usd(1.50),     // $1.50 per hour
]);

let turn_cost = estimate_cost(&model_id, &token_counts)?;
budget_usd.record_usd(turn_cost.total_usd)?;

What it does NOT do

This crate does not share state across threads or processes by itself. The BudgetWindow is a single-process, in-memory structure. If you run 10 parallel agent workers and want a shared rate limit across all of them, you need to wrap the window in an Arc<Mutex<BudgetWindow>> or move the budget check to a shared service. The crate gives you the window logic; you provide the coordination layer. For cross-process limits, a Redis-backed token bucket or a sidecar rate limiter is the right tool.

Inside the lib

The sliding window implementation is a ring buffer of time-stamped token counts.

Each call to record(n) appends an entry (now, n) to the ring buffer. Before returning, it sweeps the buffer to drop entries older than the window duration, sums the remaining entries, and checks the sum against the limit. The ring buffer has a fixed maximum capacity. If the buffer is full and a new entry arrives, the oldest entry is evicted. This caps memory use regardless of call frequency.

Ring buffer: [(t1, 500), (t2, 800), (t3, 1200), ...]
                                    ^--- now

For a 60-second window:
Drop all entries where (now - t_i) > 60 seconds.
Sum the rest. Compare to limit.

The alternative design would be a count-per-second array: one bucket per second, 60 buckets for a per-minute window. That approach uses fixed memory proportional to the window width in seconds. For a per-day window at 1-second resolution, that is 86,400 buckets per window. The ring buffer approach uses memory proportional to call frequency rather than window width. A rate of 10 calls per minute uses 10 entries regardless of whether the window is 1 minute or 24 hours.

Multiple windows share one ring buffer. Each entry is checked against all window durations on every call. This is O(entries) per call per window, which is fine when call rates are moderate. For extremely high-frequency applications (thousands of calls per second), a dedicated high-performance rate limiter is a better fit.

The atomic guarantee is that all windows are checked against the same snapshot of the buffer state. You do not get a situation where the per-minute window says OK but the per-hour window rejects the call based on a stale read.

When useful

Overnight or long-running agent jobs where a runaway loop could burn a large budget before anyone notices.
Batch processors where each item should cause a predictable number of LLM calls. A per-minute cap that is 5x the expected rate will catch runaway items without blocking normal operation.
Multi-turn agents that can accidentally loop. Detecting a high per-minute rate is often faster than detecting a logical loop condition.
Development and staging environments where you want a hard ceiling on API spend regardless of what the code does.

When NOT

Single-user, interactive applications where one request per user turn is normal. A per-minute cap on a chat app would incorrectly block a user who sends many messages quickly.
Applications where the request rate is genuinely variable and high peaks are expected. A product that runs batch jobs at unpredictable sizes needs total-budget caps, not rate caps.
When the bottleneck is already the LLM provider's own rate limit. If the provider is returning 429s, the provider is already enforcing a rate limit. Adding a client-side rate cap on top will not help and adds complexity.

Install

[dependencies]
llm-budget-window = "0.1"

crates.io: llm-budget-window
GitHub: MukundaKatta/llm-budget-window

Siblings

Crate / Package	What it does
token-budget-pool	Thread-safe total session cap, not time-windowed; pair for belt-and-suspenders
token-budget-py	Python port of token-budget-pool
llm-cost-cap	Pre-flight cost gate: estimates cost before the call and rejects if over budget
llm-circuit-breaker	Open/closed/half-open circuit breaker; stops calling a provider that keeps failing
claude-cost	Compute per-call USD cost for Anthropic API calls, including cache read rates

What is next

The most requested feature is a Backpressure mode that does not reject the call but instead blocks (or returns a future that resolves after the reset time) until the window clears. Right now, record returns Err immediately when a window is exceeded. The caller decides whether to sleep, abort, or queue. A built-in backpressure mode with an async record_or_wait function would make the common "wait and retry" pattern require less boilerplate.

Separate per-model windows are also on the list. A single budget window covers all models combined. If you are running multiple models simultaneously and want separate per-minute caps per model (to match per-model rate limits from the provider), you need multiple BudgetWindow instances today. A BudgetWindowMap that routes by model ID would make that cleaner.

Part of the Hermes Agent Challenge sprint. All crates shipped on crates.io.

DEV Community