I run 30 agents in parallel. They share one budget. Here is the pool primitive that makes that safe.

#hermeschallenge #ai #rust #concurrency

I run a fleet of agents. They are independent of each other. They share two things: the same Anthropic API key, and the same monthly budget.

The naive way to enforce a budget is one shared counter. Every agent increments the counter after each call. When the counter hits the cap, refuse new calls.

The naive way is wrong in two specific ways.

First, every agent that increments at the same instant can race past the cap together. Two agents both see "$99 spent, $1 left, my next call costs $0.50, looks safe" and both proceed. Now you have spent $100.50. The cap was a hint, not a wall.

Second, after-the-fact accounting means an agent that submits an expensive long prompt does not learn about budget exhaustion until after the request has already been billed. You wanted the budget to refuse the call. You got an accounting report.

token-budget-pool is the Rust crate I wrote so neither happens. It is on crates.io as token-budget-pool. The whole library is 250 lines.

The shape of the fix

The pool exposes a two-phase API: reserve then commit (or release).

use token_budget_pool::{BudgetPool, BudgetExceeded};

let pool = BudgetPool::new()
    .with_token_cap(1_000_000)
    .with_usd_cap(50.0);

// Before the LLM call, reserve the expected cost.
let estimate = estimate_tokens(&messages);  // your estimator
let estimated_usd = estimate as f64 * 0.000003;

let reservation = pool.reserve(estimate, estimated_usd)?;  // returns BudgetExceeded if cap is exceeded

// Now make the LLM call. The cap is held.
let response = call_llm(messages).await?;

// Commit the actual usage, release any unused part of the reservation.
let actual_tokens = response.usage.total_tokens;
let actual_usd = response.cost_usd;
reservation.commit(actual_tokens, actual_usd);

The ? on reserve is the wall. Two agents both trying to reserve when only one fits get refused for the second one. No race, no overrun.

The commit step is what makes the system honest. The reservation is your best estimate before the call. The commit is what actually happened. If actual was less than reserved, the difference returns to the pool. If actual exceeded the reservation (the response was bigger than the estimator predicted), the commit still succeeds, but the next reservation by another agent will see the new total and refuse if it now exceeds the cap.

What it does NOT do

It does not call the LLM. The pool only counts.
It does not estimate tokens. You pass an estimator. The crate ships no tokenizer dependency because tokenizers are heavy and provider-specific.
It does not handle time windows. If you want a per-minute budget that resets each minute, use llm-budget-window.
It does not handle multiple cost types beyond tokens and USD. If your model has a separate cache-hit cost or a separate image-input cost, sum into USD upstream and pass the total.

Inside the lib: one design choice worth showing

The hard part was deciding what happens when a reservation is dropped without commit.

A naive implementation says: reservation held while the Reservation struct lives, released on Drop. That gives clean RAII but it means a dropped reservation silently returns budget to the pool without any signal to the caller. If the agent crashed mid-call, you have no record of the failed call in your budget.

A stricter implementation says: every reservation must be explicitly committed or released. Drop without commit is a panic. That gives strong correctness but it means every error path in the caller has to remember to release.

The crate's answer is a middle ground. Drop releases the reservation but logs a BudgetEvent::DroppedReservation to a configured sink. RAII is preserved (no panics, no deadlocks) but you still get a visible signal that something went sideways.

let pool = BudgetPool::new()
    .with_token_cap(1_000_000)
    .with_event_sink(Box::new(my_logger));

The logger gets every reserve, commit, release, and dropped reservation. Production teams wire this into observability and notice when an agent crashes mid-flight before the metrics dashboard would have shown anything wrong.

When this is useful

You run many concurrent agents that share a budget and you do not want a race to push you over.
You want pre-call refusal (the safety we actually want) instead of after-call accounting (the safety we usually get).
You are wiring the pool into a hierarchy: per-agent caps, per-team caps, per-org caps. Each layer is its own BudgetPool instance.

When this is NOT what you want

For single-agent CLI runs. Use token-budget-py or just a counter. The concurrency safety here is overkill if you never have parallel calls.
For per-call hard caps. A single call that exceeds the per-call limit is a different problem. Reject before submission with a token estimator and a simple if-statement.
For cost-attribution dashboards. The pool counts but does not group by tool, model, or callsite. Pair with agenttrace for attribution.

Install

[dependencies]
token-budget-pool = "0.1"

Repo: https://github.com/MukundaKatta/token-budget-pool
Python port: https://github.com/MukundaKatta/token-budget-py

Sibling libraries

Lib	Boundary	Repo
token-budget-pool	Shared concurrent budget (Rust)	this repo
token-budget-py	Same, in Python	https://github.com/MukundaKatta/token-budget-py
llm-budget-window	Time-windowed budget cap	https://github.com/MukundaKatta/llm-budget-window
tool-call-budgets	Per-tool call-count caps	https://github.com/MukundaKatta/tool-call-budgets
claude-cost	Cache-aware Claude cost calc	https://github.com/MukundaKatta/claude-cost

What's next

A BudgetPool::with_back_pressure(...) mode that does not refuse calls but instead returns a Future that waits until the pool has capacity. Useful for batch jobs where you would rather wait than fail. Already prototyped, semantics still under discussion.