I was running a multi-agent setup over a weekend. Three workers in parallel, each calling Claude, each with their own retry logic. I woke up on Sunday to a bill alert.
Forty bucks. Eighteen minutes. One worker had gotten into a retry loop on a malformed tool response and had hammered the API until I happened to glance at my dashboard.
The annoying part: I had per-call cost logging. Every call printed its USD cost. I just had no shared cap across the three workers. Each one thought it was being polite by capping at $5. Three workers, three caps. The actual ceiling was $15+ per minute and nothing stopped them.
So I built token-budget-pool. One shared atomic counter that all workers check before every call.
The API
use token_budget_pool::Budget;
let budget = Budget::new_usd(5.00); // $5 cap across all consumers
let cost_estimate = 0.018; // claude sonnet on a short prompt
budget.reserve(cost_estimate)?; // errors if pool exhausted
let actual_cost = call_claude(...)?;
budget.commit(actual_cost)?; // adjusts the running total
reserve and commit are the two-phase part. You reserve before the call so concurrent workers cannot all squeeze through at once. You commit after with the actual cost from the API response (input tokens + output tokens + cache reads + cache writes).
If the budget is blown, reserve returns BudgetExceeded and the worker can decide to back off, page someone, or just exit clean.
Why two phase
The single-phase version (just record(cost) after the call) loses the race. Two workers can both check the budget, both see room, both call the API, both blow past the cap.
With reserve-then-commit, the reserve is atomic. The commit just adjusts up or down to match the real cost. If the actual cost was lower than the estimate, the pool gains a tiny bit back. If it was higher (long generation), the pool loses extra and the next reserve fails sooner.
let res = budget.reserve(0.02);
if res.is_err() {
eprintln!("budget gone, this worker is done");
return Ok(());
}
let actual = call_llm(...).await?;
budget.commit(actual)?;
Python port
I ported it to Python because most of my agent code is Python and the import was annoying via FFI for this kind of thing.
from token_budget import Budget
budget = Budget.new_usd(5.00)
try:
budget.reserve(0.018)
except BudgetExceededError:
print("budget gone")
return
cost = call_claude(...)
budget.commit(cost)
Same shape. 18 tests cover the threading edge cases. Locks are released on context exit.
Time windows
Just a static cap is not enough. You want "no more than $20 per hour" or "no more than 200k tokens per minute" so a slow drip does not still blow your budget over a day.
That is what llm-budget-window does. Same author. It tracks multiple windows at once.
use llm_budget_window::WindowedBudget;
use std::time::Duration;
let budget = WindowedBudget::builder()
.add_window(Duration::from_secs(60), 100_000) // 100k tokens/min
.add_window(Duration::from_secs(3600), 5_000_000) // 5M tokens/hour
.build();
budget.record(tokens_used)?; // errors if either window exceeded
Records are atomic across all windows. If you blow the minute window first, you get told which window. The day window still has room but the agent backs off for the minute.
What it does not do
It does not bill you. It does not refund you. It is a pre-call cap. If the upstream API charges you for the failed retry, you still pay. The point is to stop the next call.
It also does not understand model-specific pricing. You pass cost estimates in. I have a separate crate (claude-cost, openai-cost, bedrock-cost) for the per-model math. Composing them is two lines:
let est = claude_cost::estimate(model, input_tokens, max_output_tokens);
budget.reserve(est)?;
The lesson
Per-call cost logging is necessary but not sufficient. If you have two or more workers, or any retry loop, you need a shared cap. The cost of writing one is an afternoon. The cost of not having one was $40 in my case. Probably more for whoever wakes up to a four-figure alert.
Repos:
- crates.io: https://crates.io/crates/token-budget-pool
- PyPI: https://pypi.org/project/token-budget-py/
- GitHub: https://github.com/MukundaKatta/token-budget-pool
One shared cap. Two-phase. Sleep better.
Top comments (1)
Solid pattern, especially the two-phase reserve/commit — that's the part most "budget cap" implementations get wrong. A few extensions worth thinking about as you scale this past the weekend-fix stage:
Adaptive cost estimation feeding
reserve(). A static estimate works for short prompts but skews badly for agent loops with variable output length. An EWMA per(model, prompt_class)onactual_output_tokens / max_tokensconverges fast (~50-100 samples) and gives a much tighter envelope. Theclaude-cost/openai-costcrates could publish aDistribution, not a scalar, andreserve()takes a percentile (p95 estimate = conservative reserve).Inter-worker fairness when budget is tight. With three workers sharing one pool, the worker that loops fastest captures the budget at the expense of slower-but-higher-value workers. Two cheap fixes: (a) per-worker rate-limit underneath the pool, (b) a priority field on
reserve()that lets the pool admit the highest-priority pending call rather than first-come.Budget-aware fallback chains, not just hard-stop. When
BudgetExceededreturns, the natural next move is "downgrade to a cheaper model" not "exit". Areserve_with_fallback(estimate, fallback_model)helper auto-retrying against a lower-cost model when the high-tier estimate doesn't fit is the missing ergonomic. Most production code writes this glue by hand and gets it slightly wrong.The picker-between-workers itself is a bandit problem. Once you have N workers + 1 budget pool + a quality signal (was the result correct, was the retry warranted), you can let the system learn which workers to prioritize when budget is constrained. UCB1 or Thompson Sampling treats each worker as an arm and budget-spend-per-success as the reward signal.
Probably overkill for the immediate "stop bleeding $40/weekend" fix, but item 4 is where this naturally extends past 3 workers. I've packaged UCB1/Thompson and 19 other decision algorithms as an MCP server in case it's useful as a reference for that piece: github.com/Whatsonyourmind/oraclaw. The two-phase reserve/commit pattern stands alone fine either way — nice work shipping it.