Mukunda Rao Katta

Posted on May 19

How one bad prompt burned $40 of my Claude budget in 18 minutes

#hermesagent #ai #llm #rust

Shared atomic budget for parallel agents

I was running a multi-agent setup over a weekend. Three workers in parallel, each calling Claude, each with their own retry logic. I woke up on Sunday to a bill alert.

Forty bucks. Eighteen minutes. One worker had gotten into a retry loop on a malformed tool response and had hammered the API until I happened to glance at my dashboard.

The annoying part: I had per-call cost logging. Every call printed its USD cost. I just had no shared cap across the three workers. Each one thought it was being polite by capping at $5. Three workers, three caps. The actual ceiling was $15+ per minute and nothing stopped them.

So I built token-budget-pool. One shared atomic counter that all workers check before every call.

The API

use token_budget_pool::Budget;

let budget = Budget::new_usd(5.00);  // $5 cap across all consumers

let cost_estimate = 0.018;  // claude sonnet on a short prompt
budget.reserve(cost_estimate)?;  // errors if pool exhausted

let actual_cost = call_claude(...)?;
budget.commit(actual_cost)?;  // adjusts the running total

reserve and commit are the two-phase part. You reserve before the call so concurrent workers cannot all squeeze through at once. You commit after with the actual cost from the API response (input tokens + output tokens + cache reads + cache writes).

If the budget is blown, reserve returns BudgetExceeded and the worker can decide to back off, page someone, or just exit clean.

Why two phase

The single-phase version (just record(cost) after the call) loses the race. Two workers can both check the budget, both see room, both call the API, both blow past the cap.

With reserve-then-commit, the reserve is atomic. The commit just adjusts up or down to match the real cost. If the actual cost was lower than the estimate, the pool gains a tiny bit back. If it was higher (long generation), the pool loses extra and the next reserve fails sooner.

let res = budget.reserve(0.02);
if res.is_err() {
    eprintln!("budget gone, this worker is done");
    return Ok(());
}

let actual = call_llm(...).await?;
budget.commit(actual)?;

Python port

I ported it to Python because most of my agent code is Python and the import was annoying via FFI for this kind of thing.

from token_budget import Budget

budget = Budget.new_usd(5.00)

try:
    budget.reserve(0.018)
except BudgetExceededError:
    print("budget gone")
    return

cost = call_claude(...)
budget.commit(cost)

Same shape. 18 tests cover the threading edge cases. Locks are released on context exit.

Time windows

Just a static cap is not enough. You want "no more than $20 per hour" or "no more than 200k tokens per minute" so a slow drip does not still blow your budget over a day.

That is what llm-budget-window does. Same author. It tracks multiple windows at once.

use llm_budget_window::WindowedBudget;
use std::time::Duration;

let budget = WindowedBudget::builder()
    .add_window(Duration::from_secs(60), 100_000)        // 100k tokens/min
    .add_window(Duration::from_secs(3600), 5_000_000)    // 5M tokens/hour
    .build();

budget.record(tokens_used)?;  // errors if either window exceeded

Records are atomic across all windows. If you blow the minute window first, you get told which window. The day window still has room but the agent backs off for the minute.

What it does not do

It does not bill you. It does not refund you. It is a pre-call cap. If the upstream API charges you for the failed retry, you still pay. The point is to stop the next call.

It also does not understand model-specific pricing. You pass cost estimates in. I have a separate crate (claude-cost, openai-cost, bedrock-cost) for the per-model math. Composing them is two lines:

let est = claude_cost::estimate(model, input_tokens, max_output_tokens);
budget.reserve(est)?;

The lesson

Per-call cost logging is necessary but not sufficient. If you have two or more workers, or any retry loop, you need a shared cap. The cost of writing one is an afternoon. The cost of not having one was $40 in my case. Probably more for whoever wakes up to a four-figure alert.

Repos:

crates.io: https://crates.io/crates/token-budget-pool
PyPI: https://pypi.org/project/token-budget-py/
GitHub: https://github.com/MukundaKatta/token-budget-pool

One shared cap. Two-phase. Sleep better.

Top comments (31)

Whatsonyourmind • May 19

Solid pattern, especially the two-phase reserve/commit — that's the part most "budget cap" implementations get wrong. A few extensions worth thinking about as you scale this past the weekend-fix stage:

Adaptive cost estimation feeding reserve(). A static estimate works for short prompts but skews badly for agent loops with variable output length. An EWMA per (model, prompt_class) on actual_output_tokens / max_tokens converges fast (~50-100 samples) and gives a much tighter envelope. The claude-cost/openai-cost crates could publish a Distribution, not a scalar, and reserve() takes a percentile (p95 estimate = conservative reserve).
Inter-worker fairness when budget is tight. With three workers sharing one pool, the worker that loops fastest captures the budget at the expense of slower-but-higher-value workers. Two cheap fixes: (a) per-worker rate-limit underneath the pool, (b) a priority field on reserve() that lets the pool admit the highest-priority pending call rather than first-come.
Budget-aware fallback chains, not just hard-stop. When BudgetExceeded returns, the natural next move is "downgrade to a cheaper model" not "exit". A reserve_with_fallback(estimate, fallback_model) helper auto-retrying against a lower-cost model when the high-tier estimate doesn't fit is the missing ergonomic. Most production code writes this glue by hand and gets it slightly wrong.
The picker-between-workers itself is a bandit problem. Once you have N workers + 1 budget pool + a quality signal (was the result correct, was the retry warranted), you can let the system learn which workers to prioritize when budget is constrained. UCB1 or Thompson Sampling treats each worker as an arm and budget-spend-per-success as the reward signal.

Probably overkill for the immediate "stop bleeding $40/weekend" fix, but item 4 is where this naturally extends past 3 workers. I've packaged UCB1/Thompson and 19 other decision algorithms as an MCP server in case it's useful as a reference for that piece: github.com/Whatsonyourmind/oraclaw. The two-phase reserve/commit pattern stands alone fine either way — nice work shipping it.

Mukunda Rao Katta • May 21

Solid extensions. The bandit angle in (4) is where I want to take this next. I've been thinking of worker selection as fairness but you're right it's reward driven. EWMA on actual/max output ratio (1) is the smallest piece to ship and probably the most useful right now since static estimates blow up on long generations. Will look at oraclaw.

Whatsonyourmind • May 27

Glad the framing landed. Two quick adds for the EWMA piece since that's your immediate target:

Bias correction matters more than the α choice early on. The naive recursion s_t = α·x_t + (1-α)·s_{t-1} initialized to s_0 = 0 biases low for the first ~1/α samples, which is exactly when budget decisions are riskiest. The Welford-style fix:

s = 0.0
w = 0.0
for x in observations:
    s = α * x + (1 - α) * s
    w = α      + (1 - α) * w
    estimate = s / w   # bias-corrected

One extra multiply, gets you usable estimates from sample 1 instead of sample ~30.

α for actual/max ratio: 0.1–0.2 is the sweet spot (effective window 5–10 generations, matches the cadence at which worker performance actually drifts). Below 0.05 is over-smoothed; above 0.3 starts tracking per-prompt noise. If you want adaptive α down the line, the GD-EWMA trick (gradient descent on prediction error) auto-tunes per worker — useful when one worker is more variable than the others.

Path to bandits when you're ready: once EWMA-stabilized actual/max is your per-worker reward signal, UCB1 drops in directly as score[i] = ratio_ewma[i] + c·√(ln(t)/n[i]). Use c ≈ 0.5 (lower than textbook √2 because rewards are bounded [0,1]). Converges to the right allocation at ~50 total generations across N=3 workers. Thompson Sampling with a Beta posterior is better when rewards are sparse/binary (success vs failure) — slightly more setup.

The @oraclaw/bandit package keeps the same contract for both (selectArm(arms) → id, update(arm_id, reward)), so you can ship UCB first and swap to Thompson later without touching call sites. Happy to look at a PR if you want a second pair of eyes.

Mukunda Rao Katta • May 28

Good catch on bias correction. The cold-start low-bias is exactly the danger window for budget gating, before one bad call has moved the average. I'll fold the Welford-style debias into the EWMA section. Thanks for the accumulator detail.

Sol • May 19

Useful pattern. One thing from Anthropic's pricing model that bit us: cache write/read tokens are priced differently from base input/output, so a single USD estimate can under-reserve after cache misses. Did you consider reserving by token class first (input/output/cache-write/cache-read) and converting to USD at commit?

Curious whether that reduced false-safe budget checks in your runs.

Mukunda Rao Katta • May 21

Honest answer is no, I reserve in USD only right now and that does under-reserve when cache writes and reads are not separated. Reserving by token class first then converting at commit would be the right shape. Adding it to the next version.

Sol • May 21

Helpful, thank you for confirming this. The next-version shape you described, reserve by token class first and convert at commit, matches the under-reserve failure mode we keep seeing in cost-control audits.

One implementation boundary I am curious about: when a request mixes cache write and cache read token paths across retries, will you keep separate reservation buckets and idempotency keys per token class, or one reservation record with per-class deltas? We saw reconciliation drift with the single-record pattern.

Sol • May 21

Thanks for confirming this. When you move from USD-only reservation to token-class reservation, do you plan to persist cache_write and cache_read as separate usage buckets on the root run so downstream LangSmith or Langfuse rollups can attribute under-reserve deltas per workflow step? I am trying to separate reservation error from trace aggregation error in multi-model graphs.

Sol • May 21

Checked against OpenTelemetry GenAI semantic conventions issue #35 (still open, updated 2026-05-20): task/action/agent/team/artifact/memory semantics are now explicit, but I still do not see a canonical pair for cost-centre attribution plus token-to-cost joins at root-run level.

Without that mapping, cache_write/cache_read plus prompt/completion deltas remain hard to reconcile across LangSmith and Langfuse rollups. Are you planning a standard reservation-scope plus usage-bucket mapping so under-reserve variance can be attributed per workflow step?

Sol • May 21

Thanks, this is very useful. For your next version, are you planning separate reservation buckets and idempotency keys per token class when cache write/read paths split across retries?

Valentin Monteiro • May 20

The thread keeps reaching for better reservation math, but the incident isn't really an algo bug. It's a missing budget and eval layer before traffic hits the model. The pattern that actually catches this kind of blowup is a cost-cap per feature flag plus token-spend alerts scoped per tenant or route, not per worker. Worker-level fairness is fine until one route quietly eats most of the spend; the cap is what stops the fire, the bandit just optimizes how fast it burns.

Mukunda Rao Katta • May 21

Fair point on the layer. In my case there was no routing layer, just three workers in one process, so the pool is the cap at the right scope for that. Once you have tenants or routes the cap belongs higher up, agreed. Bandit at the worker level is downstream of that anyway.

Valentin Monteiro • May 21

Yeah, even at three workers there's a subtle gap: shared pool stops the fire but doesn't tell you which worker burned it. Tagging reserve() with a worker_id (or prompt class) makes the post-mortem one query instead of three log greps. Worth it before you scale past the single-process case.

Mukunda Rao Katta • May 29

Agreed, and it's cheap to add. reserve() already takes the call, tagging it with worker_id or prompt class turns the post-mortem into one query instead of grepping three logs. Doing this before the single-process assumption breaks. Thanks.

TxDesk • May 21

The two-phase reserve+commit is the right primitive, and the worker-race story is exactly the failure mode the single-phase version loses. Two things worth flagging that I hit running this in production and ended up baking into the metering layer.

The estimate-then-actual delta gets ugly for streaming calls. Your two-phase catches concurrent kickoffs, but if three workers all reserve estimates of 0.02 each and start streaming responses that actually run 0.08 each, the budget shows headroom the whole time the calls are in flight and you only learn the real cost on commit. By then you're 3-4× over. The fix I landed on: reserve a worst-case token budget per turn, not a midpoint estimate, and downgrade to a cheaper model before the call if worst-case won't fit. Pessimistic reservation costs you headroom on the happy path but it's the only way the cap is real on streaming.

Second one: a flat pool treats all calls as equal priority. In any system with user-facing requests plus background work (retries, indexing, periodic jobs), you want priority categories on the same pool. Background work reserves from leftover after user-facing demand. Saves you the case where a background retry loop is technically inside cap but starves a user turn that needed the budget.

The afternoon-of-work claim holds though. The cost of not having this was the $40. The cost of getting it nuanced enough for production is maybe a week. Worth the week.

Mukunda Rao Katta • May 29

This matches what I hit. Reserving the midpoint is what makes the cap fake on streaming, you only learn real cost on commit and by then you're 3-4x over. Worst-case-per-turn reservation plus downgrading when it won't fit is the same shape I landed on. Priority pool I don't have yet, background retries can starve a user turn while under cap. Adding leftover-only reservation for background work. Best comment on the post, thanks.

TxDesk • May 31

Convergent independent landings on this stuff is usually a sign the primitive is right. The leftover-only reservation for background is good - what I ended up adding on top is a hard ceiling on background-pool's per-turn delta. Background can claim leftover but never more than X% of an active user-turn's reservation, even if leftover is huge. Otherwise a user with a long quiet stretch suddenly sees a 4x latency hit when background catches up and the model's queued in front of them.

The thing I don't have a clean answer for: starvation on background tasks that are technically optional but operationally important. Logging, telemetry, derived caches. They starve, user doesn't notice for a week, then a metric is wrong. Curious if you've thought about that tier.

Xidao • May 21

This reserve-then-commit pattern is a really practical fix for the "polite local cap, broken global cap" failure mode. One thing I've found useful in similar agent setups is separating a hard shared budget from a smaller per-run circuit breaker, because retry storms often show up as bursts long before the total daily budget is exhausted.

I'm also curious whether you've considered attaching the reservation to an idempotency key or request fingerprint. That makes it easier to reconcile duplicate retries after timeouts, especially when workers can crash between reserve and commit and you need a cleanup story for stale reservations.

Mukunda Rao Katta • May 29

Yes, the idempotency key is in the next version for exactly that. Crash between reserve and commit was leaving phantom holds. Each reservation now carries a fingerprint plus a TTL, and a sweeper reclaims anything that never commits. The per-run circuit breaker is a good separate axis, the daily pool was too coarse for retry-storm bursts.

Theo Valmis • May 20

The unbounded tool loop is the classic failure mode — the model keeps calling because nothing in the prompt defines "done" in terms the model can evaluate, so it defaults to continuing. What makes it expensive is that LLM calls scale with context length, and each loop iteration grows the context with the previous tool outputs.

The fix you landed on (explicit stop conditions in the prompt) is right, but there's a deeper design principle here: agents need a success criterion that's as precise as the task description. "Summarize the codebase" is a task. "Stop after producing one summary under 500 words" is a task with a termination condition. Without the second part, you're relying on the model's judgment about when enough is enough — which is exactly where cost surprises come from.

Mukunda Rao Katta • May 21

Yeah this is the root cause for me. The prompt did not have a stop condition the model could check against, so it kept calling. Budget caps are the seatbelt for when that fails. The prompt fix is the actual fix.

mote • May 21

The $40 surprise is a symptom of a deeper problem: LLM token budgets don't behave like compute budgets. With CPU, you can profile once and predict. With LLMs, a small prompt change can cascade into 10x more generation tokens depending on how the model interprets its instruction boundary.

What's helped me: treating the system prompt as a contract with explicit output constraints. Instead of "summarize this codebase," something like "summarize in at most 3 bullet points, each under 20 words." The model still interprets creatively, but you've bounded the worst case. You can also add a token-budget field to the API call itself on providers that support it.

Have you experimented with soft constraints in the prompt vs. hard limits via API parameters? Curious which gave you more predictable behavior.

Mukunda Rao Katta • May 29

Hard limits via max_tokens were more predictable, but only as a backstop. The prompt contract is what shaped behavior. "At most 3 bullets, under 20 words each" changed the generation; max_tokens just stopped the runaway. Soft constraints move the median, the hard cap bounds the tail. You need both.

Harjot Singh • May 30

$40 in 18 minutes from one bad prompt is the perfect micro-case for why the cost model is so dangerous: a single vague instruction sent the agent into a loop, and there's no friction at the moment of spend to stop it. You find out after, not during.

Two structural defenses fall out of this: (1) a hard budget cap / per-task ceiling so a runaway loop physically can't drain $40 silently, and (2) not running that exploratory loop on the most expensive model in the first place - a cheaper model fails cheaper while you figure out the right prompt, then you escalate once you know what you want. The expensive mistakes almost always happen on the premium model during the figuring-it-out phase. Brutal but instructive - this kind of post does more to change behavior than any pricing page. Hope you clawed the $40 back.

江欢（JackSoul） • Jun 2

This is a great concrete example of why “log the token cost after the call” is necessary but not sufficient. The reserve/commit pattern is especially useful when multiple workers share a pool.

One thing I’d add for production teams is tagging each reservation with dimensions like user_id, project_id, agent, model, and retry_reason, so the same budget guard can later answer “who/what burned the spend?” rather than only “did we exceed the cap?” Curious if you’ve considered exposing that attribution layer alongside the shared cap.

Alex Shev • Jun 12

This is a good reminder that prompt cost is not just price per token. The expensive part is usually a vague task that causes retries, over-broad context, and long outputs nobody can use. I like setting a budget before the run: scope, stop condition, max context, and what evidence counts as done.

View full discussion (31 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.