I was running a multi-agent setup over a weekend. Three workers in parallel, each calling Claude, each with their own retry logic. I woke up on Sun...
For further actions, you may consider blocking this person and/or reporting abuse
Solid pattern, especially the two-phase reserve/commit — that's the part most "budget cap" implementations get wrong. A few extensions worth thinking about as you scale this past the weekend-fix stage:
Adaptive cost estimation feeding
reserve(). A static estimate works for short prompts but skews badly for agent loops with variable output length. An EWMA per(model, prompt_class)onactual_output_tokens / max_tokensconverges fast (~50-100 samples) and gives a much tighter envelope. Theclaude-cost/openai-costcrates could publish aDistribution, not a scalar, andreserve()takes a percentile (p95 estimate = conservative reserve).Inter-worker fairness when budget is tight. With three workers sharing one pool, the worker that loops fastest captures the budget at the expense of slower-but-higher-value workers. Two cheap fixes: (a) per-worker rate-limit underneath the pool, (b) a priority field on
reserve()that lets the pool admit the highest-priority pending call rather than first-come.Budget-aware fallback chains, not just hard-stop. When
BudgetExceededreturns, the natural next move is "downgrade to a cheaper model" not "exit". Areserve_with_fallback(estimate, fallback_model)helper auto-retrying against a lower-cost model when the high-tier estimate doesn't fit is the missing ergonomic. Most production code writes this glue by hand and gets it slightly wrong.The picker-between-workers itself is a bandit problem. Once you have N workers + 1 budget pool + a quality signal (was the result correct, was the retry warranted), you can let the system learn which workers to prioritize when budget is constrained. UCB1 or Thompson Sampling treats each worker as an arm and budget-spend-per-success as the reward signal.
Probably overkill for the immediate "stop bleeding $40/weekend" fix, but item 4 is where this naturally extends past 3 workers. I've packaged UCB1/Thompson and 19 other decision algorithms as an MCP server in case it's useful as a reference for that piece: github.com/Whatsonyourmind/oraclaw. The two-phase reserve/commit pattern stands alone fine either way — nice work shipping it.
Solid extensions. The bandit angle in (4) is where I want to take this next. I've been thinking of worker selection as fairness but you're right it's reward driven. EWMA on actual/max output ratio (1) is the smallest piece to ship and probably the most useful right now since static estimates blow up on long generations. Will look at oraclaw.
Glad the framing landed. Two quick adds for the EWMA piece since that's your immediate target:
Bias correction matters more than the α choice early on. The naive recursion
s_t = α·x_t + (1-α)·s_{t-1}initialized tos_0 = 0biases low for the first ~1/α samples, which is exactly when budget decisions are riskiest. The Welford-style fix:One extra multiply, gets you usable estimates from sample 1 instead of sample ~30.
α for actual/max ratio: 0.1–0.2 is the sweet spot (effective window 5–10 generations, matches the cadence at which worker performance actually drifts). Below 0.05 is over-smoothed; above 0.3 starts tracking per-prompt noise. If you want adaptive α down the line, the GD-EWMA trick (gradient descent on prediction error) auto-tunes per worker — useful when one worker is more variable than the others.
Path to bandits when you're ready: once EWMA-stabilized actual/max is your per-worker reward signal, UCB1 drops in directly as
score[i] = ratio_ewma[i] + c·√(ln(t)/n[i]). Use c ≈ 0.5 (lower than textbook √2 because rewards are bounded [0,1]). Converges to the right allocation at ~50 total generations across N=3 workers. Thompson Sampling with a Beta posterior is better when rewards are sparse/binary (success vs failure) — slightly more setup.The
@oraclaw/banditpackage keeps the same contract for both (selectArm(arms)→ id,update(arm_id, reward)), so you can ship UCB first and swap to Thompson later without touching call sites. Happy to look at a PR if you want a second pair of eyes.Good catch on bias correction. The cold-start low-bias is exactly the danger window for budget gating, before one bad call has moved the average. I'll fold the Welford-style debias into the EWMA section. Thanks for the accumulator detail.
Useful pattern. One thing from Anthropic's pricing model that bit us: cache write/read tokens are priced differently from base input/output, so a single USD estimate can under-reserve after cache misses. Did you consider reserving by token class first (input/output/cache-write/cache-read) and converting to USD at commit?
Curious whether that reduced false-safe budget checks in your runs.
Honest answer is no, I reserve in USD only right now and that does under-reserve when cache writes and reads are not separated. Reserving by token class first then converting at commit would be the right shape. Adding it to the next version.
Thanks for confirming this. When you move from USD-only reservation to token-class reservation, do you plan to persist cache_write and cache_read as separate usage buckets on the root run so downstream LangSmith or Langfuse rollups can attribute under-reserve deltas per workflow step? I am trying to separate reservation error from trace aggregation error in multi-model graphs.
Checked against OpenTelemetry GenAI semantic conventions issue #35 (still open, updated 2026-05-20): task/action/agent/team/artifact/memory semantics are now explicit, but I still do not see a canonical pair for cost-centre attribution plus token-to-cost joins at root-run level.
Without that mapping, cache_write/cache_read plus prompt/completion deltas remain hard to reconcile across LangSmith and Langfuse rollups. Are you planning a standard reservation-scope plus usage-bucket mapping so under-reserve variance can be attributed per workflow step?
Helpful, thank you for confirming this. The next-version shape you described, reserve by token class first and convert at commit, matches the under-reserve failure mode we keep seeing in cost-control audits.
One implementation boundary I am curious about: when a request mixes cache write and cache read token paths across retries, will you keep separate reservation buckets and idempotency keys per token class, or one reservation record with per-class deltas? We saw reconciliation drift with the single-record pattern.
Thanks, this is very useful. For your next version, are you planning separate reservation buckets and idempotency keys per token class when cache write/read paths split across retries?
The thread keeps reaching for better reservation math, but the incident isn't really an algo bug. It's a missing budget and eval layer before traffic hits the model. The pattern that actually catches this kind of blowup is a cost-cap per feature flag plus token-spend alerts scoped per tenant or route, not per worker. Worker-level fairness is fine until one route quietly eats most of the spend; the cap is what stops the fire, the bandit just optimizes how fast it burns.
Fair point on the layer. In my case there was no routing layer, just three workers in one process, so the pool is the cap at the right scope for that. Once you have tenants or routes the cap belongs higher up, agreed. Bandit at the worker level is downstream of that anyway.
Yeah, even at three workers there's a subtle gap: shared pool stops the fire but doesn't tell you which worker burned it. Tagging reserve() with a worker_id (or prompt class) makes the post-mortem one query instead of three log greps. Worth it before you scale past the single-process case.
Agreed, and it's cheap to add. reserve() already takes the call, tagging it with worker_id or prompt class turns the post-mortem into one query instead of grepping three logs. Doing this before the single-process assumption breaks. Thanks.
The two-phase reserve+commit is the right primitive, and the worker-race story is exactly the failure mode the single-phase version loses. Two things worth flagging that I hit running this in production and ended up baking into the metering layer.
The estimate-then-actual delta gets ugly for streaming calls. Your two-phase catches concurrent kickoffs, but if three workers all reserve estimates of 0.02 each and start streaming responses that actually run 0.08 each, the budget shows headroom the whole time the calls are in flight and you only learn the real cost on commit. By then you're 3-4× over. The fix I landed on: reserve a worst-case token budget per turn, not a midpoint estimate, and downgrade to a cheaper model before the call if worst-case won't fit. Pessimistic reservation costs you headroom on the happy path but it's the only way the cap is real on streaming.
Second one: a flat pool treats all calls as equal priority. In any system with user-facing requests plus background work (retries, indexing, periodic jobs), you want priority categories on the same pool. Background work reserves from leftover after user-facing demand. Saves you the case where a background retry loop is technically inside cap but starves a user turn that needed the budget.
The afternoon-of-work claim holds though. The cost of not having this was the $40. The cost of getting it nuanced enough for production is maybe a week. Worth the week.
This matches what I hit. Reserving the midpoint is what makes the cap fake on streaming, you only learn real cost on commit and by then you're 3-4x over. Worst-case-per-turn reservation plus downgrading when it won't fit is the same shape I landed on. Priority pool I don't have yet, background retries can starve a user turn while under cap. Adding leftover-only reservation for background work. Best comment on the post, thanks.
Convergent independent landings on this stuff is usually a sign the primitive is right. The leftover-only reservation for background is good - what I ended up adding on top is a hard ceiling on background-pool's per-turn delta. Background can claim leftover but never more than X% of an active user-turn's reservation, even if leftover is huge. Otherwise a user with a long quiet stretch suddenly sees a 4x latency hit when background catches up and the model's queued in front of them.
The thing I don't have a clean answer for: starvation on background tasks that are technically optional but operationally important. Logging, telemetry, derived caches. They starve, user doesn't notice for a week, then a metric is wrong. Curious if you've thought about that tier.
This is a great concrete example of why “log the token cost after the call” is necessary but not sufficient. The reserve/commit pattern is especially useful when multiple workers share a pool.
One thing I’d add for production teams is tagging each reservation with dimensions like
user_id,project_id,agent,model, andretry_reason, so the same budget guard can later answer “who/what burned the spend?” rather than only “did we exceed the cap?” Curious if you’ve considered exposing that attribution layer alongside the shared cap.The unbounded tool loop is the classic failure mode — the model keeps calling because nothing in the prompt defines "done" in terms the model can evaluate, so it defaults to continuing. What makes it expensive is that LLM calls scale with context length, and each loop iteration grows the context with the previous tool outputs.
The fix you landed on (explicit stop conditions in the prompt) is right, but there's a deeper design principle here: agents need a success criterion that's as precise as the task description. "Summarize the codebase" is a task. "Stop after producing one summary under 500 words" is a task with a termination condition. Without the second part, you're relying on the model's judgment about when enough is enough — which is exactly where cost surprises come from.
Yeah this is the root cause for me. The prompt did not have a stop condition the model could check against, so it kept calling. Budget caps are the seatbelt for when that fails. The prompt fix is the actual fix.
The $40 surprise is a symptom of a deeper problem: LLM token budgets don't behave like compute budgets. With CPU, you can profile once and predict. With LLMs, a small prompt change can cascade into 10x more generation tokens depending on how the model interprets its instruction boundary.
What's helped me: treating the system prompt as a contract with explicit output constraints. Instead of "summarize this codebase," something like "summarize in at most 3 bullet points, each under 20 words." The model still interprets creatively, but you've bounded the worst case. You can also add a token-budget field to the API call itself on providers that support it.
Have you experimented with soft constraints in the prompt vs. hard limits via API parameters? Curious which gave you more predictable behavior.
Hard limits via max_tokens were more predictable, but only as a backstop. The prompt contract is what shaped behavior. "At most 3 bullets, under 20 words each" changed the generation; max_tokens just stopped the runaway. Soft constraints move the median, the hard cap bounds the tail. You need both.
This reserve-then-commit pattern is a really practical fix for the "polite local cap, broken global cap" failure mode. One thing I've found useful in similar agent setups is separating a hard shared budget from a smaller per-run circuit breaker, because retry storms often show up as bursts long before the total daily budget is exhausted.
I'm also curious whether you've considered attaching the reservation to an idempotency key or request fingerprint. That makes it easier to reconcile duplicate retries after timeouts, especially when workers can crash between reserve and commit and you need a cleanup story for stale reservations.
Yes, the idempotency key is in the next version for exactly that. Crash between reserve and commit was leaving phantom holds. Each reservation now carries a fingerprint plus a TTL, and a sweeper reclaims anything that never commits. The per-run circuit breaker is a good separate axis, the daily pool was too coarse for retry-storm bursts.
$40 in 18 minutes from one bad prompt is the perfect micro-case for why the cost model is so dangerous: a single vague instruction sent the agent into a loop, and there's no friction at the moment of spend to stop it. You find out after, not during.
Two structural defenses fall out of this: (1) a hard budget cap / per-task ceiling so a runaway loop physically can't drain $40 silently, and (2) not running that exploratory loop on the most expensive model in the first place - a cheaper model fails cheaper while you figure out the right prompt, then you escalate once you know what you want. The expensive mistakes almost always happen on the premium model during the figuring-it-out phase. Brutal but instructive - this kind of post does more to change behavior than any pricing page. Hope you clawed the $40 back.
Primary-source note you may find useful: in BerriAI/litellm PR #26471 (Apr 25, 2026), the team added per-model per-team-member budget enforcement with 429 stops, then reviewers pushed hardening on atomic DB increments and reseeding counters on cache miss after restarts.
If you are reproducing this $40 burn pattern, two checks are worth running in your stack: (1) per-model caps still isolate user/team spend under concurrency, and (2) cache restart/flush does not reset enforcement counters.
Have you tested those two failure modes yet?
Useful reference. Haven't tested those two yet, the restart counter-reseed one worries me most since an in-memory counter silently resets enforcement on a crash. On the token-class question from the other thread: yes, separate buckets per class reserved first and converted to USD at commit, since USD-only under-reserves after cache misses.
We hit the same shared-cap problem on our Rust orchestration layer. Per-call ceilings looked safe, but a multi-agent fanout still exploded until we wrapped a global token budget around the runtime. The fanout was the killer: one parent agent spawning 6 children, each spending under their cap but together blowing through 5x. We now treat the budget like a memory allocator with explicit ownership semantics. Each child requests from the parent, parent denies past 80% of envelope
The fanout is the case a flat pool misses entirely. Six children under cap, 5x over together, parent never sees it until commit. Treating it like an allocator with parent-owned envelopes is the right model. I've used a single shared pool, fine in one process but breaks the moment there's a parent/child tree. Where do you put the deny threshold, per-child or total envelope?