Give ten agents a shared, metered tool — a paid search or research API where every call is real
money — and you've handed ten of them the same company credit card. Each one reasons "I'll just run
a quick search." You find out the total on the invoice.
Anthropic's own multi-agent write-up clocks a fleet at roughly 15x the tokens of a single chat.
That's the token pool. The paid external tools are the line item the orchestration playbooks skip —
and it's the one that shows up in actual dollars. I run four production projects on Claude Code with
no external agent framework, and this is the pattern that keeps that line item from surprising me.
The fix that doesn't work
You can't solve this by asking agents to be frugal. "Be mindful of the budget" in a prompt is a
wish, and an LLM will sail past "max 8 searches" the moment the task still feels unfinished. If your
budget lives in the prompt, you don't have a budget — you have a hope.
The split most write-ups collapse
Cost governance is two layers, and conflating them is why prompt-level "budgets" fail:
1. Enforcement is deterministic and lives below the model. A hard counter in the harness owns the
paid credential and denies the call after N. The model can't argue with it, jailbreak it, or sneak
"just one more search" past it. This is what actually bounds the dollars — and it's plain middleware,
not an agent. A dumb metered proxy in front of the API does this fine.
2. Judgment is the half a proxy can't do. A gateway can cap spend, but it can't decide whether
this task deserves to spend at all. That decision is model-shaped, and it's where the agent earns
its place:
- A novelty-gate. Most tasks don't qualify to spend: CRUD, mechanical edits, known facts → zero research. The biggest budget win isn't a smaller cap — it's that the majority of work never reaches the paid tool. A proxy can't make that call; it has no notion of "is this architectural or trivial." The agent does.
- A tier + cap per qualifying call (quick-fact vs deep-dive), declared as policy that the harness then enforces mechanically — the agent proposes the tier, the counter imposes the limit.
- Honest degradation. When the paid source fails, the agent falls back to a free one and flags the reliability downgrade in its output — so a weaker citation never silently poses as authoritative.
So the division is clean: the gateway enforces, the agent judges.
The novel part isn't "centralize the credential" — that's least-privilege, decades old, and
centralizing alone doesn't cap anything (N callers can still demand N searches through one door). The
leverage is pairing a deterministic cap with model-shaped judgment about whether to spend at all —
with the gate defaulting to "no."
The tradeoff nobody mentions
If the budget is a single shared number across parallel agents, you're holding a distributed-counter
problem: two agents can both read "budget remaining," both decide they're clear, and both spend —
blowing the cap. You have two honest options. The simplest is to serialize every paid call through
the one budgeted agent: correct and easy to reason about, but that agent becomes a throughput
bottleneck for the whole fleet. The other is an atomic reservation — each call reserves its slice of
the quota before spending and releases the remainder after, which keeps agents parallel at the cost
of a little more plumbing. For metered money, paying either price beats discovering the race
condition on the invoice — just don't pretend it isn't there.
The takeaway
Most "agent budget" advice optimizes the cap. The leverage is one level up: a gate that makes most
work never spend at all, and a hard counter the model can't talk its way past. Stop asking your fleet
to be frugal. Give exactly one agent a budget the harness enforces — and the judgment to know when
not to use it.
Top comments (0)