- Book: AI Agents Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
An engineer I work with walked me through a "multi-agent" customer-support system their team had shipped. Five agents: triage, retrieval, drafting, sentiment, escalation. They were unhappy with the latency, the token bill was higher than expected, and the eval pass rate had dropped after a model upgrade.
Their fix, after a few hours of pulling it apart: delete four of the five agents.
The remaining single-agent system was several times faster, ran roughly an order of magnitude cheaper, and the eval pass rate recovered above where it had been before. Same prompts, same tools, same retrieval, collapsed into one loop. The "multi-agent architecture" had been an org chart pasted into a Python file. (Numbers as reported by the team, not a controlled benchmark; your mileage will vary, but the shape generalizes.)
Anthropic's own research-system writeup is blunt about the cost shape: their multi-agent system uses about 15x more tokens than chat, and agents already use 4x more than chat. That math only pencils out when the workload genuinely needs the extra compute. Most workloads do not.
Here are the three signals that say it does. If none of them fire for your problem, the cheapest, fastest, most debuggable system is one agent in a loop.
Signal 1: tool-call concurrency that exceeds one model's context budget
A single agent runs sequentially by default. It calls a tool, reads the result, plans, calls the next tool. If you have a task that fans out to 40 web searches, 40 result documents averaging 8k tokens each, and a final synthesis step, you have 320k tokens of intermediate state competing with your prompt and your output budget. Even a 200k-context model like Claude Sonnet 4.5 or older Opus 4.x is blown out before the synthesis runs, and a 1M-context model still pays for every token kept in scope.
Multi-agent fixes this with token isolation: a lead agent dispatches subagents that each carry their own context window. Subagent #7 reads its 8k document, returns a 400-token summary to the lead, and the document never enters the lead's context. The same pattern is what let Anthropic's research system, with Claude Opus 4 as lead and Sonnet 4 subagents, outperform single-agent Claude Opus 4 by 90.2% on their internal research eval. In their analysis, token usage by itself explained 80% of the variance.
The signal: your task's intermediate evidence exceeds your answering context. Not your final answer. The stuff you need to look at before you can answer. If a single agent would need to keep 30 documents in scope to write 3 paragraphs, you have the signal. If it would need to keep 3 documents in scope to write 30 paragraphs, you do not.
Concrete check: run the task with a single agent that logs token counts per step. If max(context_used) over the run exceeds 60% of your model's window before the final step, you have the signal.
Signal 2: distinct policy/safety domains per role
A single agent inherits all permissions of its tool list. If it can read the customer database and write to the billing API, every tool call needs to be safe against the worst-case prompt injection in any input it touches. That works for narrow tool surfaces. It falls apart when one role is "summarize untrusted email" and another role is "issue refunds."
Multi-agent gives you a clean break: a reader agent with read-only tools and access to untrusted text, and a writer agent with write tools and only the reader's structured output as input. The writer never sees the email. A prompt-injection payload in the email cannot reach the refund tool because the refund tool isn't in scope where the payload is.
This is the strongest case for multi-agent in production today. It's also the case people skip because it doesn't feel like "agent design". It feels like permissions. But "two principals with different blast radii" is the textbook reason to split a system into separate processes.
The signal: you can name two roles such that the union of their tool sets is something you'd never grant to a single principal. If you'd never give one human both abilities, don't give one agent both either.
Signal 3: genuinely parallel sub-tasks with independent state
The keyword is independent. "Look up the board members of all 500 S&P IT companies" is independent: you can fire 500 lookups, the answer to one doesn't change another, and you can join results at the end. "Refactor the auth module across the codebase" is not. Every change influences every other change, and the agents would spend more tokens reconciling than refactoring.
Anthropic's own writeup acknowledges this: most coding tasks involve fewer truly parallelizable subtasks than research, and current LLMs are not good at coordinating with each other in real time. The multi-agent wins are in research, broad-fanout retrieval, and document synthesis. The losses are in code, planning, and anything where state has to stay coherent.
The signal: you can write the task as a map, then a reduce, with no loop between them. If you can't, you're going to spend the multi-agent token premium on coordination overhead and earn nothing.
A 50-line single-agent that looks multi-agent (and isn't)
Here's the kind of task that pattern-matches to "multi-agent" and shouldn't. The user asks: "For our enterprise customer Acme, find the open support tickets, identify the ones blocking renewal, draft a status update, and slack the AE."
It looks like four roles: search, classify, draft, notify. It is one loop with four tools.
import anthropic
client = anthropic.Anthropic()
TOOLS = [
{
"name": "search_tickets",
"description": "Search support tickets by account.",
"input_schema": {
"type": "object",
"properties": {
"account": {"type": "string"},
"status": {"type": "string"},
},
"required": ["account"],
},
},
{
"name": "get_renewal_context",
"description": "Renewal date and risk flags for an account.",
"input_schema": {
"type": "object",
"properties": {"account": {"type": "string"}},
"required": ["account"],
},
},
{
"name": "send_slack",
"description": "Send a message to an AE's Slack DM.",
"input_schema": {
"type": "object",
"properties": {
"ae_email": {"type": "string"},
"body": {"type": "string"},
},
"required": ["ae_email", "body"],
},
},
]
SYSTEM = (
"You handle account status updates. Search tickets, "
"cross-reference renewal context, draft a concise update, "
"send it on Slack. Stop after sending."
)
def run(user_msg: str) -> str:
msgs = [{"role": "user", "content": user_msg}]
while True:
r = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
system=SYSTEM,
tools=TOOLS,
messages=msgs,
)
msgs.append({"role": "assistant", "content": r.content})
if r.stop_reason == "end_turn":
return r.content[0].text
tool_results = [
{
"type": "tool_result",
"tool_use_id": b.id,
"content": dispatch(b.name, b.input),
}
for b in r.content if b.type == "tool_use"
]
msgs.append({"role": "user", "content": tool_results})
dispatch is a regular Python function that maps tool name to implementation. The model picks the order. The model decides when to stop. There is no orchestrator, no message bus, no shared scratchpad. One agent, four tools, one context window, full traceability.
If you tried to "split this up," you'd reach for a graph framework, define four nodes, write transitions, debug why the drafting node hallucinates the AE's name because you forgot to thread it through. You'd ship in two weeks instead of two days. The eval would be worse. The bill would be 4x.
Why people reach for multi-agent anyway
Three reasons, all bad on their own.
The org-chart projection. "We have a triage team and a drafting team, so the agent system should too." Org charts exist because humans don't share context cheaply. LLMs do, inside their window. Modeling the org chart as code wastes tokens on coordination that the human org needs and the model does not.
The "specialist" intuition. "A drafting agent will draft better than a generalist." Empirically, a single Sonnet 4 with the drafting instructions in its system prompt drafts as well as a "drafting agent" with the same instructions in its system prompt. The specialization is in the prompt, not the process boundary.
The framework gravity. LangGraph, CrewAI, and AutoGen are excellent at multi-agent because that's what they're for (side-by-side comparison if you want it). If you've already learned LangGraph, every problem looks like a graph. The fact that a framework lets you build something doesn't mean the problem requires it. The single-agent loop above is 50 lines of stdlib plus the Anthropic SDK.
The decision rule
Run this checklist before you split:
- Will a single agent's intermediate context exceed 60% of the model window?
- Are there two roles whose tool sets you'd never grant to one principal?
- Is the task a clean
map→reducewith independent sub-tasks?
Zero hits: single agent. Build the loop. Add tools. Ship.
One hit: probably single agent, with an extra mitigation. Hit on signal 2? Use a "tool-call mediator" that re-validates inputs before write tools run, still inside one loop. Hit on signal 1? Stream summaries instead of full documents back into context.
Two or three hits: multi-agent earns its premium. Use the framework that matches your control needs. LangGraph for explicit state machines, CrewAI for fast role-based prototyping, AutoGen (and its AG2 fork) for conversational coordination. Budget the multi-agent token premium upfront. Anthropic reported ~15× chat for their research system, and your number will land somewhere in that order of magnitude, so the finance team isn't surprised.
The mistake is reading "multi-agent" as a step up from "single-agent" on a sophistication ladder. It is not a ladder. It is a tradeoff between coordination cost and isolation benefit, and most production tasks sit on the coordination-heavy side. The teams that know this ship single-agent systems with confident tool design and stop trying to recreate the org chart in YAML.
If this was useful
The AI Agents Pocket Guide covers the loop, the tool surface, the failure modes, and the production patterns that decide single vs. multi without the framework marketing. If you're choosing an architecture this week and the team is already arguing about LangGraph nodes, it's the short answer.
Top comments (0)