SapotaCorp

Posted on May 24 • Originally published at sapotacorp.vn

Multi-agent: what 5x the cost actually buys you

#ai

A SaaS founder forwarded us a vendor invoice with one comment: "this is 6.4 times what they projected." The system was a "multi-agent crew" that a consultancy had pitched as the upgrade from his single-agent customer support chatbot. The pitch had projected $0.04 per query. Production was running at $0.255.

The accuracy lift over the previous single-agent setup was 4 percentage points (from 78% to 82%) on the team's eval set. The latency had gone from 4 seconds to 19 seconds at the 95th percentile. Customer satisfaction had dropped because users were giving up before the response arrived.

The diagnosis was that the system did not need most of the agents the vendor had built. Multi-agent is a real and useful pattern. It is also frequently the wrong tool for the job, sold to teams who do not have the framework to push back. Here is the math we walk founders through.

The actual cost multiplier

A single-agent system handling a customer query costs roughly:

1 LLM call to understand the query: $0.002
1 to 2 tool calls (RAG search, account lookup): $0.001
1 LLM call to generate the response: $0.003

Total: around $0.006 per query for a well-tuned single agent.

A multi-agent system for the same query, depending on how the vendor designed it:

1 router LLM call to dispatch: $0.002
3 to 5 specialist agents, each running their own ReAct loop with 2 to 3 LLM calls: $0.020 to $0.060
1 synthesizer LLM call to merge results: $0.005
1 critic LLM call to validate output: $0.005

Total: $0.032 to $0.072 per query. That is 5x to 12x the single-agent cost.

The vendor projection of $0.04 was at the optimistic end of this range. Production hit $0.255 because the actual queries were more complex than the test set, the agents looped more often, and a few popular query types triggered cascading sub-agent calls.

This is the math that founders are not shown during the pitch. The cost projection assumes best-case behavior. Production behavior is usually 3 to 6 times worse than that.

When the cost is actually justified

Multi-agent systems do produce better outputs, but only on certain task profiles. The accuracy lift is roughly:

Homogeneous tasks (FAQ, single-domain support, simple Q&A): 0 to 5 percentage points. Often 0. The single-agent baseline is already as good as the multi-agent system, because there is no specialization to distribute.
Heterogeneous tasks with clear specialization (research with different domains, code review with different concerns, due diligence with different angles): 15 to 30 percentage points. The math works here because each agent genuinely contributes a perspective the others cannot.
Multi-step reasoning tasks (synthesis, planning, complex decision-making): 10 to 20 percentage points. Worth it when the task is genuinely complex.

The founder's customer support use case fell into the first category. The queries were 90% billing, account, and product questions. There was no genuine specialization that justified four specialist agents. A single well-prompted agent with three tools (KB search, account lookup, billing API) would have hit the same 82% accuracy at one-fifth the cost.

The five-question test Sapota uses

Before recommending multi-agent, we ask:

Does the task have three or more genuine specialties? Specialty means "this agent uses different tools, different knowledge, or different reasoning patterns than the others." If the agents are doing similar work with slightly different prompts, that is not specialization. That is duplication.

Are there parallel sub-tasks? If the work is sequential (A then B then C), multi-agent saves no time over a multi-step single agent. The latency multiplier kicks in immediately. Multi-agent only saves time when sub-tasks can run concurrently.

Does the output need verification from a different angle? Debate patterns (proposer plus critic) earn their cost when the cost of being wrong is high. Legal review, medical second opinions, financial recommendations. They do not earn their cost when the user just wants a quick answer.

Do different sub-tasks need different LLM models? Vision tasks need a vision model, code tasks benefit from code-tuned models. If the entire workflow runs on the same model with the same parameters, the multi-agent split is mostly cosmetic.

Does the team have the operational capacity? Multi-agent systems are 3 to 5 times harder to debug than single-agent. Each agent has its own failure modes, the inter-agent communication has its own failure modes, and the cascade behaviors are non-obvious. A two-engineer team usually cannot maintain a five-agent system.

If three or more answers are no, do not build multi-agent. Use a well-prompted single agent with the right tools.

What we shipped for the founder

We replaced the multi-agent system with a single agent setup:

One specialist agent role: "Senior customer support engineer for B2B SaaS"
Three tools: knowledge base search (RAG), account lookup (CRM API), billing data lookup (internal API)
Faithfulness gate before responding (catches hallucinations)
Conversation memory (sliding window, last 5 turns)
Fallback to human handoff for low-confidence responses

Cost dropped from $0.255 to $0.018 per query (14x reduction). Latency dropped from 19 seconds to 3.2 seconds at p95. Accuracy stayed at 82%. Customer satisfaction recovered within two weeks.

The vendor had built a sophisticated multi-agent system that solved a problem the customer did not have. The actual problem was a single agent that needed better tools and a faithfulness check. Sometimes the right architecture is the boring one.

When multi-agent is genuinely the right call

We recommend multi-agent for:

Research synthesis: a researcher agent finds sources, an analyst agent extracts insights, a writer agent synthesizes a report. Each agent has different prompting needs and different tool access. Latency is 30 to 60 seconds, which is acceptable for batch report generation.
Code review: a security agent looks for vulnerabilities, a performance agent checks complexity, a maintainability agent reviews structure. Different agents catch different classes of issues. The accuracy lift over a single reviewer agent is dramatic.
High-stakes decisions (investment, hiring, legal): a debate pattern with a proposer and a critic forces the system to defend its reasoning. Cost is 2 to 3x a single agent, but the auditability and quality lift justify it for decisions that matter.
Multi-domain customer support at scale (banking with billing, fraud, lending, mortgage as separate domains, each with their own KB and tools): routing to specialist agents beats trying to maintain a single mega-agent that knows everything.

The pattern: multi-agent earns its cost when the task is genuinely heterogeneous and the agents are doing meaningfully different work. It does not earn its cost when the task is uniform and the agents are mostly just splitting up a single conversation.

The Sapota recommendation

Before approving a multi-agent build:

List each proposed agent's role, tools, and the specific output it produces
For each agent, ask: could a different prompt to a single agent produce this output? If yes, you do not need this agent
Run the cost projection at realistic production volume, not best-case
Plan for 3 to 6x the cost the vendor projects, because that is what production does
If the math still works after that adjustment, build it

If the math does not work, the architecture is wrong. That is a feature of the analysis, not a bug. The founder who avoided the eighteen-agent system saved six months of build time and 90% of the projected ongoing cost.

If your multi-agent costs feel out of control

If your team has built or inherited a multi-agent system and the costs are running 3 to 10 times the original projection, the right intervention is usually consolidation, not optimization. Each redundant agent removed cuts cost by its full multiplier.

Sapota offers a multi-agent audit that takes your current architecture, identifies which agents are doing meaningful specialty work and which are just adding overhead, and ships the consolidated single-agent or reduced multi-agent setup as a working PR. We have done this for half a dozen B2B SaaS clients. The cost reduction is typically 60 to 90%.

Reach out via the AI engineering page with your current architecture, monthly cost, and a sample of production queries. The first conversation usually surfaces 2 to 3 agents that can be removed.

Top comments (1)

joinwell52 • Jun 1

Great article. I agree that multi-agent systems must justify what the extra cost actually buys.

More agents are not automatically better. Each agent should have a distinct responsibility, tool boundary, permission level, or deliverable. Otherwise, the system just adds duplicated inference cost, latency, and more failure points.

For me, the key question is:

What decision or evidence does this agent uniquely contribute?

This is why I think cost should also become part of the collaboration ledger. A multi-agent system should record not only tasks, reports, reviews, and blockers, but also agent count, retries, latency, and token/API cost per task.

I am exploring this with FCoP / CodeFlowMu: using files as the protocol layer so multi-agent collaboration becomes visible, auditable, and measurable.

Multi-agent is worth it when it buys real specialization, independent verification, parallelism, or resilience. Otherwise, a single agent with better tools is often the better architecture.