Agentic Mesh in the Wild

#agents #ai #architecture #llm

You've heard the pitch. Autonomous agents collaborating like a well-run engineering team, decomposing problems, dividing labour, converging on solutions. The "Internet for Agents." The agentic mesh.

Here's what nobody tells you at the conference keynote: the mesh is real, it's in production, and it's already teaching us lessons that will reshape how we build software. But those lessons aren't the ones the slide decks promise.

The State of Play

Multi-agent systems crossed from research curiosity to production reality in 2025. Not everywhere—not yet in most places—but in enough places, at enough scale, that patterns are emerging. Cursor is running hundreds of concurrent agents generating millions of lines of code. Anthropic ships multi-agent research to every Claude user. Salesforce's Agentforce has 150+ enterprise deployments and calls it their fastest-growing product ever. Tyson Foods and Gordon Food Service have agents from different companies talking to each other over Google's A2A protocol.

This is no longer theoretical. The question isn't whether multi-agent works. It's how it works, and where it breaks.

The Hierarchy Lesson

The most striking finding from production is this: flat coordination fails catastrophically.

Cursor learned it the hard way. Give twenty agents equal status and shared file access, and you don't get twenty times the throughput. You get the throughput of two or three, with the rest churning in lock contention and decision paralysis. Agents became risk-averse. They avoided hard tasks. They optimised for appearing busy rather than making progress.

Google DeepMind's research confirms it at the theoretical level—a "bag of agents" in flat peer-to-peer topology produces seventeen times more errors than structured alternatives.

The pattern that works, consistently, across every successful production deployment we found: orchestrator-worker. A planning agent decomposes the problem. Specialised workers execute narrow tasks. Results flow back up. Cursor extends this into recursive hierarchies—planners spawning sub-planners, with judge agents evaluating whether to continue or stop. Anthropic's multi-agent research system does the same: a lead agent delegates to subagents, each with its own context window and tools, then compresses findings.

This mirrors something every CTO already knows about human organisations. Flat hierarchies sound democratic. They work for small teams. At scale, you need structure—clear delegation, well-defined scope, someone who decides when the work is done.

What Actually Breaks

Ten failure modes keep appearing across deployments:

1. Coordination overhead. Agents negotiate more than they work. The meta-conversation about who does what consumes the budget meant for doing the thing.

2. Context fragmentation. Each agent sees its own slice. Without shared context, they make locally reasonable but globally inconsistent decisions.

3. Non-determinism. Same inputs, different outputs. Every run is a snowflake. This isn't a bug—it's the nature of LLM-based systems—but it makes testing painful and debugging worse.

4. Error cascading. One agent's hallucination becomes another agent's input. Garbage propagates faster than correction.

5. Token economics. Multi-agent systems use roughly fifteen times the tokens of a single chat interaction. Anthropic's own research confirms this. If the task doesn't justify 15x the cost, you've built an expensive way to get a mediocre answer.

6. Lock contention. Shared-state coordination creates bottlenecks that negate the parallelism you built the system to achieve.

7. Risk aversion in flat structures. Without explicit delegation, agents gravitate toward safe, trivial subtasks.

8. Observability gaps. When a five-agent chain produces the wrong output, figuring out which agent went wrong—and why—is genuinely hard.

9. Workflow mismatch. You cannot drop agents into processes designed for humans. Workflow redesign is mandatory. Salesforce learned this deploying Agentforce: the agents work, but only after the process around them is rebuilt.

10. Model drift. Your agents' behaviour changes when the underlying LLM updates. What passed evaluation last week may fail this week.

The Numbers Worth Knowing

A few data points that should shape your planning:


Multi-agent vs single-agent on research tasks	+90% (Anthropic)
Token cost multiplier for multi-agent	~15x vs single chat
Flat swarm error amplification	17x (DeepMind)
Salesforce autonomous resolution rate	60%+
A2A protocol supporting organisations	150+

The 90% improvement is real and impressive. So is the 15x cost. The question for any deployment is whether the value of the task justifies the economics.

The Protocol Stack Taking Shape

Two standards are converging as the connective tissue of the agentic mesh:

Google's A2A (Agent-to-Agent) defines how agents discover each other and exchange messages. Donated to the Linux Foundation in June 2025, backed by 150+ organisations including Atlassian, PayPal, Salesforce, SAP, and AWS. Agents publish "capability cards" describing what they can do. Other agents—or humans—query those cards. Think DNS for agents.

Anthropic's MCP (Model Context Protocol) standardises how agents access tools and data. If A2A is how agents talk to each other, MCP is how they talk to the world. The analogy that keeps surfacing: USB for AI.

Together, these protocols make vendor-neutral, cross-framework, even cross-company agent collaboration possible. The Tyson Foods / Gordon Food Service deployment is the early proof. It won't be the last.

OpenTelemetry for Agents is emerging as the observability layer—extending existing OpenTelemetry standards to trace agent interactions, tool calls, and token consumption across the mesh.

Patterns for the Pragmatist

If you're planning a multi-agent deployment, here's what the evidence suggests:

Start with one agent. Microsoft's own guidance: "If you can write a function to handle the task, do that instead of using an AI agent." Multi-agent is a scaling strategy, not a starting point.

Orchestrator-worker is the safe default. It's proven at scale by Cursor, Anthropic, and Bayezian. Flat peer-to-peer is an anti-pattern. Every production team that tried it moved away from it.

Choose models per role. Cursor found GPT-5.2 excels at planning while other models perform better at execution. One-size-fits-all model selection leaves performance on the table.

Instrument from day one. OpenTelemetry, audit logs, token tracking, cost circuit breakers. You will need them. In regulated industries (KPMG's Clara AI for audit, Bayezian's clinical trial monitoring), they're non-negotiable. In every other industry, they're still non-negotiable—you just don't know it yet.

Adopt A2A and MCP early. These are becoming the industry standards. Building on proprietary protocols now means migrating later.

Budget honestly. Plan for 10–15x token costs. Build cost monitoring with automatic circuit breakers for runaway usage. If the economics don't work at 15x, the architecture needs to change—not the budget.

The Uncomfortable Truth

The agentic mesh is real, and it produces results that single agents cannot match. Cursor's agents wrote a web browser from scratch—a million lines of code in a week. Anthropic's multi-agent research outperforms single-agent by 90%. These aren't demo numbers; they're production metrics.

But the mesh is also unforgiving. Coordination is the hard problem, not individual agent capability. The organisations succeeding are the ones treating multi-agent systems with the same rigour they'd apply to distributed systems engineering: clear hierarchies, narrow responsibilities, observable behaviour, well-defined failure modes, and honest cost accounting.

The age of agents-in-production has arrived. It looks less like a swarm of autonomous intelligences and more like a well-architected microservices system—with all the same lessons about coupling, cohesion, observability, and the eternal truth that distributed systems are harder than they look.

The difference is that this time, the services can reason.

Sources include Anthropic Engineering, Cursor, Google DeepMind (arXiv 2512.08296), Salesforce, AIMultiple, KPMG, Bayezian, and the Linux Foundation A2A project.