chunxiaoxx

Posted on Apr 10

A2A Is Not Enough: Production Multi-Agent Systems Need a Control Plane

#agents #ai #architecture #devops

A2A Is Not Enough: Production Multi-Agent Systems Need a Control Plane

Most discussions about multi-agent systems still stop at communication.

That is necessary, but it is not sufficient.

If you want real production systems, you need to solve three different layers:

Agent-to-agent coordination
Agent-to-tool integration
Observability and governance

In 2025 and 2026, the open ecosystem got much better on the first two.

The Linux Foundation launched the Agent2Agent (A2A) project on June 23, 2025 as an open protocol created by Google for secure agent-to-agent communication.
Anthropic announced on December 9, 2025 that it was donating MCP (Model Context Protocol) to the Agentic AI Foundation, and reported 10,000+ active public MCP servers and 97M+ monthly SDK downloads across Python and TypeScript.
In 2026, observability vendors started saying the quiet part out loud: observability is the control plane for agentic systems.

Those three facts together define the real architecture shift.

The stack is getting clearer

A useful way to think about modern agent systems is:

A2A handles agent-to-agent interoperability
MCP handles agent-to-tool connectivity
Observability/control-plane infrastructure handles traceability, policy, reliability, and cost control

That separation matters because teams often try to force one layer to do the job of another.

What A2A is good at

According to the Linux Foundation announcement, A2A exists to let agents:

discover one another
exchange information securely
collaborate across systems
reduce vendor lock-in
interoperate across platforms and frameworks

That is the collaboration layer.

If you have specialized agents for planning, coding, research, operations, or QA, A2A gives you a standard way to route work between them.

But A2A does not tell you:

which tool call caused a bad decision
why costs spiked
which memory retrieval corrupted the context
whether a guardrail fired too late
how to replay a failure path

That is not a protocol failure. It is simply not the protocol’s job.

What MCP is good at

MCP solved a different problem: connecting models and agents to external systems through a common interface.

The December 2025 Anthropic announcement matters because it showed MCP was no longer a niche developer experiment. The protocol had:

10,000+ active public servers
support across major products and platforms
97M+ monthly SDK downloads

That is what production gravity looks like.

MCP gives teams a standard way to expose tools, data sources, connectors, and workflows to agent runtimes.

But again, MCP does not solve the entire production problem.

Even with perfect tool access, you still need to answer:

Who approved this action?
What prompt and context produced it?
What chain of tools did the agent invoke?
How much latency, token usage, and spend did the run create?
What happened when the system degraded?

Why observability is the missing layer

Arthur AI’s 2026 observability playbook describes observability as the layer that turns autonomous behavior into measurable, auditable outcomes.

That framing is correct.

Traditional monitoring tells you whether a service is up.
Agent observability tells you:

what the agent attempted
why it chose that path
which tools it called
what context it consumed
where the workflow stalled or drifted
how much it cost
whether it stayed inside policy

That is the difference between a demo and an operating system.

The production architecture pattern

A practical production stack looks like this:

1. Specialized agents

Each agent should own a bounded role:

planner
researcher
coder
reviewer
operator
customer-facing executor

Do not create a committee of identical generalists.

2. Standardized coordination via A2A

Use A2A for:

capability discovery
task routing
handoffs
delegation
status exchange

This keeps the collaboration layer interoperable.

3. Standardized tools via MCP

Use MCP for:

databases
code execution
retrieval
external APIs
internal services
enterprise systems

This keeps tool access portable.

4. A real control plane

Instrument the system so every important step emits:

traces
tool invocations
memory retrieval events
prompts and outputs where policy allows
latency
token usage
error classes
approval events
policy/guardrail triggers

This is where OpenTelemetry-first design becomes attractive. You want the telemetry path to be portable too.

Common failure mode: protocol maximalism

A common mistake is thinking:

“If we adopt the right protocol, the system becomes production-ready.”

It does not.

Protocols reduce integration friction. They do not magically produce:

good decomposition
cost discipline
sane governance
testability
rollback paths
useful failure handling

Production multi-agent systems fail less because of missing slogans and more because of missing operational structure.

What teams should build now

If you are building an agent platform in 2026, prioritize this order:

First: make actions visible

Before adding more agent roles, ensure you can answer:

What did the system do?
Why did it do it?
Which tool calls happened?
What did the run cost?
Where did it fail?

Second: make collaboration explicit

Use A2A or an equivalent contract so delegation is structured, not ad hoc prompt passing.

Third: make tools standardized

Use MCP or an equivalent abstraction so agents are not tightly coupled to one-off integrations.

Fourth: add governance where it matters

Add policy checks on:

sensitive tool access
irreversible actions
customer-impacting outputs
cost thresholds
escalation boundaries

Fifth: optimize for replayability

If you cannot replay an incident, you cannot improve the system reliably.

Final point

The future is not one giant agent.
It is also not a swarm with no discipline.

The winning pattern is:

A2A for coordination
MCP for tool use
observability as the control plane

Communication matters.
Tool access matters.
But if you cannot inspect, govern, and debug autonomous behavior, you do not have a production system.
You have a live experiment.

Sources

Linux Foundation, “Linux Foundation Launches the Agent2Agent Protocol Project to Enable Secure, Intelligent Communication Between AI Agents” (June 23, 2025)
Anthropic, “Donating the Model Context Protocol and establishing the Agentic AI Foundation” (Dec 9, 2025)
Arthur AI, “Agentic AI Observability Playbook 2026: Standards Every Executive Must Adopt” (Apr 2, 2026)

Top comments (1)

Krrish Dholakia • Jun 11

Interesting. I feel like a lot of what you're describing is covered by an AI Gateway. You do lose observability though when you swap to a managed agent runtime.

Curious if you're looking at running coding harnesses in a sandbox, and exploring a control plane above that?