DEV Community

chunxiaoxx
chunxiaoxx

Posted on

A2A Is Not Enough: Production Multi-Agent Systems Need a Control Plane

A2A Is Not Enough: Production Multi-Agent Systems Need a Control Plane

Most discussions about multi-agent systems still stop at communication.

That is necessary, but it is not sufficient.

If you want real production systems, you need to solve three different layers:

  1. Agent-to-agent coordination
  2. Agent-to-tool integration
  3. Observability and governance

In 2025 and 2026, the open ecosystem got much better on the first two.

  • The Linux Foundation launched the Agent2Agent (A2A) project on June 23, 2025 as an open protocol created by Google for secure agent-to-agent communication.
  • Anthropic announced on December 9, 2025 that it was donating MCP (Model Context Protocol) to the Agentic AI Foundation, and reported 10,000+ active public MCP servers and 97M+ monthly SDK downloads across Python and TypeScript.
  • In 2026, observability vendors started saying the quiet part out loud: observability is the control plane for agentic systems.

Those three facts together define the real architecture shift.

The stack is getting clearer

A useful way to think about modern agent systems is:

  • A2A handles agent-to-agent interoperability
  • MCP handles agent-to-tool connectivity
  • Observability/control-plane infrastructure handles traceability, policy, reliability, and cost control

That separation matters because teams often try to force one layer to do the job of another.

What A2A is good at

According to the Linux Foundation announcement, A2A exists to let agents:

  • discover one another
  • exchange information securely
  • collaborate across systems
  • reduce vendor lock-in
  • interoperate across platforms and frameworks

That is the collaboration layer.

If you have specialized agents for planning, coding, research, operations, or QA, A2A gives you a standard way to route work between them.

But A2A does not tell you:

  • which tool call caused a bad decision
  • why costs spiked
  • which memory retrieval corrupted the context
  • whether a guardrail fired too late
  • how to replay a failure path

That is not a protocol failure. It is simply not the protocol’s job.

What MCP is good at

MCP solved a different problem: connecting models and agents to external systems through a common interface.

The December 2025 Anthropic announcement matters because it showed MCP was no longer a niche developer experiment. The protocol had:

  • 10,000+ active public servers
  • support across major products and platforms
  • 97M+ monthly SDK downloads

That is what production gravity looks like.

MCP gives teams a standard way to expose tools, data sources, connectors, and workflows to agent runtimes.

But again, MCP does not solve the entire production problem.

Even with perfect tool access, you still need to answer:

  • Who approved this action?
  • What prompt and context produced it?
  • What chain of tools did the agent invoke?
  • How much latency, token usage, and spend did the run create?
  • What happened when the system degraded?

Why observability is the missing layer

Arthur AI’s 2026 observability playbook describes observability as the layer that turns autonomous behavior into measurable, auditable outcomes.

That framing is correct.

Traditional monitoring tells you whether a service is up.
Agent observability tells you:

  • what the agent attempted
  • why it chose that path
  • which tools it called
  • what context it consumed
  • where the workflow stalled or drifted
  • how much it cost
  • whether it stayed inside policy

That is the difference between a demo and an operating system.

The production architecture pattern

A practical production stack looks like this:

1. Specialized agents

Each agent should own a bounded role:

  • planner
  • researcher
  • coder
  • reviewer
  • operator
  • customer-facing executor

Do not create a committee of identical generalists.

2. Standardized coordination via A2A

Use A2A for:

  • capability discovery
  • task routing
  • handoffs
  • delegation
  • status exchange

This keeps the collaboration layer interoperable.

3. Standardized tools via MCP

Use MCP for:

  • databases
  • code execution
  • retrieval
  • external APIs
  • internal services
  • enterprise systems

This keeps tool access portable.

4. A real control plane

Instrument the system so every important step emits:

  • traces
  • tool invocations
  • memory retrieval events
  • prompts and outputs where policy allows
  • latency
  • token usage
  • error classes
  • approval events
  • policy/guardrail triggers

This is where OpenTelemetry-first design becomes attractive. You want the telemetry path to be portable too.

Common failure mode: protocol maximalism

A common mistake is thinking:

“If we adopt the right protocol, the system becomes production-ready.”

It does not.

Protocols reduce integration friction. They do not magically produce:

  • good decomposition
  • cost discipline
  • sane governance
  • testability
  • rollback paths
  • useful failure handling

Production multi-agent systems fail less because of missing slogans and more because of missing operational structure.

What teams should build now

If you are building an agent platform in 2026, prioritize this order:

First: make actions visible

Before adding more agent roles, ensure you can answer:

  • What did the system do?
  • Why did it do it?
  • Which tool calls happened?
  • What did the run cost?
  • Where did it fail?

Second: make collaboration explicit

Use A2A or an equivalent contract so delegation is structured, not ad hoc prompt passing.

Third: make tools standardized

Use MCP or an equivalent abstraction so agents are not tightly coupled to one-off integrations.

Fourth: add governance where it matters

Add policy checks on:

  • sensitive tool access
  • irreversible actions
  • customer-impacting outputs
  • cost thresholds
  • escalation boundaries

Fifth: optimize for replayability

If you cannot replay an incident, you cannot improve the system reliably.

Final point

The future is not one giant agent.
It is also not a swarm with no discipline.

The winning pattern is:

  • A2A for coordination
  • MCP for tool use
  • observability as the control plane

Communication matters.
Tool access matters.
But if you cannot inspect, govern, and debug autonomous behavior, you do not have a production system.
You have a live experiment.


Sources

  • Linux Foundation, “Linux Foundation Launches the Agent2Agent Protocol Project to Enable Secure, Intelligent Communication Between AI Agents” (June 23, 2025)
  • Anthropic, “Donating the Model Context Protocol and establishing the Agentic AI Foundation” (Dec 9, 2025)
  • Arthur AI, “Agentic AI Observability Playbook 2026: Standards Every Executive Must Adopt” (Apr 2, 2026)

Top comments (0)