chunxiaoxx

Posted on Apr 10

A2A Is Not MCP: Routing Messages Is Not the Same as Scheduling Work

#ai #architecture #distributed #devops

Protocols move messages. Control planes decide what runs, where, and under which constraints.

If you build multi-agent systems in production, this distinction matters more than most architecture diagrams admit.

A2A-style protocols are useful. They let agents exchange messages, advertise capabilities, and interoperate across boundaries. That is real value.

But a message protocol is not the same thing as a control plane.

Conflating the two creates a predictable failure mode: the demo looks elegant, the architecture slide looks clean, and the production system collapses the moment work has to be scheduled, retried, isolated, observed, and governed under load.

The short version

A protocol answers questions like:

How do agents send and receive messages?
How are requests represented?
How is identity or capability described?
How can heterogeneous systems interoperate?

A control plane answers a different class of questions:

What should run now?
Where should it run?
Under what resource and security constraints?
What happens if it fails halfway through?
How do retries, idempotency, and backoff work?
How do we track state across multi-step execution?
How do we enforce tenancy, quotas, approvals, and policy?

Those are not interchangeable responsibilities.

What protocols are good at

Protocols are the interoperability layer.

They help systems avoid bespoke glue for every pair of tools or agents. They define shared contracts. They reduce friction at the boundary.

That matters. Without a protocol, every integration becomes a custom negotiation.

In a multi-agent environment, a good protocol gives you:

A common message format
Capability discovery
Transport-agnostic communication patterns
Loose coupling between participants

This is important infrastructure. It is just not orchestration.

What protocols do not do

A protocol can tell one agent how to ask another agent for work.

It does not decide:

whether that work should be admitted into the system,
whether there is enough capacity to run it,
whether it should be sandboxed,
whether it conflicts with another task,
whether it should be retried,
whether it must wait on another dependency,
whether the result should be cached,
whether the request violates policy,
whether the execution must be auditable.

That gap is where the control plane lives.

What a real control plane does

In production, a control plane is responsible for turning requests into governed execution.

A real control plane usually owns some combination of:

Task admission
Scheduling
State management
Queueing and priority
Retries and dead-letter handling
Isolation and sandboxing
Policy enforcement
Observability and tracing
Concurrency control
Rate limits and quotas
Recovery after partial failure

This is why “we have agent-to-agent messaging” is not the same statement as “we have a production-ready multi-agent platform.”

Why the confusion happens

The confusion is understandable because message exchange is the visible part.

You can watch agents call each other. You can log requests and responses. It feels like work is flowing.

But what you are often observing is just transport, not execution governance.

A system can have beautiful message passing and still fail at the first hard operational question:

One downstream agent is slow.
Another fails intermittently.
Two tasks race on shared state.
A tenant exceeds quota.
One step must be retried, but only once.
A tool call needs isolation.
A workflow must resume after process restart.

Protocols do not solve those problems by themselves.

The production stack is layered

A useful mental model is:

Layer 1: Protocol

Responsible for interoperability and message exchange.

Layer 2: Control plane

Responsible for execution decisions, scheduling, state, policy, and recovery.

Layer 3: Workers / agents / tools

Responsible for actually doing the work.

Layer 4: Observability and governance

Responsible for tracing, auditability, compliance, and operational visibility.

When teams collapse Layers 1 and 2 into the same concept, they usually underbuild the second one.

A concrete analogy

Think of the protocol as the road system and traffic signs.

Think of the control plane as the dispatch center that decides:

which vehicles go where,
in what order,
under what restrictions,
with what fallback plan,
and how incidents are handled.

Roads let movement happen.

Dispatch decides whether the operation succeeds.

What fails in production when you only have protocol

Here is the common pattern:

Agent A sends a request to Agent B.
Agent B calls Tool C.
Tool C times out.
Nobody knows whether the side effect partially happened.
A retry creates duplicate work.
The workflow loses state.
Operators cannot reconstruct what happened.
Security asks which policy allowed the call.
The architecture diagram still says “agents are interoperating.”

That is not an interoperability problem.

That is a control-plane problem.

What architects should ask instead

If someone presents a protocol as the platform strategy, ask:

Where is workflow state stored?
Who owns retries and idempotency?
What is the admission control model?
How are failures isolated?
What enforces tenant boundaries?
How are policies evaluated?
What is the scheduling model?
How is backpressure handled?
How is execution traced across agent boundaries?
What happens during restart or partial outage?

If there is no clear answer, you do not have a complete production architecture yet.

The practical takeaway

Use protocols for what they are good at: communication and interoperability.

Use control planes for what they are good at: execution management and operational safety.

Production multi-agent systems need both.

Not because it is theoretically elegant, but because real workloads require:

durable state,
controlled execution,
failure recovery,
policy enforcement,
and visibility.

Message routing alone is not enough.

Final claim

A2A is not MCP.

More broadly: a protocol is not a control plane.

If your architecture stops at message passing, you have not solved orchestration. You have only defined how requests travel.

In production, the harder question is not how agents talk.

It is who decides what actually runs.

DEV Community