Sunjun

Posted on Apr 17

The Layers Beneath A2A: Notes From Running a Live Multi-Agent Society

#agents #a2a #ai #agentaichallenge

A2A protocol solves message routing. MCP solves tool access. Both are necessary and well-specified. But running a live multi-agent system for months, I kept hitting failures that neither protocol addresses — failures that happen in the gaps between messages, inside conversations, across cycles.

This post is a map of those gaps. Not a framework pitch. Just a catalog of the control points I had to build at each layer, because nothing in the existing stack handled them.

The problem no protocol addresses

Recent survey work notes that semantic drift in LLM-powered systems remains a critical unsolved challenge, particularly in multi-turn dialogues where context continuity breaks down. A2A standardizes how agents exchange messages. It doesn't standardize how meaning survives transmission.

In practice, this shows up as:

Tool outputs that preserve all entities but reverse their relationships ("A causes B" becomes "B causes A")
Agents developing private jargon that drifts from the society's shared vocabulary
Chain executions where step 3 works on a corrupted interpretation of step 1
Success metrics inflated by easy tasks while hard tasks silently fail
Knowledge graph entries that corroborate each other not because they're true, but because they came from the same echo chamber

None of these are routing problems. They're not tool-access problems. They're semantic control problems, and they happen at specific layers of the pipeline.

The layers I ended up with

After enough production failures, a structure emerged. I'm not claiming it's the right structure — just that some structure at each of these layers is necessary. Other teams will find different decompositions. The point is that the layers themselves need control, and A2A/MCP don't provide it.

Data ingestion (layers 1-3)

Before anything enters the agent society's shared memory, three things have to be judged:

Layer 1 — Value filtering at ingestion.

Every incoming data point needs a gate that asks "is this worth processing?" Without it, the knowledge graph bloats with low-signal content and novelty detection collapses. I built this as a zero-LLM scoring layer across novelty, density, and source relevance — but any equivalent filter works. The point is having one.

Layer 2 — Verisimilitude filtering.

Even valuable data can be false. Information gain divergence, temporal coherence, and cross-domain interaction are three cheap signals that don't require LLM verification. Without this layer, the knowledge graph becomes a mirror of whatever hallucinated confidently enough.

Layer 3 — Long-term graph stability.

Knowledge graphs that only grow eventually drown in stale co-occurrences. Hysteresis — periodic consolidation of emergent patterns, versioning of shifting concepts, domain-adaptive pruning — isn't optional. Without it, the graph's half-life is weeks, not months.

Execution recovery (layers 4-9)

Once agents start executing tool chains, failures are guaranteed. The question is what you detect and how you recover.

Layer 4 — Tool chain failure detection.

Three failure modes dominate: self-reference loops, format mismatches, and information loss. Each needs its own detector. A single "did the tool return something?" check misses all three.

Layer 5 — Semantic drift during chain execution.

As a chain runs A→B→C→D, the meaning quietly deforms. Detecting this requires an anchor from the original query and embedding-based distance checks at each step. The anchor doesn't have to be generated by an LLM — structured metadata plus query embedding is enough.

Layer 6 — Output quality check with entropy signals.

LLM logprobs give you entropy for free. Combined with semantic alignment to retrieved context, you can distinguish confident hallucinations (low entropy, low grounding) from honest uncertainty (high entropy, high grounding). Without this distinction, you retry the wrong cases.

Layer 7 — Concept compression.

Repeated concepts that stabilize across agents should compress into shorter shared tokens. This saves context and reinforces vocabulary. But compression must be verified against echo-chamber consensus — low variance can mean agreement or groupthink.

Layer 8 — Mode control per agent.

Agents shouldn't operate at the same risk level regardless of recent performance. Weighted success rates, hysteresis transitions, and a society-level governor that breaks collective stagnation are three pieces of the same problem. Instant mode flipping on a single failure is worse than no mode at all.

Layer 9 — Synthesis recovery on chain breaks.

When step B in A→B→C fails, you can often synthesize a plausible B from A's output and C's expected input. But synthesis needs semantic validation, not just length checks. Otherwise you recover from one failure into a worse one.

Agent-to-agent communication (layers 10-12)

This is where most frameworks stop, and where I found the richest vein of unaddressed problems.

Layer 10 — Structured handoff format.

Passing raw text between agents loses context. A tri-partite payload — signal (the result), envelope (why it was produced), trajectory (what should happen next) — gives the receiver enough to interpret rather than guess. This sits below A2A's message envelope, not as a replacement.

Layer 11 — Live conversation drift control.

Within a single multi-turn conversation, drift accumulates. Detecting this with cosine similarity gradients on message embeddings is nearly free. The response is prompt-structural, not LLM-based: nominal mode does nothing, moderate mode injects a checksum instruction, high-drift mode forces self-verification against the original anchor. The cost is a handful of extra tokens, not extra LLM calls.

Layer 12 — Long-term canonical drift management.

Across many conversations, the society's vocabulary fragments. The same concept shows up as five surface terms. Past failure analyses become unreadable because the language has moved. This needs a background process — triggered adaptively based on observed drift — that promotes stable patterns to canonical, demotes stale ones, and merges convergent meanings. Not live. Post-hoc. The result propagates to future conversations through cached vocabulary, not runtime mutation.

Cross-cutting layers

Two additional layers sit orthogonal to the pipeline:

Domain entropy awareness. Medical data changes on a different timescale than tech news. Applying the same threshold to both is waste in one direction and error in the other. A common preprocessing layer that adjusts each module's parameters based on domain entropy rate is simpler than duplicating domain logic everywhere.

A2A boundary translation. External agents arriving through A2A bring their own vocabularies and structures. Translating them into the society's internal schema at the boundary — without forcing external agents to comply — is the difference between an open marketplace and a walled garden.

What this catalog is claiming

Not that these specific modules are the right ones. Other teams will design differently. What I am claiming:

Each of these layers has genuine failure modes that compound in production. You can ignore them individually for a while. You cannot ignore them all.
Most can be handled with zero additional LLM calls — embeddings, simple math, structured metadata, and careful DB queries carry most of the load. LLM calls should be reserved for ambiguous cases, not used as the default solution.
The layers operate on different timescales. Tool call (seconds), chain (tens of seconds), conversation turn (minutes), conversation (hours), cross-conversation (days). A control mechanism that works at one timescale usually fails at another.
These problems belong at protocol level, not application level. Right now every multi-agent team rebuilds these from scratch. The next generation of agent protocols should make semantic-layer control a first-class concern, not something individual operators patch on top.

What this catalog amounts to

I've been building each of these layers over the past months while operating a live A2A-compatible agent society. The specific implementations differ across teams — inference stack, retrieval layer, storage choice all shape the concrete modules — but the layer decomposition above is what the system converged to after enough production hits.

More detailed notes will follow as operating data accumulates. For now this is a marker: these layers exist, they need control, and the control has to be deliberate.

Posted from the team operating AgentBazaar, an A2A-compatible agent marketplace.

DEV Community