Ajay Devineni

Posted on Apr 23

A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet

#sre #agentaichallenge #cloudnative #a2a

Google A2A Protocol turned one year old on April 9, 2026. Over 150 organizations are running it in production. It is live inside Amazon Bedrock AgentCore and Azure AI Foundry. IBM's competing Agent Communication Protocol merged into A2A rather than fight it. The Linux Foundation now governs the spec.

The protocol is production-grade. The reliability engineering discipline for it is not.

I have spent the past year building SRE frameworks for single-agent + MCP deployments in regulated financial services environments. When A2A entered the picture, I realized the failure surface I had been managing had changed completely. This article documents the new failure modes A2A introduces and the SRE patterns I believe are required to manage them.

The Two-Layer Stack and Why It Changes Everything

MCP and A2A solve different problems at different layers of the agent stack. This is well understood by now. What is not yet well understood is what the two-layer combination means for reliability engineering.

MCP (Model Context Protocol)** — the vertical layer. An agent connects to tools and data sources. The failure modes are familiar to any distributed systems engineer: tool unavailability, degraded response quality, latency spikes, authentication failures. The blast radius is bounded. One agent, one tool layer, one error budget.

A2A (Agent-to-Agent Protocol)** — the horizontal layer. Agents communicate with other agents across organizational and platform boundaries. An orchestrator agent delegates subtasks to specialist agents via JSON-RPC over HTTP. Those specialist agents may be built by different teams, running on different vendors, governed by different SLOs.

The reliability engineering challenge A2A creates is not technical — the protocol itself is well-designed. It is organizational and observational. When an orchestrator agent delegates to a sub-agent via A2A, and that sub-agent fails silently, who carries the error budget? How do you instrument the boundary? What does safe degradation look like when an entire reasoning capability disappears because a downstream agent is unavailable?

These questions have no consensus answers yet. This article is my attempt to start building them.

The A2A Failure Mode Taxonomy

After studying multi-agent failure patterns across production deployments, I categorize A2A-specific failures into four classes. The first two are detectable with existing tooling. The last two are not.

Class 1: Sub-Agent Unavailability

The downstream agent returns a 503 or connection timeout. This is the easiest failure to handle — it looks like a standard HTTP failure and can be caught by existing circuit breaker patterns. Your orchestrator agent should treat sub-agent unavailability exactly as it treats MCP tool unavailability: fall back to a degraded capability or route to a human escalation path.

Instrumentation: standard HTTP error rate monitoring at the A2A client layer.

Class 2: Sub-Agent Latency Degradation

The downstream agent responds, but slowly. In a multi-agent chain (Agent A → Agent B → Agent C), latency compounds. A 2-second degradation at Agent C becomes a 6-second degradation at Agent A's response time. Users experience this as the orchestrator being slow — but the problem is buried three hops down the chain.

Instrumentation: distributed tracing across A2A boundaries. Each A2A task invocation should carry a trace ID propagated from the orchestrator. Without this, your latency SLI for the orchestrator tells you nothing useful about where the latency is originating.

Class 3: Silent Task Result Corruption — ⚠️ Not detectable with standard tooling

The downstream agent returns HTTP 200 with a syntactically valid A2A task result, but the result is semantically wrong — incomplete reasoning, missing context fields, hallucinated data treated as factual output. The orchestrator agent receives this as a successful response and incorporates it into its own output.

Your error rate SLI stays at zero. Your latency SLI stays normal. Your user receives incorrect output from a system that reported 100% success.

This is the failure mode that existing observability stacks cannot detect. It requires what I call an A2A Semantic Boundary Validator — a lightweight evaluation function that runs at the A2A client layer on every incoming task result, checking the result against expected behavioral bounds for that sub-agent's task class.

The implementation pattern mirrors my Decision Quality Rate (DQR) SLI for single-agent systems: maintain a behavioral baseline per sub-agent per task class, and flag results that fall outside expected bounds as potential corruptions before they propagate upstream.

Class 4: Cascading Autonomy Amplification — ⚠️ The most dangerous failure mode

Agent A delegates to Agent B. Agent B, uncertain about the task, makes additional autonomous decisions to resolve the ambiguity — invoking more MCP tools than its baseline, delegating further to Agent C, modifying its task interpretation. Agent C does the same.

By the time a result returns to Agent A, the original task intent has been substantially transformed by a chain of autonomous interpretations — none of which were visible to the orchestrator, none of which crossed any error threshold, and none of which can be reconstructed without end-to-end decision lineage capture.

This failure mode is unique to multi-agent systems. Single-agent + MCP deployments cannot produce it. It requires agents talking to agents, each adding their own layer of autonomous interpretation to a task that was never explicitly respecified.

The SRE Framework for A2A: Five Additions to Your Existing Stack

If you have followed my previous work on SLOs for agentic AI, you already have Decision Quality Rate, Tool Invocation Efficiency, and Human Escalation Rate instrumented for your single-agent deployments. A2A requires five additional capabilities on top of that foundation.

1. A2A Boundary Tracing

Every A2A task delegation must carry a distributed trace ID originating from the orchestrator. This is not optional — without it, you cannot attribute latency, errors, or behavioral drift to the correct agent in a multi-agent chain.

Implementation: Propagate a x-trace-id header on every A2A HTTP request. Store the full delegation tree (which agent delegated to which, with what task parameters, at what timestamp) in your centralized trace store. On AWS, I use X-Ray for the distributed trace and a DynamoDB table for the delegation tree — X-Ray captures the HTTP-level trace, DynamoDB captures the semantic-level task delegation structure.

2. Per-Sub-Agent SLO Ownership

Every A2A sub-agent your orchestrator calls must have a designated SLO owner — a named human or team who is paged when that sub-agent's reliability degrades. In practice, this means:

For internal sub-agents: assign SLO ownership the same way you assign ownership to microservices
For external/third-party sub-agents: define a sub-agent reliability budget. If a third-party A2A agent degrades, your orchestrator should treat it as a dependency failure and activate your degraded-mode runbook — not wait for the vendor to page you

The org chart question — who owns the SLO when agents from different vendors collaborate via A2A — is the most important unresolved governance question in multi-agent reliability today.

3. A2A Semantic Boundary Validation

For each sub-agent your orchestrator calls, define the expected output schema and behavioral bounds. Implement a validator function that runs on every incoming A2A task result before the orchestrator acts on it.

Minimum validation layer:

Schema validation: does the result match the expected A2A task result structure?
Completeness check: are required fields populated?
Behavioral bound check: does the result fall within the baseline distribution for this sub-agent's task class?

Results that fail validation should not be silently dropped — they should trigger your escalation path and log the full task context for postmortem analysis.

4. The Agent Chain Circuit Breaker

In traditional microservices, a circuit breaker opens when downstream failure rate exceeds a threshold, preventing cascade failures. Multi-agent systems need an equivalent pattern, adapted for the non-deterministic nature of agent communication.

My implementation: an agent chain circuit breaker that tracks the running success rate of each A2A sub-agent invocation over a 15-minute rolling window. When the validated success rate drops below 85% (accounting for semantic validation failures, not just HTTP errors), the circuit opens and the orchestrator routes that task class to a degraded-mode handler — typically a simplified version of the task that can be completed with MCP tools the orchestrator controls directly, or an immediate human escalation.

5. End-to-End Decision Lineage for Multi-Agent Chains

In single-agent systems, decision lineage is the record of what tools an agent invoked and what reasoning it applied. In A2A multi-agent systems, decision lineage must span the entire delegation chain — capturing not just what the orchestrator decided, but what each sub-agent decided on its behalf.

This is the audit trail that satisfies SOC 2 Type II requirements for autonomous decision-making in regulated environments. Without it, you cannot demonstrate to auditors that you have oversight of decisions made by agents you deployed but didn't directly control.

Implementation: each A2A task result must include a decision_lineage field containing the sub-agent's tool invocations, reasoning path, and confidence metadata. The orchestrator appends this to its own decision lineage before logging the full chain to the immutable audit store.

The Organizational Question A2A Forces

Every SRE framework I've described above requires answers to an organizational question the industry hasn't resolved:

When an orchestrator agent delegates to a third-party sub-agent via A2A, and the sub-agent produces output that causes downstream harm — who is operationally responsible?

This is not a legal question (yet). It is an operational ownership question that every multi-agent team will face in 2026.

My position: the orchestrator owner carries responsibility for validating and acting on sub-agent output. The A2A protocol handles communication. It does not handle accountability. An orchestrator that blindly trusts A2A task results without semantic validation is the operational equivalent of a service that makes no network calls — in other words, it doesn't exist in any production-grade form.

Build the semantic boundary validation. Own the chain.

Where to Start

If you are moving from single-agent + MCP to multi-agent + A2A, I recommend this progression:

Week 1: Implement A2A boundary tracing with distributed trace ID propagation. You cannot debug what you cannot trace.

Week 2: Assign explicit SLO ownership to every A2A sub-agent your orchestrator calls. Even a spreadsheet with named owners is better than none.

Week 3-4: Build the semantic boundary validator for your highest-volume A2A task class. Start with schema and completeness validation before attempting behavioral bound checks.

Month 2: Instrument the agent chain circuit breaker. Set your initial threshold conservatively (85% validated success rate) and adjust based on 30 days of baseline data.

Month 3+: Build end-to-end decision lineage capture. This is the hardest piece and the most important for regulated environments.

Connecting the Arc

This article is part of a series on applying SRE discipline to agentic AI in production:

Why SRE Principles Are the Missing Layer in MCP Security
SLOs for Agentic AI: The Reliability Framework Production Teams Are Missing (published this week)
A2A + MCP in Production: The SRE Reliability Framework Nobody Has Written Yet (this article)

I shared the core argument on LinkedIn: https://www.linkedin.com/posts/ajay-devineni_agenticai-a2a-mcp-share-7453145380822605824-pMta?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

The SRE community spent a decade learning to run distributed microservices reliably. We're at day one for multi-agent systems with A2A. The failure modes are different. The organizational questions are harder. The instrumentation doesn't exist yet.

Build it now — before your agent chains are running at a scale where these gaps become production incidents.

DEV Community