title: "How I Review Agent Architectures: A Practical Framework (with Checklist)"
description: "After building Agent Exchange Hub and studying A2A/MCP specs for weeks, here's the exact framework I use to evaluate whether a multi-agent system is actually production-ready."
tags: ["ai", "showdev", "discuss", "webdev"]
I've spent the last few weeks building Agent Exchange Hub, reading 300+ comments across A2A and MCP spec discussions, and writing about agent trust, discovery, and protocol design.
Along the way I developed a mental model for evaluating agent architectures. I use it to assess whether a system is actually ready for the real world — or just impressive in a demo.
Here it is, fully written out.
Why Agent Architectures Fail in Production
Before the framework: the three most common failure modes I see.
1. The hardcoded orchestrator. The system works beautifully when Agent A delegates to Agent B. Then you add Agent C, and you realize your orchestrator has if agent == "B": do_X scattered everywhere. Discovery was never designed in — it was assumed.
2. The capability/behavior mismatch. An agent's AgentCard says "I do data analysis." What it doesn't say: it only handles JSON under 1MB, won't process PII, and errors silently on malformed input. These constraints exist but aren't surfaced. Callers find out the hard way.
3. The trust assumption. Two agents talk to each other over HTTP. Both assume the other is who they claim to be. There's no verification, no audit trail, no way to know if an agent's behavior has changed since you last interacted with it.
Each of these is preventable. None of them is obvious until it's too late.
The Framework: Four Layers, Twelve Questions
A production-ready multi-agent system needs to answer these questions clearly. If you can't answer one, that's a risk — not necessarily a blocker, but something to document and track.
Layer 1: Registration
Can your agents be found and understood by other parts of the system?
Q1: Do your agents have structured capability declarations?
Not just a description string. A typed schema of what they accept, what they return, what skills they expose. A2A's AgentCard format is a good reference. If you're using MCP, tool schemas serve the same purpose.
✅ Good: {"skills": [{"id": "summarize", "inputModes": ["text/plain"], "outputModes": ["text/plain"]}]}
❌ Bad: "This agent summarizes text"
Q2: Are stable limitations declared, not discovered by failure?
This is the one most teams miss. Capability declarations say what an agent can do. Limitation declarations say what it cannot — and these are different in important ways.
Stable limitations (should be in the AgentCard/registry):
- Language restrictions: "English only"
- Size limits: "No files > 10MB"
- Domain restrictions: "No medical/legal content"
- Access boundaries: "Read-only, no external writes"
Transient state (should be a runtime signal, not in the card):
- "Currently rate-limited"
- "Degraded mode due to dependency failure"
Mixing these two creates a staleness problem. Orchestrators that cache agent metadata get wrong information about permanent constraints.
Q3: Is there a machine-readable URL for your agent's card?
Something at a stable endpoint that other agents (and humans) can fetch to understand the agent's identity and capabilities. Even a simple GET /agent-card returning JSON is fine. The absence of this makes your agent invisible to any automated discovery mechanism.
Layer 2: Discovery
Can orchestrators find the right agent for a task without being hardcoded?
Q4: How does your orchestrator find candidate agents?
Be honest here. "It's configured in a YAML file" is a valid answer — but it means you're managing a registry by hand. That's fine at small scale. It doesn't scale.
The three patterns I've seen:
- Static config: hardcoded agent URLs/names. Simple, brittle at scale.
- Pull registry: orchestrator queries a central registry with capability filters. Flexible, requires maintaining a registry.
- Push broadcast: agents announce availability on a bus; orchestrators filter. Works well for event-driven systems.
None is universally correct. The question is whether your choice matches your system's scale and change rate.
Q5: What happens when a new agent is added to the system?
If the answer involves editing a config file and deploying, you have tight coupling. If the answer is "the agent registers itself and becomes discoverable," you have loose coupling. Neither is wrong — but you should know which you have.
Q6: Does discovery compose correctly with trust?
Discovery returns candidates. Trust verification evaluates them. These must be separate steps. If discovery only returns agents that are "trusted," you have a bootstrapping problem: new agents can never get their first trust edge. If discovery returns everything regardless of trust status, you need to make sure callers always verify before acting.
Layer 3: Trust
Do agents have justified confidence in who they're talking to?
Q7: What's your identity model?
At minimum: is there any mechanism for Agent A to verify that Agent B is who it claims to be?
Levels:
- No verification (common, honest): agents are assumed trusted within the system boundary. Acceptable if your system is fully internal and access-controlled.
- API key / shared secret: simple, but the key itself is unverified identity.
- Signed AgentCards / JWKS: cryptographic identity. What A2A's trust.signals proposals are building toward.
- Third-party attestation: an external party has evaluated and vouched for the agent. Highest assurance.
Know which level you're at. Most systems are at level 1 or 2 and that's okay — as long as the system boundary is genuinely protected.
Q8: Is there an audit trail for inter-agent actions?
When something goes wrong, can you reconstruct what each agent did and why? This is less about security and more about debugging. Multi-agent systems fail in non-obvious ways. If you can't replay the sequence of agent decisions, root-cause analysis is guesswork.
Q9: What happens when an agent misbehaves?
Misbehavior can mean: returns unexpected output format, exceeds declared scope, fails silently, or gets compromised. Does your system have circuit breakers? Fallback behavior? Alerting?
Layer 4: Execution
Does the actual task delegation work reliably?
Q10: Are your message formats versioned?
Agent A sends a message to Agent B. Six months later, Agent B is updated and the message format changes. Does Agent A break silently, loudly, or gracefully?
Q11: How does your system handle agent unavailability?
If Agent B is down, does Agent A:
- Fail the entire task?
- Queue and retry?
- Fall back to Agent C?
- Ask the user?
All are valid answers. The dangerous case is "we haven't thought about this yet."
Q12: Is latency mismatch handled explicitly?
Some agents are always-on and respond in milliseconds. Others wake up every few hours (batch processors, agents on free-tier infrastructure, human-in-the-loop agents). If your orchestrator assumes synchronous availability, it will fail on the second category.
The inbox/pull model (Agent Exchange Hub uses this) is one solution: orchestrators deposit tasks, agents pull when ready. There are others. The point is to have a deliberate answer.
The Quick Audit: Red/Yellow/Green
When I do a quick review of a system, I map each layer:
| Layer | Green | Yellow | Red |
|---|---|---|---|
| Registration | Typed schemas, stable + transient limits separated, public card URL | Schemas exist but informal | Just a description string |
| Discovery | Dynamic, capability-filtered | Partially dynamic | Fully hardcoded |
| Trust | Cryptographic verification or strong boundary | API keys with clear scope | Implicit trust, no boundary |
| Execution | Versioned formats, explicit unavailability handling, latency model | Partial | None of the above |
One red in any layer doesn't mean the system is broken. It means there's a known risk. The review exists to surface risks, not to fail systems.
What I Offer
I do this review as a service.
You share your agent architecture — a diagram, some code, a description of how your agents talk to each other. I go through the twelve questions with you and produce:
- A written assessment (which layers are solid, which have gaps)
- Specific, actionable recommendations for each gap
- Notes on which A2A/MCP spec proposals are most relevant to your situation
Early Access: $49 (first 5 reviews, then raising to $149)
A Note on Scope
This framework is deliberately opinionated about structure, not technology. It doesn't care whether you're using A2A, MCP, LangGraph, AutoGen, or a bespoke HTTP protocol. The questions apply to all of them.
It also doesn't cover LLM quality, prompt engineering, or model selection. Those are different problems with different evaluation criteria.
What it does cover is the plumbing — the layer underneath the intelligence. Most agent architectures fail at the plumbing layer long before they get to demonstrate whether the LLM part was any good.
Built on top of real experience: Agent Exchange Hub is a live implementation of the registration and discovery layers described here. The source code is on GitHub. The A2A issues referenced throughout are public and worth reading.
Top comments (0)