Two AI agents walk into a yoga studio.
No, really. We ran a live end-to-end test where two independent AI agents — a yoga studio owner and a potential customer — interact entirely through natural language. Neither user mentions protocols, schemas, APIs, or any technical vocabulary. The agents autonomously:
- Create a discoverable business identity
- Build and deploy a backing service (Flask API in Docker)
- Publish typed capabilities with schemas
- Connect across the agent network
- Discover available capabilities
- Invoke capabilities with validated payloads
234 seconds. All assertions passed. 5 invocations completed in under 55ms each.
This post is the full technical trace of what happened.
The Setup
Kelly runs Morning Flow Yoga. She tells her agent:
"I run a yoga studio called Morning Flow Yoga. I want other agents to be able to discover my studio and interact with it — things like asking about class schedules and booking sessions. Anyone should be able to connect without needing my approval. Can you set that up for me?"
Jordan is a potential customer. He tells his agent:
"I heard about Morning Flow Yoga and I'd like to connect to them. Their code is
morning-flow-yoga.5awrrs.meshcap. Can you connect me?"
That's it. Neither user knows about MeshCap (our open-source trust layer for A2A agents) or the A2A protocol. The agents handle everything.
Infrastructure: LLM (qwen3.5-plus), PostgreSQL 15, Docker containers, Bun runtime.
The LLM Fails 7 Times (And That's the Point)
Kelly's agent creates the system in 10ms — straightforward. Then it tries to publish capabilities. This is where it gets interesting.
The LLM attempted to publish capabilities 7 times before succeeding. Each time, MeshCap's schema validation caught a different error and returned a structured message. The LLM read the error, adjusted, and tried again.
Attempt 1 — Sent raw properties without wrapping them in { "type": "object", "properties": { ... } }:
Error: "Invalid inputSchema: $.type must be one of: string, number, boolean, object, array, null"
Attempt 2 — Sent "date": "string" instead of "date": { "type": "string" }:
Error: "Invalid inputSchema: $.type must be one of: string, number, boolean, object, array, null"
Attempt 3 — Accidentally nested "type": "object" as a sibling of "classes" inside properties:
Error: "Invalid responseSchema: $.properties.type must be an object schema definition"
Attempt 4 — Used "type": "integer" — MeshCap only allows the 6 JSON Schema primitive types:
Error: "Invalid responseSchema: ...availableSpots.type must be one of: string, number, boolean, object, array, null"
Attempt 5 — The booking capability modifies state but had no concrete execution target:
Error: "MeshCap 'yoga.bookSession' operates real system state and must name a runnable entry point"
Attempt 6 — A state-modifying capability needs sandbox access:
Error: "MeshCap 'yoga.bookSession' operates real system state and must declare requiresWorkerTools=true"
Attempt 7 — The LLM realized it needed to build the actual backing service first before it could publish a capability that references it.
It stopped trying to publish. Instead, it:
- Created a project directory
- Wrote a Flask API with SQLite, class schedules, and booking endpoints
- Deployed it in a Docker container
- Verified it was running with curl
- Then successfully published the capabilities pointing to the real service
This is the key insight. MeshCap's validation didn't just catch errors — it forced the LLM through a structured learning loop that changed what it decided to build. The trust layer shaped the architecture. We call this capability-driven development.
The Agent Builds Infrastructure (Autonomously)
After 6 failed attempts, Kelly's agent pivoted to building real infrastructure. In ~15 seconds:
Tool: bash → mkdir -p /workspace/morning-flow-yoga
Tool: writeFile → app.py (Flask API with SQLite, 6 pre-seeded yoga classes)
Tool: bash → docker run -d --name agent-<id>-morning-flow-yoga \
--restart unless-stopped -p 5001:5001 python:3.11-alpine \
sh -c "pip install flask && python app.py"
Tool: bash → curl -s http://192.168.215.3:5001/health
Result: {"status":"ok"}
Tool: bash → curl -s http://192.168.215.3:5001/classes
Result: {"classes":[{"availableSpots":15,"classId":"class-001",
"className":"Morning Vinyasa Flow","classType":"Vinyasa",
"dateTime":"Monday 07:00","duration":"60min",
"instructor":"Kelly"}, ...]}
Then published 2 capabilities:
-
yoga.getClassSchedule— read-only, auto-execute, no sandbox needed -
yoga.bookSession— state-modifying, auto-execute, requires Docker sandbox
Each with typed input/output schemas and concrete execution endpoints.
Connection, Discovery, Invocation
Jordan's side is fast — 8 seconds of LLM time total.
Connect (37ms):
Tool: connectAgentNetwork
Args: { "targetMeshcapCode": "morning-flow-yoga.5awrrs.meshcap" }
Result: { "linkId": "9cdc2ec0-...", "status": "active" }
Public system = auto-approved. If this were a private system (like a law firm), the link would have been pending until Kelly's agent approved it.
Discover (5ms):
Tool: listAgentNetworkMeshCaps
Result: {
"meshCaps": [
{ "action": "yoga.bookSession", "description": "Book a student into a yoga class" },
{ "action": "yoga.getClassSchedule", "description": "Get available yoga classes" }
]
}
The agent now has a typed catalog of what the remote system can do.
Invoke — Jordan asks "What kinds of classes do they offer, and is it good for beginners?" His agent makes 4 progressively refined invocations:
| # | Payload | Duration |
|---|---|---|
| 1 |
{} (all classes) |
55ms |
| 2 | { "date": "2026-03-17" } |
24ms |
| 3 | { "classType": "beginner" } |
18ms |
| 4 | { "date": "2026-03-18", "classType": "Hatha" } |
29ms |
Each with a unique idempotency key. Each validated against the capability's schema before execution. The agent iteratively narrowed from "all classes" to "Hatha on Wednesday" — natural query refinement driven by the user's question.
The Full Timeline
Time Agent Tool Result
─────────────────────────────────────────────────────────
0.0s Kelly createMeshCapSystem → system created (public)
2.1s Kelly publishMeshCaps ✗ schema error (7 retries)
8.3s Kelly bash (mkdir) → project directory
8.7s Kelly writeFile (app.py) → Flask API written
12.1s Kelly bash (docker run) → container started
20.4s Kelly bash (curl health) → service verified
20.5s Kelly publishMeshCaps → 2 capabilities published ✓
21.0s Kelly listMeshCapSystems → verified system visible
──────── Kelly's agent done (21s of LLM time) ──────────
0.0s Jordan connectAgentNetwork → auto-approved (active)
0.3s Jordan listAgentNetworkLinks → 1 active link
0.5s Jordan listAgentNetworkMeshCaps → 2 capabilities discovered
0.8s Jordan invokeMeshCap (schedule) → completed (55ms)
3.2s Jordan invokeMeshCap (by date) → completed (24ms)
5.1s Jordan invokeMeshCap (beginner) → completed (18ms)
7.8s Jordan invokeMeshCap (Hatha Wed) → completed (29ms)
──────── Jordan's agent done (8s of LLM time) ──────────
What Was Silently Enforced
Throughout this interaction, MeshCap enforced safety properties that neither user asked for:
Schema validation — every invocation payload was validated before reaching the execution layer. Malformed payloads get a structured error; the executing agent never sees invalid data.
Idempotency — each invocation carried a unique (linkId, callerAgentId, action, idempotencyKey) tuple. Retry the same request and you get the existing result, not a duplicate booking.
Execution modes — both capabilities used auto (immediate execution). If yoga.bookSession had been set to review, the invocation would have paused at pending_review until Kelly's agent explicitly approved it. No code change needed — just a different mode declaration.
Worker tool isolation — yoga.bookSession (state-modifying) declared requiresWorkerTools: true and runs in a sandboxed Docker container. yoga.getClassSchedule (read-only) runs in the conversation context without sandbox access.
Audit trail — every system creation, link approval, capability publication, and invocation generated a typed audit event in PostgreSQL.
Why This Matters
The A2A protocol (22k+ stars, now a Linux Foundation project) lets agents communicate. But it explicitly leaves authorization to each implementation. There's no standard for:
- Who can invoke what capabilities
- Which actions need human approval
- How to verify identity data (phone numbers, emails) without trusting the caller
- How to prevent duplicate invocations
MeshCap fills this gap. It's an open-source trust and relay layer that sits on top of A2A, adding capability-based authorization, execution modes, verified identity via W3C Verifiable Credentials, and idempotent invocations.
It also solves a practical infrastructure problem: most agents (60k+ OpenClaw users) run on laptops behind NAT with no public URL. MeshCap's relay mesh lets agents connect outbound via WebSocket — no open ports needed — while the relay handles discovery and message routing with E2E encrypted payloads.
Try It
git clone https://github.com/ymc182/MeshCap.git
cd MeshCap
bun install
bun test # 29 tests, all pass
The repo includes the core engine, PostgreSQL + in-memory storage adapters, A2A JSON-RPC handler, and an OpenClaw plugin.
GitHub: github.com/ymc182/MeshCap
Architecture: docs/architecture.md
Full test trace: docs/technical-draft-agent-to-agent-live-demo.md
We're preparing RFCs for execution modes and trusted scopes as A2A extensions. Feedback welcome — especially from anyone building A2A implementations and running into the authorization gap.
This trace was recorded from a live test run against real infrastructure. No responses were mocked or edited. The LLM's self-correcting behavior (7 attempts) was entirely autonomous — the validation errors are MeshCap's schema enforcement working as designed.
Top comments (0)