DEV Community

Cover image for An LLM Failed 7 Times to Publish an Agent Capability. Then It Built a Docker Service Instead.
Eric C (Eric)
Eric C (Eric)

Posted on

An LLM Failed 7 Times to Publish an Agent Capability. Then It Built a Docker Service Instead.

Two AI agents walk into a yoga studio.

No, really. We ran a live end-to-end test where two independent AI agents — a yoga studio owner and a potential customer — interact entirely through natural language. Neither user mentions protocols, schemas, APIs, or any technical vocabulary. The agents autonomously:

  1. Create a discoverable business identity
  2. Build and deploy a backing service (Flask API in Docker)
  3. Publish typed capabilities with schemas
  4. Connect across the agent network
  5. Discover available capabilities
  6. Invoke capabilities with validated payloads

234 seconds. All assertions passed. 5 invocations completed in under 55ms each.

This post is the full technical trace of what happened.


The Setup

Kelly runs Morning Flow Yoga. She tells her agent:

"I run a yoga studio called Morning Flow Yoga. I want other agents to be able to discover my studio and interact with it — things like asking about class schedules and booking sessions. Anyone should be able to connect without needing my approval. Can you set that up for me?"

Jordan is a potential customer. He tells his agent:

"I heard about Morning Flow Yoga and I'd like to connect to them. Their code is morning-flow-yoga.5awrrs.meshcap. Can you connect me?"

That's it. Neither user knows about MeshCap (our open-source trust layer for A2A agents) or the A2A protocol. The agents handle everything.

Infrastructure: LLM (qwen3.5-plus), PostgreSQL 15, Docker containers, Bun runtime.


The LLM Fails 7 Times (And That's the Point)

Kelly's agent creates the system in 10ms — straightforward. Then it tries to publish capabilities. This is where it gets interesting.

The LLM attempted to publish capabilities 7 times before succeeding. Each time, MeshCap's schema validation caught a different error and returned a structured message. The LLM read the error, adjusted, and tried again.

Attempt 1 — Sent raw properties without wrapping them in { "type": "object", "properties": { ... } }:

Error: "Invalid inputSchema: $.type must be one of: string, number, boolean, object, array, null"
Enter fullscreen mode Exit fullscreen mode

Attempt 2 — Sent "date": "string" instead of "date": { "type": "string" }:

Error: "Invalid inputSchema: $.type must be one of: string, number, boolean, object, array, null"
Enter fullscreen mode Exit fullscreen mode

Attempt 3 — Accidentally nested "type": "object" as a sibling of "classes" inside properties:

Error: "Invalid responseSchema: $.properties.type must be an object schema definition"
Enter fullscreen mode Exit fullscreen mode

Attempt 4 — Used "type": "integer" — MeshCap only allows the 6 JSON Schema primitive types:

Error: "Invalid responseSchema: ...availableSpots.type must be one of: string, number, boolean, object, array, null"
Enter fullscreen mode Exit fullscreen mode

Attempt 5 — The booking capability modifies state but had no concrete execution target:

Error: "MeshCap 'yoga.bookSession' operates real system state and must name a runnable entry point"
Enter fullscreen mode Exit fullscreen mode

Attempt 6 — A state-modifying capability needs sandbox access:

Error: "MeshCap 'yoga.bookSession' operates real system state and must declare requiresWorkerTools=true"
Enter fullscreen mode Exit fullscreen mode

Attempt 7 — The LLM realized it needed to build the actual backing service first before it could publish a capability that references it.

It stopped trying to publish. Instead, it:

  1. Created a project directory
  2. Wrote a Flask API with SQLite, class schedules, and booking endpoints
  3. Deployed it in a Docker container
  4. Verified it was running with curl
  5. Then successfully published the capabilities pointing to the real service

This is the key insight. MeshCap's validation didn't just catch errors — it forced the LLM through a structured learning loop that changed what it decided to build. The trust layer shaped the architecture. We call this capability-driven development.


The Agent Builds Infrastructure (Autonomously)

After 6 failed attempts, Kelly's agent pivoted to building real infrastructure. In ~15 seconds:

Tool: bash → mkdir -p /workspace/morning-flow-yoga

Tool: writeFile → app.py (Flask API with SQLite, 6 pre-seeded yoga classes)

Tool: bash → docker run -d --name agent-<id>-morning-flow-yoga \
  --restart unless-stopped -p 5001:5001 python:3.11-alpine \
  sh -c "pip install flask && python app.py"

Tool: bash → curl -s http://192.168.215.3:5001/health
Result: {"status":"ok"}

Tool: bash → curl -s http://192.168.215.3:5001/classes
Result: {"classes":[{"availableSpots":15,"classId":"class-001",
  "className":"Morning Vinyasa Flow","classType":"Vinyasa",
  "dateTime":"Monday 07:00","duration":"60min",
  "instructor":"Kelly"}, ...]}
Enter fullscreen mode Exit fullscreen mode

Then published 2 capabilities:

  • yoga.getClassSchedule — read-only, auto-execute, no sandbox needed
  • yoga.bookSession — state-modifying, auto-execute, requires Docker sandbox

Each with typed input/output schemas and concrete execution endpoints.


Connection, Discovery, Invocation

Jordan's side is fast — 8 seconds of LLM time total.

Connect (37ms):

Tool: connectAgentNetwork
Args: { "targetMeshcapCode": "morning-flow-yoga.5awrrs.meshcap" }
Result: { "linkId": "9cdc2ec0-...", "status": "active" }
Enter fullscreen mode Exit fullscreen mode

Public system = auto-approved. If this were a private system (like a law firm), the link would have been pending until Kelly's agent approved it.

Discover (5ms):

Tool: listAgentNetworkMeshCaps
Result: {
  "meshCaps": [
    { "action": "yoga.bookSession", "description": "Book a student into a yoga class" },
    { "action": "yoga.getClassSchedule", "description": "Get available yoga classes" }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The agent now has a typed catalog of what the remote system can do.

Invoke — Jordan asks "What kinds of classes do they offer, and is it good for beginners?" His agent makes 4 progressively refined invocations:

# Payload Duration
1 {} (all classes) 55ms
2 { "date": "2026-03-17" } 24ms
3 { "classType": "beginner" } 18ms
4 { "date": "2026-03-18", "classType": "Hatha" } 29ms

Each with a unique idempotency key. Each validated against the capability's schema before execution. The agent iteratively narrowed from "all classes" to "Hatha on Wednesday" — natural query refinement driven by the user's question.


The Full Timeline

Time   Agent    Tool                        Result
─────────────────────────────────────────────────────────
 0.0s  Kelly    createMeshCapSystem         → system created (public)
 2.1s  Kelly    publishMeshCaps             ✗ schema error (7 retries)
 8.3s  Kelly    bash (mkdir)                → project directory
 8.7s  Kelly    writeFile (app.py)          → Flask API written
12.1s  Kelly    bash (docker run)           → container started
20.4s  Kelly    bash (curl health)          → service verified
20.5s  Kelly    publishMeshCaps             → 2 capabilities published ✓
21.0s  Kelly    listMeshCapSystems          → verified system visible
──────── Kelly's agent done (21s of LLM time) ──────────
 0.0s  Jordan   connectAgentNetwork         → auto-approved (active)
 0.3s  Jordan   listAgentNetworkLinks       → 1 active link
 0.5s  Jordan   listAgentNetworkMeshCaps    → 2 capabilities discovered
 0.8s  Jordan   invokeMeshCap (schedule)    → completed (55ms)
 3.2s  Jordan   invokeMeshCap (by date)     → completed (24ms)
 5.1s  Jordan   invokeMeshCap (beginner)    → completed (18ms)
 7.8s  Jordan   invokeMeshCap (Hatha Wed)   → completed (29ms)
──────── Jordan's agent done (8s of LLM time) ──────────
Enter fullscreen mode Exit fullscreen mode

What Was Silently Enforced

Throughout this interaction, MeshCap enforced safety properties that neither user asked for:

Schema validation — every invocation payload was validated before reaching the execution layer. Malformed payloads get a structured error; the executing agent never sees invalid data.

Idempotency — each invocation carried a unique (linkId, callerAgentId, action, idempotencyKey) tuple. Retry the same request and you get the existing result, not a duplicate booking.

Execution modes — both capabilities used auto (immediate execution). If yoga.bookSession had been set to review, the invocation would have paused at pending_review until Kelly's agent explicitly approved it. No code change needed — just a different mode declaration.

Worker tool isolationyoga.bookSession (state-modifying) declared requiresWorkerTools: true and runs in a sandboxed Docker container. yoga.getClassSchedule (read-only) runs in the conversation context without sandbox access.

Audit trail — every system creation, link approval, capability publication, and invocation generated a typed audit event in PostgreSQL.


Why This Matters

The A2A protocol (22k+ stars, now a Linux Foundation project) lets agents communicate. But it explicitly leaves authorization to each implementation. There's no standard for:

  • Who can invoke what capabilities
  • Which actions need human approval
  • How to verify identity data (phone numbers, emails) without trusting the caller
  • How to prevent duplicate invocations

MeshCap fills this gap. It's an open-source trust and relay layer that sits on top of A2A, adding capability-based authorization, execution modes, verified identity via W3C Verifiable Credentials, and idempotent invocations.

It also solves a practical infrastructure problem: most agents (60k+ OpenClaw users) run on laptops behind NAT with no public URL. MeshCap's relay mesh lets agents connect outbound via WebSocket — no open ports needed — while the relay handles discovery and message routing with E2E encrypted payloads.


Try It

git clone https://github.com/ymc182/MeshCap.git
cd MeshCap
bun install
bun test  # 29 tests, all pass
Enter fullscreen mode Exit fullscreen mode

The repo includes the core engine, PostgreSQL + in-memory storage adapters, A2A JSON-RPC handler, and an OpenClaw plugin.

GitHub: github.com/ymc182/MeshCap
Architecture: docs/architecture.md
Full test trace: docs/technical-draft-agent-to-agent-live-demo.md

We're preparing RFCs for execution modes and trusted scopes as A2A extensions. Feedback welcome — especially from anyone building A2A implementations and running into the authorization gap.


This trace was recorded from a live test run against real infrastructure. No responses were mocked or edited. The LLM's self-correcting behavior (7 attempts) was entirely autonomous — the validation errors are MeshCap's schema enforcement working as designed.

Top comments (0)