I've been building in the agentic space for a while. Not as a researcher, not at a well-funded lab — as a solo indie developer trying to build something that actually works in production.
And the same failure mode keeps showing up regardless of which framework people use.
When something goes wrong in a multi-agent pipeline, nobody knows where it broke. The LLM completed successfully from the framework's perspective. No exception was thrown. But the output was wrong, the next agent consumed it anyway, and by the time a human noticed, the error had propagated three steps downstream.
Most frameworks treat agent communication like a conversation. One agent finishes, dumps its output into context, and the next agent picks it up. There's no contract. No definition of what "done" actually means. No gate between steps that asks whether the output meets acceptance criteria before allowing the next agent to proceed.
I call this vibe-based engineering. The system works great in demos because demos don't encounter unexpected model behavior. Production does.
The Problem With "Just Retry"
The standard answer to LLM unreliability is retry logic. If the model returns something unexpected, retry until it doesn't.
This is necessary but not sufficient. Retry logic answers the question "did the function complete." It doesn't answer "was the output actually correct." A task can succeed in every framework-observable way while producing output that silently breaks the next step in the chain.
This is the gap. Most orchestration tooling is building a reliable conveyor belt. Nobody is checking whether what came off the conveyor belt is actually good.
Contract-Based Engineering
The pattern that fixes this is treating agent handoffs like typed work orders rather than conversations.
Instead of an agent dumping output into shared context, it produces a packet — a typed object with a defined scope, constraints, acceptance criteria, and a lifecycle. The receiving agent cannot start until the packet is valid. The output cannot advance until it passes a quality check. If it fails, the packet is rejected and the reason is recorded.
Every transition is traceable. Every failure has a location and a cause. You can prove exactly where a task died and why it was blocked.
This is what I've been calling the Agent Handoff Protocol. It's a small open spec, runtime and model agnostic, MIT licensed.
What This Unlocks Beyond Reliability
The traceability isn't just useful for debugging. It turns out that a quality-gated packet trace is a training curriculum.
Every verified handoff is a labeled teacher-student pair. Every rejected output is a labeled negative example. If you're distilling smaller specialist models from your agent runs, the quality gate means your training data is clean by construction — bad runs are rejected before they ever become training signal.
This is the insight that changed how I think about the whole system. Reliability and distillation aren't separate concerns. The same gate that makes your pipeline trustworthy is the same gate that makes your training data trustworthy.
Where This Lives
I've built this out into a full orchestration engine called Orca, named by my wife who got tired of hearing me say "orchestrator." It has named roles that communicate via AHP packets, 620 tests passing across 12 packages, and a v1.2.2 release on GitHub.
The protocol is separate from the engine by design. AHP is useful without Orca. You can implement the packet structure in any system, with any models, using any runtime.
If you're building anything beyond a single-agent wrapper, the contract-based vs vibe-based distinction starts to matter a lot.
AHP protocol and spec: https://github.com/junkyard22/AHP
Orca engine: https://github.com/junkyard22/Orca
Happy to get into the weeds on architecture, the quality gating design, or what it looks like to build something like this as a solo indie dev.
Top comments (3)
Nice. I agree. I also created handoff specs between my agent "rooms". Its interesting how we are all building similar things. My ui/ux shows all agents conversation within a room.
Really interesting architecture — the convergence pipeline is a different shape than AHP but solving adjacent problems. You're optimizing for consensus across parallel exploration, I'm optimizing for contract enforcement in sequential execution. Curious whether your Audit step uses predefined criteria or emerges from the convergence process itself.
Parallel exploration (fan out) happens within a room. However, orchestrator enforces contracts to each agent. Also, contracts (handoff) is enforced between rooms.