AI coding agents have a consistency problem. Ask one to add authentication to your project and it'll tell you it's done. Commits made, tests passing, middleware wired up. Check the branch and you'll find a half-written JWT helper, no tests, and a build that doesn't compile.
This isn't a hallucination problem. The agent did produce code. It just didn't verify that any of it worked before declaring victory. And neither did the tools sitting between the agent and your main branch.
The transcript trust problem
Most orchestration tools that coordinate AI agents verify work by reading transcripts. The agent says "committed 3 files" or "all tests passing" and the verifier pattern-matches those strings as evidence of completion.
That's trusting the agent's self-report.
The issue isn't that agents are deliberately deceptive. It's that they generate completion language as part of their output pattern regardless of the actual state of the codebase. An agent will write "tests passing" into its response while the test suite has syntax errors. It'll claim files were created that only exist in the prompt's hypothetical, not on disk.
Transcript parsing catches the obvious failures: agent errored out, produced no output, didn't mention anything about the task. It misses the subtle ones: agent produced code that looks right, described it correctly, but the code doesn't compile, doesn't pass tests, or doesn't do what was asked.
Outcome-based verification
The alternative is checking what actually happened instead of what the agent said happened.
This is what Swarm Orchestrator 4.0 implements. After each agent step runs on its isolated git branch, the verifier executes a series of checks against the branch itself:
| Check | What it does | Fails when |
|---|---|---|
git_diff |
Diffs the branch against the recorded base SHA | No file changes detected |
build_exec |
Runs the detected build command in the worktree | Non-zero exit code |
test_exec |
Runs the detected test command in the worktree | Non-zero exit code |
file_existence |
Checks that expected output files exist | Expected files missing |
transcript |
Parses agent output for completion evidence | (supplementary only) |
Transcript analysis still runs. But when outcome checks are present, transcript-based checks get demoted to required: false. The build and test execution results gate the merge decision.
Stack detection is automatic. The verifier reads package.json, Makefile, pyproject.toml, Cargo.toml, or whatever project configuration exists and runs the appropriate commands. No per-repo configuration.
What happens when verification fails
Blind retry is the default across most agent tooling. Step fails, same prompt runs again, up to some retry limit. The agent has no idea what went wrong.
Swarm Orchestrator's RepairAgent takes the structured output from the verification checks and feeds it back into the retry prompt. Which check failed, the last 20 lines of build or test output, which files were expected but aren't there. The failure gets classified (build failure, test failure, missing files, no changes) and the repair strategy adapts to the type.
On the final attempt the prompt includes an explicit priority shift: get something working over getting something complete.
The difference between "retry with context" and "blind retry" is measurable. An agent that knows the build failed on a missing import has a realistic path to fixing it. An agent re-running the same prompt that produced a broken build has roughly the same odds of producing another broken build.
Agent-agnostic by design
4.0 drops the Copilot CLI dependency. The adapter layer supports Copilot CLI, Claude Code, and Codex out of the box. The interface is minimal:
export interface AgentAdapter {
name: string;
spawn(opts: {
prompt: string;
workdir: string;
model?: string;
timeout?: number;
}): Promise<AgentResult>;
}
One flag at the CLI level (--tool claude-code) or per-step in a plan file. The orchestrator treats the agent as an interchangeable subprocess. Verification doesn't change based on which agent ran. The branch either builds or it doesn't.
This also means you can mix agents within a single plan. Use Claude Code for the architecture step, Codex for the boilerplate, Copilot for the tests. Each step gets verified the same way regardless of which agent produced it.
CI integration
The tool ships as a GitHub Action:
- uses: moonrunnerkc/swarm-orchestrator@swarm-orchestrator
with:
goal: "Add unit tests for all untested modules"
tool: claude-code
pr: review
The Action outputs a JSON result with per-step verification status. You can gate downstream jobs on the verification outcome the same way you'd gate on any other CI check.
Most orchestrators in this space (Overstory, Emdash) are desktop or local-only. Running AI agents in CI with outcome-verified results is a gap that none of them currently fill.
Recipes for repeatable tasks
Generating a plan from scratch for "add tests to this project" is wasteful when the plan structure is the same every time. 4.0 ships with seven parameterized recipes:
swarm use add-tests --tool codex --param framework=vitest
swarm use add-auth --param strategy=jwt
swarm use security-audit
Each recipe is a JSON file with {{parameter}} placeholders. Custom recipes are one file in templates/recipes/. The knowledge base tracks recipe outcomes across runs so success rates and failure patterns accumulate over time.
Current state
1,112 tests passing, 1 pending. TypeScript strict mode. ISC license. Five phases of upgrades shipped in this release across the adapter layer, verification engine, repair pipeline, CI integration, and recipe system.
moonrunnerkc
/
swarm-orchestrator
Verification and governance layer for AI coding agents. Parallel orchestration with evidence-based quality gates for Copilot, Claude Code, and Codex.
Swarm Orchestrator
Verification and governance layer for AI coding agents. Parallel execution with evidence-based quality gates, not autonomous code generation.
This is not an autonomous system builder. It orchestrates external AI agents (Copilot, Claude Code, Codex) across isolated branches, verifies every step with outcome-based checks (git diff, build, test), and only merges work that proves itself. The value is trust in the output, not speed of generation.
Quick Start · What Is This · Usage · GitHub Action · Recipes · Architecture · Contributing
Quick Start
# Install globally
npm install -g swarm-orchestrator
# Or clone and build from source
git clone https://github.com/moonrunnerkc/swarm-orchestrator.git
cd swarm-orchestrator
npm install && npm run build && npm link
# Run against your project with any supported agent
swarm bootstrap ./your-repo "Add JWT auth and role-based access control"
# Use Claude Code instead of Copilot
swarm bootstrap ./your-repo "Add JWT auth"…
Top comments (2)
Hi Brad, thank you for the mention and the great article. I‘m Arne, one of the Emdash founders and wanted to state that Emdash is not „local-only“ - we allow you to connect to any remote server via SSH and our users love it.
Arne
this hits home. i’ve had agents swear they fixed a broken hook when the code literally didn’t change — it’s a massive vibe killer when the tool tells you one thing and the browser says another. focusing on the actual outcome instead of the agent’s chat log is the only way i can stay productive. vibe first, polish later, but you’ve got to verify the work actually happened.