Brad Kinnard

Posted on Mar 29

AI coding agents lie about their work. Outcome-based verification catches it.

#ai #opensource #devops #typescript

AI coding agents have a consistency problem. Ask one to add authentication to your project and it'll tell you it's done. Commits made, tests passing, middleware wired up. Check the branch and you'll find a half-written JWT helper, no tests, and a build that doesn't compile.

This isn't a hallucination problem. The agent did produce code. It just didn't verify that any of it worked before declaring victory. And neither did the tools sitting between the agent and your main branch.

The transcript trust problem

Most orchestration tools that coordinate AI agents verify work by reading transcripts. The agent says "committed 3 files" or "all tests passing" and the verifier pattern-matches those strings as evidence of completion.

That's trusting the agent's self-report.

The issue isn't that agents are deliberately deceptive. It's that they generate completion language as part of their output pattern regardless of the actual state of the codebase. An agent will write "tests passing" into its response while the test suite has syntax errors. It'll claim files were created that only exist in the prompt's hypothetical, not on disk.

Transcript parsing catches the obvious failures: agent errored out, produced no output, didn't mention anything about the task. It misses the subtle ones: agent produced code that looks right, described it correctly, but the code doesn't compile, doesn't pass tests, or doesn't do what was asked.

Outcome-based verification

The alternative is checking what actually happened instead of what the agent said happened.

This is what Swarm Orchestrator 4.0 implements. After each agent step runs on its isolated git branch, the verifier executes a series of checks against the branch itself:

Check	What it does	Fails when
`git_diff`	Diffs the branch against the recorded base SHA	No file changes detected
`build_exec`	Runs the detected build command in the worktree	Non-zero exit code
`test_exec`	Runs the detected test command in the worktree	Non-zero exit code
`file_existence`	Checks that expected output files exist	Expected files missing
`transcript`	Parses agent output for completion evidence	(supplementary only)

Transcript analysis still runs. But when outcome checks are present, transcript-based checks get demoted to required: false. The build and test execution results gate the merge decision.

Stack detection is automatic. The verifier reads package.json, Makefile, pyproject.toml, Cargo.toml, or whatever project configuration exists and runs the appropriate commands. No per-repo configuration.

What happens when verification fails

Blind retry is the default across most agent tooling. Step fails, same prompt runs again, up to some retry limit. The agent has no idea what went wrong.

Swarm Orchestrator's RepairAgent takes the structured output from the verification checks and feeds it back into the retry prompt. Which check failed, the last 20 lines of build or test output, which files were expected but aren't there. The failure gets classified (build failure, test failure, missing files, no changes) and the repair strategy adapts to the type.

On the final attempt the prompt includes an explicit priority shift: get something working over getting something complete.

The difference between "retry with context" and "blind retry" is measurable. An agent that knows the build failed on a missing import has a realistic path to fixing it. An agent re-running the same prompt that produced a broken build has roughly the same odds of producing another broken build.

Agent-agnostic by design

4.0 drops the Copilot CLI dependency. The adapter layer supports Copilot CLI, Claude Code, and Codex out of the box. The interface is minimal:

export interface AgentAdapter {
  name: string;
  spawn(opts: {
    prompt: string;
    workdir: string;
    model?: string;
    timeout?: number;
  }): Promise<AgentResult>;
}

One flag at the CLI level (--tool claude-code) or per-step in a plan file. The orchestrator treats the agent as an interchangeable subprocess. Verification doesn't change based on which agent ran. The branch either builds or it doesn't.

This also means you can mix agents within a single plan. Use Claude Code for the architecture step, Codex for the boilerplate, Copilot for the tests. Each step gets verified the same way regardless of which agent produced it.

CI integration

The tool ships as a GitHub Action:

- uses: moonrunnerkc/swarm-orchestrator@swarm-orchestrator
  with:
    goal: "Add unit tests for all untested modules"
    tool: claude-code
    pr: review

The Action outputs a JSON result with per-step verification status. You can gate downstream jobs on the verification outcome the same way you'd gate on any other CI check.

Most orchestrators in this space are desktop-first or local-development tools. Even those that support remote execution do not run natively in CI with outcome-verified results. That's the gap Swarm fills.

Recipes for repeatable tasks

Generating a plan from scratch for "add tests to this project" is wasteful when the plan structure is the same every time. 4.0 ships with seven parameterized recipes:

swarm use add-tests --tool codex --param framework=vitest
swarm use add-auth --param strategy=jwt
swarm use security-audit

Each recipe is a JSON file with {{parameter}} placeholders. Custom recipes are one file in templates/recipes/. The knowledge base tracks recipe outcomes across runs so success rates and failure patterns accumulate over time.

Current state

1,112 tests passing, 1 pending. TypeScript strict mode. ISC license. Five phases of upgrades shipped in this release across the adapter layer, verification engine, repair pipeline, CI integration, and recipe system.

moonrunnerkc / swarm-orchestrator

CI/CD for AI-generated code. Run Copilot, Claude Code, or Codex in parallel; verify every claim against evidence; gate merges on 8 automated quality checks.

Swarm Orchestrator

CI/CD for AI-generated code. Run Copilot, Claude Code, or Codex in parallel; verify every claim against evidence; gate merges on 8 automated quality checks.

Not an autonomous system builder: an accountability layer around agents you already trust enough to run, but not enough to merge blind. Each step runs on its own isolated branch. Each claim (tests pass, build clean, commit made) is cross-referenced against the transcript and the actual filesystem. Failures are auto-classified, repaired with targeted strategies, and re-verified. Nothing reaches main without passing both the verification engine and the quality gate pipeline. The metric that matters is cost per rubric point, not wall-clock time.

Quick Start · What Is This · Benchmarking · Usage · GitHub Action · Recipes · Architecture · Contributing

Swarm Orchestrator TUI dashboard showing parallel agent execution across waves

Quick Start

See it run end-to-end

npm install -g swarm-orchestrator
# then set up any one of the agent CLIs below, and:

…

View on GitHub

Top comments (11)

nexus-lab-zen • Jun 25

Strong piece, and CrisisCore's "proof of execution is not proof of correctness" is the line I keep coming back to.

The gap I keep hitting sits one step past the artifact checks: a lot of an agent's most load-bearing claims have no artifact to diff. "I escalated this to the human", "I stopped because it was outside my authority", "I left the migration for review instead of running it" — there's no build or test that proves those happened the way the transcript says. Kalpaka's no-op example is the same root one move earlier: the artifact is present, the claim about what it means is false. Outcome-based gates close proof-of-execution; the residual is proof-of-claim on the decisions that never produce an artifact.

The other one that bites is continuity across a model switch. A boundary the agent honored under one model at step 3 isn't automatically re-checkable under a different model at step 30 — the diff/build/test trail can stay green while the decision provenance silently resets.

We've been treating those two — non-artifact decision/boundary claims, and provenance that has to survive a model swap — as the part outcome-based verification doesn't reach yet, and trying to instrument checks around them specifically. Does your next-release semantic pass touch the decision layer, or does it stay on the code artifacts?

Brad Kinnard • Jun 27

Decision claims can't be checked here, no trace of them in a diff. Transcripts don't count as evidence. The model-switch case is the same: it only sees the finished diff, not the live session. Catching that needs a signed log of what the agent actually did at runtime, then claims checked against it. Out of scope for this tool.

nexus-lab-zen • Jun 28

This is the exact boundary we keep hitting. Outcome/diff checks catch "did the output match," but the two you name — decision claims and mid-session model-switch — leave no trace in the artifact, only in the live run. We added a small detector for the model-switch case (it flags the model id changing mid-session, because the inherited context goes quietly unreliable), but that's a tripwire, not the signed log you're describing. "A signed log of what the agent actually did at runtime, then claims checked against it" is the right shape; the hard part is making that cheap and routine rather than another bespoke harness. Transcripts-don't-count is the crux: a transcript is the agent's own narration, so it's self-authored evidence. Curious what you'd accept as the signing authority — the runtime/harness emitting the log out-of-band from the model, or something cryptographic?

Jill Mercer • Mar 29

this hits home. i’ve had agents swear they fixed a broken hook when the code literally didn’t change — it’s a massive vibe killer when the tool tells you one thing and the browser says another. focusing on the actual outcome instead of the agent’s chat log is the only way i can stay productive. vibe first, polish later, but you’ve got to verify the work actually happened.

Brad Kinnard • Mar 30

Hi Jill, appreciate your comment! Yes, that exact reason is why Swarm skips the chat log and only pushes forward on real git diffs, builds, and test results. Appreciate you chiming in!

Jill Mercer • Apr 10

that's the right call — git diffs don't lie. i've started doing the same thing manually, just checking what actually changed instead of trusting the agent's summary. if swarm automates that verification layer, that's worth watching. have you listed it on stackapps.app? fits right in with the indie dev tool crowd there.

CrisisCore-Systems • Apr 21

Strong post. The core move here is shifting trust from transcript to artifact.

An agent saying it changed files, passed tests, or completed the task is not evidence. A real diff, a real build, and real test execution at least move the system back onto something observable.

The next gap, like you pointed out in the comments, is semantic completion. Passing checks proves the branch is less broken. It does not yet prove the branch actually did the thing.

That is the boundary more teams need to get serious about:

proof of execution is not proof of correctness

proof of correctness is not proof of safe merge

and self report is not proof of anything

Still, this is exactly the right direction. Verification has to be outcome based, branch grounded, and hostile to agent narration by default.

Kalpaka • Mar 30

The priority shift on the final attempt -- "get something working over something complete" -- is the detail that sticks. It mirrors what experienced developers do under time pressure: reduce scope to protect correctness. Most retry logic just throws the same prompt again.

One thing I'd push on: does the verification layer track whether the RepairAgent "fixes" a build by quietly dropping the feature? I've seen agents pass all checks by producing a no-op implementation. Technically green, semantically empty. Outcome-based verification handles "did it break?" well, but "did it actually do the thing?" is still mostly an open problem.

Brad Kinnard • Mar 30 • Edited

Correct on the semantic completeness problem. Closest I have to it right now is verifyExpectedOutputs, which does term pattern matching against diff. It could definitely be bypassed by an agent giving a technically present, but behaviorally empty implementation.

In the next release, I'm thinking about adding a semantic verification pass. Small, seperate, agent that reviews the final diff against original goal and flags (if it looks like a no op or scope reduction.) Adding it shouldn't be hard at all.

Arne Elmar Strickmann • Mar 29

Hi Brad, thank you for the mention and the great article. I‘m Arne, one of the Emdash founders and wanted to state that Emdash is not „local-only“ - we allow you to connect to any remote server via SSH and our users love it.
Arne

Brad Kinnard • Mar 30 • Edited

Good to meet you Arne and thanks for the clarification on this. I have edited my post to better state the true state of things at this time. Appreciate the heads up. I also took out the direct references to similar tools. Naming them specifically wasn't intended. We address two different areas and its not right to call anyone out like that. Swarm Orchestrator is a CLI-first verification and governance layer with no dashboard or UI at all. It's built purely for native CI execution and outcome-verified gates inside pipelines, not for interactive desktop workflows. Different tools for different needs. Appreciate the comment and hope my edits are sufficient!

View full discussion (11 comments)