DEV Community

Cover image for AI coding agents lie about their work. Outcome-based verification catches it.
Brad Kinnard
Brad Kinnard Subscriber

Posted on

AI coding agents lie about their work. Outcome-based verification catches it.

AI coding agents have a consistency problem. Ask one to add authentication to your project and it'll tell you it's done. Commits made, tests passing, middleware wired up. Check the branch and you'll find a half-written JWT helper, no tests, and a build that doesn't compile.

This isn't a hallucination problem. The agent did produce code. It just didn't verify that any of it worked before declaring victory. And neither did the tools sitting between the agent and your main branch.

The transcript trust problem

Most orchestration tools that coordinate AI agents verify work by reading transcripts. The agent says "committed 3 files" or "all tests passing" and the verifier pattern-matches those strings as evidence of completion.

That's trusting the agent's self-report.

The issue isn't that agents are deliberately deceptive. It's that they generate completion language as part of their output pattern regardless of the actual state of the codebase. An agent will write "tests passing" into its response while the test suite has syntax errors. It'll claim files were created that only exist in the prompt's hypothetical, not on disk.

Transcript parsing catches the obvious failures: agent errored out, produced no output, didn't mention anything about the task. It misses the subtle ones: agent produced code that looks right, described it correctly, but the code doesn't compile, doesn't pass tests, or doesn't do what was asked.

Outcome-based verification

The alternative is checking what actually happened instead of what the agent said happened.

This is what Swarm Orchestrator 4.0 implements. After each agent step runs on its isolated git branch, the verifier executes a series of checks against the branch itself:

Check What it does Fails when
git_diff Diffs the branch against the recorded base SHA No file changes detected
build_exec Runs the detected build command in the worktree Non-zero exit code
test_exec Runs the detected test command in the worktree Non-zero exit code
file_existence Checks that expected output files exist Expected files missing
transcript Parses agent output for completion evidence (supplementary only)

Transcript analysis still runs. But when outcome checks are present, transcript-based checks get demoted to required: false. The build and test execution results gate the merge decision.

Stack detection is automatic. The verifier reads package.json, Makefile, pyproject.toml, Cargo.toml, or whatever project configuration exists and runs the appropriate commands. No per-repo configuration.

What happens when verification fails

Blind retry is the default across most agent tooling. Step fails, same prompt runs again, up to some retry limit. The agent has no idea what went wrong.

Swarm Orchestrator's RepairAgent takes the structured output from the verification checks and feeds it back into the retry prompt. Which check failed, the last 20 lines of build or test output, which files were expected but aren't there. The failure gets classified (build failure, test failure, missing files, no changes) and the repair strategy adapts to the type.

On the final attempt the prompt includes an explicit priority shift: get something working over getting something complete.

The difference between "retry with context" and "blind retry" is measurable. An agent that knows the build failed on a missing import has a realistic path to fixing it. An agent re-running the same prompt that produced a broken build has roughly the same odds of producing another broken build.

Agent-agnostic by design

4.0 drops the Copilot CLI dependency. The adapter layer supports Copilot CLI, Claude Code, and Codex out of the box. The interface is minimal:

export interface AgentAdapter {
  name: string;
  spawn(opts: {
    prompt: string;
    workdir: string;
    model?: string;
    timeout?: number;
  }): Promise<AgentResult>;
}
Enter fullscreen mode Exit fullscreen mode

One flag at the CLI level (--tool claude-code) or per-step in a plan file. The orchestrator treats the agent as an interchangeable subprocess. Verification doesn't change based on which agent ran. The branch either builds or it doesn't.

This also means you can mix agents within a single plan. Use Claude Code for the architecture step, Codex for the boilerplate, Copilot for the tests. Each step gets verified the same way regardless of which agent produced it.

CI integration

The tool ships as a GitHub Action:

- uses: moonrunnerkc/swarm-orchestrator@swarm-orchestrator
  with:
    goal: "Add unit tests for all untested modules"
    tool: claude-code
    pr: review
Enter fullscreen mode Exit fullscreen mode

The Action outputs a JSON result with per-step verification status. You can gate downstream jobs on the verification outcome the same way you'd gate on any other CI check.

Most orchestrators in this space are desktop-first or local-development tools. Even those that support remote execution do not run natively in CI with outcome-verified results. That's the gap Swarm fills.

Recipes for repeatable tasks

Generating a plan from scratch for "add tests to this project" is wasteful when the plan structure is the same every time. 4.0 ships with seven parameterized recipes:

swarm use add-tests --tool codex --param framework=vitest
swarm use add-auth --param strategy=jwt
swarm use security-audit
Enter fullscreen mode Exit fullscreen mode

Each recipe is a JSON file with {{parameter}} placeholders. Custom recipes are one file in templates/recipes/. The knowledge base tracks recipe outcomes across runs so success rates and failure patterns accumulate over time.

Current state

1,112 tests passing, 1 pending. TypeScript strict mode. ISC license. Five phases of upgrades shipped in this release across the adapter layer, verification engine, repair pipeline, CI integration, and recipe system.

GitHub logo moonrunnerkc / swarm-orchestrator

Verification and governance layer for AI coding agents. Parallel orchestration with evidence-based quality gates for Copilot, Claude Code, and Codex.


Swarm Orchestrator

Swarm Orchestrator

Verification and governance layer for AI coding agents. Parallel execution with evidence-based quality gates, not autonomous code generation.

This is not an autonomous system builder. It orchestrates external AI agents (Copilot, Claude Code, Codex) across isolated branches, verifies every step with outcome-based checks (git diff, build, test), and only merges work that proves itself. The value is trust in the output, not speed of generation.

License: ISC    CI    Tests: 1159 passing    Node.js 20+    TypeScript 5.x


Quick Start · What Is This · Quality Benchmarks · Usage · GitHub Action · Recipes · Architecture · Contributing


Swarm Orchestrator TUI dashboard showing parallel agent execution across waves


Quick Start

# Install globally
npm install -g swarm-orchestrator
# Or clone and build from source
git clone https://github.com/moonrunnerkc/swarm-orchestrator.git
cd swarm-orchestrator
npm install && npm run build && npm link
Enter fullscreen mode Exit fullscreen mode
# Run against your project with any supported agent
swarm bootstrap ./your-repo "Add JWT auth and role-based access control"
# Use Claude Code instead of Copilot
swarm bootstrap ./your-repo "Add
Enter fullscreen mode Exit fullscreen mode

Top comments (7)

Collapse
 
kalpaka profile image
Kalpaka

The priority shift on the final attempt -- "get something working over something complete" -- is the detail that sticks. It mirrors what experienced developers do under time pressure: reduce scope to protect correctness. Most retry logic just throws the same prompt again.

One thing I'd push on: does the verification layer track whether the RepairAgent "fixes" a build by quietly dropping the feature? I've seen agents pass all checks by producing a no-op implementation. Technically green, semantically empty. Outcome-based verification handles "did it break?" well, but "did it actually do the thing?" is still mostly an open problem.

Collapse
 
moonrunnerkc profile image
Brad Kinnard • Edited

Correct on the semantic completeness problem. Closest I have to it right now is verifyExpectedOutputs, which does term pattern matching against diff. It could definitely be bypassed by an agent giving a technically present, but behaviorally empty implementation.

In the next release, I'm thinking about adding a semantic verification pass. Small, seperate, agent that reviews the final diff against original goal and flags (if it looks like a no op or scope reduction.) Adding it shouldn't be hard at all.

Collapse
 
jill_builds_apps profile image
Jill Mercer

this hits home. i’ve had agents swear they fixed a broken hook when the code literally didn’t change — it’s a massive vibe killer when the tool tells you one thing and the browser says another. focusing on the actual outcome instead of the agent’s chat log is the only way i can stay productive. vibe first, polish later, but you’ve got to verify the work actually happened.

Collapse
 
moonrunnerkc profile image
Brad Kinnard

Hi Jill, appreciate your comment! Yes, that exact reason is why Swarm skips the chat log and only pushes forward on real git diffs, builds, and test results. Appreciate you chiming in!

Collapse
 
jill_builds_apps profile image
Jill Mercer

that's the right call — git diffs don't lie. i've started doing the same thing manually, just checking what actually changed instead of trusting the agent's summary. if swarm automates that verification layer, that's worth watching. have you listed it on stackapps.app? fits right in with the indie dev tool crowd there.

Collapse
 
arne_elmarstrickmann_9cc profile image
Arne Elmar Strickmann

Hi Brad, thank you for the mention and the great article. I‘m Arne, one of the Emdash founders and wanted to state that Emdash is not „local-only“ - we allow you to connect to any remote server via SSH and our users love it.
Arne

Collapse
 
moonrunnerkc profile image
Brad Kinnard • Edited

Good to meet you Arne and thanks for the clarification on this. I have edited my post to better state the true state of things at this time. Appreciate the heads up. I also took out the direct references to similar tools. Naming them specifically wasn't intended. We address two different areas and its not right to call anyone out like that. Swarm Orchestrator is a CLI-first verification and governance layer with no dashboard or UI at all. It's built purely for native CI execution and outcome-verified gates inside pipelines, not for interactive desktop workflows. Different tools for different needs. Appreciate the comment and hope my edits are sufficient!