Brad Kinnard

Posted on Apr 9

AI Coding Agents Can Verify Some of Their Work Now. Here's What They Still Miss.

#devops #opensource #typescript #ai

Copilot and Claude Code both ship with verification features now. Copilot's Agent mode runs terminal commands, detects build failures, and iterates fixes. Claude Code plans changes across files and can run your test suite after modifications. Both have improved significantly since 2025.

They're still not catching everything.

Developers consistently report agents declaring tasks complete while skipping accessibility attributes, test isolation, config externalization, dark mode, responsive layout, and meta tags. The agent runs the build, sees green, and moves on. But "build passes" and "the output is production-ready" are different bars. The reprompt cycle for quality attributes the agent never attempted in the first place is still a significant time sink on any non-trivial project.

That gap is where Swarm Orchestrator sits. Not replacing the agent's self-verification, but adding the checks it doesn't run.

What It Does

You give it a goal. It builds a dependency-aware plan, assigns steps to specialized agents, and launches them in parallel on isolated git branches. Each step runs through outcome-based verification (build, test, diff, expected files) and eight quality gates covering scaffold leftovers, duplicate code, hardcoded config, README accuracy, test isolation, test coverage, accessibility, and runtime correctness.

Before the agent runs, the orchestrator injects acceptance criteria based on the project type. For web apps, that's 16 requirements: semantic HTML, responsive layout, dark mode via CSS custom properties, prefers-reduced-motion, image alt attributes, heading hierarchy, ARIA labels, focus-visible styles, and more. For everything else, 6 baseline criteria covering error handling, documentation, input validation, logging, and test coverage.

The agent sees these as hard requirements. After execution, the quality gates check whether they were met. The agent's own verification handles "does it compile and do tests pass." The orchestrator handles "did it actually do what was asked, completely."

Benchmark context

Head-to-head runs against standalone Copilot CLI, Claude Code, and Codex on the same goals showed a consistent pattern: quality attributes the agent never attempted were absent from unassisted output. These aren't build failures the agent would catch on its own. They're requirements like skip-to-content links, prefers-reduced-motion media queries, CSS custom properties on :root, dual theme-color meta tags, module separation between logic and presentation, zero-dependency test runners. Each is at least one follow-up prompt. Several take 2-3 rounds.

The orchestrator caught and enforced all of them in a single pass.

Steps that fail don't get blindly retried. The orchestrator classifies the failure (build, test, missing artifact, dependency, timeout) and sends the agent back with the actual error output and context. This works alongside the agent's own retry capabilities, not instead of them.

What's New in v4.2.0

Three additions.

Multi-Tool Adapters

The --tool flag existed in previous versions. It parsed from the CLI, reached the options object, and then did nothing. The orchestrator always spawned Copilot CLI internally regardless of what you passed.

That's fixed. resolveAdapter() now routes through real adapter implementations with a shared process supervisor.

swarm run --goal "Add auth" --tool copilot              # default, unchanged behavior
swarm run --goal "Add auth" --tool claude-code
swarm run --goal "Add auth" --tool claude-code-teams --team-size 3

Agent Teams mode spawns a team lead per wave for native multi-agent coordination. If the team lead fails, it falls back to per-step sequential execution automatically.

Every adapter shares the same process supervisor: 5-minute stall timeout, 10-second heartbeat checking stdout activity, SIGTERM on stall, SIGKILL after 5-second grace. Previously only the Copilot path had stall detection. A hung claude process would block your entire run indefinitely. That's gone.

OWASP ASI Compliance Mapping

The orchestrator already enforced branch isolation (ASI-03: Excessive Agency), outcome-based verification (ASI-05: Improper Output Handling), and failure-classified repair. Those behaviors map directly to risks in the OWASP Top 10 for Agentic Applications.

--owasp-report formalizes that mapping. After every run, it generates a per-risk assessment with evidence pulled from actual execution metadata.

swarm run --goal "Build REST API" --governance --owasp-report

6 of 10 ASI risks are assessed. 4 are marked not-applicable with explicit rationale (the orchestrator doesn't store user data, doesn't communicate across networks, doesn't train models). If it doesn't apply, it says so and explains why.

Which ASI risks are assessed?

ASI Risk	Assessed	Rationale
ASI-01: Prompt Injection	Yes	Agent prompts controlled by orchestrator, user goals parameterized into plan steps
ASI-02: Insecure Tool Use	Yes	Tool invocations verified against transcript evidence
ASI-03: Excessive Agency	Yes	Scope enforcement via isolated worktrees and boundary declarations
ASI-04: Unreliable Execution	Yes	Failure classification, targeted repair, retry with error context
ASI-05: Improper Output Handling	Yes	Build/test/diff verification independent of agent self-reporting
ASI-10: Uncontrolled Autonomy	Yes	Governance mode with Critic scoring, human-in-the-loop approval
ASI-06, 07, 08, 09	N/A	No model training, no data storage, no cross-network communication, no supply chain

Structured Run Reports

Every run already produced artifacts: session state, metrics, cost attribution, per-step verification reports, and now OWASP compliance. Pulling a coherent picture from those files meant opening each one individually.

swarm report runs/my-run-id                             # generate from any completed run
swarm report --latest --stdout                          # most recent run, print to terminal
swarm report runs/my-run-id --format json               # JSON only

One command. Markdown and JSON. Missing sections (cost data, OWASP) are handled gracefully and just don't appear in the output.

Where This Sits

The agents have gotten better at self-verification. That's a good thing. The orchestrator isn't competing with that. It's adding a layer the agents don't cover: acceptance criteria enforcement, quality gates for attributes agents don't check on their own, independent verification that doesn't rely on the agent's self-reporting, and an auditable trail of everything that happened.

	Standalone Agent (2026)	With Orchestrator
Build/test verification	Built-in (Copilot Agent, Claude Code)	Independent check on isolated branch
Quality attributes	Whatever you prompt for	16 web-app / 6 baseline criteria injected and verified
Failure handling	Agent retries with some context	Classified failure, targeted repair prompt with error output
Audit trail	Chat history, some checkpoints	Transcripts, verification reports, cost attribution, OWASP compliance
Merge safety	Agent says it's done	Proof required across verification + 8 quality gates

GitHub: moonrunnerkc/swarm-orchestrator

TypeScript. ISC license. Requires Node 20+ and Git.

Top comments (1)

nexus-lab-zen • Jun 23

This matches something I keep running into, and I think you've drawn the within-run boundary well: the outcome gate catches the quality attributes an agent skipped before it sees green and moves on.

The part I haven't solved sits across runs. A "green" from one run quietly becomes an assumption the next run builds on, and the later agent inherits a "done" it never re-checks — it just carries it forward. Outcome verification answers "did this run break something?", but the one that bites me is "is the done I'm carrying forward from three runs / two days back still real, or am I trusting prose?"

Have you thought about the gate persisting evidence a later run can independently re-derive — artifacts a downstream agent can resolve on its own — instead of re-running the check each time? Re-running feels like sidestepping the harder provenance problem, but the weaker, non-CI workflows are exactly where persisting re-derivable evidence gets hard.