Copilot and Claude Code both ship with verification features now. Copilot's Agent mode runs terminal commands, detects build failures, and iterates fixes. Claude Code plans changes across files and can run your test suite after modifications. Both have improved significantly since 2025.
They're still not catching everything.
Developers consistently report agents declaring tasks complete while skipping accessibility attributes, test isolation, config externalization, dark mode, responsive layout, and meta tags. The agent runs the build, sees green, and moves on. But "build passes" and "the output is production-ready" are different bars. The reprompt cycle for quality attributes the agent never attempted in the first place is still a significant time sink on any non-trivial project.
That gap is where Swarm Orchestrator sits. Not replacing the agent's self-verification, but adding the checks it doesn't run.
What It Does
You give it a goal. It builds a dependency-aware plan, assigns steps to specialized agents, and launches them in parallel on isolated git branches. Each step runs through outcome-based verification (build, test, diff, expected files) and eight quality gates covering scaffold leftovers, duplicate code, hardcoded config, README accuracy, test isolation, test coverage, accessibility, and runtime correctness.
Before the agent runs, the orchestrator injects acceptance criteria based on the project type. For web apps, that's 16 requirements: semantic HTML, responsive layout, dark mode via CSS custom properties, prefers-reduced-motion, image alt attributes, heading hierarchy, ARIA labels, focus-visible styles, and more. For everything else, 6 baseline criteria covering error handling, documentation, input validation, logging, and test coverage.
The agent sees these as hard requirements. After execution, the quality gates check whether they were met. The agent's own verification handles "does it compile and do tests pass." The orchestrator handles "did it actually do what was asked, completely."
Head-to-head runs against standalone Copilot CLI, Claude Code, and Codex on the same goals showed a consistent pattern: quality attributes the agent never attempted were absent from unassisted output. These aren't build failures the agent would catch on its own. They're requirements like skip-to-content links, The orchestrator caught and enforced all of them in a single pass.Benchmark context
prefers-reduced-motion media queries, CSS custom properties on :root, dual theme-color meta tags, module separation between logic and presentation, zero-dependency test runners. Each is at least one follow-up prompt. Several take 2-3 rounds.
Steps that fail don't get blindly retried. The orchestrator classifies the failure (build, test, missing artifact, dependency, timeout) and sends the agent back with the actual error output and context. This works alongside the agent's own retry capabilities, not instead of them.
What's New in v4.2.0
Three additions.
Multi-Tool Adapters
The --tool flag existed in previous versions. It parsed from the CLI, reached the options object, and then did nothing. The orchestrator always spawned Copilot CLI internally regardless of what you passed.
That's fixed. resolveAdapter() now routes through real adapter implementations with a shared process supervisor.
swarm run --goal "Add auth" --tool copilot # default, unchanged behavior
swarm run --goal "Add auth" --tool claude-code
swarm run --goal "Add auth" --tool claude-code-teams --team-size 3
Agent Teams mode spawns a team lead per wave for native multi-agent coordination. If the team lead fails, it falls back to per-step sequential execution automatically.
Every adapter shares the same process supervisor: 5-minute stall timeout, 10-second heartbeat checking stdout activity, SIGTERM on stall, SIGKILL after 5-second grace. Previously only the Copilot path had stall detection. A hung claude process would block your entire run indefinitely. That's gone.
OWASP ASI Compliance Mapping
The orchestrator already enforced branch isolation (ASI-03: Excessive Agency), outcome-based verification (ASI-05: Improper Output Handling), and failure-classified repair. Those behaviors map directly to risks in the OWASP Top 10 for Agentic Applications.
--owasp-report formalizes that mapping. After every run, it generates a per-risk assessment with evidence pulled from actual execution metadata.
swarm run --goal "Build REST API" --governance --owasp-report
6 of 10 ASI risks are assessed. 4 are marked not-applicable with explicit rationale (the orchestrator doesn't store user data, doesn't communicate across networks, doesn't train models). If it doesn't apply, it says so and explains why.
Which ASI risks are assessed?
ASI Risk
Assessed
Rationale
ASI-01: Prompt Injection
Yes
Agent prompts controlled by orchestrator, user goals parameterized into plan steps
ASI-02: Insecure Tool Use
Yes
Tool invocations verified against transcript evidence
ASI-03: Excessive Agency
Yes
Scope enforcement via isolated worktrees and boundary declarations
ASI-04: Unreliable Execution
Yes
Failure classification, targeted repair, retry with error context
ASI-05: Improper Output Handling
Yes
Build/test/diff verification independent of agent self-reporting
ASI-10: Uncontrolled Autonomy
Yes
Governance mode with Critic scoring, human-in-the-loop approval
ASI-06, 07, 08, 09
N/A
No model training, no data storage, no cross-network communication, no supply chain
Structured Run Reports
Every run already produced artifacts: session state, metrics, cost attribution, per-step verification reports, and now OWASP compliance. Pulling a coherent picture from those files meant opening each one individually.
swarm report runs/my-run-id # generate from any completed run
swarm report --latest --stdout # most recent run, print to terminal
swarm report runs/my-run-id --format json # JSON only
One command. Markdown and JSON. Missing sections (cost data, OWASP) are handled gracefully and just don't appear in the output.
Where This Sits
The agents have gotten better at self-verification. That's a good thing. The orchestrator isn't competing with that. It's adding a layer the agents don't cover: acceptance criteria enforcement, quality gates for attributes agents don't check on their own, independent verification that doesn't rely on the agent's self-reporting, and an auditable trail of everything that happened.
| Standalone Agent (2026) | With Orchestrator | |
|---|---|---|
| Build/test verification | Built-in (Copilot Agent, Claude Code) | Independent check on isolated branch |
| Quality attributes | Whatever you prompt for | 16 web-app / 6 baseline criteria injected and verified |
| Failure handling | Agent retries with some context | Classified failure, targeted repair prompt with error output |
| Audit trail | Chat history, some checkpoints | Transcripts, verification reports, cost attribution, OWASP compliance |
| Merge safety | Agent says it's done | Proof required across verification + 8 quality gates |
GitHub: moonrunnerkc/swarm-orchestrator
TypeScript. ISC license. Requires Node 20+ and Git.
Top comments (0)