DEV Community

Cover image for How I Built a Verification Layer for Copilot CLI's Multi-Agent Output
Brad Kinnard
Brad Kinnard Subscriber

Posted on

How I Built a Verification Layer for Copilot CLI's Multi-Agent Output

GitHub shipped /fleet for parallel subagent dispatch in Copilot CLI earlier this year. It works. The problem is what happens after: you get code from multiple agents and no structured way to know if any of it actually works.

I built Copilot Swarm Orchestrator to fill that gap. It wraps copilot -p as isolated subprocesses, each on its own git branch, and verifies every agent's output against its session transcript before anything merges.

What verification looks like

The orchestrator captures each agent's /share transcript and parses it for concrete evidence: commit SHAs, test runner output, build markers, file changes. Every claim the agent makes gets cross-referenced against that evidence. If the agent says "all tests pass" but the transcript shows a test failure, the step fails verification.

Failed steps don't just get retried blindly. The Repair Agent classifies the failure (build error, test failure, missing artifact, dependency issue, timeout) and applies a strategy specific to that failure type. Context accumulates across retries so the agent doesn't repeat the same mistake.

Quality gates

After merge, six automated gates scan the generated code:

Gate What it catches
Scaffold leftovers TODO placeholders, Lorem ipsum
Duplicate detection Repeated code blocks
Hardcoded config Magic strings and values
README drift Claims that don't match actual code
Test isolation Cross-test dependencies and shared state
Runtime correctness Execution-time failures

Gates that fail can auto-inject follow-up remediation steps. The orchestrator spawns a targeted Copilot session to fix what the gate flagged.

Cost tracking

Every copilot -p invocation burns a premium request. Model multipliers compound this: o3 costs 20x per invocation, o4-mini costs 5x. An 8-step plan with retries on a 20x model can consume a month's Pro allowance in one run.

The cost estimator predicts consumption before execution starts, using model multipliers and historical failure rates from a persistent knowledge base. You can preview the estimate and exit, or set a hard budget that aborts if the estimate exceeds it. After execution, per-step attribution shows exactly where requests went.

Example: cost estimate output for dashboard-showcase demo
Plan Analysis:
  Steps: 8 (4 parallel waves)
  Model: o4-mini (5x multiplier)
  Estimated requests: 40-65 (accounting for ~30% retry rate)
  Budget impact: ~13-22% of monthly Pro allocation

Per-step breakdown:
  Step 1 (scaffold):     5 requests (low retry probability)
  Step 2 (API routes):   8 requests (moderate complexity)
  Step 3 (chart logic):  10 requests (high retry probability)
  ...
Enter fullscreen mode Exit fullscreen mode

v3.2: the speed release

The latest release replaced the wave-barrier scheduler with greedy scheduling. Steps launch the moment their dependencies resolve instead of waiting for an entire wave to finish.

Other changes in this release:

  • Prompt compression extracts shared boilerplate into .copilot-instructions.md (which Copilot CLI reads natively), cutting ~60% of repeated tokens per step
  • Octopus merge for parallel branch completion: one merge commit instead of N
  • Event-driven dependency resolution via EventEmitter instead of file polling

The dashboard-showcase demo (4 agents building a React + Chart.js + Express app with 27 tests) dropped from 8m 48s to 7m 56s.

By the numbers

Metric Value
TypeScript source files 71
Lines of code 17,903
Tests passing 649 (Mocha + Node.js assert)
Built-in demo scenarios 6 (1-min smoke test to 40-min SaaS MVP)
Contributors 1
License ISC

View on GitHub

The quickest way to see it work:

npm start demo-fast
Enter fullscreen mode Exit fullscreen mode

Runs two parallel agents in about a minute.


Top comments (0)