How I Built a Verification Layer for Copilot CLI's Multi-Agent Output

#github #githubcopilot #typescript #opensource

GitHub shipped /fleet for parallel subagent dispatch in Copilot CLI earlier this year. It works. The problem is what happens after: you get code from multiple agents and no structured way to know if any of it actually works.

I built Copilot Swarm Orchestrator to fill that gap. It wraps copilot -p as isolated subprocesses, each on its own git branch, and verifies every agent's output against its session transcript before anything merges.

What verification looks like

The orchestrator captures each agent's /share transcript and parses it for concrete evidence: commit SHAs, test runner output, build markers, file changes. Every claim the agent makes gets cross-referenced against that evidence. If the agent says "all tests pass" but the transcript shows a test failure, the step fails verification.

Failed steps don't just get retried blindly. The Repair Agent classifies the failure (build error, test failure, missing artifact, dependency issue, timeout) and applies a strategy specific to that failure type. Context accumulates across retries so the agent doesn't repeat the same mistake.

Quality gates

After merge, six automated gates scan the generated code:

Gate	What it catches
Scaffold leftovers	TODO placeholders, Lorem ipsum
Duplicate detection	Repeated code blocks
Hardcoded config	Magic strings and values
README drift	Claims that don't match actual code
Test isolation	Cross-test dependencies and shared state
Runtime correctness	Execution-time failures

Gates that fail can auto-inject follow-up remediation steps. The orchestrator spawns a targeted Copilot session to fix what the gate flagged.

Cost tracking

Every copilot -p invocation burns a premium request. Model multipliers compound this: o3 costs 20x per invocation, o4-mini costs 5x. An 8-step plan with retries on a 20x model can consume a month's Pro allowance in one run.

The cost estimator predicts consumption before execution starts, using model multipliers and historical failure rates from a persistent knowledge base. You can preview the estimate and exit, or set a hard budget that aborts if the estimate exceeds it. After execution, per-step attribution shows exactly where requests went.

Example: cost estimate output for dashboard-showcase demo

Plan Analysis:
  Steps: 8 (4 parallel waves)
  Model: o4-mini (5x multiplier)
  Estimated requests: 40-65 (accounting for ~30% retry rate)
  Budget impact: ~13-22% of monthly Pro allocation

Per-step breakdown:
  Step 1 (scaffold):     5 requests (low retry probability)
  Step 2 (API routes):   8 requests (moderate complexity)
  Step 3 (chart logic):  10 requests (high retry probability)
  ...

v3.2: the speed release

The latest release replaced the wave-barrier scheduler with greedy scheduling. Steps launch the moment their dependencies resolve instead of waiting for an entire wave to finish.

Other changes in this release:

Prompt compression extracts shared boilerplate into .copilot-instructions.md (which Copilot CLI reads natively), cutting ~60% of repeated tokens per step
Octopus merge for parallel branch completion: one merge commit instead of N
Event-driven dependency resolution via EventEmitter instead of file polling

The dashboard-showcase demo (4 agents building a React + Chart.js + Express app with 27 tests) dropped from 8m 48s to 7m 56s.

By the numbers

Metric	Value
TypeScript source files	71
Lines of code	17,903
Tests passing	649 (Mocha + Node.js assert)
Built-in demo scenarios	6 (1-min smoke test to 40-min SaaS MVP)
Contributors	1
License	ISC