Brad Kinnard

Posted on Mar 18

I Shipped 6 Upgrades to My Copilot CLI Orchestrator. The SDK Had Other Plans.

#githubcopilot #opensource #typescript #cli

Six weeks ago I had a working orchestrator for GitHub Copilot CLI. It ran agents in parallel on isolated branches, verified their output against transcripts, and merged what passed. 947 tests. Zero TypeScript errors. Solid.

Then I looked at what Copilot CLI shipped since GA and realized the tool was already falling behind the platform. Fleet mode, hooks, plugins, MCP servers, the /pr command. The CLI had grown a surface area that my orchestrator was either duplicating or ignoring.

So I planned six upgrades, built all six, and ran live integration tests against real Copilot CLI sessions, real GitHub PRs, and real MCP client connections. Some things worked exactly as designed. One thing didn't work at all, and the reason is worth knowing if you're building anything on top of Copilot CLI's hook system.

Here's what happened.

moonrunnerkc / swarm-orchestrator

CI/CD for AI-generated code. Run Copilot, Claude Code, or Codex in parallel; verify every claim against evidence; gate merges on 8 automated quality checks.

Swarm Orchestrator

CI/CD for AI-generated code. Run Copilot, Claude Code, or Codex in parallel; verify every claim against evidence; gate merges on 8 automated quality checks.

Not an autonomous system builder: an accountability layer around agents you already trust enough to run, but not enough to merge blind. Each step runs on its own isolated branch. Each claim (tests pass, build clean, commit made) is cross-referenced against the transcript and the actual filesystem. Failures are auto-classified, repaired with targeted strategies, and re-verified. Nothing reaches main without passing both the verification engine and the quality gate pipeline. The metric that matters is cost per rubric point, not wall-clock time.

Quick Start · What Is This · Benchmarking · Usage · GitHub Action · Recipes · Architecture · Contributing

Swarm Orchestrator TUI dashboard showing parallel agent execution across waves

Quick Start

See it run end-to-end

npm install -g swarm-orchestrator
# then set up any one of the agent CLIs below, and:

…

View on GitHub

Upgrade 1: Copilot CLI Plugin Packaging

Problem: Installing the orchestrator required cloning a repo, running npm install, and linking globally. Nobody tries a tool with that much friction.

Fix: Package the agents, skills, hooks, and quality gates as a Copilot CLI plugin.

copilot /plugin install moonrunnerkc/copilot-swarm-orchestrator

That's it. Eight agent profiles, three skills (orchestrate, verify, gates), scope enforcement hooks, and the MCP server config drop into your Copilot CLI environment. /agents list shows the installed agents. /swarm gates runs quality checks from inside a Copilot session.

The plugin format requires a plugin.json manifest, .agent.md files for agents (converted from the existing YAML definitions), SKILL.md files for skills, and hook JSON files. The conversion from YAML to .agent.md was mechanical, which is exactly what you want. The prompt text has been tested across 124 commits. Reformatting it shouldn't change behavior, and it didn't.

Two bugs in plugin.json surfaced during testing. Both were structural (wrong field names in the manifest). Caught by attempting a local install, which is the only real validation since there's no official schema validator for the plugin format yet.

The plugin is the lightweight entry point. Full parallel wave scheduling, cost governance, the repair pipeline, and PR integration still require the source install. But for people who just want the agents and quality gates, friction dropped to one command.

Upgrade 2: GitHub PR Integration

Problem: The orchestrator merged verified branches locally. The audit trail lived in runs/ directories on your machine. For teams, "trust me, verification passed" doesn't cut it.

Fix: A --pr flag that creates real GitHub Pull Requests with verification evidence attached as structured comments.

npm start swarm plan.json --pr auto

Each verified step becomes a PR. The PR body includes a verification table (pass/fail per evidence type), cost attribution (premium requests consumed), and quality gate results when governance mode is enabled. In --pr auto mode, verified PRs merge immediately. In --pr review mode, execution pauses at each wave until the PR is approved on GitHub.

Two real bugs found during live testing. First, cli-handlers.ts wasn't forwarding the --target, --pr, and --hooks flags to the execution pipeline. Everything parsed correctly at the CLI layer but the flags evaporated before reaching the orchestrator. Second, pr-manager.ts was calling gh pr create before pushing the branch to the remote. The PR creation failed silently because the branch didn't exist on GitHub yet.

Both are the kind of integration bugs that unit tests will never catch. The flag forwarding worked fine in isolation. The PR manager's unit tests mocked gh and didn't check whether the branch existed remotely. Only a live test that created an actual PR on an actual GitHub repo surfaced both issues.

After the fixes, PRs #1 through #4 were created and auto-merged on the test repo. Verification evidence rendered correctly in GitHub's markdown. The review mode correctly paused until approval. Shipped.

Upgrade 3: Live Hook-Based Scope Enforcement

This is where things got interesting.

Problem: Verification happens after the agent finishes. By the time the orchestrator parses the transcript and discovers the agent touched files outside its scope, the damage is done. The step fails, but the bad code already exists on the branch.

Fix: Inject Copilot CLI hooks into each agent session that enforce scope boundaries during execution, not after.

Copilot CLI supports lifecycle hooks: preToolUse, postToolUse, sessionStart, errorOccurred. Each hook runs a shell command and receives JSON context about the event via stdin. The plan was straightforward: generate per-step hook files that block out-of-scope file operations in preToolUse and capture structured evidence in postToolUse.

The original hook-generator.ts implementation was wrong in four ways:

Hooks load from <gitRoot>/.github/hooks/**/*.json, not ~/.copilot/hooks/
The format is { "version": 1, "hooks": { "eventName": [{ "type": "command", "bash": "...", "timeoutSec": N }] } }, not what I had
Context arrives via stdin as JSON, not through environment variables
Plugin hooks are declared as a string path in plugin.json, not as inline config

All four were wrong because I built from documentation that was either incomplete or I misread. The fix required reverse-engineering the actual SDK at index.js in the Copilot CLI package to find the correct format, directory, and context mechanism. Complete rewrite of hook-generator.ts.

After the rewrite, hooks fired correctly. postToolUse captured structured evidence (tool name, file paths, timestamps) to an evidence.jsonl file. The verifier cross-references this against transcript evidence. Contradictions between sources fail the step. This part works.

The Deny Discovery

Then I tested preToolUse scope enforcement.

The hook fires. The script reads the tool context from stdin, checks the file path against the agent's boundary rules, and outputs a deny decision with a reason. Copilot CLI receives the deny decision.

And ignores it.

Copilot CLI SDK v1.0.7 does not honor deny decisions from preToolUse hook processes. The hook runs, the deny is emitted, and the agent proceeds to write the file anyway. I verified this multiple times with different agents, different file paths, and different deny formats. The SDK receives the hook's output and does nothing with it.

This isn't a bug in the orchestrator. It's a platform limitation. The hook fires as monitoring, but execution-time enforcement doesn't exist yet in the SDK.

What I shipped instead

Scope enforcement moved to the verification layer. When preToolUse detects a boundary violation, it logs a scope_violation entry to evidence.jsonl with the tool name, file path, agent name, and which boundary rule was violated. The verifier reads these entries and fails any step that has scope violations.

The result: out-of-scope writes aren't blocked during execution, but they are caught before merge. The agent does the work, the hook logs the violation, the verifier kills the step. Not as clean as execution-time blocking, but the same outcome: scope violations never reach main.

When the SDK adds deny enforcement, I'll flip it on. The hook already outputs the correct deny format. It's just waiting for the other side to listen.

Hook generation runs at 0.11ms per step. No measurable latency impact on execution.

Upgrade 4: Agent Export from Execution History

Problem: Agent definitions are written by hand based on intuition about what works. The orchestrator's knowledge base has actual data about what works, but it sits in JSON files that nobody reads.

Fix: A swarm agents export command that generates .agent.md files from execution history.

npm start agents export --output-dir ./agents --min-runs 5

The exporter reads knowledge base data across all runs, aggregates per-agent statistics with recency weighting (30-day half-life), and produces Copilot CLI-compatible agent files. Each exported agent includes tool usage patterns, scope boundaries derived from actual file paths modified, failure prevention rules from patterns that broke things, and performance notes.

Example: if tester_elite consistently fails when asked for integration tests requiring database setup but succeeds 95% on unit tests with mocks, the exported agent includes that as a concrete instruction, not a statistic.

One bug during testing: the exporter normalized agent names for lookup but the knowledge base stored them differently. backend_master in the config didn't match backendmaster in the knowledge base text. Stripping underscores during matching fixed it.

The real test: I deployed an exported backend_master to ~/.copilot/agents/ and asked it about its guidelines. Copilot CLI referenced the learned patterns from the exported file, specifically TypeScript strict mode preferences and error handling middleware guidance that came from execution data, not the base definition. The export-to-behavior pipeline works end to end.

Use --diff to see how agent definitions evolve between exports as the knowledge base grows.

Upgrade 5: Native /fleet Hybrid Mode

Problem: The orchestrator spawns independent copilot -p subprocesses for each step. Copilot CLI's /fleet command does native parallel subagent dispatch with lower-cost models by default. I was reimplementing parallel execution that Copilot already does, and paying full model price for it.

Fix: A --fleet flag that delegates intra-wave parallel dispatch to /fleet while the orchestrator handles inter-wave scheduling, verification, and quality gates.

npm start swarm plan.json --fleet

The fleet executor constructs a single /fleet prompt from all steps in a wave, including agent assignments and scope boundaries. After /fleet completes, it maps subtask results back to the orchestrator's step model and feeds them into the verification pipeline.

During testing, the result parser broke. The regex expected whitespace between the step number and completion checkmark (\s*), but /fleet outputs **Subtask 1 (BackendMaster):** ... ✅ with arbitrary content on the same line. Changed the regex to [^\n]* to match anything on the same line. Small fix, would have silently broken every fleet run.

Live result: 2 subtasks dispatched, completed in 1 minute 40 seconds, 1 premium request. Compared to 2 subprocess invocations at 1 premium request each, that's a 50% cost reduction on a minimal plan. On larger plans with 6-8 steps per wave, the savings scale.

If /fleet fails or produces results the orchestrator can't map back to steps, it automatically falls back to subprocess mode. No manual intervention.

Upgrade 6: MCP Server

Problem: The orchestrator is a standalone CLI tool. Other tools in the ecosystem (VS Code Copilot, Claude Code, other agents) can't query its state, trigger operations, or inspect results.

Fix: An MCP server that exposes orchestrator state and control operations to any MCP-compatible client.

claude mcp add copilot-swarm-orchestrator -- node dist/mcp-server.js

Five resources (runs, run detail, step detail, agents, knowledge base) and four tools (plan, gates, export agents, status). The server reads from the same runs/ and config/ files the CLI uses. No separate database. If a run is active, it serves live state by watching the filesystem.

One protocol bug during testing: the server was responding to notifications/initialized, which violates the MCP spec. Notifications don't get responses. Claude Code connected but the extra response confused the protocol handshake. Fix: return null for notifications and skip sending a response. After that, Claude Code listed all resources and tools, queried agent profiles, and ran quality gates through the MCP interface.

The MCP server auto-configures when you install via the plugin path. copilot /mcp show lists it.

What 951 Tests Look Like

Final state after all six upgrades and live integration testing:

76 source files, ~19,500 lines of TypeScript
66 test files, 951 passing, 1 pending, 0 failing
Zero TypeScript errors
Every upgrade tested against real Copilot CLI sessions, real GitHub PRs, real MCP client connections

The 1 pending test is intentional (placeholder for execution-time deny enforcement, waiting on SDK support).

Files modified during live integration testing

Bug fixes:

cli-handlers.ts: flag forwarding to execution pipeline
pr-manager.ts: git push before PR creation

Hooks rewrite:

hook-generator.ts: complete rewrite for correct SDK format
session-executor.ts: hook directory injection per subprocess
swarm-orchestrator.ts: hook lifecycle management

Fixes from final testing round:

verifier-engine.ts: scope violation enforcement in verification layer
agents-exporter.ts: underscore-stripped name matching for KB lookup
fleet-executor.ts: subtask completion regex for /fleet output parsing
mcp-server.ts: notification handling (MCP spec compliance)

Platform Limitations Worth Knowing

If you're building on top of Copilot CLI's extensibility surface, two things to be aware of:

Hook deny decisions are not enforced. As of SDK v1.0.7, preToolUse hooks fire and receive context correctly, but deny/block decisions from hook processes are ignored. Hooks work for monitoring and evidence capture. They don't work for execution-time enforcement. Build your enforcement at a different layer.

Plugin format has no schema validator. plugin.json errors fail silently. The plugin just doesn't appear. No error message, no log entry in most cases. Test by installing locally (copilot /plugin install /path/to/local/dir) before pushing to a marketplace. Check ~/.copilot/logs/ if things go wrong.

Neither of these are showstoppers. Both are things I wish I'd known before building, not after.

Try it: full source or plugin install

The orchestrator started as a Copilot CLI Challenge submission that ran agents in parallel and checked their transcripts. It's now a plugin-distributed, PR-integrated, hook-monitored, fleet-enabled orchestration system with an MCP server and empirically-derived agent profiles.

If you're running Copilot CLI for anything beyond single-shot prompts, the verification layer alone is worth the install. Everything an agent claims gets cross-referenced against evidence before it touches your codebase.

DEV Community