Kwansub Yun

Posted on Feb 12 • Edited on Feb 13

Copilot Guardian - Beyond the Red X: AI That Debugs With You, Not For You

#devchallenge #githubchallenge #cli #githubcopilot

GitHub Copilot CLI Challenge Submission

What I Built

Copilot Guardian -- a deterministic safety layer for GitHub Copilot that turns CI failures into auditable diagnosis, risk-stratified patches, and fail-closed quality verdicts.

When CI breaks, most developers stare at logs and guess. Most AI tools give you one answer and hope it's right. Guardian takes a different approach:

Multi-hypothesis reasoning -- Copilot generates 3 competing root-cause theories with confidence scores and evidence. You see why it picked one, not just that it did.
3 patch strategies -- Conservative, Balanced, Aggressive. Different risk profiles for different situations. You choose.
Deterministic quality guard -- An independent guardrail layer checks every patch for scope violations, bypass anti-patterns (continue-on-error, --insecure, NODE_TLS_REJECT_UNAUTHORIZED=0), and slop signals. Enforces NO_GO when safety conditions are violated.
Forced abstain policy -- For non-patchable failure classes (401/403 auth, rate limits, infrastructure), Guardian emits NOT_PATCHABLE and refuses to generate unsafe patches.
Full artifact trail -- analysis.json, patch_options.json, quality_review.*.json, raw model traces. Every word Copilot said is auditable.

This is not a one-shot demo. It's an engineering system designed for real CI workflows.

Repository: github.com/flamehaven01/copilot-guardian

Judge Quick Test (90 seconds)

Prerequisites (10 seconds):

gh auth status succeeds
Copilot access is enabled for your account/session


# Fastest path (no install)

npx copilot-guardian@latest run \

  --repo flamehaven01/copilot-guardian \

  --last-failed \

  --show-options \

  --fast \

  --max-log-chars 20000

Expected outputs: analysis.json, patch_options.json, quality_review.*.json

Demo

Runtime: 3m43s | Profile: --fast --max-log-chars 20000

Quick Repro Paths

Judge Quick Test (npx ... run) is already shown above.

Use the paths below when you prefer install/build workflows.


# 2) Global install path

npm install -g copilot-guardian@latest

copilot-guardian run \

  --repo flamehaven01/copilot-guardian \

  --last-failed \

  --show-options \

  --fast \

  --max-log-chars 20000



# 3) Source path (clone + build)

gh repo clone flamehaven01/copilot-guardian

cd copilot-guardian

npm install

npm run build


copilot-guardian run \
  --repo flamehaven01/copilot-guardian \
  --last-failed \
  --show-options \
  --fast \
  --max-log-chars 20000

Multi-Hypothesis Diagnosis:

Three competing theories with confidence scores, evidence, and disconfirming signals. Guardian selects the strongest hypothesis but preserves the full reasoning trace.

Patch Spectrum with Quality Verdicts:

Three strategies at different risk levels. Each gets an independent quality review. The deterministic guard can override the model verdict and force NO_GO when it detects bypass patterns or scope violations.

Output Artifacts


.copilot-guardian/

  analysis.json              # Multi-hypothesis diagnosis

  reasoning_trace.json       # Full hypothesis audit trail

  patch_options.json         # 3 strategies + verdicts

  fix.*.patch                # Generated strategy patches

  quality_review.*.json      # Per-strategy quality results

  copilot.*.raw.txt          # Raw model responses

  abstain.report.json        # Forced abstain (if triggered)

Challenge Rubric Mapping

Use of Copilot CLI -- Guardian is operated from terminal flows; the project demonstrates reproducible copilot-guardian and gh-based workflows for CI failure recovery.
Usability / UX -- The 90-second Judge Quick Test plus explicit expected outputs make validation fast and deterministic.
Originality -- Multi-hypothesis diagnosis + risk-stratified patches + deterministic fail-closed guard + forced abstain policy.

Receipts: Structured + Fail-Closed Evidence

Representative analysis.json excerpt (redacted):

{
  "selected_hypothesis": {
    "id": "h1",
    "summary": "Missing API_URL in workflow environment",
    "confidence": 0.89,
    "evidence": [
      "CI log contains: API_URL is not defined",
      "Failure reproduces in Actions context only"
    ],
    "disconfirming_signals": [
      "No Node version mismatch in failing run"
    ]
  }
}

Representative quality_review.aggressive.json excerpt (redacted):

{
  "strategy": "aggressive",
  "verdict": "NO_GO",
  "deterministic_flags": [
    "bypass_pattern: continue-on-error: true"
  ],
  "slop_score": 0.73,
  "reason": "Safety policy violation detected by deterministic guard"
}

My Experience with GitHub Copilot CLI and SDK

Most people use Copilot to write code faster. I used it to build a reasoning engine.

Copilot Guardian is a terminal CLI tool. Copilot requests use @github/copilot-sdk as the default path, and optional gh copilot flows are provided for reproducible terminal-first local operations.

Pattern 1: Multi-Hypothesis Prompting

Instead of asking "what's wrong?", I structured the prompt to force multiple competing explanations:


# prompts/analysis.v2.txt (excerpt)

You must explore multiple hypotheses before selecting

the most likely root cause.

Produce exactly 3 hypotheses in descending confidence order.

Each hypothesis must include: evidence, disconfirming signals,

and a next\_check action.

This eliminates confirmation bias. Copilot can't jump to conclusions -- it has to show competing theories with evidence for and against each one. The result is a structured JSON object validated against a schema, not free-form text.

Pattern 2: Risk-Stratified Generation

A single "fix" is never enough in production CI. I prompt Copilot to generate three strategies at once:


# prompts/patch.options.v1.txt (excerpt)

Generate THREE alternative patch strategies:

1) conservative: minimal, safest change

2) balanced: standard best practice fix

3) aggressive: broader change (often over-engineered)



SAFETY CONSTRAINTS:

- Only touch files in allowed\_files.

- Do NOT weaken security (no disabling SSL,

no continue-on-error, no force installs).

This gives developers actual choice. A production hotfix needs Conservative. A planned refactor might pick Balanced. The Aggressive option often gets flagged -- which is itself useful data.

Pattern 3: AI Auditing AI (Anti-Slop)

After patch generation, I send each strategy back through Copilot with a quality audit prompt:


# prompts/quality.v1.txt (excerpt)

ANTI-SLOP CHECKS (Critical):

- Detect placeholder code (TODO, FIXME)

- Detect over-abstraction (unnecessary layers)

- Detect complexity explosion (>3x LOC for minimal fix)

- Detect deprecated / suspicious Actions usage



If any anti-slop signals detected,

MUST set verdict to NO\_GO and include slop\_score.

But I don't trust the model alone. A deterministic quality guard runs before the model review, checking for 15+ bypass anti-patterns:


// src/engine/patch_options.ts
// deterministicQualityReview() - hard-coded bypass detection

const bypassPatterns: RegExp[] = [
  /continue-on-error:\s*true/,
  /NODE_TLS_REJECT_UNAUTHORIZED\s*=\s*['"]?0/,
  /GIT_SSL_NO_VERIFY/,
  /curl\s+(?:-k|--insecure)/,
  /npm\s+--insecure|strict-ssl\s+false/,
  /\|\|\s*true|set\s+\+e/
];

If the deterministic guard says NO_GO, the model verdict is overridden. Fail-closed, always.

Pattern 4: Transparent Artifact Trail

Every Copilot interaction is persisted as raw text:


copilot.analysis.raw.txt        # Exact model response

copilot.patch.options.raw.txt   # Patch generation response

copilot.quality.\*.raw.txt      # Quality review responses

reasoning\_trace.json           # Complete audit trail

You can diff what Copilot said versus what Guardian decided. No black boxes.

Honest Take: What Worked, What Didn't

I use Claude Code, Codex CLI, Gemini CLI, and Copilot CLI in parallel across different projects. So this isn't my first AI CLI tool -- and I'm not going to pretend Copilot CLI was flawless.

What genuinely worked well:

GitHub-native context. Copilot CLI understands repos, issues, PRs, and Actions logs without extra configuration. For a project that lives entirely in GitHub, this was a real advantage over general-purpose AI CLIs that need manual context feeding.
Structured output compliance. Once I locked the prompts to strict JSON-only constraints, Copilot reliably produced schema-valid responses. The structured reasoning quality -- especially disconfirming evidence in hypothesis generation -- was better than I expected.
Terminal-first workflow. No editor, no browser, no context switching. For CI debugging specifically, staying in the terminal felt natural and fast.

What was frustrating:

Session drops on long prompts. This was the most persistent issue. When the input context grew large (deep log analysis + source files + MCP context), the session would disconnect mid-generation. I had to implement retry logic with exponential backoff and a --max-log-chars cap specifically to work around this. It happened often enough that it shaped the architecture -- the --fast mode exists partly because shorter prompts are more stable.
SDK maturity. The @github/copilot-sdk is still early. Error messages are sometimes opaque, and the documentation was thin when I started. I spent real time reverse-engineering behavior that should have been documented.

Compared to alternatives:

Honestly, once the retry and timeout handling was solid, the actual reasoning quality was competitive. The GitHub integration advantage is real -- other CLIs can't natively pull Actions logs, workflow context, and repo metadata the way Copilot does. For this specific use case (CI failure diagnosis on GitHub repos), nothing else fit as naturally.

The session stability issue is the main thing holding it back. Fix that, and Copilot CLI becomes a genuinely strong tool for GitHub-centric automation.

What I Learned

Raw AI output is not enough. Copilot produces good reasoning, but CI automation requires schema validation and deterministic safety checks on top of it.
Fail-closed beats fail-open. Malformed responses, bypass patterns, and scope creep must be blocked by default. The deterministic guard caught issues the model review missed.
Multi-hypothesis prompting produces better reasoning. Forcing 3 theories with disconfirming evidence significantly improved diagnosis quality compared to single-answer prompts.
Build around the tool's limits, not against them. The session drop issue could have been a blocker. Instead, it pushed me toward better architecture: shorter prompts, explicit timeouts, retry logic, and a fast mode. The constraints made the tool more robust.

Net Impact

GitHub Copilot accelerated both implementation and iteration. But the biggest gain came from combining Copilot with strict guardrails and explicit runtime policies. This turned "AI-generated suggestions" into a controllable CI engineering workflow -- and the honest friction along the way made the result more production-ready than a smooth ride would have.

Built by Flamehaven (Yun) -- Trust is built on receipts, not magic.

DEV Community