Audit AI-Generated PRs Before You Merge Them (Swarm Orchestrator 10.3.0)

#ai #opensource #githubactions #devops

If you let Claude Code, Cursor, Devin, Aider, Copilot, or any other coding agent open PRs against your repo, you already know the problem. The diff looks fine on a fast read. CI is green. You merge it. A week later you find the test that "passed" got deleted, or the error handling is a silent catch {}, or the "fix" was a comment swap that never touched the bug.

Swarm Orchestrator looks at those PRs and flags the suspicious bits before you click merge.

What it is

A CLI and a GitHub Action. Open source. Node 20 or later. You point it at a PR (or a local diff) and it scores the patch against a set of cheat-pattern detectors. It posts a comment back to the PR with what it found and why.

swarm audit moonrunnerkc/swarm-orchestrator#42

That's the whole interface for most people.

What it does

The default detector set has four checks, all aimed at patterns AI agents actually produce on real PRs:

error-swallow: a new empty or comment-only catch block in non-test code.
mock-of-hallucination: a jest.mock or vi.mock against a module that doesn't exist anywhere in the repo.
no-op-fix: tests changed without source, or source changed without tests, when the diff claims to fix something.
fake-refactor: an exported symbol renamed in source, with no caller in the diff updated.

Six more detectors live behind --detectors experimental for shadow runs. They're not scored well enough on real PRs to be on by default, and the README says so.

Every finding renders with its measured precision number inline, so a reviewer sees the false-positive rate every time the bot speaks.

If you need compliance artifacts, --emit-aibom cyclonedx-ml writes a CycloneDX 1.6 ML-BOM and an SPDX 3.0 AI-Profile per audit. That covers the EU AI Act Annex IV and CISA SBOM-for-AI minimums without bolting on a separate vendor.

Who it's for

Teams that let AI agents open PRs and want a second pair of eyes that runs in CI, costs nothing per call, and produces a deterministic comment instead of vibes. Also useful for procurement and security folks who need an AI-BOM next to their SBOM and don't want another tool in the chain.

If you have one developer eyeballing every line of every AI PR by hand, you probably don't need this yet. If you have ten agents pushing diffs to a queue at 2am, you do.

What's new in 10.3.0

Four things:

no-op-fix got a v2.0 with a gated LLM judge. The judge is off by default and only fires when you set --enable-llm-judge (or SWARM_AUDIT_LLM_JUDGE=1) and have an Anthropic key. Verdicts are content-addressed and cached, so the same diff and title always gets the same answer. The model id is pinned in the ledger so replay stays deterministic.
--shadow-output <path>. One JSON file per audit with detector verdicts, judge call count, and the rendered comment. Drops into a directory you can jq later. The existing --shadow <repo> per-repo rollup still works.
Public leaderboard on GitHub Pages. Fetches the real-corpus score snapshot and renders precision, recall, F1, and a sortable per-detector table. No build step, no CDN, just an HTML page and one JS file: moonrunnerkc.github.io/swarm-orchestrator/docs/leaderboard/.
Real-corpus headline rescored against the v2.0 detectors. F1 moved from 0.109 (P 0.067, R 0.300) to 0.167 (P 0.100, R 0.500). mock-of-hallucination picked up two true positives the v1 shape missed.

The honest part

The real-corpus F1 is 0.167 across 205 AI-labeled PRs (10 broken, 195 clean, eight agent vendors). Precision is 0.100. Recall is 0.500.

That precision number is exactly why the default mode is advise and not gate. Most flags will be false positives. The tool is calibrated to be useful as a reviewer-assist signal, not a merge blocker. If you want it to block, opt in: --mode gate.

The 205-PR corpus is currently labeled by an AI judge with "pending human review" stamped on every entry. That's the largest credibility hole in the project and the next milestone closes it. The labeling rubric, the kappa script, and the labels-v2 scaffold already live in the repo.

Don't read this as "ship this into your release gate today." Read it as "here's a tool you can run in shadow mode, look at what it flags, and decide for yourself if those flags are useful."

Try it

git clone https://github.com/moonrunnerkc/swarm-orchestrator.git
cd swarm-orchestrator
npm install
npm run build
npm link

# audit a PR (advisory, never blocks)
GITHUB_TOKEN=... swarm audit owner/repo#PR