A green test suite is supposed to mean the change works. It doesn't. A test can be weakened just enough to pass. An error can be caught and thrown away. A rename can stop halfway and still compile. None of that turns red, and none of it shows up in the linters most teams already run.
Swarm Orchestrator is built to catch exactly that class of problem in AI-written pull requests.
The gap linters leave
Semgrep and ESLint are built around risky APIs and known-bad code patterns. Whether a diff is honest is a different question. They won't tell you a test was edited until it passed, or that a catch block quietly eats the error it caught. That's the gap.
Two examples from merged Cloudflare pull requests:
| PR | Finding | Semgrep + ESLint |
|---|---|---|
workers-sdk#14063 |
Function renamed, some callers still using the old name | No finding |
workers-sdk#14132 |
Empty catch block hiding errors | No finding |
Across 72 known-bad pull requests from 12 repositories, that pair of analyzers produced one finding. The auditor flagged 67.
What the auditor checks
Eleven checks total. Eight run by default. The other three exist but stay off, because they haven't shown useful signal on real pull requests yet, and a noisy check is worse than no check.
The default set looks for things like:
- Errors caught and ignored
- Renames left unfinished
- Test coverage reduced
- Tests weakened
- Assertions removed
- New
@ts-ignoreoreslint-disablecomments - Test-only fixes with no code change behind them
- Mocks pointing at modules that don't exist
Measured, not assumed
The detection rate isn't a guess. Known defects get injected into real pull requests, then the auditor runs against them. It caught 253 of 300, or 84 percent.
Reproduce it:
npm run benchmarks:full
Runtime mode (optional)
The checks can also execute code instead of only reading a diff: mutation testing, coverage, and reproducing reported issues.
On trpc#6098 it found mutations surviving on lines a later hotfix changed. The tests passed. They weren't actually exercising that code.
Why this mode stays optional
Running code is louder than reading a diff: it averages about 3.4 findings on a clean pull request. That noise is fine when you're deliberately hunting, but it's too much to leave on by default, so it's opt-in.
Defining "done" with a contract
The second command is swarm run. You write down what done means:
obligations:
- type: build-must-pass
command: npm run build
- type: test-must-pass
command: npm test
A patch is accepted only if every obligation passes and the falsifier can't break it. The default provider is deterministic, so identical inputs give identical results, and every input and hash gets written to a hash-chained ledger.
Blocking merges
Findings are advisory out of the box. Gate mode can block a merge, but only on reproducible evidence. The structural checks throw too many false positives to trust as automatic blockers on their own.
Right now no runtime signal has enough real-world evidence to justify auto-rejection, so the gate stays open and reports that fact directly instead of pretending otherwise.
Who it's for
If you review a lot of AI-written pull requests and want signals the usual linters skip, that's the case this is built for. It also emits CycloneDX-ML and SPDX AI BOM documents with --emit-aibom, supports TypeScript and JavaScript, and runs offline.
It points reviewers at the code worth inspecting. It doesn't claim to prove anything bug-free.
moonrunnerkc
/
swarm-orchestrator
Reviews pull requests for the shortcuts AI coding agents take to look done without being done: relaxed tests, swallowed errors, fake renames, 11 checks in all. Flags them for a human by default, or blocks the merge if you turn that on. Can also turn a goal into a checklist and only accept a patch once every check passes.
Swarm Orchestrator
A CLI for auditing AI-generated PRs and grading patches against typed contracts.
Install · Quick start · What it does · Results · Detectors · AI-BOM · Reference
What This Does
Swarm Orchestrator reads a pull-request diff and flags the shortcuts an AI coding agent takes to look done without being done: relaxed tests, stripped assertions, swallowed errors, fake renames, eleven checks in all. On a benchmark of planted cheats it recovers 253 of 300 (84%, up 20.5% from the prior version), and on real merged Cloudflare PRs it caught two cheats that Semgrep and the ESLint security rules missed, both reproducible offline. Findings are advisory by default, so it never blocks a merge unless you turn that on.
Who it's for
- You review AI-written PRs at volume and want a "this change may be gaming the tests" signal that ordinary linters do not give you.
- You have…
Top comments (0)