DEV Community

Cover image for Catching the shortcuts AI coding agents take to look done
Brad Kinnard
Brad Kinnard Subscriber

Posted on

Catching the shortcuts AI coding agents take to look done

A green test suite is supposed to mean the change works. It doesn't. A test can be weakened just enough to pass. An error can be caught and thrown away. A rename can stop halfway and still compile. None of that turns red, and none of it shows up in the linters most teams already run.

Swarm Orchestrator is built to catch exactly that class of problem in AI-written pull requests.


Two parts. One audits AI-written PRs for the shortcuts that fake "done" (11 checks). The other gates a patch against a contract you define: it builds, passes tests, satisfies your requirement, and survives a falsifier that tries to break it.

TypeScript, Node 20, ISC license. The audit side runs with no model credentials.

The gap linters leave

Semgrep and ESLint are built around risky APIs and known-bad code patterns. Whether a diff is honest is a different question. They won't tell you a test was edited until it passed, or that a catch block quietly eats the error it caught. That's the gap.

Two examples from merged Cloudflare pull requests:

PR Finding Semgrep + ESLint
workers-sdk#14063 Function renamed, some callers still using the old name No finding
workers-sdk#14132 Empty catch block hiding errors No finding

Across 72 known-bad pull requests from 12 repositories, that pair of analyzers produced one finding. The auditor flagged 67.

What the auditor checks

Eleven checks total. Eight run by default. The other three exist but stay off, because they haven't shown useful signal on real pull requests yet, and a noisy check is worse than no check.

The default set looks for things like:

  • Errors caught and ignored
  • Renames left unfinished
  • Test coverage reduced
  • Tests weakened
  • Assertions removed
  • New @ts-ignore or eslint-disable comments
  • Test-only fixes with no code change behind them
  • Mocks pointing at modules that don't exist

Measured, not assumed

The detection rate isn't a guess. Known defects get injected into real pull requests, then the auditor runs against them. It caught 253 of 300, or 84 percent.

Reproduce it:

npm run benchmarks:full
Enter fullscreen mode Exit fullscreen mode

Runtime mode (optional)

The checks can also execute code instead of only reading a diff: mutation testing, coverage, and reproducing reported issues.

On trpc#6098 it found mutations surviving on lines a later hotfix changed. The tests passed. They weren't actually exercising that code.

Why this mode stays optional
Running code is louder than reading a diff: it averages about 3.4 findings on a clean pull request. That noise is fine when you're deliberately hunting, but it's too much to leave on by default, so it's opt-in.

Defining "done" with a contract

The second command is swarm run. You write down what done means:

obligations:
  - type: build-must-pass
    command: npm run build
  - type: test-must-pass
    command: npm test
Enter fullscreen mode Exit fullscreen mode

A patch is accepted only if every obligation passes and the falsifier can't break it. The default provider is deterministic, so identical inputs give identical results, and every input and hash gets written to a hash-chained ledger.

Blocking merges

Findings are advisory out of the box. Gate mode can block a merge, but only on reproducible evidence. The structural checks throw too many false positives to trust as automatic blockers on their own.

Right now no runtime signal has enough real-world evidence to justify auto-rejection, so the gate stays open and reports that fact directly instead of pretending otherwise.

Who it's for

If you review a lot of AI-written pull requests and want signals the usual linters skip, that's the case this is built for. It also emits CycloneDX-ML and SPDX AI BOM documents with --emit-aibom, supports TypeScript and JavaScript, and runs offline.

It points reviewers at the code worth inspecting. It doesn't claim to prove anything bug-free.

View the repo on GitHub

GitHub logo moonrunnerkc / swarm-orchestrator

Reviews pull requests for the shortcuts AI coding agents take to look done without being done: relaxed tests, swallowed errors, fake renames, 11 checks in all. Flags them for a human by default, or blocks the merge if you turn that on. Can also turn a goal into a checklist and only accept a patch once every check passes.

Swarm Orchestrator

Swarm Orchestrator

A CLI for auditing AI-generated PRs and grading patches against typed contracts.

CI license ISC node >= 20 version 11.1.1 oracle recall 84% (253/300) real-PR false alarms 0.11/PR real-PR cheats vs linters 4 confirmed (Semgrep+ESLint: 1)

Install · Quick start · What it does · Results · Detectors · AI-BOM · Reference


What This Does

Swarm Orchestrator reads a pull-request diff and flags the shortcuts an AI coding agent takes to look done without being done: relaxed tests, stripped assertions, swallowed errors, fake renames, eleven checks in all. On a benchmark of planted cheats it recovers 253 of 300 (84%, up 20.5% from the prior version), and on real merged Cloudflare PRs it caught two cheats that Semgrep and the ESLint security rules missed, both reproducible offline. Findings are advisory by default, so it never blocks a merge unless you turn that on.

Who it's for

  • You review AI-written PRs at volume and want a "this change may be gaming the tests" signal that ordinary linters do not give you.
  • You have…

Top comments (0)