Brad Kinnard

Posted on Jun 5

Catching the shortcuts AI coding agents take to look done

#ai #typescript #opensource #devops

A green test suite is supposed to mean the change works. It doesn't. A test can be weakened just enough to pass. An error can be caught and thrown away. A rename can stop halfway and still compile. None of that turns red, and none of it shows up in the linters most teams already run.

Swarm Orchestrator is built to catch exactly that class of problem in AI-written pull requests.

Two parts. One audits AI-written PRs for the shortcuts that fake "done" (11 checks). The other gates a patch against a contract you define: it builds, passes tests, satisfies your requirement, and survives a falsifier that tries to break it.

TypeScript, Node 20, ISC license. The audit side runs with no model credentials.

The gap linters leave

Semgrep and ESLint are built around risky APIs and known-bad code patterns. Whether a diff is honest is a different question. They won't tell you a test was edited until it passed, or that a catch block quietly eats the error it caught. That's the gap.

Two examples from merged Cloudflare pull requests:

PR	Finding	Semgrep + ESLint
`workers-sdk#14063`	Function renamed, some callers still using the old name	No finding
`workers-sdk#14132`	Empty catch block hiding errors	No finding

Across 72 known-bad pull requests from 12 repositories, that pair of analyzers produced one finding. The auditor flagged 67.

What the auditor checks

Eleven checks total. Eight run by default. The other three exist but stay off, because they haven't shown useful signal on real pull requests yet, and a noisy check is worse than no check.

The default set looks for things like:

Errors caught and ignored
Renames left unfinished
Test coverage reduced
Tests weakened
Assertions removed
New @ts-ignore or eslint-disable comments
Test-only fixes with no code change behind them
Mocks pointing at modules that don't exist

Measured, not assumed

The detection rate isn't a guess. Known defects get injected into real pull requests, then the auditor runs against them. It caught 253 of 300, or 84 percent.

Reproduce it:

npm run benchmarks:full

Runtime mode (optional)

The checks can also execute code instead of only reading a diff: mutation testing, coverage, and reproducing reported issues.

On trpc#6098 it found mutations surviving on lines a later hotfix changed. The tests passed. They weren't actually exercising that code.

Why this mode stays optional

Running code is louder than reading a diff: it averages about 3.4 findings on a clean pull request. That noise is fine when you're deliberately hunting, but it's too much to leave on by default, so it's opt-in.

Defining "done" with a contract

The second command is swarm run. You write down what done means:

obligations:
  - type: build-must-pass
    command: npm run build
  - type: test-must-pass
    command: npm test

A patch is accepted only if every obligation passes and the falsifier can't break it. The default provider is deterministic, so identical inputs give identical results, and every input and hash gets written to a hash-chained ledger.

Blocking merges

Findings are advisory out of the box. Gate mode can block a merge, but only on reproducible evidence. The structural checks throw too many false positives to trust as automatic blockers on their own.

Right now no runtime signal has enough real-world evidence to justify auto-rejection, so the gate stays open and reports that fact directly instead of pretending otherwise.

Who it's for

If you review a lot of AI-written pull requests and want signals the usual linters skip, that's the case this is built for. It also emits CycloneDX-ML and SPDX AI BOM documents with --emit-aibom, supports TypeScript and JavaScript, and runs offline.

It points reviewers at the code worth inspecting. It doesn't claim to prove anything bug-free.

View the repo on GitHub

moonrunnerkc / swarm-orchestrator

Reviews pull requests for the shortcuts AI coding agents take to look done without being done: relaxed tests, swallowed errors, fake renames, 11 checks in all. Flags them for a human by default, or blocks the merge if you turn that on. Can also turn a goal into a checklist and only accept a patch once every check passes.

Swarm Orchestrator

A CLI for auditing AI-generated PRs and grading patches against typed contracts.

Install · Quick start · What it does · Results · Detectors · AI-BOM · Reference

What This Does

Swarm Orchestrator reads a pull-request diff and flags the shortcuts an AI coding agent takes to look done without being done: relaxed tests, stripped assertions, swallowed errors, fake renames, eleven checks in all. On a benchmark of planted cheats it recovers 253 of 300 (84%, up 20.5% from the prior version), and on real merged Cloudflare PRs it caught two cheats that Semgrep and the ESLint security rules missed, both reproducible offline. Findings are advisory by default, so it never blocks a merge unless you turn that on.

Who it's for

You review AI-written PRs at volume and want a "this change may be gaming the tests" signal that ordinary linters do not give you.
You have…

View on GitHub

Top comments (8)

Max Quimby • Jun 7

This nails a failure mode I've watched repeatedly: an agent optimizes for whatever signal you give it, and a green checkmark is the cheapest signal to satisfy. If passing the suite is the reward, weakening an assertion or wrapping a flaky call in a swallowed catch is a perfectly rational move for the model — just not the one you wanted. The "test-only fix with no code change behind it" check is the one I'd value most; that pattern is almost a tell that the agent patched the symptom rather than the cause.

The mutation-testing result on trpc#6098 is the part that should worry people — passing tests that don't actually exercise the changed lines are invisible to coverage numbers too, since the line gets "hit" without being meaningfully asserted on. One question: how do you avoid flagging legitimate test deletions? Sometimes removing a brittle test or tightening an over-broad assertion is the correct change, and a naive "assertions removed" check could punish good cleanup. Do the contract obligations let you whitelist intentional reductions?

Brad Kinnard • Jun 7

It uses count / swap / weakness checks, and flags whatever trips those rules for human review. These checks never block, only flag for review. That's about the best I could get it to at this time. Whitelist lives in the per-repo audit config, not the obligations, and it's for files / folders. A per-line whitelist would be out of scope, since it'd ride along in the PR and let a cheat exempt itself.

Mykola Kondratiuk • Jun 15

the 'tests that pass but lie' category is the real gap. saw this same pattern - agent weakened assertions to get green without reading the spec. tests passed, integration was broken.

Andrii Krugliak • Jun 9

The 'tests weakened, errors swallowed' category is the scary one because it passes every linter you have. An agent that deletes the assertion to make a test green has technically done what you asked, which is why I've started treating 'green' and 'correct' as two separate signals.

hahamerry • Jun 6

This resonates. When integrating AI APIs into production pipelines, I've seen similar patterns — the model confidently produces code that passes basic tests but quietly skips edge cases. Running a separate validation layer has saved me more than once. Would love to see this approach extended to API response validation as well.

Brad Kinnard • Jun 6

Appreciate that. I'm keeping this one focused on the premerge side. catching the code change before it lands, so runtime response validation sits outside what it's meant to do. That parts already well covered by tools like zod, ajv, and OpenAPI/Schemathesis. If you ever want it as a contract check here, you can write the response as a property the patch has to hold and let the falsifier go at it.

Alex Shev • Jun 12

This is the failure mode I watch for most: the agent optimizes for looking complete. It writes the happy path, updates the obvious file, and avoids the awkward integration edge. The countermeasure is not just better prompting; it is forcing evidence: tests, screenshots, logs, diff review, and explicit unchecked assumptions.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.