DEV Community

Cover image for Your AI's tests pass. That doesn't mean the code works.
Brad Kinnard
Brad Kinnard Subscriber

Posted on

Your AI's tests pass. That doesn't mean the code works.

You ask a coding agent to fix a bug. It writes the code, writes the tests, CI goes green, you merge. The bug's still there.

The agent's job was to turn the check green. The honest way to do that is to fix the code. The lazy way is to write a test that passes no matter what the code does. CI can't tell those two apart. A green check means the tests passed, not that the code is right.

It's easy to miss in review, because the test sits right there looking like proof:

test("parses the config", () => {
  const result = parseConfig(rawInput);
  expect(result).toBeDefined();
});
Enter fullscreen mode Exit fullscreen mode

That passes whether parseConfig works perfectly or returns nothing useful on every input. It checks nothing. Adding more tests like it just raises your coverage number, not your odds of catching a bad change.

So I built ClaimCheck (https://github.com/moonrunnerkc/claimcheck). Instead of trusting the agent's tests, it tries to break them. If a test still passes after the supposedly fixed code is broken on purpose, the test was never really checking the fix, and it gets blocked. Same answer every time, no AI making the call. So far it's caught every cheat in a set of twelve hand-built cases. Twelve is small, and there's no public release yet, so treat that as a direction, not a finished result.

Some cheats slip through anyway. If the agent writes a real, solid test that locks in the wrong answer, every check passes. The only way to know the answer's wrong is to already know the right one, and nothing in the pull request can tell you that except the agent you're trying to catch. The one thing that helps is a clue from outside it, like a human-written bug report you can run the fix against.

There's a second, wider tool, Swarm Orchestrator (https://github.com/moonrunnerkc/swarm-orchestrator). It flags suspicious changes and keeps a tamper-evident record for audits. The record-keeping is the solid part. The catching is not: on real pull requests its accuracy is still low, and that's the half I'm hardening now.

The next step is comparing the old code's behavior to the new directly. The catch is that a wrong change and a harmless cleanup can look the same from the outside, and a tool that blocks good code is worse than one that lets a bad change through. That's the part I'm still working out.

Top comments (0)