Your AI's tests pass. That doesn't mean the code works.

#testing #ai #programming #opensource

You ask a coding agent to fix a bug. It writes the code, writes the tests, CI goes green, you merge. The bug's still there.

The agent's job was to turn the check green. The honest way to do that is to fix the code. The lazy way is to write a test that passes no matter what the code does. CI can't tell those two apart. A green check means the tests passed, not that the code is right.

It's easy to miss in review, because the test sits right there looking like proof:

test("parses the config", () => {
  const result = parseConfig(rawInput);
  expect(result).toBeDefined();
});

That passes whether parseConfig works perfectly or returns nothing useful on every input. It checks nothing. Adding more tests like it just raises your coverage number, not your odds of catching a bad change.

So I built ClaimCheck (https://github.com/moonrunnerkc/claimcheck). Instead of trusting the agent's tests, it tries to break them. If a test still passes after the supposedly fixed code is broken on purpose, the test was never really checking the fix, and it gets blocked. Same answer every time, no AI making the call. So far it's caught every cheat in a set of twelve hand-built cases. Twelve is small, and there's no public release yet, so treat that as a direction, not a finished result.

Some cheats slip through anyway. If the agent writes a real, solid test that locks in the wrong answer, every check passes. The only way to know the answer's wrong is to already know the right one, and nothing in the pull request can tell you that except the agent you're trying to catch. The one thing that helps is a clue from outside it, like a human-written bug report you can run the fix against.

There's a second, wider tool, Swarm Orchestrator (https://github.com/moonrunnerkc/swarm-orchestrator). It flags suspicious changes and keeps a tamper-evident record for audits. The record-keeping is the solid part. The catching is not: on real pull requests its accuracy is still low, and that's the half I'm hardening now.

The next step is comparing the old code's behavior to the new directly. The catch is that a wrong change and a harmless cleanup can look the same from the outside, and a tool that blocks good code is worse than one that lets a bad change through. That's the part I'm still working out.

Top comments (2)

Harjot Singh • Jun 1

this is the exact gap that bites - green tests, broken behaviour. i got tired of fighting it so i built moonshift: describe an app, it ships a full next.js + postgres + auth build deployed in ~7 min, code on your own github, flat per-build cost. happy to give you a free run if you wanna pressure-test it, no strings.

Harjot Singh • Jun 1

you make a solid point about how passing tests can be misleading. it's crucial to ensure tests actually validate functionality. at Moonshift, we let you get a full next.js + postgres + auth build deployed in about 7 minutes, and you own the code on your github. if you're curious, I can set you up for a free run to give it a whirl.