No Agent Grades Its Own Homework

#codereview #ai #agents #selfpreferencebias

You ask Claude to review your code. It says "looks good, clean, well factored". Of course it does. It wrote that code five minutes ago. You just asked the author to grade his own paper, and he gave himself an A.

Having an AI review code works. But not by asking the one who just wrote it. Quality doesn't come from a smarter model, it comes from an architecture where no role checks itself.

The self-preference bias

This isn't a hunch, it's measured. A model evaluating its own output rates it higher than others' at equal quality: the self-preference bias, documented by Panickssery and co-authors in 2024, and it's causal, not correlational. The model recognizes its own style and prefers it.

In practice that means the naive loop "write, then review what you just wrote" is broken by construction. You don't get a review, you get a justification. The agent already decided its code was good the moment it produced it; asking again only confirms.

The blind reviewer

So the first rule: the reviewer is never the author. In my config, the review agents run in a clean context. They don't see the implementation prompt, they don't know what constraints the author set, they meet the diff like a colleague on Monday morning. And when the author is a known model, the reviewer is from a different family, to break style recognition.

One detail matters as much as the rest: the developer's name never enters the reviewer's prompt. No "this was written by a senior", no "review this model's work". The author's identity is exactly the information that triggers the bias. We take it off the table.

No finding without a receipt

The second trap is the opposite of the first. An AI reviewer, especially in a clean context, tends to over-flag: it invents problems to look useful, it flags "vulnerabilities" that aren't. A review that cries wolf on every line is no better than a complacent one: either way, you stop listening.

Hence the receipt rule. Every finding must cite a file:line and pass a check before it's surfaced: a grep proving the occurrence, a sandbox run, a failing test, a data-flow trace. A finding nobody can prove is dropped silently, no debate.

Finding: "non-parameterized SQL call, injection risk"
  → receipt required: grep the user-input → query flow
  → if the value is a code constant: dropped
  → if it comes from the HTTP request: kept, with the line

Proof comes before judgment. The reviewer isn't allowed to bother you over a hunch.

The refute panel

For critical findings, the ones that would block a merge, I add a last layer: a panel of independent skeptics whose instruction isn't to confirm but to refute. Each one gets the finding and tries to tear it down: "here's why this isn't a bug". If a majority is needed to keep the finding, plausible false alarms don't survive. The ones that remain took a demolition attempt and held.

It's the exact opposite of the naive loop. Instead of one model trying to be right, several trying to contradict each other. The truth that comes out has been attacked, not self-proclaimed.

Separation of powers

Put end to end, this gives a team of agents where the roles never overlap. The one who writes the code isn't the one who writes the tests, which are written from the spec only, not from the code. The one who reviews didn't write. And before a human or an LLM even gives an opinion, an objective gate (build, lint, tests) has to be green: the model's judgment only comes after the machine, never instead of it.

This isn't gratuitous distrust of AI. It's the same principle that governs a newsroom, an accounting team, a court: you let no one sign off on their own work, because no one is a good judge of themselves. LLMs, with a proven self-preference bias, even less than the rest.

Conclusion

The temptation, with a model that codes well, is to hand it the whole cycle: write, test, review, sign off. That's exactly what you mustn't do, because each of those steps corrects the previous one, and a corrector that corrects itself corrects nothing. The quality of an AI review isn't measured by the model's intelligence. It's measured by how many times you stop it from grading itself.

Top comments (2)

Alex Shev • Jun 28

The independence of the reviewer matters more than the fact that it is another model. If the same context, same incentives, and same assumptions flow into the grader, it will often bless the original mistake. A useful review agent needs separate evidence and permission to disagree.

Raju Dandigam • Jun 30

This is a very practical framing of AI code review. The “no finding without a receipt” rule is especially strong because it turns review from opinion into evidence. I’ve seen similar issues when teams let AI generate code, tests, and review comments in the same loop — the workflow becomes fast, but the validation boundary becomes weak. Separating author, tester, reviewer, and objective gates feels much closer to how production engineering teams should use agents safely.