Ken Imoto

Posted on Jun 13

I Pointed Copilot, CodeRabbit, and Claude Sub-Agents at the Same 30 PRs. They Agreed on 22%.

#codereview #ai #claudecode #github

I had been quietly running three different AI code reviewers in parallel on a project for two months. GitHub Copilot's PR review, CodeRabbit, and a triple of Claude Code sub-agents wired into a pre-merge hook. The plan was always to pick one and turn the other two off. What stopped me was that whenever I read the reviews side by side, they did not look like they were reading the same code.

So I ran a small experiment. Thirty PRs, all from one real production service, all already merged so I knew which issues turned out to matter. I fed each PR to all three reviewers fresh, with no system-prompt tuning beyond defaults. I categorized every flagged issue. Then I asked one question: how often do all three agree?

22%.

This post is about the 78% where they did not agree, what the gap looked like by category, and why I stopped trying to pick a winner.

What I fed each one

The 30 PRs spanned about three months of work: refactors, bug fixes, a couple of new endpoints, two database migrations, one nasty performance regression. Average diff size was 240 lines added, 110 removed. Languages: mostly Python, some TypeScript on the frontend slice. The repo has CI, has tests, has a CHANGELOG, has a CLAUDE.md for the agents to read.

The three reviewers:

GitHub Copilot Code Review (the GA version, default settings, as of 2026-05). Inline comments at the line level.
CodeRabbit (PR-level review, "thorough" preset, no custom rules). Summary plus inline comments.
Three Claude Code sub-agents (architect / security / performance), wired into a pre-merge hook so all three pass before merge. Each sub-agent reviews the diff independently and posts comments via the GitHub API.

I did not give any of them special context beyond what the repo already has. The Claude sub-agents got the CLAUDE.md automatically (that is what CLAUDE.md is for). Copilot and CodeRabbit had what was checked into the repo.

How I counted "agreement"

For every PR I built a list of distinct issues across all three reviewers, then de-duped near-matches by hand (some called the same nested-loop issue by three different names). For each canonical issue I marked which of the three flagged it. "Agreement" meant all three flagged the same canonical issue.

Across the 30 PRs:

244 canonical issues total
54 flagged by all three (22%)
41 flagged by exactly two
149 flagged by exactly one

The "all three" rate of 22% is the headline. What was more interesting was the 149 single-reviewer findings, and which reviewer raised which.

What each reviewer was actually good at

When I bucketed the single-reviewer findings by category, the three reviewers were not just disagreeing randomly. They were each pulling weight in their own zone.

Copilot — was best at line-level style and best-practice nits. Things like "this exception is swallowed silently", "this f-string has no formatting", "consider using a set here." Of its 76 unique findings, 58 were in that bucket. It also had the lowest false-positive rate on the style category (around 9%) and the highest false-positive rate on anything architectural (around 41% when it tried to comment on structure). Worth keeping for style. Not worth listening to about structure.

CodeRabbit — was best at cross-file consistency and contract drift. "You changed the signature in module A but the call in module B still expects the old kwargs." "This new field is not in the migration script." Of its 64 unique findings, 47 were cross-file. This is the kind of thing one-file-diff reviewers (including humans on a Friday afternoon) miss because the relevant context is in a file the PR did not touch.

Claude sub-agents — were best at runtime, security, and behavior under load. "This new endpoint does not have auth on it; here is the line where you would add it." "This loop will allocate one dict per row at 800 rows per request." "Locking order here is opposite to what module_z.py does, deadlock risk." Of their 52 unique findings, 38 were in those categories. They also had the highest false-positive rate on style (about 23%) — they kept suggesting style "improvements" that the team had explicitly debated and rejected, which Copilot would have known not to bring up because it had seen the codebase's pattern more times.

Here is the thing that surprised me when I read all this back: none of the three single-reviewer findings were dominated by overlap with the others. The 149 unique findings were genuinely different findings, not just the same finding said differently. The "all three agreed" 22% was mostly the most obvious bugs, the ones any decent reviewer would catch.

What the 22% looked like

I read every all-three-agreed item. The pattern was depressing in a useful way. The 22% was almost entirely:

Obvious null/None handling mistakes
Off-by-one in pagination
Missing await on an async call
Typos in test names or assertion messages
Hardcoded values that looked like they belonged in config

None of the agreed-on items were the ones that, post-merge, actually caused trouble. The two issues that did cause trouble after merge (the asyncpg dataclass regression I have written about elsewhere, and a cache-key-collision bug) were flagged by exactly one reviewer each (the Claude perf sub-agent and CodeRabbit respectively), and ignored by me. That is on me.

Why I stopped trying to pick a winner

When you have three reviewers that agree 22% of the time, two framings are available:

"They are noisy. I should pick the most accurate one and turn the rest off."
"They are looking at different things. I should figure out which one is looking at what."

I tried framing 1 for a week and missed two real issues that the "winning" reviewer did not flag. I switched to framing 2 and have not gone back.

The routing I run now:

Copilot at pre-commit as inline IDE suggestions. The style/nit bucket. I take or leave each one in 2 seconds. Noise tolerance is high because it is in my editor, not in the PR.
CodeRabbit at PR open as the first PR comment. The cross-file bucket. I read the summary, scan inline comments for "you changed X but did not update Y" patterns, ignore the style stuff.
Claude sub-agents at pre-merge as a gating hook. The runtime/security/behavior bucket. Block merge until each sub-agent passes or I explicitly override with a reason. This is the highest-cost layer (around 35 cents per PR for the three sub-agents combined) so I do not run it on every commit, only at the merge gate.

Total cost increase per PR over running just one reviewer: about 45 cents and an extra 3 minutes of human reading. Total real bug catch rate: noticeably higher, although I cannot give you a clean number because half the comparison is counterfactual.

What I do not do

I do not let any single reviewer auto-approve a PR. I have read enough Anthropic release notes on auto-approve modes to know that the failure modes are not zero. The reviewers vote with a comment; a human still hits merge.

I also do not deduplicate the three reviewers' comments before I read them. I tried merging the three reviewer outputs into a single combined comment and lost information every time. The voice of each reviewer is part of the signal. CodeRabbit explaining cross-file drift sounds different from Claude explaining a deadlock, and the way the explanation is shaped is part of how you decide whether to take it seriously. Merging them flattens that.

The lesson, if there is one, is that AI code review has not yet collapsed into one tool. Today, three half-overlapping reviewers cover more ground than any single one. That is annoying for the "pick one and stop thinking" decision style and it is the truth.

The longer version, including the pre-merge hook script for wiring three Claude sub-agents into PR checks and the routing rules I use per project, is in Practical Claude Code. Chapter 11 is the sub-agent orchestration. Chapter 8 is the day-to-day workflow that wraps these gates.

Top comments (2)

Mehmet Can Farsak • Jun 14

Fascinating breakdown — that 22% agreement rate really highlights how each agent has a distinct "thinking mode." Copilot in style-review mode, CodeRabbit in cross-file analysis mode, Claude sub-agents in runtime analysis mode. They're not disagreeing; they're each in a different headspace.

That's exactly what got me building Brainstorm-Mode (mehmetcanfarsak on GitHub) — agents don't natively switch between thinking modes. The plugin adds three modes (divergent, actionable, academic) via PreToolUse hooks so agents stay in the right headspace instead of jumping between analysis types mid-review. Could be useful for anyone orchestrating multiple agent reviewers.

Ken Imoto • Jun 15

"Different headspace" is a better frame than the "agreement" one I reached for. Thanks for that. The twist in my case is that nobody picked the modes; each tool just defaulted to one. So the 22% is less a choice and more an artifact of how they're built, which is exactly why I didn't want them to converge. Your plugin flips that into a deliberate knob on a single agent, the opposite lever. I'll take a look at the divergent mode.