I Tested 4 AI Code Review Tools on Real Pull Requests — Here's What They Actually Caught

#programming

My team had a problem that I suspect is fairly common: PRs were getting reviewed, but the review quality was inconsistent. Some reviewers would catch SQL injection patterns immediately. Others would approve code with obvious race conditions because they were focused on feature completeness.

So I ran an experiment: I tested four AI code review tools on roughly 40 real pull requests over eight weeks. Not toy examples — production code across a Django backend and React frontend. Here's what actually happened.

The Setup

Our codebase is mid-sized — around 180K lines. The PRs ranged from 15-line bug fixes to 800-line feature branches. I configured each tool to run automatically on every PR and tracked three things: bugs caught that humans missed, false positives that wasted time, and total time saved (or lost) per review cycle.

I tested CodeRabbit, GitHub Copilot's PR review feature, Codacy, and Cursor's inline review. Each got the same PRs over the same period.

CodeRabbit: The Surprising Winner

I went in skeptical. CodeRabbit is relatively new and the marketing felt aggressive. But the results were hard to argue with.

On our 40 PRs, CodeRabbit flagged around 23 genuine issues that made it through human review. That's roughly a 58% catch rate on the bugs that existed. More importantly, its false positive rate sat at about 22% — meaning roughly one in five flags was noise, which is tolerable.

What impressed me most was the contextual awareness. It caught a case where a developer changed a serializer field name in one file but didn't update the corresponding API test. It noticed that a new database query inside a loop would cause N+1 problems. These aren't pattern-matching catches — they require understanding how different parts of the codebase connect.

The pricing is around $15/month per seat for teams, which felt reasonable given the time savings.

GitHub Copilot PR Review: Seamless but Shallow

Copilot's PR review is convenient because it's already in your GitHub workflow. You don't install anything extra — it just shows up in the review panel.

The catch rate was lower than CodeRabbit's — roughly 57% of real issues flagged. But the false positive rate was higher at about 33%. That meant more time spent evaluating whether Copilot's suggestions were actually worth acting on.

Where Copilot struggled was with cross-file context. It would review each file in isolation, missing issues that only become apparent when you understand how the changed file interacts with its dependencies. It caught surface-level issues well — missing null checks, potential type mismatches, obvious security patterns — but missed the architectural stuff.

The advantage is zero friction. If your team already uses GitHub Enterprise with Copilot, there's no additional cost or setup.

Codacy: Noisy but Deep on Security

Codacy is more of a static analysis platform than a pure AI review tool, but it has ML-powered features layered on top. The experience is different — it focuses heavily on code quality metrics, security vulnerabilities, and style consistency.

Catch rate on logic bugs was lower, around 51%. But on security-specific issues, it was the best of the four. It caught a SQL injection vector in a raw query that every other tool and every human reviewer missed. It also flagged several instances of hardcoded credentials in test fixtures that had been there for months.

The downside is noise. Codacy generates a lot of findings, and many are style-related rather than bug-related. You need to invest time configuring which rules matter for your project. Out of the box, the signal-to-noise ratio is rough.

The trend tracking dashboard is genuinely useful though — being able to see code quality metrics over time helps make the case for technical debt investment.

Cursor: Great for Pre-Push, Not for PR Review

Cursor doesn't have a dedicated PR review feature. What it does have is strong inline code understanding that you can use during development to catch issues before they become PRs.

I tried using Cursor's chat to review diffs manually by pasting them in. It worked, but it was manual and didn't scale. The context window limitations meant larger PRs had to be reviewed in chunks.

Where Cursor genuinely shines is as a pre-commit sanity check. Writing code in Cursor and asking it to review your own changes before pushing caught several issues early. But as a team-wide PR review tool, it's not there yet.

What None of Them Caught

This is the part that matters most. Across all 40 PRs, there were about 7 bugs that no AI tool flagged. Every single one fell into the same category: business logic errors.

One PR changed how subscription renewals were calculated. The code was syntactically correct, the types were right, the tests passed (because the tests were also wrong). But the renewal date calculation didn't account for leap years in a specific edge case. No AI tool flagged it because the code looked correct — you needed to understand the business requirement to see the problem.

Another case: a developer added a caching layer that worked perfectly in isolation but created a stale data issue when combined with a separate feature that updated the same records through a different code path. The AI tools reviewed each PR independently and saw nothing wrong. The bug only existed in the interaction between two separately-correct changes.

These are the kinds of issues that require understanding intent, not just code. AI code review tools are getting better, but they're fundamentally analyzing what the code does, not what the code should do.

My Recommendation Framework

Use CodeRabbit if: You want the best overall catch rate and can afford ~$15/seat/month. Best for teams shipping 5+ PRs per day where review bottlenecks are real.

Use Copilot PR review if: You're already on GitHub Enterprise with Copilot. The convenience of zero setup outweighs the lower catch rate.

Use Codacy if: Security is your primary concern, or you need compliance-oriented code quality tracking.

Use Cursor for self-review if: You're a solo developer or small team where formal PR review isn't your bottleneck.

Use all of them if: Just kidding. Pick one, configure it properly, and invest the time you save into better test coverage. That's where the actual bug prevention happens.

Has anyone else run similar comparisons? I'm curious whether the results hold up on different tech stacks — our Django/React combo might bias the results in ways I'm not seeing.

I wrote a longer comparison of AI code review tools with benchmark results on my review site if you want more detail on each tool.