DEV Community

Cover image for Best AI Code Reviewer in 2026? We Ran 4 in Parallel for 3 Weeks (146 PRs, 679 Findings)
v.j.k.
v.j.k.

Posted on

Best AI Code Reviewer in 2026? We Ran 4 in Parallel for 3 Weeks (146 PRs, 679 Findings)

Disclosure and context. A small team running a backend SaaS application (PHP/ReactJS, moderate-sized codebase, no connection to Sentry) was running its own evaluation of four AI code reviewers: CodeRabbit, Sentry Seer, Greptile, and Cursor BugBot. They asked for help on the data side. I built the ingester, captured the comments, and crunched the numbers. The conclusions below are the team's, drawn from the data. I work at Sentry, so one of the four reviewers is my employer's product. All four ran in the default configuration their onboarding wizard sets up; no custom rules, no vendor outreach.

How the data lands:

  • Greptile: zero false positives across 120 findings, ~92% bug-shaped, largest precise top-tier pool (51 P1 findings, 40 solo). Leads on precision and signal density.
  • CodeRabbit: highest volume (281 findings), 68.3% one-click diff coverage. Leads on breadth and applyability.
  • Seer: 6/6 perfect at critical (the only reviewer in the dataset to use that label). Holds the strictest-label sub-claim.

If the team had to commit today, their lean is Greptile. They haven't picked a long-term winner; other reviewers in this space deserve a fair shot before they lock anything in. The entire dataset and ingester are open-sourced at vlad-ko/pr-review-bench so you can re-run the comparison on your own codebase and verify every number.

It started with a confession: nobody on the team could remember which AI reviewer was supposed to be the good one.

We had four of them turned on, all fighting for the same comment surface on every pull request. The conversations sounded like this:

"Did Greptile catch that bug last week, or was that the other one?"
"I think Seer flagged it. Or maybe CodeRabbit. They both say 'Critical' so it's hard to tell."
"Wait, do we even pay for BugBot, or does it come free?"

So we did the only thing engineers ever do when faced with vendor confusion: we instrumented it. A small Python script. A SQLite database. Every comment, every reviewer, every PR: captured verbatim, frozen in place, queryable forever.

Three and a half weeks later, the database held 679 findings across 146 merged pull requests, drawn from 446 review events. We had data. We had opinions. We had been wrong about almost everything.

This is what three weeks of side-by-side comparison taught us about the tools, about pricing, and about the weird ways AI reviewers reveal their personalities when you watch them long enough.

One methodology note before we start. We didn't spend weeks tuning these. For every reviewer in this comparison we ran the default configuration that ships out of the box, plus whatever minimal setup each vendor's onboarding wizard asks for (the GitHub App, the repo allowlist). No custom rules, no profile gymnastics, no per-language tweaks. That's the experience a team gets in week one of adoption, and it's the comparison most readers are actually shopping for.


The cast

Four AI code reviewers, all enabled simultaneously, on a backend SaaS codebase of moderate size:

What it claims to do First impression
CodeRabbit "Detailed code review with inline suggestions" The one that always has notes
Sentry Seer "Predicts bugs before they hit production" The paranoid one
Greptile "PR review with codebase context" The quiet one
Cursor BugBot "Spots real bugs in your changes" The well-priced surprise (initially)

After three weeks of watching these four bots argue about our code, we can confirm: those personalities are remarkably stable.


CodeRabbit will tell you everything

CodeRabbit is the reviewer that always has notes. On a four-line typo fix it will, with cheerful confidence, suggest tightening an assertion matcher and removing an unused import. On a large refactor it will produce a dozen findings, most of them flavors of "Add a PHPDoc to this new method."

In our window, CodeRabbit posted 281 findings across 82 PRs, an average of 3.4 per PR. That's the highest density of any reviewer. The eye-popping number lives somewhere else: roughly seven out of ten CodeRabbit findings come with a patch you can apply directly. Specifically, 68.3% ship a unified diff block, and 27.8% of those additionally ship a GitHub one-click suggestion block on top of the diff (suggestions never appear standalone). Two-thirds of those patches are mechanical (formatting, docblocks, naming). The remaining third is real bug fixes you can paste in.

CodeRabbit re-runs after every commit, including fix-pushes. Its "CHILL" profile is exhaustive on every cycle, surfacing one to three new findings on a fix-push that just resolved the prior round. Plan on five-to-six review cycles per non-trivial PR. It's configurable, but the default is what you'll get if you don't think about it.

The quirk that bit us twice: CodeRabbit nests additional findings inside its review-body wrapper, in <details><summary> blocks. These nested findings have no comment ID and can't be replied to via GitHub's in_reply_to API. A merge-readiness audit that only queries inline comments will miss them. Once we knew where to look it was easy to fix; we now scan both inline comments and review bodies.

CodeRabbit's false-positive rate is 2.3%, the lowest among reviewers that produced false positives at all. The tradeoff is latency: it's the slowest reviewer in this comparison, mean 9.5 minutes from commit to first finding (more on this below). The CHILL profile's exhaustiveness has a wall-clock cost.

A signature CodeRabbit catch (paraphrased): A new command advertised a recovery action called acknowledge_partial with a default payload of 0. Three layers down, the validator rejected any value < 1. CodeRabbit traced the contract end-to-end: "this action is dead-on-arrival for exactly the empty-input case it was designed to handle." The kind of cross-file inconsistency a human reviewer misses on a 600-line PR. Shipped with a unified-diff patch and a regression test.


Sentry Seer is convinced production is on fire

Seer reads every PR like the on-call engineer paged twice already this week. Where CodeRabbit worries about indentation, Seer worries that your null-safe operator is going to NPE during a retry storm at 3am.

This pays off, hard. Seer flagged 40 high-severity and 6 critical-severity bugs in our window, more than the other three reviewers' combined high+critical-labeled findings (12 from CodeRabbit, 22 from BugBot, 0 from Greptile under those exact labels). Greptile's top-priority P1 tier is plausibly comparable to Seer's high and would close that gap considerably; the methodology footnote on cross-vendor labeling spells out why the apples-to-apples comparison is harder than the label count makes it look. Either way, Seer's critical tier (6/6 perfect, the only reviewer to use that label at all) is uncontested. These weren't speculative findings; they were "this WILL break in production" claims, often pointing at exact failure modes we'd shipped before.

The Seer trade-off lives in the severity column. Slice the false-positive rate by severity tier and a pattern emerges:

Sentry Seer false-positive rate by severity tier

Seer's "critical" tier is perfect: zero false positives across all six critical findings. Seer's "high" tier is the noisiest of any reviewer in our dataset, with a 15.0% false-positive rate (6 high-severity findings out of 40 turned out wrong). About one in seven "high" findings misses. Read the rule out loud: when Seer says critical, ship the fix. When Seer says high, evaluate carefully. When Seer says medium or low, treat as advisory.

This calibration is actively useful. Most reviewers don't telegraph their confidence; Seer effectively does: its severity column maps cleanly to precision.

Seer's check-run conclusion is the charming quirk that almost burned us. success means clean. neutral means findings. GitHub renders neutral as a flat grey square, which the human eye reads as "this check didn't fail" rather than "this check is telling you something." A row of mostly-green checkmarks with one grey square next to it gets pattern-matched as "passing-ish" at a glance. We once caught a clearly-failing PR almost merged for exactly that reason. (BugBot uses the same neutral-means-findings convention, so the trap isn't Seer-specific. We audited a sample of commits where each reviewer had a high-severity finding: BugBot consistently reported neutral, Seer reported neutral most of the time but sometimes reported success while findings were still live, which is the worst version of this bug.) The takeaway: a grey square next to "Seer Code Review" or "Cursor Bugbot" is the loudest thing on the checks list, not the quietest. And in Seer's case, you can't even fully trust a real green success if findings are live, because the conclusion sometimes lags the comment stream.

Seer posts zero applyable diffs or suggestions in its finding bodies. It tells you what's going to break and trusts you to figure out how. For high-stakes bugs that's fine: those need design judgment anyway. For lower-tier findings it's a real friction tax.

Seer is also the fastest reviewer in our dataset, mean 3.7 minutes from commit to first finding, with the tightest variance. Combined with the best-in-class high-severity detection, this is the reviewer that gets out of your way and only speaks when it matters.

A signature Seer catch (paraphrased): A new dedup helper used lockForUpdate() to serialize concurrent writers, except the row didn't exist yet on the first call. Seer flagged it cold: "this race window allows duplicate records under concurrent load, because lockForUpdate() is ineffective on rows that don't exist; the lock acquires nothing and the next-arriving concurrent request will create its own duplicate before the first one's INSERT commits." Severity: critical. Suggested fix: add a database-level UNIQUE constraint as the real backstop. We did. Two hours of staring at the code probably wouldn't have surfaced this; Seer surfaced it in 3.7 minutes.


Greptile is the friend who only speaks when it matters

Greptile is quiet. For three to five minutes after each push it says nothing: no status, no incremental signal. Then, all at once, two or three findings appear and Greptile is done. (Sometimes it takes longer; the variance is wider than the mean suggests.)

In our dataset Greptile posted 120 findings across 55 PRs, 2.2 findings per PR, the lowest density of any reviewer. The findings trend toward correctness over style: race conditions, null-handling bugs, off-by-one errors, places where the existing test doesn't actually exercise the changed branch.

Here's the thing that surprised us when we re-audited the data: 32.5% of Greptile's findings ship with a one-click suggestion block. GitHub renders these as "Apply suggestion" buttons in the inline-comment UI, same outcome as a CodeRabbit unified diff, different syntactic envelope. Greptile is not "the no-fix reviewer" we initially thought. About one in three findings can be applied without the developer touching the keyboard.

Greptile's false-positive rate in our dataset is 0%, with a methodology caveat we can't quietly wave away. Out of 120 findings, 118 got a "Fixed in" reply, 2 are still pending, and zero got a "Not applicable" pushback. Two readings:

  • Greptile is genuinely that precise. Plausible: its findings are the kind where the right answer is to fix the issue, not argue. Race conditions don't get a counterargument.
  • Greptile's findings are obvious enough that we never disagreed. Also plausible.

Either way: 0% means "no false positives caught the eye", not "could never produce one." It's still the strongest precision number in our dataset.

The head-to-head with CodeRabbit is where Greptile's quality is most visible. In our window, CodeRabbit and Greptile both flagged the same (file, line) on exactly six occasions. Both got the verdict right every time (six-for-six on each side, zero disagreements about whether the bug was real). But the framings diverged sharply. CodeRabbit's titles were almost always the same templated banner ("⚠️ Potential issue / 🟠 Major / ⚡ Quick win"), which tells you a finding exists but doesn't tell you what the bug is. Greptile's titles, on the same lines, named the actual failure mode: "--days=0 resolves to subDays(0) = now, causing occurred_at <= now() to match everything"; "Trigger button is permanently locked after the user cancels the modal"; "Test verifies Carbon's API, not the controller fix." Twice in those six cases Greptile escalated the severity over CodeRabbit's "Minor" rating because the user-visible consequence was actually a Major bug. The verdict-rate tie is real, but at the framing layer Greptile reads as the editor who already understood the story, while CodeRabbit reads as the linter that flagged something for further investigation.

Where Greptile bit us: its check-run is intermittent. It registered a Greptile Review check on some PRs and not others within the same week, same repo, same configuration, same week. There's a knob somewhere (analysis depth, repo wiring, post-time delay) that determines whether the check shows up; we never figured out which knob to turn. Our merge gate doesn't wait on Greptile. It's advisory-only, and we audit Greptile's inline comments separately.

The other Greptile feature that should be off-by-default and isn't: sequence diagrams in PR reviews. Greptile auto-generates ASCII/markdown sequence diagrams attempting to visualize the control flow of your change. In our experience these were uniformly unhelpful, sometimes outright confusing, occasionally just wrong about call ordering. We turned them off in the repo settings and the reviewer's signal quality went up immediately. It belongs on a bottom shelf in a closet labeled "experimental features," not in the default reviewer config.

One bright spot worth flagging because it's poorly advertised: Greptile offers a 50% discount for early-stage startups (pre-Series A, under $2M revenue in past 12 months) and free reviews for open-source repos under MIT/Apache/GPL licenses. Neither is on the front of the pricing page; both are buried in a "Special Programs" footer section. If you qualify, apply at $15/seat/month (the 50%-off rate) Greptile becomes one of the most cost-efficient reviewers in this set.

A signature Greptile catch (paraphrased): A controller method made an external API call to a third-party vendor inside a DB::transaction() block that started with lockForUpdate(). Greptile pointed out the obvious-once-you-see-it problem: the vendor's API call routinely takes several seconds; holding a database row lock for that duration blocks every other handler that locks the same row, and under upstream slowness, risks exhausting the connection pool. Then, and this is the Greptile signature, it referenced a sibling method in the same class that already documented the correct pattern: make the network call outside the transaction, then enter the transaction only to persist the result. The lock's stated purpose (preventing concurrent writers from racing) is still achieved. Greptile didn't just find the bug; it found the convention the bug was violating.


Cursor BugBot was the one we cut

BugBot was the surprise of the dataset.

BugBot posted 128 findings across 50 PRs, 2.6 per PR, tied with Seer for second place in density. False-positive rate: 4.8%. Second-lowest. 22 high-or-critical-severity catches in three weeks, with the specific, actionable framing that read like a senior engineer pointing at the screen.

BugBot's GitHub integration was the most predictable of the four. Cursor Bugbot check-run from the cursor GitHub App, every time, same conclusion encoding, no review-body shenanigans. Only Seer and BugBot had stable, predictable status-check behavior. CodeRabbit thrashes; Greptile flickers; BugBot just worked.

A signature BugBot catch (paraphrased): A data-sync method called an upstream adapter that returned its data key as a wrapper Collection type (built via collect()->map(...)), not a plain PHP array. The downstream consumer guarded with ! is_array($result['data']), which always returns true for a Collection. BugBot caught it: "every iteration logs a warning and continues, so no records are ever actually synced." The bug had been live in main for two weeks; CI hadn't caught it because the integration tests mocked the adapter to return a plain array. The kind of catch that would have eventually surfaced as a Monday-morning support ticket asking why prices weren't updating.

And we cut it anyway.

When we were buyers, BugBot's pricing was welded to a Cursor IDE seat. To enable BugBot for a developer, that developer needed an active, paid Cursor IDE seat, even if they coded in VS Code, JetBrains, Vim, or Emacs. The capability was not sold standalone. The pricing page presented this as a feature: "AI bug review **included* with your Cursor seat!"* The buyer's view was the inverse: "to use BugBot, force half my team off their preferred editor and onto Cursor." Compounding the IDE coupling, the per-charge billing was opaque. Line items on the invoice didn't cleanly map to dashboard activity, and clarifying what we paid for required more support tickets than any reviewer should require. So we cut it.

The counterfactual: If Cursor unbundled BugBot from the IDE seat, or already has, would we reactivate it? Yes, same day, with verification. The data is unambiguous: BugBot's quality, FP rate, predictable check-run, and high-severity catch density were all top-half of the field. As of mid-2026, Cursor's pricing page lists a BugBot tier at $40/user/month with a 200 PRs/month cap on Pro, but we can't tell from the page alone whether that represents true unbundling from the IDE or a renaming of the prior bundle. Anyone reactivating BugBot in 2026 should verify whether non-Cursor-IDE developers can be enrolled standalone. That's the load-bearing question. The product is fine; the packaging is where the wheels came off.


Slicing the data seven more ways

Now the part where the personalities turn into numbers from multiple angles, because no single number wins this argument.

Volume vs. precision

Findings by reviewer pie chart, n=679

False-positive rate by reviewer, lower is better

CodeRabbit posts close to 2× the volume of any other reviewer with one-third Seer's FP rate. Greptile is the inverse: lowest volume, highest precision (with the methodology footnote).

Who hands you a fix to apply?

How often does each reviewer ship a one-click fix?

The chart measures one specific thing: out of every finding a reviewer posts, what fraction comes with a fix you can apply with a single click inside GitHub's PR UI? Two formats qualify: unified diff blocks and GitHub's native suggestion blocks (the latter renders as a literal "Apply suggestion" button in the GitHub UI). Here's the breakdown:

Reviewer Unified diff suggestion block One-click in GitHub Other fix mechanism
CodeRabbit 68.3% 27.8% 68.3% none
Greptile 0.0% 32.5% 32.5% none
Cursor BugBot 0.0% 0.0% 0.0% Every finding includes a "Fix In Cursor" deep-link button that opens the proposed fix in the Cursor IDE
Sentry Seer 0.0% 0.0% 0.0% Prose-only fix descriptions

Note on the columns. CodeRabbit's 27.8% suggestion blocks always appear inside a finding that already ships a diff (a strict subset), so the "One-click in GitHub" total for CodeRabbit equals its diff coverage. Greptile's 32.5% suggestion blocks are standalone (no diff), so its one-click total equals its suggestion coverage. The columns are independent counts, not partitions.

CodeRabbit and Greptile use different formats for the same outcome: CodeRabbit favors unified diffs (often with a nested suggestion inside), Greptile favors GitHub's native suggestion blocks alone (which render as one-click Apply buttons). Functionally equivalent, syntactically different.

The Seer 0% is a clean prose-only story: Seer describes what's broken and trusts you to write the fix. For high-severity bugs that's fine because those need design judgment anyway; for routine cases it's a friction tax.

The BugBot 0% needs a footnote. BugBot does have a fix mechanism: every finding ships with a "Fix In Cursor" deep-link that opens the proposed change pre-loaded inside the Cursor IDE. So it's not prose-only. But the fix mechanism is only one-click from Cursor itself. Engineers on VS Code, JetBrains, Vim, or Emacs see the prose only. This is another expression of the IDE-coupling story: BugBot's most useful affordance is invisible to anyone who didn't already adopt Cursor as their editor.

Review depth: how often a reviewer re-engages

SELECT reviewer,
 ROUND(AVG(distinct_commits_with_findings), 2) AS mean_review_events_per_pr,
 MAX(distinct_commits_with_findings) AS max_in_one_pr
FROM (SELECT reviewer, pr_number, COUNT(DISTINCT commit_sha) AS distinct_commits_with_findings
 FROM findings WHERE commit_sha IS NOT NULL GROUP BY reviewer, pr_number)
GROUP BY reviewer ORDER BY mean_review_events_per_pr DESC;
Enter fullscreen mode Exit fullscreen mode
Reviewer Mean review events / PR Max in one PR % of PRs where reviewer re-engaged ≥2 commits
Seer 2.02 10 51%
BugBot 1.90 9 48%
CodeRabbit 1.89 5 49%
Greptile 1.47 4 36%

This was the angle we didn't expect. Our intuition was that CodeRabbit would re-engage the most while Seer would "wake up once." The data flips that. Seer re-engaged on 51% of multi-commit PRs and posted findings on up to ten separate commits in a single PR. It's not a one-shot reviewer at all. CodeRabbit is just slightly behind (49%, max 5). Greptile is the only reviewer that's predominantly one-and-done.

Findings per review event (per-session chattiness)

Reviewer Findings per review event Max in one event
CodeRabbit 1.80 15
Greptile 1.46 4
BugBot 1.35 9
Seer 1.33 8

When CodeRabbit shows up, it talks more per session than the others. The gap isn't enormous (1.8 vs 1.3) but it stacks across the 5–6 review cycles per PR.

Latency: who responds first?

WITH first_finding AS (SELECT f.reviewer, f.commit_sha,
 MIN(f.created_at) AS first_at, c.committed_at
 FROM findings f JOIN commits c ON c.sha = f.commit_sha
 GROUP BY f.reviewer, f.commit_sha, c.committed_at)
SELECT reviewer,
 ROUND(AVG((julianday(first_at) - julianday(committed_at)) * 1440), 1) AS mean_min
FROM first_finding GROUP BY reviewer ORDER BY mean_min;
Enter fullscreen mode Exit fullscreen mode
Reviewer Mean time to first finding P95
Sentry Seer 3.7 min 6.8 min
Greptile 4.9 min 38.0 min
Cursor BugBot 6.1 min 13.3 min
CodeRabbit 9.5 min 41.3 min

Reviewer latency: commit push to first finding

Seer is 2.5× faster than CodeRabbit. Seer also has the tightest variance: its P95 is 6.8 min, while CodeRabbit's P95 is 41 minutes. That's a real difference if you context-switch off a PR and have to come back later: Seer findings arrive while the code's still in your head; CodeRabbit findings arrive after you've made coffee. (CodeRabbit's status check also flips through pending and success two-to-four times during a single review, which makes "is this thing done yet?" a recurring question. We had to re-engineer our merge gate to wait a fixed window after every commit.)

Heroes vs noise: FP rate by severity

Top-tier severity FP rate across reviewers

The takeaway no vendor will telegraph: higher severity tiers tend to have higher false-positive rates. This is the bug-prediction tradeoff in concentrated form. The reviewer flagging "critical" is the one going out on a limb. Sometimes it's wrong.

The exception that proves the rule: Seer's critical tier is 6/6 perfect. When Seer escalates from "high" to "critical," that escalation itself is information.

Did the four reviewers ever agree?

Almost never. And never all four at once.

Out of 617 distinct (file, line) coordinates flagged across the merged-PR window:

  • 576 (93.4%) were caught by exactly one reviewer
  • 37 (6.0%) were caught by exactly two reviewers
  • 4 (0.6%) were caught by three reviewers
  • 0 (zero) were caught by all four reviewers

That's the headline. In three and a half weeks across 146 merged PRs, all four reviewers never once converged on the same line of code.

Reviewer agreement at the same file:line, 93% unique-catch dominance

The breakdown of which pairs co-flagged is remarkably flat. Across the 37 two-way overlaps:

Pair Co-flags
CodeRabbit + BugBot 7
Greptile + Seer 7
BugBot + Greptile 6
BugBot + Seer 6
CodeRabbit + Greptile 6
CodeRabbit + Seer 5

No pair dominates. No two reviewers are twins. No reviewer's catches are a subset of another's. They are genuinely seeing different things.

Who uncovered the most unique problems? Two answers, depending on what counts as a "problem":

Reviewer Solo findings Solo rate Top-tier label (own vocab) Solo top-tier FP rate at top tier
CodeRabbit 259 92.2% critical + high (12 total) 11 2.3%
Seer 126 84.0% critical + high (46 total) 39 0% at critical (6/6); 15% at high
BugBot 103 80.5% critical + high (22 total) 16 4.8%
Greptile 96 80.0% P1 (51 total) 40 0%

By raw volume of unique catches, CodeRabbit wins decisively: 259 of its 281 findings sit at coordinates no other reviewer touched. Nine out of every ten CodeRabbit findings is something the other three reviewers did not see. That breadth deserves credit. CodeRabbit grinds a long tail of stylistic, structural, and conformance issues that simply isn't visible from where the other reviewers are looking.

By severity-weighted credit (unique findings at each reviewer's top-priority tier, the kind that page someone at 3am), the picture is contested between Seer and Greptile, with each holding a different strongest claim. Seer is the only reviewer that used the critical label at all and posted a perfect 6/6 record on it (zero false positives at the strictest tier in our dataset). Greptile labels its top-priority findings P1 and produced the largest top-priority pool of any reviewer (51 findings, 40 of those solo) at a 0% FP rate, compared with Seer's 15% FP rate at its broader high tier. The cross-vendor calibration caveat from the methodology section applies: Seer's critical is stricter than Greptile's P1, and Greptile's P1 is plausibly broader than Seer's high. A reader's preference between "perfect record at strictest label" and "largest precise top-tier pool" will decide which of the two wins on this axis. Both are credible answers to "which reviewer would have caught the bug that breaks production?", and neither cleanly dominates. BugBot solo-caught 16 top-tier; CodeRabbit 11.

All three kinds of unique-catch credit matter. They reward different things. CodeRabbit reaches more lines per PR than anyone else. Greptile and Seer both reach the highest-stakes lines that nobody else reaches at all, with Greptile holding the breadth advantage and Seer holding the calibrated-strictness advantage.

When the reviewers DID agree, each framed it through its own personality. Take one of the four three-way overlaps in the dataset: a single line of an identity-verification controller where three reviewers landed on the same logical bug (a wrapper method called a retry-session helper but discarded the helper's return value, leaving the caller's session record stale). Each framed it differently:

  • BugBot named the immediate consequence: "Retry result discarded, profile left with stale state." Severity: medium.
  • Greptile zoomed out to the downstream effect: "Retry result not persisted, stale session ID triggers redundant retries." Severity: minor.
  • Seer led with the production failure mode: "Bug: the result of the retry-session helper is discarded in the calling method, creating orphaned third-party-vendor sessions if a user repeatedly abandons and retries the verification flow." Severity: high.

Same line. Three reviewers. Three different bites at the same bug: state correctness (BugBot), semantic invariant (Greptile), production blast radius (Seer). Each framing carried the reviewer's signature, and each escalated severity in a different direction. (CodeRabbit didn't flag this one, which is part of the story too: high-stakes correctness bugs aren't where its 96%-Major/Minor personality spends most of its attention.)

The personalities are real, and the data validates them. Pull each reviewer's finding mix and the claims at the top of this post are visible in the numbers:

  • CodeRabbit: 96% of findings sit at "Major" or "Minor" severity, with almost nothing at "Critical" or "High." Every title is the same templated banner. Personality matches: the reviewer that always has notes, grinding through style, structure, and conformance.
  • Seer: the only reviewer with a material concentration at the top of the severity scale (4% critical, 27% high). Its finding titles, when they survive without being edited to a "Resolved in <sha>" link, open with "Bug:" and a failure-mode sentence. Personality matches: convinced production is on fire.
  • Greptile: finding titles read like one-liner bug reports: "HTTP call to third-party vendor made while holding a DB row lock", "fresh() can return null, unguarded downstream calls", "N+1 queries per escalating item." Narrow, architectural, surgical. Personality matches: the friend who only speaks when it matters.
  • BugBot: finding titles are concrete and action-oriented: "File action authorized via overly broad view policy", "Catch block safety broken by throwing DB call", "Submit button disabled synchronously prevents form submission." 16% at "High," no banners. Personality matches: spots real bugs in your changes.

Practical implication: adding a second reviewer is additive, not duplicative. If you can only afford one, you'll miss real findings the others would have caught. If you can afford two, the data suggests picking reviewers from different personality clusters (e.g., CodeRabbit + Seer, or BugBot + Greptile) gives you the widest catch surface for the smallest budget.


How we made sure the numbers were right

This section is the boring methodology one, but it earns its place because the numbers above only mean something if the ingestion is honest.

The data pipeline is small: a Python script that walks each PR's review threads, review bodies, and inline comments, freezes the contents verbatim into SQLite, and auto-classifies a verdict on each finding based on whether we replied "Fixed in <sha>" (valid) or "Not applicable" (false positive). Re-runs are idempotent. The numbers in this post are computed live from the DB; the queries are inline above so they're reproducible.

Before drafting this post we triple-checked the dataset against the source:

  1. Coverage audit. Every merged PR in the date window was cross-checked against the DB. Any PR present on GitHub but missing locally was re-ingested. The window contains 146 merged PRs; the DB now holds findings from all of them.
  2. Per-PR spot audit. We picked ten high-volume PRs at random and manually counted reviewer comments on each one, comparing the count to the DB. Variance was zero.
  3. Body-content audit. When verifying the diff-coverage numbers above, we re-read raw finding bodies from each reviewer to confirm what counts as an "applyable fix". That's how we caught a heuristic bug where Greptile's suggestion blocks weren't being counted alongside CodeRabbit's unified diffs. Fixed; numbers above reflect the corrected count.

The triple-check exposed exactly two surprises worth mentioning: (a) Greptile's diff coverage is meaningfully higher than the initial scan suggested (32.5%, not 0%), and (b) Cursor BugBot's per-PR average is higher than the initial scan suggested (its earlier-window activity was the most under-counted in pre-audit data). Both surprises landed in BugBot and Greptile's favor: the audit didn't reveal weakness anywhere, it revealed strength we'd previously underweighted.

The methodology footnotes that remain:

  • Greptile's 0% FP rate is real but underspecified. Out of 120 Greptile findings, 118 got "Fixed in" replies, 2 pending, zero "Not applicable." Treat as "no false positives caught the eye," not as a precision claim.
  • Seer's 6/6 critical-tier record is across only six findings. The perfect ratio is meaningful (especially given the high-tier FP rate of 15%), but the sample is small. Read it as "so far so good," not "this will hold at scale." The claim becomes stronger with each additional critical-tier finding Seer produces without a false positive; it isn't strong yet just by virtue of the 100% number.
  • Cross-vendor severity comparisons are not directly meaningful. Each reviewer uses its own vocabulary at the severity column. CodeRabbit emits critical, major, minor (and rarely high); Seer emits critical, high, medium, low; BugBot emits critical, high, medium, low; Greptile emits two priority tiers via badges rendered in the body (P1 for top-priority, P2 for second-tier), which the ingester normalizes to major/minor in the severity column. So a Greptile finding shown as severity = major in the DB is a P1 finding in the original body, and Greptile's top-tier pool is 51 P1 findings (not zero critical+high, which is what a verbatim-label query returns). A unified scheme would require manually re-tagging 679 findings against an agreed-on rubric, which we didn't do. Compare reviewers within their own vocabularies, not across labels.
  • Latency is per-commit, not per-review. We measured time from commit.committed_at to the first finding from each reviewer on that commit. P95 numbers are skewed by long-tail PRs (large changesets, late-breaking commits). Median is the better summary statistic.
  • Single codebase, single team. Everything here is N=1. A reviewer that's noisy on our stack might be tuned for a different one.

Pricing: where the fine print hides

Vendor pricing pages tell you the headline number per developer per month. They do not tell you what happens at month-end when you blow through your quota during a flurry of reviewer-fix-cycle CI runs. Here's what we found when we read all four pricing pages line-by-line (as of May 2026; verify before purchasing):

Vendor Headline price What's included Where the fine print hides
CodeRabbit $24/user/mo (Pro), $48/user/mo (Pro Plus) Unlimited PRs 5 reviews/hour on Pro, 10/hour on Pro Plus. A busy team that pushes 6 fix commits in an hour will hit the rate limit. Open source: free.
Greptile $30/seat/mo (Pro) 50 reviews per seat per month "$1 per additional code review". And a review is a single re-run on a single commit, not a whole PR. A multi-round PR with 5 fix-pushes burns 5 reviews. 50% off for early-stage startups; free for OSS (both buried in a footer "Special Programs" section).
Sentry Seer $40/active contributor/mo (new model, Jan 2026+) Issue scans, fix runs, code review "Active contributor" = anyone with 2+ PRs to a Seer-enabled repo. So contractors and one-off contributors don't count. Legacy customers may still be on per-action billing ($0.003/scan, $1/fix-run); check your subscription.
Cursor BugBot Historically tied to a Cursor IDE seat Per-IDE-seat coverage At time of our purchase, BugBot was sold only with a Cursor IDE seat. The pricing page as of May 2026 now lists BugBot at $40/user/mo with a 200 PRs/month cap on the Pro tier, which may represent an unbundling from the IDE. Verify current packaging before purchasing, especially for teams whose engineers don't use Cursor as their IDE.

The hidden cost everyone underestimates: multi-round PRs eat into per-review quotas fast. A typical mid-size feature PR in our window went through 5-6 fix-cycle commits (CodeRabbit's CHILL profile surfaces 1-3 new findings on each fix push, which trigger more pushes). If you're paying per review, that's 5-6 reviews per PR, and 50 reviews per seat per month gets used up by a single developer shipping 8-10 substantive PRs. The vendor's pricing page won't tell you this; only the post-deploy invoice will.

Sentry Seer's "active contributor" billing model is the friendliest of the four for small teams, because it bills per human not per review, and you don't pay extra during heavy fix-cycle weeks. CodeRabbit's hourly rate limit is structurally similar: you can blow through it but only briefly, and the limit resets cleanly. Greptile's per-review-overage model is the harshest at scale; budget accordingly.

For BugBot's history specifically: the buyer experience we had ("to enable BugBot for a developer, that developer needs an active Cursor IDE seat") is what drove us to cut it. If Cursor has since unbundled (the current pricing page is ambiguous to us), readers should verify before assuming it's still IDE-locked. The post's recommendation is shaped by the pricing reality we encountered as buyers.


Where the data lands

Time to take a position. The data doesn't pick a winner cleanly, but it does pick an order on each of three axes, and the answer depends on which axis you weight.

On the precision-and-signal-density axis, Greptile won.

Greptile is the only reviewer with a clean 0% false-positive rate in our dataset (0 out of 118 verdicts; 2 still pending). Its findings are overwhelmingly bug-shaped: a hand-classification of all 120 titles puts roughly 92% in correctness/perf/security/test-quality/architecture territory (race conditions, null-handling, N+1 queries, missing guards, security gaps), and roughly 8% in lighter territory (unused imports, redundant casts, naming nits, stale comments). What you mostly don't see in Greptile output is docblock chasing or "extract this constant" suggestions, which is where a meaningful slice of CodeRabbit's volume goes. The mean latency to first finding is 4.9 minutes, faster than CodeRabbit's 9.5 and trailing only Seer's 3.7. And in the head-to-head with CodeRabbit at the six co-flag coordinates, both reviewers got the verdict right every time, but Greptile's titles named the actual user-visible bug while CodeRabbit's titles were the same templated "⚠️ Potential issue / 🟠 Major" banner. If you weight quality and signal density above volume, Greptile gives you a higher fraction of finding-per-finding value than anyone else in this comparison. The only meaningful blocker is the intermittent check-run; the findings themselves are spot-on.

On the breadth-and-applyability axis, CodeRabbit won.

A reviewer producing routine-but-applyable findings on every change pays dividends every day in mechanical fixes a human reviewer would have to type by hand. CodeRabbit's 68.3% applyable-diff coverage is a generational gap nobody else in this comparison has closed. It produces more total fixes per developer-hour spent reading reviewer comments than any of the others, by a wide margin. Its 2.3% false-positive rate is competitive (lowest of the three reviewers that produced false positives at all). The cost is 5–6 review cycles per PR, a slower median turnaround (9.5 minutes), and a finding mix that includes a meaningful chunk of style/docblock/formatting work (useful, but not the bug-shaped findings Greptile spends its budget on). Of the 192 CodeRabbit findings that shipped a unified diff, roughly two-thirds were mechanical (formatting, docblocks, naming, redundant constants); the remaining third were real bug fixes.

On the severity-weighted axis, Greptile and Seer are co-leaders with different strongest claims.

If your costliest review failures are the ones that ship pre-production catastrophes (not the ones that miss a style nit on every PR), top-priority precision is what you should weight hardest. Two reviewers have strong cases here. Seer leads on calibrated-strictness: 6/6 perfect on its critical tier, the only reviewer in our dataset that used the critical label at all and the only one with a perfect record on it. Its top-severity findings concentrate on production-failure-mode framings (race-storm NPEs, transient retries, orphaned-session lifecycles). Greptile leads on top-tier volume and top-tier precision: 51 P1 findings (its top tier, more than any other reviewer's top-tier label count), 40 of those solo, 0% false positives at P1, vs Seer's 15% FP rate at its high tier. The cross-vendor calibration caveat (each reviewer's own vocabulary, see methodology) means a clean ordering at this axis depends on whether you prefer "perfect record at strictest label" or "largest precise top-tier pool." Where Seer falls short of Greptile and CodeRabbit is the prose-only fix style: every finding tells you what's going to break and trusts you to figure out how. Where Greptile falls short of Seer is the absence of an even-stricter top-tier label.

Honest answer for our specific situation: if we had to commit today, we'd lean Greptile. The findings are accurate, the framings name the actual bug, the latency is competitive, and the only material quirk is the intermittent check-run, a fixable footgun rather than a quality problem. But we haven't decided, and we're not in a hurry to. The data ranks the field on this dataset, but it doesn't end the experiment. There are other reviewers in this space we haven't yet evaluated, and they deserve a fair shot before we lock anything in. We'll keep instrumenting and we'll likely test newer entrants before committing; the AI reviewer space is moving fast enough that a Q1 2026 winner isn't guaranteed to be a Q3 2026 winner. Honestly, the experiment was fun, and crunching the numbers was the icing on the cake. We'd rather keep running it than declare a frozen winner.

If we could keep two, it's Greptile + Seer. The personality split is the load-bearing reason: Greptile handles the everyday correctness, architecture, and race-condition work; Seer handles the pre-production catastrophes and the rare-but-expensive blast-radius bugs. They have effectively zero overlap, both ship one-click suggestion fixes for the cases where one applies, and both have median latencies under five minutes.

If we could keep three, add CodeRabbit. The breadth axis matters too once the first two are covered, and CodeRabbit's 68.3% applyable-diff coverage handles the long tail of mechanical fixes a linter would otherwise have to chase down.

If we could keep four, and Cursor unbundled BugBot from the IDE seat, that's the configuration we'd actually run.


Ten things we wish vendors would fix

In rough order of how much they bit us:

  1. CodeRabbit: publish a definitive "I'm done" signal. Anything other than thrashing through pending and success four times. A single neutral/success/failure check-run conclusion would fix this overnight.
  2. CodeRabbit: put nested review-body findings behind their own comment IDs, or any other queryable surface. The current <details> approach is invisible to in_reply_to automation.
  3. Seer and BugBot: rename the neutral conclusion to findings, or introduce a third state. The grey-square rendering reads as "didn't fail," not as "telling you something." Both reviewers use the neutral-means-findings convention; both inherit this footgun.
  4. Greptile: make the check-run reliable, or remove it entirely. Sometimes-appearing is worse than always-absent for automation.
  5. BugBot: if the $40/user/month tier is truly unbundled from the Cursor IDE seat, problem solved. If not, unbundle it. Either way, make the packaging unambiguous on the pricing page; we'd buy it tomorrow once that's verifiable.
  6. All of them: publish a "highest confidence" tier (Seer's critical is the de facto example) and make it the highest-precision tier by design. Calibration is currency.
  7. All of them: use canonical severity vocabulary (critical/high/medium/low/info) so cross-vendor severity comparisons are even possible.
  8. Seer and BugBot: include a code diff or one-click suggestion when the fix is mechanical. CodeRabbit and Greptile show it's possible; the other two have no excuse.
  9. All of them: post a "reviewing now" status when work begins, not just when it ends. The 9.5-minute CodeRabbit silence makes "is this thing on?" a recurring question.
  10. All of them: commit to a clear, bottom-line latency SLA. Mean 3.7 min vs 9.5 min is a 2.5× difference and we shouldn't have had to instrument it to find out.

Get the data (and the tool)

Everything's open-sourced at github.com/vlad-ko/pr-review-bench, MIT-licensed:

  • ingest.py. The ingester. ~1,150 lines of Python, stdlib + the gh CLI only. Run it against any GitHub repo: ./ingest.py --repo owner/name --pr 1234.
  • schema.sql. The SQLite schema. ~170 lines. Idempotent: safe to run on every ingest. Six tables, three audit tables for frozen-body history.
  • data/findings_anonymized.csv. The full anonymized 679-row dataset from this post. Columns: anonymized PR ID, reviewer, severity, verdict, has_unified_diff (0/1), has_suggestion_block (0/1), latency_min_to_finding, body_word_count_approx, created_date. No file paths, no comment bodies, no real PR numbers, no commit SHAs.
  • data/summary_stats.csv, 28 pre-aggregated metrics. Every number cited in this post comes out of one of these rows; you can sanity-check the math without re-running any SQL.

If you build a similar comparison on your own codebase, the schema fits one screenshot and the ingester is ~1,150 lines of Python. The harder part is the patience to wait three and a half weeks before drawing conclusions, and the discipline to audit the ingestion before drawing them.

PRs welcome. Adapters for additional AI reviewers are particularly invited.


Three and a half weeks of side-by-side AI reviewer data. Four reviewers, 146 merged PRs, 679 findings, 446 review events. One reviewer cut for pricing reasons that have nothing to do with quality. If your team is shopping for a reviewer right now: instrument first, opinion second. Vendor marketing pages don't survive contact with a SQLite database.

Top comments (0)