Qodo Research | March 2026
Anthropic launched Code Review for Claude Code, a multi-agent system that dispatches parallel agents to review pull requests, verify findings, and post inline comments on GitHub. It is a substantial engineering effort, and we wanted to see how it performs on a rigorous, standardized benchmark.
We run the Qodo Code Review Benchmark. When a new tool ships that is positioned as a deep, agentic code reviewer, we add it. That is what we did here.
This is what we found.
A Note on Methodology First
Before the results: we built this benchmark, which means the obvious question is whether we can be trusted to evaluate tools on it fairly.
The short answer is that the benchmark is publicly verifiable. The dataset covers 100 PRs with 580 injected issues across 8 production-grade open-source repositories spanning TypeScript, Python, JavaScript, C, C#, Rust, and Swift. The injection-based methodology evaluates both code correctness and code quality within full PR review scenarios rather than just isolated bug detection. Our initial evaluation covered eight leading AI code review tools, and Claude Code Review is the ninth.
If you want to run the methodology against your own tool, you can. That is intentional.
What We Evaluated
Claude Code Review was configured exactly as a new customer would set it up: default settings, running on the same forked repositories used for every other tool. AGENTS.md rules were generated from the codebase and committed to each repo root, and Claude Code Review ran automatically on PR submission. No tuning. No special configuration. Just a fair, head-to-head comparison.
The benchmark injected the same realistic defects across the same PRs, and findings were scored against the same validated ground truth with the same LLM-as-a-judge system used for every tool.
What Looked Competitive
Precision: 79%.
That is the same published precision as both Qodo configurations in this comparison. When Claude Code Review flags something, the signal quality is high. The multi-agent architecture appears to be doing what it is designed to do: produce high-signal findings rather than noisy output.
That is worth saying clearly before the rest of the analysis. Precision at this level is not easy to achieve and reflects genuine engineering depth.
Where the Gap Opened
Recall is where the results diverge.
| Configuration | Precision | Recall | F1 Score |
|---|---|---|---|
| Qodo (Extended) | 79% | 71% | 74.7% |
| Qodo (Default) | 79% | 60% | 68.2% |
| Claude Code Review | 79% | 52% | 62.7% |
Claude Code Review surfaces 52% of the ground-truth issues on this benchmark. Qodo's default configuration reaches 60%, and Qodo Extended reaches 71%. That puts Qodo Extended 12.0 F1 points ahead of Claude Code Review in the published comparison.
Because this benchmark is a living evaluation rather than a static snapshot, Qodo's current production numbers are higher than those in the original research paper. These March 2026 figures are the updated baseline used for this comparison.
Why Recall Is the Hard Problem
The precision parity is interesting because it suggests both systems have made real progress on filtering out noise before posting comments. Where they diverge is coverage: how much of the real issue surface each system actually finds.
As we argued in the benchmark methodology, precision can be tightened with post-processing and stricter thresholds, but recall depends on whether the system detected the issue in the first place. That means recall is more tightly linked to deep codebase understanding, cross-file reasoning, and the ability to apply repository-specific standards.
Qodo Extended is designed around that problem. Rather than running a single review pass, it dispatches multiple agents tuned for different issue categories and merges their outputs through verification and deduplication. In the published comparison, that architectural layer raises recall from 60% to 71% while keeping precision at 79%.
The Cost Question
Claude Code Review is priced at $15–$25 per review on a token-usage basis. Anthropic is positioning it as a premium, depth-first product, and the engineering behind it reflects that ambition.
For teams evaluating the cost model, the practical issue is how per-review pricing behaves at their actual PR volume. Qodo's argument in the released post is that its own platform delivers higher recall while scaling at materially lower cost.
Neither pricing model should be evaluated in the abstract. Your team should run the numbers against its real PR volume and review requirements.
What This Means
Claude Code Review is a capable system. Its precision is real, and its multi-agent architecture is substantive.
The benchmark shows a recall gap that matters in practice. On a dataset designed to test not only obvious bugs but also subtle best-practice violations, cross-file issues, and architectural concerns, the published Qodo results show meaningfully broader issue coverage.
A great question for your dev team is whether the recall difference maps to the issue types that matter in your codebase, and whether the pricing model makes sense at your PR volume.
The dataset and evaluated reviews are public. If the numbers matter to your decision, you can inspect the evidence and run the methodology yourself.
The Qodo Code Review Benchmark 1.0 is publicly available in our benchmark GitHub organization. Full research paper: "Beyond Surface-Level Bugs: Benchmarking AI Code Review on Scale."



Top comments (0)