AI code review tools promise to catch what human reviewers miss. But which one actually delivers? I planted 38 deliberate bugs, security vulnerabilities, and code smells into a .NET 10 codebase — then let three AI reviewers loose on the same PR. Here's what happened.
Why This Comparison?
Every major platform now offers AI-powered code review: GitHub has Copilot, Cursor has BugBot, and Anthropic has Claude. They all claim to catch security issues, bugs, and code quality problems. But marketing aside, I wanted answers to three practical questions:
- How many issues does each tool actually catch? Not in a curated demo — in a realistic PR with a mix of critical vulnerabilities and subtle code smells.
- How do they behave across multiple review cycles? A first pass is one thing. What happens when you fix the findings and re-request a review?
- What's the developer experience like? Detection rate is a number. But does the tool actually help you ship with confidence?
To find out, I designed a controlled experiment.
The Setup
The Codebase
I scaffolded a clean .NET 10 solution called Demostr8 with two projects:
- Demostr8.Api — an ASP.NET Core Web API with JWT authentication, Entity Framework Core, CORS, and OpenAPI. Controllers for orders and users, service layer, typed options pattern, global exception middleware.
-
Demostr8.Worker — a BackgroundService that polls for pending orders and processes payments via an external gateway using
IHttpClientFactory.
Production-quality code. Proper async/await, CancellationToken propagation, BCrypt password hashing, parameterized queries, dependency injection — the works. This became the main branch baseline.
The Poisoned PR
On a feature branch (feature/order-processing-improvements), I introduced 38 deliberate issues across 8 files, designed to look like genuine developer mistakes — the kind of shortcuts and oversights that slip through under deadline pressure. The commit message? A perfectly innocent "feat: improve order processing and streamline authentication flow".
The 38 issues span four categories:
| Category | Count | Examples |
|---|---|---|
| Security | 18 | SQL injection, hardcoded credentials, plaintext passwords, missing auth, wildcard CORS, API keys in URLs |
| Bugs | 8 |
async void, fire-and-forget tasks, swallowed exceptions, missing null checks, .Result deadlocks |
| Code Smells | 11 | Missing CancellationToken, Thread.Sleep in async code, wrong HTTP status codes, magic numbers, removed validation |
| Performance | 1 | N+1 query (.Include() replaced with a foreach loop) |
The severity is deliberately graduated. Some issues are showstoppers (SQL injection with sa credentials in a string literal). Others are subtler (returning Ok() instead of NoContent() on DELETE, or removing a PollingInterval constant in favor of an inline 30000).
The Full List
Here's every issue I planted, grouped by category. This is the scorecard the three tools were measured against.
Security (18 issues)
| # | File | Issue |
|---|---|---|
| 1 |
Program.cs (Api) |
Hardcoded SQL Server connection string with sa password in UseSqlServer()
|
| 2 |
Program.cs (Api) |
Hardcoded JWT signing key "demostr8-jwt-secret-key-2024" in AddJwtBearer()
|
| 3 |
Program.cs (Api) |
app.UseAuthorization() removed from middleware pipeline |
| 4 |
Program.cs (Api) |
CORS changed to AllowAnyOrigin().AllowAnyMethod().AllowAnyHeader()
|
| 5 |
Program.cs (Api) |
app.UseHttpsRedirection() removed |
| 6 | OrdersController.cs |
[Authorize] removed from CreateOrder POST endpoint |
| 11 | OrdersController.cs |
[Authorize(Roles = "Admin")] removed from DeleteOrder
|
| 14 | UsersController.cs |
Login changed to [FromQuery] string username, string password — credentials visible in URLs and logs |
| 20 | OrderService.cs |
Report query uses FromSqlRaw with string interpolation — SQL injection |
| 21 | UserService.cs |
Raw ADO.NET with $"SELECT ... WHERE Username = '{request.Username}'" and hardcoded sa connection string |
| 22 | UserService.cs |
BCrypt.Verify replaced with plaintext comparison request.Password != user.PasswordHash
|
| 24 | UserService.cs |
Password stored as plaintext: PasswordHash = request.Password
|
| 25 | UserService.cs |
AuthenticateAsync returns full User entity including PasswordHash in the response |
| 32 | PaymentService.cs |
IOptions<PaymentGatewayOptions> removed, API key hardcoded as "sk_live_demostr8_payment_key_2024"
|
| 33 | PaymentService.cs |
API key passed as URL query string: $"payments/process?apiKey={ApiKey}"
|
| 36 | PaymentService.cs |
UpdateOrderStatusAsync with undisposed SqlConnection and $"UPDATE ... SET Status = '{status}'" — SQL injection |
| 37 | PaymentService.cs |
private const string ConnectionString with production sa credentials |
| 38 |
Program.cs (Worker) |
PaymentService changed from AddScoped to AddSingleton — DI lifetime mismatch with scoped dependencies |
Bugs (8 issues)
| # | File | Issue |
|---|---|---|
| 7 | OrdersController.cs |
GetAllOrders uses .Result blocking call — sync-over-async deadlock risk |
| 8 | OrdersController.cs |
GetOrderById returns Ok(order) without null check — 200 with null body |
| 18 | OrderService.cs |
SaveChangesAsync wrapped in empty try/catch(Exception){} — swallowed exception |
| 19 | OrderService.cs |
Notification call changed to _ = fire-and-forget, try/catch removed from notification method |
| 23 | UserService.cs |
Duplicate username check removed from RegisterAsync
|
| 27 | OrderProcessingWorker.cs |
ProcessOrderAsync changed to async void — unobserved exceptions crash the process |
| 34 | PaymentService.cs |
response.EnsureSuccessStatusCode() removed — HTTP failures silently ignored |
| 35 | PaymentService.cs |
Entire ProcessPaymentAsync body wrapped in empty try/catch(Exception){}
|
Code Smells (11 issues)
| # | File | Issue |
|---|---|---|
| 9 | OrdersController.cs |
ModelState.IsValid check removed from CreateOrder
|
| 10 | OrdersController.cs |
CreateOrder returns Ok(order) instead of CreatedAtAction() — wrong 200 status |
| 12 | OrdersController.cs |
DeleteOrder returns Ok() instead of NoContent() — wrong 200 status |
| 13 | OrdersController.cs |
Report endpoint accepts raw string params with DateTime.Parse instead of typed DTO, no CancellationToken
|
| 15 | UsersController.cs |
ModelState.IsValid check removed from Register
|
| 17 | OrderService.cs |
CancellationToken removed from GetOrderByIdAsync
|
| 26 | OrderProcessingWorker.cs |
File.ReadAllLinesAsync replaced with synchronous File.ReadAllLines
|
| 28 | OrderProcessingWorker.cs |
await Task.Delay() replaced with Thread.Sleep(30000)
|
| 29 | OrderProcessingWorker.cs |
CancellationToken removed from ProcessOrderAsync
|
| 30 | OrderProcessingWorker.cs |
PollingInterval named constant removed, raw 30000 used inline |
| 31 | PaymentService.cs |
IHttpClientFactory replaced with private readonly HttpClient _httpClient = new() — socket exhaustion |
Performance (1 issue)
| # | File | Issue |
|---|---|---|
| 16 | OrderService.cs |
.Include(o => o.Customer) replaced with N+1 foreach loop loading each customer individually |
Three Identical PRs
The same branch was pushed to three separate GitHub repos, each configured with a different AI reviewer:
| Repo | Tool |
|---|---|
demostr8-bugbot |
BugBot (Cursor) |
demostr8-copilot |
GitHub Copilot |
demostr8-claude |
Claude (Anthropic) |
Same code. Same diff. Three different reviewers. Let the showdown begin.
Round 1: First Pass Results
The Scoreboard
| Tool | Detected | Missed | Detection Rate |
|---|---|---|---|
| Copilot | 34/38 | 4 | 89.5% |
| Claude | 32/38 | 6 | 84.2% |
| BugBot | 29/38 | 9 | 76.3% |
Security coverage was the strongest area for all three tools, with each catching the vast majority of SQL injections, hardcoded credentials, auth bypasses, and data exposure issues.
The differences emerged in the subtler categories.
Detection by Category
| Category | Total | BugBot | Copilot | Claude |
|---|---|---|---|---|
| Security | 18 | 16 | 18 | 17 |
| Bugs | 8 | 8 | 7 | 7 |
| Code Smells | 11 | 4 | 8 | 7 |
| Performance | 1 | 1 | 1 | 1 |
Security: all strong. Copilot had a clean sweep at 18/18. Claude missed one (API key in query string). BugBot missed two (API key in query string and DI singleton mismatch).
Code smells: the real differentiator. BugBot caught only 3 of 9 code smells on the first pass. Copilot and Claude both caught 6. Issues like wrong HTTP status codes, removed ModelState validation, and missing CancellationToken parameters separated the tools.
What Nobody Caught (First Pass)
Three issues sailed past all three reviewers:
| # | Issue | Category |
|---|---|---|
| 12 |
DeleteOrder returns Ok() instead of NoContent() (204) |
Code Smell |
| 15 |
ModelState.IsValid check removed from Register
|
Code Smell |
| 30 |
PollingInterval constant removed, raw 30000 in Thread.Sleep
|
Code Smell |
All three are code convention issues — not security vulnerabilities or correctness bugs. But they're the kind of thing a senior human reviewer would flag in seconds.
Review Style
Beyond the numbers, each tool had a distinct personality:
- Copilot posted 33 individual inline comments with code suggestions. Thorough, granular, actionable.
- Claude posted 25 inline comments plus a structured summary with severity tiers, a "DO NOT MERGE" recommendation, and a credential rotation checklist. Opinionated, contextual, decisive.
- BugBot posted 22 comments, often grouping related issues together. Concise, low noise.
Rounds 2, 3, 4: The Re-Review Loop
Here's where it gets interesting. After fixing the issues each tool found, I pushed the fixes and re-requested reviews. Would the tools find more?
Copilot: The Relentless Reviewer
| Pass | New Findings | Cumulative |
|---|---|---|
| 1 | 34 | 34/38 |
| 2 | +3 (#9, #12, #15) | 37/38 |
| 3 | +1 (#30) | 38/38 |
Copilot kept digging. On pass 2, it caught the missing ModelState checks and the DELETE status code — issues it had overlooked when the PR was noisier with 38 problems. On pass 3, it found the magic number. Perfect score across 3 passes.
It also raised 5 new observations about UpdateOrderStatusAsync being dead code with architectural inconsistencies — legitimate findings beyond the original 38.
BugBot: The Iterative Improver
| Pass | New Findings | Cumulative |
|---|---|---|
| 1 | 29 | 29/38 |
| 2 | +3 (#10, #13, #33) | 32/38 |
| 3 | +3 (#12, #17, #26) | 35/38 |
| 4 | +0 (new observations only) | 35/38 |
BugBot also improved with each pass, catching 6 more issues across rounds 2 and 3. It even caught a regression introduced by its own fix — the agent that fixed issue #13 used BadRequest(ModelState) instead of ValidationProblem(ModelState), and BugBot flagged the inconsistency. That's a genuinely impressive bit of self-awareness.
Still, after 4 passes, it plateaued at 35/38 — never catching #9, #15, or #30.
Claude: The Structured Escalator
| Pass | New Findings | Assessment | Cumulative |
|---|---|---|---|
| 1 | 32 | DO NOT MERGE | 32/38 |
| 2 | +3 (#10, #17, #29) | NEEDS MINOR FIXES | 35/38 |
| 3 | +3 (#12, #15, #30) | NEARLY READY | 38/38 |
Claude surprised me. On its second pass, it found 3 more issues and changed its verdict from "DO NOT MERGE" to "NEEDS MINOR FIXES" — explicitly confirming all CRITICAL and HIGH issues were resolved. On pass 3, it caught the last 3 — including the magic number that no tool found on the first pass — moving to "NEARLY READY." On pass 4, after all fixes were applied, it gave the green light: "APPROVED — READY TO MERGE."
Same 38/38 as Copilot, but with a merge confidence progression at every step. It also flagged that the API key fix (moved from query string to body) should use a header instead — a quality-of-fix observation none of the other tools made. And even on the final approval, it included a non-blocking suggestion for a follow-up (configure HttpClient in DI registration) and a persistent reminder to rotate the credentials that had been in git history.
The Multi-Pass Scoreboard
| Tool | First Pass | Final (All Passes) | Passes to 38/38 |
|---|---|---|---|
| Copilot | 34/38 (89.5%) | 38/38 (100%) | 3 |
| Claude | 32/38 (84.2%) | 38/38 (100%) | 3 |
| BugBot | 29/38 (76.3%) | 35/38 (92.1%) | Never (plateaued at 35) |
The Developer Experience Problem
Copilot and Claude both hit 38/38. Same detection. So what's the difference? Everything about how they tell you.
Copilot: Thoroughness without closure
With Copilot, every fix-and-push triggered a new round of findings. Pass 1: fix 34 issues. Pass 2: three more. Pass 3: one more. It felt like playing whack-a-mole with an increasingly pedantic reviewer.
The fundamental problem: there's no merge signal. After addressing every finding, you re-request a review and hold your breath. Will it find more? You don't know until it runs. And when it does find more, you're back in the loop. Copilot never says "this is good enough to merge."
BugBot: Partial resolution tracking, but no verdict
BugBot had a similar pattern of surfacing new issues on each pass. To be fair, GitHub does automatically mark some inline comments as "Outdated" when the referenced code changes — and BugBot's comments benefit from this, showing a "Show resolved" label on fixed issues.
But it's inconsistent: some fixed issues get the "Outdated" tag while others don't, even when the fix is clearly in place. And crucially, there's no summary confirming which findings were addressed — you have to scroll through every comment thread to piece together the status yourself. Like Copilot, BugBot never gives you an overall "ready to merge" verdict.
The shared problem
In a real team workflow, both tools create:
- Developer fatigue from repeated review cycles
- No clear "green light" for merge readiness
- Uncertainty about whether the next pass will surface yet more issues
- Every finding feels equally urgent — no distinction between blockers and nice-to-haves
Claude: Same detection, with a merge confidence progression
Claude also reached 38/38, but the experience was fundamentally different. Like the other tools, it posted detailed inline comments pinned to exact code locations with code suggestions and "Fix this" links.
But on top of that, every comment was tagged with a severity level (CRITICAL, HIGH, MEDIUM, LOW), and each review pass included an explicit merge readiness assessment:
- Pass 1 (32 found): "DO NOT MERGE" — 10 CRITICAL, 8 HIGH, with a credential rotation checklist
- Pass 2 (+4 found): "NEEDS MINOR FIXES" — all CRITICAL/HIGH resolved, remaining items are MEDIUM/LOW
- Pass 3 (+2 found): "NEARLY READY" — two small items left, everything else confirmed fixed
- Pass 4 (0 new): "APPROVED — READY TO MERGE" — all issues resolved, one non-blocking follow-up suggestion
This progression gave me something Copilot and BugBot never did: confirmation that my fixes actually addressed the review findings. At each pass, Claude explicitly verified which previous issues were resolved before flagging new ones. When it moved to "NEEDS MINOR FIXES", I knew the CRITICAL/HIGH items were confirmed fixed — not just absent from the new comments, but explicitly ticked off. And when it finally said "APPROVED — READY TO MERGE", it wasn't just silence — it was a definitive sign-off that every finding across all previous reviews had been addressed.
Even on the final approval, Claude didn't just rubber-stamp it. It included a non-blocking suggestion (configure HttpClient headers in DI registration instead of per-call) and a persistent reminder to rotate credentials that still existed in git history. That's the kind of thoughtful, context-aware feedback that builds trust.
The Trade-Off Table
| Dimension | Copilot | BugBot | Claude |
|---|---|---|---|
| Total detection (multi-pass) | 38/38 (100%) | 35/38 (92.1%) | 38/38 (100%) |
| Merge confidence signal | None | None | Clear progression at each pass |
| Review cycles to 38/38 | 3 | Never (plateaued at 35) | 3 |
| Severity prioritization | No — flat list | No — flat list | Yes — CRITICAL/HIGH/MEDIUM/LOW |
| Developer cognitive load | High (when does it end?) | High (when does it end?) | Low (clear verdict + priorities) |
| Catches its own regressions | No | Yes | No |
| Quality-of-fix feedback | No | No | Yes (API key body -> header) |
The Verdict
The scoreboard ended with two tools tied at 38/38 and one at 35. But the numbers don't capture the full picture.
- Copilot wins on first-pass breadth — 34/38 out of the gate, the highest initial detection rate. If you want the most findings upfront with the least passes, Copilot delivers.
- BugBot wins on regression awareness — it's the only tool that caught a bug introduced by its own fix. For iterative development on complex PRs, that's genuinely valuable. But its 35/38 ceiling means some issues will always slip through.
- Claude wins on developer experience — same 38/38 as Copilot, but with structured severity tiers, a merge confidence progression at every pass, and quality-of-fix feedback that goes beyond just finding problems. It doesn't just tell you what's wrong — it tells you where you stand and what to prioritise.
All three tools caught every security vulnerability and every correctness bug on the first pass. The issues that required multiple passes were all code smells and conventions. That's reassuring: for the stuff that actually matters in production, all three tools have your back.
Detection matters — you want your reviewer to catch as much as possible. But detection alone isn't enough. You also need a clear signal that your fixes have been addressed and the PR is ready to merge. Claude was the only tool that provided both.
One caveat: Claude's review was driven by a GitHub Actions workflow with a structured prompt specifying review focus areas like OWASP Top 10, .NET-specific patterns, and severity ratings (see Appendix). Copilot and BugBot used their default configurations with no custom instructions. This is a fair criticism of the comparison — a tuned prompt may have given Claude an advantage, particularly on .NET-specific issues. That said, both Copilot and BugBot support custom review instructions (via .github/copilot-review-instructions.md and .cursor/rules respectively), so you could potentially configure them to produce similar structured output. I haven't tested this, but it's worth exploring.
What I can say is that the merge readiness progression (DO NOT MERGE -> APPROVED) was not part of the prompt — Claude added that on its own.
Appendix: Claude Code Review Workflow
Here's the GitHub Actions workflow that powered Claude's review, using the claude-code-action.
name: Claude Code Review
on:
pull_request:
types: [opened, synchronize, ready_for_review, reopened]
jobs:
claude-review:
if: ${{ !github.event.pull_request.draft }}
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
issues: write
id-token: write
actions: read
steps:
- name: Checkout repository
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Run Claude Code Review
id: claude-review
uses: anthropics/claude-code-action@v1
with:
claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
track_progress: true
prompt: |
REPO: ${{ github.repository }}
PR NUMBER: ${{ github.event.pull_request.number }}
Review this pull request with focus on:
## Code Quality
- Naming conventions and readability
- DRY violations or unnecessary complexity
- Dead code or commented-out code
## Bugs & Logic
- Null reference risks
- Off-by-one errors or incorrect boundary conditions
- Race conditions or thread safety issues
- Unhandled exceptions or missing error handling
## Security (OWASP Top 10)
- SQL injection or command injection
- Hardcoded secrets or credentials
- Broken access control or missing authorization checks
- Input validation and sanitization
## .NET Specific
- Proper async/await usage (no async void, no missing await)
- IDisposable resources not disposed
- LINQ misuse or performance pitfalls (N+1 queries)
- Correct dependency injection patterns
## Performance
- Unnecessary allocations in hot paths
- Missing pagination on collection endpoints
- Inefficient database queries
Rate issues by severity: CRITICAL, HIGH, MEDIUM, LOW.
Use inline comments for specific code issues.
Post a summary comment with an overall assessment.
claude_args: |
--allowedTools "mcp__github_inline_comment__create_inline_comment,Bash(gh pr comment:*),Bash(gh pr diff:*),Bash(gh pr view:*)"
Disclaimer
I have no affiliation with Anthropic, Cursor, or GitHub. I'm a paying user of all three — Claude, Cursor (BugBot), and GitHub Copilot are all tools in my daily workflow. This experiment was also built, executed, and fixed using Claude Code (Anthropic's CLI tool), which I should note for full transparency.
My objective wasn't to crown a winner on a leaderboard. It was to answer a practical question: which tool gives me the most confidence to merge? The data speaks for itself.
Take this with a pinch of salt. Every project is different — language, framework, codebase size, and team conventions all influence how these tools perform. A Python FastAPI project or a Go microservice may yield very different results. This experiment is a baseline under controlled conditions, not a definitive ranking. Use it as a starting point, then test these tools against your own codebase before committing to one.
Every tool has its place. You may find one works better for you than it did for me — and that's perfectly fine. There's no right or wrong answer here. The best tool is the one that fits your workflow and gives you confidence to ship.







Top comments (1)
I want to push back slightly — this works great for established codebases with clear patterns, but on greenfield projects it's a different story. When you don't have conventions yet, AI just invents them, and you end up with a franken-architecture. I still go context-first for mature projects, but for new ones I write the patterns manually first.