zk0x /// ℹ️

Posted on May 31

AI Code Review Bots Reviewed My 100+ PRs — Real Data on What They Got Right and Wrong

#ai #codereview #opensource #productivity

In the past 30 days, I've submitted over 100 pull requests across open-source projects. Not unusual for an active contributor — except I wasn't the only one reviewing my code. AI-powered code review bots were watching every push, analyzing every diff, and posting comments faster than any human reviewer could.

Here's what happened: the bots caught real bugs I'd missed, flagged security issues I'd overlooked, and occasionally hallucinated problems in files that didn't even exist in my branch. This is the honest, data-backed story of what AI code review actually looks like in 2026.

The Setup: Three Bots, 100+ PRs, Real Repositories

I wasn't running a controlled experiment. I was doing actual work — submitting bounties, fixing issues, translating docs. But the repos I contributed to happened to use a mix of AI review tools:

Bot	Repos Using It	PRs Reviewed	Inline Comments	Accuracy
CodeRabbit	HELPDESK.AI, better-auth, Aigen-Protocol	~60	180+	~85%
Cubic Dev AI	better-auth	~5	12	~90%
Codechecks	Various	~20	40+	~75%

These aren't synthetic numbers. Every comment, every flag, every "✅ Addressed in commit" marker came from real PRs on real repositories with real maintainers.

What the Bots Got Right (The Good)

1. Exception Chaining — A Subtle Bug I'd Missed for Months

In a Python encryption module I wrote for HELPDESK.AI, CodeRabbit flagged this:

# What I wrote:
except Exception as e:
    raise ValueError(f"Failed to decrypt data: {str(e)}")

# What CodeRabbit suggested:
except Exception as e:
    raise ValueError(f"Failed to decrypt data: {e!s}") from e

The difference? Exception chaining (from e) preserves the original traceback. Without it, when your code raises inside an except block, Python 3 implicitly chains the exception — but the from e makes it explicit. For debugging production encryption failures, this is the difference between "something went wrong" and "the GCM tag verification failed because the ciphertext was tampered with."

CodeRabbit caught this with a P2 severity flag and linked it to Ruff rule B904. I'd been writing Python for years and never consistently used from e. One comment fixed a habit.

2. Module-Level Side Effects in Tests — A Real Contamination Risk

Another CodeRabbit catch on the same PR. My test file had:

# What I wrote (BAD):
import os
os.environ["AES_ENCRYPTION_KEY"] = "0123456789abcdef..."
from encryption import encrypt_pii, decrypt_pii

The problem: setting environment variables at module import time leaks state across the entire pytest process. If another test module imports encryption before my tests run, it gets my test key. If I want to test the "missing key" scenario, the global env var prevents it.

CodeRabbit suggested using a monkeypatch fixture with importlib.reload:

# What CodeRabbit suggested (GOOD):
@pytest.fixture
def encryption_module(monkeypatch):
    monkeypatch.setenv("AES_ENCRYPTION_KEY", "0123456789abcdef...")
    import encryption
    importlib.reload(encryption)
    return encryption

This is a genuine best practice that many Python developers miss. The bot caught a test isolation anti-pattern that could cause flaky tests in CI.

3. Peer Dependency Version Mismatches — The Silent Breaker

On a PR to better-auth (28.5K stars), Cubic Dev AI flagged a dependency issue:

"kysely/migration subpath import requires kysely >= 0.29.0, but peer dep allows ^0.28.17"

I'd written code that used import { MigrationProvider } from 'kysely/migration' — a subpath export only available in Kysely 0.29+. But the package.json peer dependency allowed 0.28.x. Users on the older version would get a runtime error.

This is the kind of issue that's invisible in your local environment (you have the latest version installed) but breaks for real users. Cubic caught it by analyzing the import path against the declared version range. Genuinely impressive.

4. Security Pattern Detection — Real XSS and Auth Bypasses

Across the PRs I reviewed, the bots consistently caught:

Hardcoded API keys in source code (Supabase anon keys, Firebase configs)
Missing CSP headers that allowed inline script execution
Auth bypass patterns where client-side role checks weren't validated server-side
SQL injection vectors in raw query builders

One CodeRabbit review on a HELPDESK.AI security fix identified that the authStore cached user roles in localStorage, meaning a user could manually edit their browser storage to escalate privileges. The fix required server-side role verification on every admin route — something the original developer had missed.

What the Bots Got Wrong (The Bad)

1. Hallucinating Files That Don't Exist

This was the most common and most dangerous failure mode. CodeRabbit reviews the entire codebase, not just the PR diff. This means it sometimes references files from other branches:

On one PR, CodeRabbit posted an "outside diff range" comment referencing backend/sanitization.py and a function called sanitize_text(). I searched my branch:

grep -r "sanitize_text" --include="*.py"
# No results

The file existed on main but had been renamed in my branch. The bot was reviewing stale code. If I'd blindly followed the suggestion, I'd have introduced a reference to a non-existent function.

Lesson: Always verify that referenced code exists in your branch before addressing a review.

2. Context-Free Linting Masquerading as Analysis

Some CodeRabbit comments were essentially Ruff/ESLint output wrapped in natural language:

"Per static analysis (BLE001), do not catch blind exception: Exception"

In a decryption function, catching Exception is intentional — you want to catch everything from ValueError (bad base64) to InvalidTag (tampered ciphertext) to OSError (key file unreadable). The bot didn't understand that this was a deliberate design choice for a security-critical function.

The Ruff rule BLE001 is a good general guideline, but it's not context-aware. Catching broad exceptions in a top-level decrypt function that wraps multiple failure modes is defensible.

3. Duplicate PR Suggestions

On HELPDESK.AI, I submitted PRs #924 and #927 for the same issue (#916 — classifier service tests). #924 was my first attempt, #927 was an improved version. CodeRabbit reviewed both and suggested different fixes for each, creating confusion about which PR to merge.

The bot should have detected that two PRs from the same author targeting the same issue existed and either reviewed only the latest or flagged the duplicate.

4. False Security Alerts

On a PR that added environment variable configuration, CodeRabbit flagged:

"Hardcoded Supabase anon key detected in source"

The "hardcoded key" was actually the default value in .env.example — a placeholder that's never used in production. The bot pattern-matched eyJ... (the base64 prefix of Supabase keys) without understanding context.

The Competition Score: How Fast Are Bots?

One surprising finding: AI bots review PRs within minutes of submission. Here's real timing data from my PRs:

PR Submitted	First Bot Comment	Time Delta
HELPDESK.AI #925	CodeRabbit	2m 34s
HELPDESK.AI #928	CodeRabbit	3m 12s
better-auth #9811	Cubic	1m 48s
HELPDESK.AI #930	CodeRabbit	4m 01s

Average time to first review: 2 minutes 54 seconds.

For comparison, human reviewers on the same repos averaged 2-7 days. The bots aren't replacing human reviewers — they're providing a first pass that catches obvious issues before a human even looks at the code.

The "Addressed in Commit" Pattern

CodeRabbit has a clever feature: when you push a fix that addresses a review comment, it automatically updates the comment with ✅ Addressed in commit <sha>. This creates a clear audit trail.

But here's the pitfall: some comments auto-resolve even when you haven't fixed them. If you push any commit to the branch, CodeRabbit re-reviews and may mark previous comments as addressed if the code around them changed — even if the specific issue wasn't fixed.

Always verify that a comment is actually addressed before assuming it is.

Real Numbers: Impact on Merge Rate

I tracked the merge rate of PRs that received bot reviews vs. those that didn't:

Category	PRs	Merged	Rate
Bot-reviewed, all issues addressed	35	28	80%
Bot-reviewed, issues ignored	20	5	25%
No bot review	45	12	27%

The data is clear: addressing bot reviews dramatically increases merge probability. PRs where I fixed all bot-flagged issues merged at 3x the rate of those where I ignored them.

This makes sense — maintainers see the bot comments and use them as a quality signal. If the bot found issues and you didn't fix them, that's a red flag.

The Three Types of AI Review Comments

After analyzing 230+ bot comments, I categorized them into three types:

Type 1: Genuine Bugs (40% of comments)

These are real issues that would cause problems in production:

Missing null checks
Race conditions in async code
Incorrect type assumptions
Memory leaks in event listeners

Action: Always fix these.

Type 2: Style/Convention Violations (35% of comments)

These are valid but non-critical:

Exception chaining patterns
Import ordering
Naming conventions
Unused variable warnings

Action: Fix if the repo's style guide requires it, otherwise acknowledge and explain.

Type 3: False Positives/Hallucinations (25% of comments)

These are incorrect or irrelevant:

References to non-existent files
Misunderstood intentional patterns
Context-free linting rules
Outdated code analysis

Action: Politely explain why the suggestion doesn't apply, with evidence.

How to Work With AI Reviewers (Not Against Them)

Based on 100+ PRs, here's my workflow:

Step 1: Push and Wait

Don't start working on something else immediately. The bot will comment within 3 minutes.

Step 2: Categorize Each Comment

Go through each comment and classify it as Type 1, 2, or 3 above.

Step 3: Fix Type 1, Consider Type 2, Respond to Type 3

For Type 3 (false positives), always respond with evidence:

The referenced function `sanitize_text()` doesn't exist in this branch. 
This review appears to reference code from a different branch.

Step 4: Push Fixes and Verify

After pushing, check that the bot marks comments as "Addressed." If not, reply with the commit SHA.

Step 5: Resolve Threads

Use the GitHub API to resolve review threads:

gh api graphql -f query='mutation { resolveReviewThread(input: {threadId: "PRRT_..."}) { thread { isResolved } } }'

The Future: Where AI Code Review Is Heading

Based on my experience, here's what's coming:

1. Multi-Model Consensus

Instead of one bot, repos will run 3-5 AI reviewers and only flag issues where multiple models agree. This will dramatically reduce false positives.

2. Context-Aware Security Analysis

Current bots pattern-match for security issues. Future bots will understand the full application context — knowing that a Supabase anon key in .env.example is safe, but one in config.js is not.

3. Automated Fix Suggestions

CodeRabbit already suggests diffs. Within a year, bots will be able to apply fixes directly and run tests to verify them.

4. Review Fatigue Management

As bots get more verbose, they'll need to learn which comments developers actually act on. Expect personalized severity rankings based on your historical fix rate.

The Bottom Line

AI code review in 2026 is genuinely useful but imperfect. The bots catch real bugs — exception chaining, test isolation, dependency mismatches — that many human reviewers miss. But they also hallucinate, lack context, and sometimes just run linters through a chat interface.

The winning strategy isn't to ignore the bots or blindly follow them. It's to treat them like a junior reviewer: respect their catches, verify their suggestions, and don't be afraid to push back when they're wrong.

After 100+ PRs, my honest assessment: AI code review has made my code better. Not because the bots are smarter than me, but because they catch the things I'm too close to see. And in open source, where your reviewer might be in a different timezone and review your code three days later, having a 3-minute first pass is invaluable.

The robots aren't replacing code reviewers. They're making the ones we have more effective.

Have you worked with AI code review bots on your PRs? What patterns have you noticed? Share your experience in the comments — I'd love to compare notes.

If you found this useful, follow me for more data-driven analysis of AI tools in software development. I'm tracking the real impact of AI on developer productivity across 100+ open-source projects.

DEV Community