DEV Community

Hopkins Jesse
Hopkins Jesse

Posted on

I Let AI Handle My PR Reviews for 30 Days — The Data Was Ugly

I stopped reviewing pull requests manually on March 1, 2026.

It wasn’t a strategic decision born from a desire to optimize my workflow. It was pure exhaustion. I had spent the previous two weeks staring at diffs that ranged from trivial whitespace changes to complex refactors of our legacy authentication module. My brain felt like mush.

So I hooked up "ReviewBot," a local LLM agent configured with our team’s style guide and security rules, to our GitHub repository. The promise was simple. It would catch syntax errors, flag potential security vulnerabilities, and enforce naming conventions. I would only step in for architectural decisions and logic validation.

I expected to save ten hours a week. I expected cleaner code. What I got was a massive increase in velocity and a subtle, creeping decay in code quality that took me three weeks to notice.

Here is exactly what happened during that month, including the metrics and the specific failure modes I encountered.

The Setup and Initial Wins

We are a team of six developers working on a mid-sized SaaS platform built with Next.js and Python. Our average PR size is about 400 lines of changed code. Before the experiment, the average time from "Open" to "Merged" was 18 hours. This included waiting for human reviewers who were often busy with their own tasks.

I configured the agent using a custom system prompt. I fed it our eslint config, our pylint rules, and a markdown file containing our internal best practices. I also gave it read-only access to our recent commit history so it could understand context.

The first week was fantastic.

The bot caught three actual bugs. One was a missing null check in a TypeScript interface that would have caused a runtime crash. Another was a SQL injection vulnerability in a raw query that standard linters missed because the variable interpolation looked safe at a glance.

My teammates loved it. They no longer had to wait for me to wake up and review their morning commits. The average merge time dropped to 4 hours. I felt like a genius. I thought I had solved the bottleneck.

Then the second week started, and the noise began.

The False Positive Flood

By day eight, the signal-to-noise ratio plummeted. The AI became overly cautious. It started flagging valid patterns as anti-patterns simply because they didn't match the most common examples in its training data.

For instance, we use a specific pattern for error handling in our API routes that involves wrapping promises in a try-catch block with a custom logger. The AI flagged this as "redundant error handling" in twelve separate PRs. It suggested removing the try-catch blocks, which would have swallowed errors silently in production.

I had to spend an hour each day dismissing these false positives. This wasn't saving time. It was shifting the workload from "reading code" to "managing the bot."

Here is a breakdown of the comments generated by the AI during Week 2:

Category Count Action Required Time Spent
Valid Bug Catch 4 Fix Code 20 mins
Style Nitpick 142 Dismiss/Ignore 35 mins
Incorrect Logic Flag 18 Explain to Dev 45 mins
Security False Alarm 9 Verify & Dismiss 15 mins

Total time spent managing the bot: ~1 hour 55 minutes.

This was worse than just reviewing the code myself. When I review code, I can skip the obvious stuff. The bot forced me to look at every single comment to ensure it wasn't hiding a real issue among the junk.

The Subtle Quality Decay

The real shock came in Week 4. I decided to do a random audit of the code merged during the experiment. I picked ten PRs that had been approved solely by the AI after I dismissed its initial comments.

I found a pattern of "lazy" coding.

Developers knew the AI wouldn't catch logical inefficiencies. It only checked for syntax and strict rule adherence. So, they started writing code that passed the checks but was structurally poor.

One developer nested five levels of conditional statements because the AI didn't flag cyclomatic complexity unless it exceeded a hard threshold of 15. The code worked, but it was unreadable. Another developer duplicated a helper function across three files because the AI didn't have global context to see that the function already existed elsewhere.

The AI was optimizing for compliance, not quality. And our team was optimizing for speed.

I looked at our bug tracking system. In the month prior to the experiment, we had logged four minor bugs related to new features. In the 30 days of AI-only reviews, we logged eleven. Three of those were direct results of the logical gaps the AI missed.

The Human Element Is Not Replaceable

I realized that code review is not just about finding bugs. It is about knowledge sharing. When I review a junior developer's code, I leave comments explaining why a certain approach is better. I link to documentation. I ask questions that force them to think about edge cases.

The AI does none of this. It gives binary feedback. Pass or fail. Fix this line. Delete that import.

Our junior developers stopped learning. They stopped asking questions. They just fixed what the bot told them to fix and merged the code. The mentorship loop was broken.

I also missed the context. The AI doesn't know that we are planning to deprecate the UserService class next

💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.

Top comments (0)