I have a confession. I hate reviewing pull requests.
Not the code itself. I enjoy solving problems. I hate the context switching. I hate reading three hundred lines of boilerplate just to find one missing null check. It drains my energy for actual feature work.
So in January 2026, I decided to stop doing it manually.
I set up an autonomous agent using the latest local LLM stack to handle first-pass reviews on my side projects. I gave it strict rules. I told it to block merges if it found security issues or style violations. I promised myself I would only step in for architectural decisions.
I ran this experiment for exactly 30 days.
The results were not what I expected. I thought I would save time. I did. But I also introduced a new category of bugs that I never saw coming. Here is the raw data and what I learned.
The Setup: Local Agents Only
I did not use any cloud-based API for this. Privacy matters, and sending proprietary code to a third party feels wrong in 2026.
I ran a quantized 70B parameter model on my local workstation. It has dual RTX 4090s. Inference was fast enough for interactive use, but batch processing took a few minutes per PR.
The agent had three specific jobs:
- Check for consistent typing (TypeScript strict mode).
- Identify potential memory leaks in React effects.
- Verify that every new function has a corresponding unit test.
If it passed these checks, it approved the PR. If it failed, it commented with specific line numbers. I configured GitHub Actions to prevent merging until the agent signed off.
I tracked two metrics: time spent reviewing and bug rate post-merge.
The First Week: False Confidence
Days 1 through 7 felt amazing.
I merged 14 PRs without opening a single file. The agent caught three genuine type errors that I would have missed during a quick scan. It also flagged a missing cleanup function in a useEffect hook.
I felt like I had unlocked a superpower. I spent my review time building features instead of nitpicking semicolons.
Then day 8 happened.
I deployed a minor update to the staging environment. The app crashed immediately on load. The error was simple. A variable was named userList in one component and usersList in another. The agent had approved both because they were technically valid TypeScript variables. It didn't understand the semantic intent.
It wasn't a syntax error. It was a logic gap. The agent saw code that compiled. It didn't see code that made sense.
I spent four hours debugging what should have been a five-minute review.
The Data: Where It Failed
I kept a spreadsheet of every issue the agent missed. By day 30, I had reviewed 42 PRs. The agent handled the initial pass for all of them. I only manually reviewed 12 of them because something felt "off."
Here is the breakdown of outcomes:
| Metric | Manual Review (Baseline) | AI Agent Review |
|---|---|---|
| Avg Time per PR | 18 minutes | 2 minutes (my time) |
| Syntax Errors Caught | 95% | 99% |
| Logic Bugs Missed | 2 per month | 5 per month |
| Developer Satisfaction | 6/10 | 8/10 (initially) |
| Post-Merge Hotfixes | 1 | 4 |
The time savings were real. I saved about 6 hours of pure review time. But those 4 hotfixes cost me nearly 10 hours of debugging and deployment stress.
Net loss: 4 hours.
But the real shock wasn't the time. It was the type of errors.
The agent was terrible at spotting "drift." Drift is when the code follows the pattern but violates the spirit of the architecture. For example, it allowed a direct database call in a UI component because the function signature was correct. It didn't care that we strictly separate data access layers.
It also struggled with context outside the diff. If a PR changed a utility function, the agent didn't always check how that change impacted consumers in other files unless explicitly prompted to do a full repo scan. Full scans took 20 minutes. Nobody waits 20 minutes for a PR check.
The Turning Point: Hybrid Workflow
On day 15, I changed the rules.
I stopped letting the agent approve PRs. Instead, I made it a strict advisor. It could comment, but it could not block or approve. I forced myself to read every comment it generated.
This shifted my role from "finder of errors" to "validator of insights."
I noticed the agent was great at spotting repetitive patterns. It caught three instances where I copied and pasted error handling logic instead of using our shared hook. I would have missed those because I was focused on the new feature logic.
It was also excellent at writing documentation. I asked it to generate JSDoc comments for every changed function. It did this perfectly. This saved me maybe 30 minutes of writing docs over the month.
The hybrid approach looked like this:
- Agent runs linting and type checks.
- Agent suggests improvements for readability.
- I read the suggestions.
- I verify the logic manually.
- I merge.
This added about 5 minutes to my review process compared to full automation. But
💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.
Top comments (0)