I spent three months training a custom AI to review my team's pull requests. The first month was a disaster. False positives everywhere, missed bugs, and my teammates hated it.
Then I fixed it. Here's exactly what I built and how you can copy it.
The Problem That Drove Me Crazy
My team of 6 engineers submits about 40 PRs per week. Each review takes 30-45 minutes if I'm thorough. That's 20-30 hours of code review every week.
I was burning out. My own code quality suffered because I rushed through reviews. And honestly? I was missing stuff. A production bug slipped through in January 2026 because I skimmed a 500-line PR at 11 PM.
I needed help. Not a linter, not a static analyzer, but something that actually understood context.
What I Actually Built
This isn't a ChatGPT wrapper. I built a pipeline that:
- Extracts the PR diff and commit history
- Feeds it into a fine-tuned Claude 3.5 model trained on our codebase
- Runs 5 specific checkers in parallel
- Generates a structured review with confidence scores
Here's the core setup I run on every PR:
import os
from github import Github
from anthropic import Anthropic
class PRReviewBot:
def __init__(self, repo_name):
self.gh = Github(os.environ['GITHUB_TOKEN'])
self.repo = self.gh.get_repo(repo_name)
self.claude = Anthropic(api_key=os.environ['ANTHROPIC_KEY'])
def review_pr(self, pr_number):
pr = self.repo.get_pull(pr_number)
files = pr.get_files()
checkers = {
'logic_errors': self.check_logic,
'security_issues': self.check_security,
'performance_regression': self.check_performance,
'style_consistency': self.check_style,
'test_coverage': self.check_tests
}
results = {}
for name, checker in checkers.items():
results[name] = checker(files)
return self.generate_summary(results)
def check_logic(self, files):
prompt = f"""Review this code for logic errors.
Our codebase uses Python 3.12 with FastAPI.
Focus on: race conditions, null pointer issues, incorrect state mutations.
Return a JSON array of issues with severity (critical/major/minor)."""
# truncated for brevity
The full version is about 400 lines. I'll share the complete repo link at the end.
The Data: What Actually Improved
I tracked metrics for 8 weeks after deployment. Here's what changed:
| Metric | Before AI | After AI | Change |
|---|---|---|---|
| Review time per PR | 38 min | 8 min | -79% |
| Bugs missed in review | 3.2/month | 0.8/month | -75% |
| Team satisfaction (1-10) | 6.1 | 8.4 | +38% |
| False flag rate | N/A | 12% | Improving |
The false flag rate dropped from 34% in week one to 12% by week eight. That's because I kept tuning the prompts and adding specific project context.
What Went Wrong (And How I Fixed It)
Month 1: The AI hated our coding style
It flagged our custom logging wrapper as "unnecessary abstraction" 47 times. I had to add a 200-line configuration file that defined our acceptable patterns. Painful but necessary.
Month 2: It missed SQL injection vulnerabilities
The model didn't understand our ORM layer. I had to feed it 30 example files showing safe vs unsafe patterns. After that, it caught 4 SQL issues in week 6 alone.
Month 3: False positives from test files
The AI kept complaining about test assertions being "too complex." I added a simple filter: skip any file in the tests/ directory unless it's been modified in the actual PR scope.
The Critical Pieces You Need
This isn't plug and play. Here's what made it work:
1. Project-specific context file
Create a CONTEXT.md that explains your architecture, naming conventions, and common patterns. I update this every sprint. The AI reads it before each review.
2. Confidence thresholds
Don't let the AI block PRs automatically. I set it to:
- Critical issues: flag immediately, block merge
- Major issues: comment, but don't block
- Minor issues: ignore unless there are 5+ in one PR
3. Human override system
Every comment has a "dismiss" button that feeds back into the training data. I reviewed 100 dismissed comments manually to fix the worst patterns.
The Cost Breakdown
Running this costs about $0.12 per PR in API calls. My time savings are worth roughly $150/week at my hourly rate. Total setup took about 40 hours spread over 3 months.
I also run it on my personal projects. Costs about $3/month for 20-30 PRs.
What I'd Do Differently
Start with a smaller scope. I tried to review everything at once. Should have just done security checks for the first month, then added logic reviews, then style.
Also, get your team on board first. I deployed silently and people were confused. Now I have a #ai-review channel where the bot posts its findings and people can react with
💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.
💰 Want to make some smart bets? I've been using Polymarket — the world's largest prediction market platform — to bet on everything from election outcomes to tech trends. Real money, real probabilities, real payouts. Unlike crypto casinos, Polymarket is a legitimate information market where your edge comes from being better informed than the crowd. I've banked some solid wins calling AI regulation timelines and crypto ETF approvals. Sign up with my referral link and start trading: Polymarket.com
Top comments (0)