February 2026. My team had 47 open pull requests. I spent 3 hours each morning just reading diffs. Most of it was boilerplate validation, style nits, and missing error handling. I was burning out.
So I built a PR review bot. Not the kind that comments "LGTM" on everything. Something that actually catches real bugs.
Here's what happened in the first 30 days.
The Problem With Manual Reviews
My team ships code fast. Too fast. We have 12 developers across 3 time zones. By the time I wake up, there are 8-15 new PRs waiting.
I used to spend:
| Activity | Hours/Week |
|---|---|
| Reading diffs | 8 |
| Writing review comments | 4 |
| Re-reviewing fixes | 3 |
| Total | 15 |
That's 15 hours. Every week. On reading other people's code.
And I was still missing things. A null pointer slipped through in January. A race condition in February. We shipped bugs because I was tired.
The Setup
I used a combination of tools:
- OpenAI's o3 model (released late 2025) for deep code analysis
- GitHub Actions for automation
- A custom prompt template I refined over 3 weeks
- PostgreSQL to store review history and learn from false positives
The bot runs on every PR. It checks:
- Does the code compile? (obvious, but saves time)
- Are there any null safety issues?
- Are error messages helpful or generic?
- Is the test coverage adequate for the changes?
- Does the PR description match the actual diff?
The Prompt That Made It Work
After 17 failed attempts, I landed on this:
You are a senior developer reviewing a pull request.
Rules:
- Be concise. No fluff.
- Only flag things that would cause bugs or maintenance issues.
- Ignore style (we use prettier).
- If you don't see anything wrong, say nothing.
- If you see a real issue, explain why in 2 sentences max.
- Flag missing error handling, null references, and race conditions.
- Do NOT suggest refactors unless there's a concrete benefit.
- Rate confidence: HIGH, MEDIUM, LOW.
PR diff:
{diff}
PR description:
{description}
Changed files:
{files}
The key insight: the "say nothing" rule. Most AI review tools spam every PR with suggestions. That destroys trust. My bot stays quiet when there's nothing wrong.
Results After 30 Days
| Metric | Before | After |
|---|---|---|
| Reviews per day | 15 | 4 |
| Time per review | 20 min | 3 min |
| Bugs caught before prod | 2/month | 11/month |
| False positive comments | N/A | 3 total |
The bot caught 11 real bugs in 30 days. I only had to override it 3 times.
One specific example: a developer used map.get(key) without checking for null. The bot flagged it. The developer pushed a fix. That code would have crashed in production 2 hours later.
Another one: a database query inside a loop. The bot calculated it would make 47,000 queries per request. The developer refactored it to a batch query.
Where It Fails
I'm not going to pretend this is perfect.
The bot struggles with:
- Context-heavy logic (business rules that span 5 files)
- Framework-specific patterns (it doesn't know our internal libraries)
- Political decisions (should we deprecate this endpoint? that's a people problem)
I still review every PR before merging. But now I only read the diffs the bot flagged. The rest get a quick glance.
The Cost
Running o3 costs about $0.15 per review. For 15 reviews per day, that's $2.25. About $67/month.
My time is worth more than that. Even at a modest $100/hour, the 12 hours I save per week is $1,200. The ROI is absurd.
What I Learned
Silence is a feature. A bot that only speaks when something is wrong earns trust. A bot that comments on everything gets ignored.
Prompt engineering is 80% of the work. The difference between useful and useless is how you frame the task. Be specific. Give examples. Set constraints.
You need a feedback loop. I log every false positive. The bot learns from them. After 3 weeks, false positives dropped to near zero.
Don't automate judgment. The bot catches factual issues. It doesn't decide architecture or team standards. That's still my job.
The Code
Here's the GitHub Actions workflow:
name: AI PR Review
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: AI Review
uses: your-org/ai-pr-review@v2
with:
openai-key: ${{ secrets.OPENAI_KEY }}
model: o3
prompt-template: .github/review-prompt.txt
confidence-threshold: MEDIUM
max-comments: 5
The confidence-threshold: MEDIUM flag is critical. It filters out LOW confidence suggestions. Those are usually noise.
Should You Do This?
If your team has
💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.
💰 Want to make some smart bets? I've been using Polymarket — the world's largest prediction market platform — to bet on everything from election outcomes to tech trends. Real money, real probabilities, real payouts. Unlike crypto casinos, Polymarket is a legitimate information market where your edge comes from being better informed than the crowd. I've banked some solid wins calling AI regulation timelines and crypto ETF approvals. Sign up with my referral link and start trading: Polymarket.com
Top comments (0)