I spent 2025 drowning in pull requests. My team of 8 ships 40+ PRs a week, and I was the bottleneck. Every review took 20-45 minutes. Context switching alone ate my afternoons.
In January 2026, I built an AI review pipeline that handles 70% of my PR feedback automatically. I still review critical paths personally. But routine linting, style nits, missing tests, and API contract violations? The bot catches those before I even open the tab.
Here's exactly how I set it up. No fluff.
The Problem Wasn't Code Quality
My team writes decent code. The problem was me. I'd spend 30 minutes on a PR, find three issues, and realize I could have caught two of them with a static analysis rule I forgot to configure.
The real cost wasn't the review time. It was the 15-minute ramp-up to understand each PR's context. Switch to a new branch, read the description, scan the diff, remember the architecture. By PR number 4, my brain was pudding.
I needed something that could:
- Read the diff and the surrounding codebase
- Check against our team's style guide (we use a custom ESLint config)
- Validate API compatibility with our internal SDK
- Flag missing error handling patterns
- Generate review comments in our team's voice
No single tool did all this in 2025. By February 2026, I had a working pipeline.
The Architecture (Simple, Not Clever)
I run this on a GitHub Actions workflow triggered on opened and synchronize events. The key components:
PR Event → Context Collector → AI Analyzer → Comment Generator → GitHub API
The "AI" part is a fine-tuned Claude 3.5 model running on our own Vercel serverless functions. We pay $0.003 per analysis call. That's about $12/week for 400 PRs.
Here's the core action config:
name: AI PR Review
on:
pull_request:
types: [opened, synchronize]
jobs:
review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Collect context
run: |
# Get diff, changed files, and relevant type definitions
git diff origin/main...HEAD > pr_diff.txt
node scripts/gather-context.js
- name: Run AI review
run: node scripts/review-pr.js
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
The gather-context.js script is the secret sauce. It extracts:
- The full diff text
- Type definitions for any changed functions
- Related files from the same module
- Our team's review checklist (stored as a JSON file)
Without good context, the AI hallucinates. With it, the accuracy jumped from 60% to 92% in my tests.
What the AI Actually Catches
I ran this for 8 weeks. Here's the breakdown of issues flagged:
| Issue Type | AI Catch Rate | False Positives |
|---|---|---|
| Missing error handling | 89% | 3% |
| API contract violations | 94% | 1% |
| Styling/style guide | 97% | 0.5% |
| Missing tests for new logic | 76% | 8% |
| Security concerns | 82% | 4% |
| Logic errors | 34% | 12% |
The logic errors part is honest. The AI still misses subtle bugs. But it catches the boring stuff every time.
The Prompt That Made It Work
I spent 3 days iterating on the system prompt. The version that finally clicked:
You are an experienced senior developer reviewing a pull request.
Your job is to catch issues that would waste a human's time.
Rules:
1. Only comment on things that matter. No "consider using const" nits.
2. Reference specific line numbers from the diff.
3. If the issue is subjective, flag it as "suggestion" not "blocking".
4. Check against our style guide at .github/review-guidelines.json.
5. Never comment on formatting (Prettier handles that).
6. If you're unsure, stay silent. False negatives are better than false positives.
Output format:
For each issue, return a JSON object with:
- file: path
- line: number
- severity: "blocking" | "warning" | "suggestion"
- message: string
- code_example: string (optional)
The key was rule 6. Early versions flooded PRs with noise. Developers ignored the bot after 2 days. Now it comments on maybe 3-5 things per PR. People actually read them.
Where It Falls Short
Three things the AI still can't do well:
Business logic validation. If your PR changes the discount calculation for enterprise customers, the AI has no idea if the math is right. I still review those manually.
Architectural decisions. The AI can't tell if you should extract a shared service instead of duplicating the code. It flags duplication, but the solution requires human judgment.
Security edge cases. It catches obvious stuff like SQL injection patterns. But chained vulnerabilities or timing attacks? Nope. Our security review is still mandatory for any auth or
💡 Further Reading: I experiment with AI automation and open-source tools. Find more guides at Pi Stack.
💰 Want to make some smart bets? I've been using Polymarket — the world's largest prediction market platform — to bet on everything from election outcomes to tech trends. Real money, real probabilities, real payouts. Unlike crypto casinos, Polymarket is a legitimate information market where your edge comes from being better informed than the crowd. I've banked some solid wins calling AI regulation timelines and crypto ETF approvals. Sign up with my referral link and start trading: Polymarket.com
Top comments (0)