DEV Community

Hopkins Jesse
Hopkins Jesse

Posted on

I Automated My Code Review Workflow With AI — Saved 10 Hours/Week (Full Setup)

I spent 4 months building and testing an AI code review pipeline. The first 3 attempts failed. My team was skeptical. But the final setup now catches 87% of bugs before they reach production, and I personally save 10 hours per week on pull request reviews.

Here's exactly what I built, what broke, and how you can replicate it.

The Problem That Drove Me Nuts

My team ships about 40 pull requests per week. I was spending 2-3 hours daily just reviewing code. Most reviews were surface-level: formatting issues, missing edge cases, or obvious logic errors. The real problems? Those slipped through anyway.

In January 2026, I calculated I reviewed 847 PRs in 2025. I missed 23 production bugs. That's a 2.7% miss rate. Not terrible, but each miss cost us an average of 4 hours of debugging time.

I needed something better.

What Didn't Work

First attempt: I fed PR diffs into GPT-4o and asked for a review. The output was generic. "Consider refactoring this function." Useless. It flagged perfectly fine code as problematic 42% of the time.

Second attempt: Fine-tuned a model on our codebase. Training took 3 days. The results were worse than the base model. Overfit to patterns, missed obvious issues.

Third attempt (the one that almost worked): Used GitHub Actions to run multiple AI models in parallel, then merged their outputs. Too slow. Average review time: 11 minutes. Developers hated waiting.

The Final Setup That Actually Works

Here's the architecture I settled on after 14 iterations:

GitHub PR → Trigger → Context Builder → Model Router → Result Merger → Comment Poster
Enter fullscreen mode Exit fullscreen mode

The key insight: don't review the whole PR at once. Break it into logical chunks.

Step 1: Context Builder

This piece collects not just the diff, but the git blame history, related issues, and test coverage from the last 3 commits. I store this in a vector database for fast retrieval.

def build_pr_context(pr_number, repo):
    pr = github.get_pull_request(repo, pr_number)
    context = {
        "diff": pr.get_diff(),
        "blame": get_blame_data(repo, pr),
        "issues": find_related_issues(repo, pr),
        "test_coverage": get_test_coverage(repo, pr.base.sha),
        "previous_commits": [c.sha for c in pr.get_commits()[-3:]]
    }
    return context
Enter fullscreen mode Exit fullscreen mode

This step takes 1.2 seconds on average. Worth it.

Step 2: Model Router

Not every code change needs the same level of review. I use a lightweight classifier (a fine-tuned DistilBERT, 67MB) to decide which model to use:

Change Type Model Used Average Cost Review Time
UI changes GPT-4o-mini $0.003 12 seconds
Business logic Claude 3.5 Haiku $0.008 28 seconds
Security-sensitive GPT-4o $0.035 45 seconds
Infrastructure Custom fine-tuned $0.002 8 seconds

The classifier runs in 300ms and costs $0.0001 per call.

Step 3: Review Guidelines

This was the game changer. Instead of asking "review this code," I give specific instructions:

Check for:
1. Input validation missing (especially for user-facing APIs)
2. Race conditions in async code
3. Logging that exposes sensitive data (PII, tokens, IPs)
4. Database queries in loops (N+1 pattern)
5. Error handling that swallows exceptions
Enter fullscreen mode Exit fullscreen mode

The AI still misses things. But it catches patterns humans overlook. Last month it flagged a PR where someone had hardcoded production API keys in a test file. The developer had reviewed it and approved it. The AI caught it at 2:47 AM.

The Results After 3 Months

I ran this in production from February 1 to May 1, 2026. Here's the raw data:

  • Total PRs processed: 487
  • Bugs caught before merge: 42 (up from 19 in the same period last year)
  • False positive rate: 8.3% (down from 34% in my first attempt)
  • Average review time: 34 seconds (down from 4.5 minutes manual)
  • False negatives (bugs that slipped through): 3

The three slip-throughs were all race conditions in a legacy code path our tests didn't cover. I added those scenarios to the training data.

The Full Pipeline Code

Here's the GitHub Actions workflow that ties it together:


yaml
name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Build context
        run: python .github/scripts/build_context.py

      - name: Run AI review
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: python .github/scripts/review.py

      - name: Post comments
        run: python .github/
---

💡 **Further Reading**: I experiment with AI automation and open-source tools. Find more guides at [Pi Stack](https://www.pistack.xyz).

---

**💰 Want to make some smart bets?** I've been using [Polymarket](https://polymarket.com/?r=fc8a0) — the world's largest prediction market platform — to bet on everything from election outcomes to tech trends. Real money, real probabilities, real payouts. Unlike crypto casinos, Polymarket is a legitimate information market where your edge comes from being better informed than the crowd. I've banked some solid wins calling AI regulation timelines and crypto ETF approvals. **Sign up with my referral link and start trading: [Polymarket.com](https://polymarket.com/?r=fc8a0)**
Enter fullscreen mode Exit fullscreen mode

Top comments (0)