I built an AI PR reviewer and it already caught bugs I missed

#ai #github #codereview #python

I've been doing code review the same way for years. Read the diff, run it locally if something looks tricky, leave a comment or two, merge. Works fine until you're tired, distracted, or too close to the code to see the obvious problem.

So I built claude-pr-reviewer. It's a GitHub Action that feeds your PR diff to Claude and posts structured feedback as a comment. Not a linter. Not a style checker. Actual reasoning about what the code does, what it gets wrong, and whether it should merge.

The setup is minimal:

name: Claude PR Review
on:
  pull_request:
    types: [opened, synchronize]

permissions:
  pull-requests: write
  contents: read

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: indoor47/claude-pr-reviewer@v1
        with:
          anthropic_api_key: ${{ secrets.ANTHROPIC_API_KEY }}

One secret. One workflow file. Every PR gets reviewed automatically.

What the output actually looks like

The review posts as a PR comment, structured into sections. Here's a real example from a rate-limiting PR I ran it against:

## Summary
This PR adds token bucket rate limiting to the /login and /register
endpoints to prevent brute force attacks...

## Issues Found

### Critical (must fix before merge)
- `auth/middleware.py:34`: Rate limit state is stored in-process memory.
  This means limits reset on every deploy and don't work across multiple
  server instances. Use Redis or a shared store.

### Major (should fix)
- The rate limit window resets on each request rather than using a
  sliding window, making it easy to bypass with careful timing.

### Minor (consider fixing)
- Variable `ratelimit_max` (line 12) should be `RATE_LIMIT_MAX` per PEP8.

## Security
The current implementation can be bypassed by rotating IPs. Consider
combining with user-based limiting in addition to IP-based.

## Overall Verdict
REQUEST CHANGES -- the in-memory state is a correctness bug that will
cause silent failures in production.

The in-memory state bug was what I missed. I was focused on the rate-limiting logic: the algorithm, the token bucket math. Whether state would evaporate on every deploy, or fail silently across multiple instances, hadn't registered. It showed up in Critical.

That's what makes it useful. When you're deep in how something works, you stop seeing what it assumes.

How it works

About 250 lines of Python, no external dependencies. When a PR opens or gets updated:

The Action reads the PR number from the GitHub Actions event payload
It fetches the raw diff from GitHub's API (Accept: application/vnd.github.v3.diff)
It sends the diff, PR title, and description to Claude via the Anthropic API
Claude returns feedback in a fixed format: Summary, Issues (Critical / Major / Minor), Security, Verdict
That gets posted as a PR comment, or if one already exists from a previous push, patched in place

That last part matters more than it sounds. Without it, every push to a PR spawns a new top-level comment. If CI takes four tries to go green, you end up with four separate review blocks. The tool searches for its own marker () and patches instead.

The prompt forces a consistent format:

You are a senior software engineer doing a thorough code review.

PR Title: {title}
PR Description: {body}

Diff:
{diff}

Respond in this exact format:

## Summary
One paragraph describing what this PR does.

## Issues Found

### Critical (must fix before merge)
...

### Major (should fix)
...

Without the format constraint you get an essay. With it you get something scannable in 30 seconds.

What it catches (and what it doesn't)

Good at: missing error handling, SQL injection risks, auth logic that doesn't match what the PR description says it does, edge cases around empty input or concurrent access. Things that pass human review because you're checking whether the code looks right, not whether the assumptions hold.

Bad at: anything that requires knowing your codebase beyond the diff. It won't know your error handling conventions, or that a module was deprecated last quarter. It only sees what changed.

If you don't have tests, none of this matters anyway. The tool tells you something is wrong; you still need tests to know whether the fix is right.

Why no dependencies

External packages need pip install, which needs a setup step, which can fail, and then your PR review workflow is broken on someone's urgent hotfix at 2am. urllib, json, re are always there.

It also keeps the attack surface at zero. One file. Nothing to audit.

Models and cost

Default is Claude Sonnet. Each review costs roughly $0.003-$0.02. A team running 20 PRs a day spends about $1-2/month. Opus is available for harder PRs by adding model: claude-opus-4-6 to the workflow inputs -- more expensive ($0.01-$0.05/review) but better on subtle issues.

There's a paid hosted tier in progress for teams that don't want to manage their own API key.

Repo is at indoor47/claude-pr-reviewer. MIT license, works with your own key. If it flags something wrong, or misses something obvious, that's worth reporting.