DEV Community

Nex Tools
Nex Tools

Posted on • Originally published at nextools.hashnode.dev

Claude Code for Code Review Automation: How I Replaced 80% of My Manual PR Reviews with AI

Originally published on Hashnode. Cross-posted for the DEV.to community.

For a long time my code review process was a bottleneck. PRs sat in queue for hours because I was the only senior dev who could review backend changes. By the time I got to a review, half the context was stale, the author had moved on to other work, and I rushed through the review just to unblock them. The result was reviews that were either too lenient or too pedantic. Neither was useful.

Then I wired Claude Code into my PR pipeline. Now every PR gets a structured review within 90 seconds of being opened. Security issues, race conditions, missing tests, and style violations get caught before I even see the PR. By the time I sit down to review, the obvious stuff is already flagged and I can focus on the architectural questions only a human can answer. This is the setup that made it work.


Why Manual PR Reviews Don't Scale

The fundamental problem with code review is attention economics. Every PR demands the same 30 minutes of focused review whether it's a typo fix or a database migration. You can't context switch into a 200-line PR and produce useful feedback in 5 minutes. You either spend the full 30 or you skim and miss things.

For a small team, this is fine. For a team shipping 20 PRs a day, it breaks. The reviewer becomes a bottleneck. PR queues grow. Authors lose context while waiting. Reviews get rushed. Quality drops.

The breaking point for me was a Monday where I had 14 PRs waiting. I spent the entire day on reviews, shipped nothing of my own, and still had 6 PRs in queue at 6pm. That was the day I decided to automate the parts of review that didn't need human judgment.

A code review is not one task. It's five tasks: catch obvious bugs, enforce style, verify tests exist, check security patterns, and evaluate architecture. Only the last one needs a human.


The Four Layers of Automated Review

I split code review into four layers, each handled by a different mechanism. The goal is to push as much as possible to the cheapest layer.

Layer 1: Linters and Formatters

These catch syntax issues, formatting violations, and obvious mistakes. They run on commit hooks and in CI. They are not part of my Claude Code setup at all. If your linter isn't catching basic style issues before the PR is opened, fix that first.

Layer 2: Static Analysis

Tools like ESLint with custom rules, Semgrep, and Bandit catch a layer of issues that linters miss: unused variables, dangerous patterns, security antipatterns. These also run in CI.

Layer 3: Claude Code Review

This is where the magic happens. Claude Code reads the diff and produces structured feedback on issues that require understanding context: race conditions, missing input validation, error handling gaps, performance regressions, missing test coverage for critical paths.

Layer 4: Human Review

I focus on architecture, business logic correctness, and questions the author should be asking. Anything Claude flagged in Layer 3 has either been fixed by the author or escalated to me with context.

The result: my human review time per PR dropped from 30 minutes to 8 minutes, and I catch more real issues because I'm not buried in stylistic noise.


The Code Review Skill

I built this as a Claude Code skill that gets invoked on every PR. The skill has a single job: review a diff and produce structured feedback.

---
name: code-review
description: Reviews a code diff and produces structured findings on bugs, security, performance, and test coverage.
---

# Code Review

You are a senior engineer reviewing a pull request. You have 15 years of experience and you push back on lazy patterns. You do not produce stylistic feedback (that's the linter's job).

## Your Task

Review the diff at the path provided. Produce findings in this exact format:

| File | Line | Severity | Category | Issue | Suggested Fix |
|------|------|----------|----------|-------|---------------|

Severity scale: Critical (data loss, security, production breakage) / High (bugs, missing validation, performance) / Medium (maintainability, edge cases) / Low (nits, suggestions).

Categories: Bug / Security / Performance / Tests / Architecture / DataIntegrity.

## Rules

- Skip findings the linter would catch
- Skip stylistic feedback unless it impacts correctness
- If a finding is uncertain, mark it explicitly: "Verify: ..."
- Limit to top 15 findings, sorted by severity descending
- If no issues found, say so explicitly. Don't fabricate findings.

## Output

Findings table first, then a one-paragraph summary at the bottom: overall risk assessment and recommendation (approve / request changes / block).
Enter fullscreen mode Exit fullscreen mode

This skill produces consistent output every time. The format is parseable, the severity levels are clear, and the recommendation gives me a starting point.


Wiring It Into the PR Pipeline

The skill is just a prompt. The integration is what makes it useful. Here's the full flow.

Step 1: PR Opens, GitHub Action Fires

A GitHub Action listens for pull_request events. When a PR opens or updates, the action triggers.

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Generate diff
        run: git diff origin/${{ github.base_ref }}...HEAD > /tmp/pr.diff
      - name: Run Claude Code review
        run: |
          claude-code --skill code-review --input /tmp/pr.diff > /tmp/review.md
      - name: Post review as PR comment
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const review = fs.readFileSync('/tmp/review.md', 'utf-8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: review
            });
Enter fullscreen mode Exit fullscreen mode

Step 2: Review Posts as PR Comment

Within 60 to 90 seconds of the PR opening, a structured review appears as a comment on the PR. The author sees it before I do.

Step 3: Author Addresses Findings

The author reviews the findings, fixes what's actionable, and pushes again. The Action re-fires on the new commits and posts an updated review. Most authors clear all Critical and High findings before flagging me.

Step 4: I Review What's Left

When I get the PR, the bulk of the review is done. I focus on architecture, business logic, and anything Claude marked as "Verify."


The Findings That Actually Matter

After running this for three months, I have data on what Claude Code catches well and what it doesn't.

Catches Well

  • Race conditions in async code
  • Missing input validation on API endpoints
  • SQL injection patterns
  • Unhandled promise rejections
  • Off-by-one errors in pagination
  • Missing tests for new error paths
  • Performance regressions (N+1 queries, unbounded loops)
  • Stale comments that contradict the code

Catches Poorly

  • Architectural mismatches (Claude sees one PR, not the system)
  • Business logic errors (Claude doesn't know your domain)
  • Subtle concurrency bugs across services
  • Issues that require historical context
  • Decisions that depend on team conventions not documented anywhere

This is exactly the split I want. The mechanical stuff goes to AI. The judgment stuff comes to me.

The review skill is a force multiplier on PRs that need a basic safety pass. It is not a replacement for review on PRs that touch core architecture or business logic. Know which PRs are which.


The Custom Categories That Made It Click

The default skill template is generic. The version that actually works for my codebase has custom categories specific to what I care about. I added these over time based on issues I kept seeing.

Tenant Isolation

Every database query in our app should be scoped to a tenant. Missing tenant filters are a critical security issue. I added a category: "TenantIsolation" with this rule:

Flag any database query that touches a multi-tenant table 
without an explicit tenant_id filter in the WHERE clause.
Enter fullscreen mode Exit fullscreen mode

This catches issues that linters can't, because the rule depends on knowing which tables are multi-tenant.

Idempotency

API endpoints that mutate state should be idempotent. I added:

For any new POST/PUT/PATCH endpoint, verify idempotency. 
If the endpoint can be called twice and produce a different 
result the second time, flag it as an issue.
Enter fullscreen mode Exit fullscreen mode

Backwards Compatibility

We have public API consumers. Breaking changes are a critical issue. I added:

Flag any change to a public API contract: new required 
fields in requests, removed fields in responses, changed 
field types, changed status codes, changed error formats.
Enter fullscreen mode Exit fullscreen mode

These domain-specific categories made the skill 3x more valuable than the generic version. The lesson: start with a generic review skill, then add categories as you encounter issue patterns that keep recurring.

Want my full code review skill template plus the 12 domain-specific categories I've added over time? Grab the code review automation toolkit where I share the exact prompts I run in production.


Handling False Positives

Claude Code occasionally produces findings that are wrong. Either the finding is technically incorrect, or it's correct but irrelevant in this context. Here's how I handle that.

Inline Suppression

The author can reply to a finding with [suppress: <reason>] and the next review run will skip that finding. The suppression reason is logged.

Pattern Suppression

If the same false positive shows up across many PRs, I add a pattern to a .claude-review-ignore file in the repo root. The skill reads this file before producing findings.

# .claude-review-ignore
- Pattern: "Missing await on logger.info"
  Reason: Logger is intentionally fire-and-forget

- Pattern: "Magic number 86400"
  Reason: Standard seconds-in-a-day constant
Enter fullscreen mode Exit fullscreen mode

The skill ignores findings that match these patterns. False positive rates dropped to under 5% once I had a dozen patterns in the ignore file.

Calibration Over Time

Every two weeks I review the suppressed findings as a group. If a pattern appears 5+ times, it goes into the ignore file. If a finding type is consistently wrong, I refine the skill prompt to be more specific.


The Metrics That Prove It Works

I tracked these metrics for the first three months after rolling this out.

  • Time to first review: dropped from 4.2 hours to 87 seconds
  • My review time per PR: dropped from 30 minutes to 8 minutes
  • Critical issues caught before merge: up 42%
  • PR cycle time (open to merge): dropped from 2.1 days to 8 hours
  • PRs requiring more than one review round: dropped from 38% to 14%

The numbers I didn't expect: developer satisfaction went up because PRs moved faster, and I had time to do real architectural reviews instead of mechanical safety passes.


What I'd Do Differently

Three things I'd tell my past self before starting this.

First: don't skip the suppression mechanism. The first version of my skill had no way to mark false positives. Within two weeks, developers were ignoring the bot entirely because it was too noisy. The suppression system is what made the bot trusted.

Second: invest in domain-specific categories early. The generic review skill catches generic issues. Your codebase has codebase-specific patterns that matter more. The first 5 categories I added moved the bot from "useful" to "critical."

Third: don't try to replace human review entirely. The temptation is to keep adding categories until the bot catches everything. That's the wrong target. The bot should handle what it does well, and humans should handle the rest. Trying to push the bot into architecture review just produces unreliable output.

The full code review automation pipeline plus my GitHub Action template is available in the automation toolkit. Drop it into your repo and you'll have AI reviews running within an hour.


What's Next

The code review skill is one piece of a larger automation push. The next layer I'm building is automatic regression analysis: when a PR introduces a behavior change, the skill detects it and asks the author to confirm the change is intentional. I'm also wiring Claude Code into the deploy pipeline so production incidents trigger an automatic post-mortem draft.

The pattern is the same one that made code review work: identify the parts that don't need human judgment, push them to AI, and free up your humans for the parts that do. Once you internalize that pattern, you start seeing it everywhere in your workflow.

If you take one thing from this: code review is the highest-leverage place to start with AI automation in a dev team. The work is structured, the output format is clear, the integration is straightforward, and the time savings are measurable from day one.

Top comments (0)