DEV Community

Cover image for Tiered AI Code Review: A Framework for AI-Generated PRs
Vuong Ngo
Vuong Ngo

Posted on

Tiered AI Code Review: A Framework for AI-Generated PRs

Something has shifted in the review queue. The diffs are bigger, they arrive faster, and a growing slice of them were generated by an AI tool rather than drafted by a human thinking through the change. That is not a complaint. But it is a problem if your review process has not adapted to it.

GitClear's longitudinal analysis of AI-assisted code output tracked millions of lines of code and found that AI tooling correlates with rising churn (lines rewritten or deleted shortly after being written), increasing copy-paste frequency, and declining code reuse. A peer-reviewed study from NYU found that developers using AI assistants produced significantly less secure code and were overconfident about it. A Veracode analysis spanning 100+ LLMs sharpens that picture: 45 percent of AI-generated samples failed security checks, producing 2.74x more vulnerabilities than equivalent human-written code.

Neither of those findings means AI coding tools are a net negative. Teams that use them ship faster. The issue is that uniform review, treating every AI-generated PR with the same depth as every human-written one, creates a bottleneck. And skipping review because "the AI probably got it right" quietly accumulates the kind of defect debt that surfaces at the worst time.

What works is calibration: a tiered AI code review approach that matches review effort to the actual risk of each PR, rather than applying one rule to all.

Three signals for tiered code review

Before a reviewer opens a diff, three signals tell you roughly what you are dealing with:

Code origin is how much of this change came from an AI tool. A human who accepted a few Copilot completions is different from a Claude Code session that planned, drafted, and committed a full feature. The distinction matters because AI-generated code tends to be syntactically sound but logically shallow. It passes linters. It sometimes misses invariants, forgets to handle the error path, or silently ignores a business rule that only lives in the team's memory.

Change scope is how many lines changed. Not a perfect signal, but a useful one. Larger diffs mean more surface area for reviewers to miss something, more decisions that were made without explicit human intent, and less chance that any single reviewer holds the whole change in their head at once.

Blast radius is what the code touches. A PR that modifies a fixture file is recoverable if something is wrong. A PR that touches the auth flow, a payment processor integration, or a database migration schema is not. The cost of a missed defect scales with blast radius, so that is where review depth needs to be highest.

Decision flowchart: code origin, change scope, and blast radius signals branch into tier-1 skim, tier-2 scrutinize, or tier-3 mandatory sign-off
How three signals combine to assign a review tier. Blast radius is the override: a critical path always lands at Tier 3 regardless of origin or scope.

The tier decision matrix

Code Origin Change Scope Blast Radius Review Tier
Human only Any Tests, docs, scripts Tier 1: Skim
AI-assisted < 100 lines Low (internal tooling, scripts) Tier 1: Skim
AI-assisted 100–500 lines Moderate (API surface, business logic) Tier 2: Scrutinize
AI-generated Any scope Low to moderate Tier 2: Scrutinize
AI-generated > 300 lines Any Tier 3: Sign-off
Any origin Any scope Critical (auth, payments, migrations, public API) Tier 3: Sign-off

The last row is the one teams are most likely to under-enforce. A 40-line AI-generated change to the OAuth callback handler is still Tier 3. The blast radius overrides everything else.

Bar chart showing relative defect risk for AI-generated versus human PRs across tier-1, tier-2, and tier-3 classifications, with tier-3 AI-generated bars highest
Relative defect risk by review tier classification (illustrative, derived from GitClear 2024 general findings, not measured per-tier data). AI-generated Tier 3 PRs carry the highest uncaught-defect risk before review.

What each tier actually demands

Tier 1 (Skim): One reviewer. CI must pass. Read the entire diff end-to-end, including the parts that look fine. The only mandatory check beyond "CI green" is confirming there are no hardcoded credentials or API keys. Target turnaround: 4 hours.

Tier 2 (Scrutinize): One reviewer, more attention. The reviewer reads every changed function with the intent to understand the logic, not evaluate the formatting. Test coverage for new branches is required, not optional. Any new dependency added to the project gets audited: license, maintenance status, and whether it pulls in something unexpected. If the PR is AI-generated and crosses a service boundary, run a security scan. Target turnaround: 24 hours.

Tier 3 (Mandatory Sign-off): Two reviewers, one of them the tech lead. CI required. Security scan required. The PR description must include a rollback plan and evidence that the change was tested in staging. For teams building systems covered by the EU AI Act (high-risk AI obligations apply from August 2026), this tier is also where you flag and document regulatory touchpoints. Target turnaround: 48 hours.

Tier 3 is a gate, not a penalty. A PR that lands in it because it touches the payment flow is not a problem PR. It is a PR that deserves a different kind of attention.

Automating tier assignment

Manual labeling works up to about five AI-generated PRs a day. Beyond that, a GitHub Actions workflow that reads diff size and changed paths can assign the right label automatically. Contributors add the ai-generated label when they push; the workflow handles tier calculation from there.

# .github/workflows/pr-tier.yml
name: PR Review Tier

on:
  pull_request:
    types: [opened, synchronize]

permissions:
  pull-requests: write

jobs:
  assign-tier:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Compute tier inputs
        id: inputs
        run: |
          BASE=${{ github.event.pull_request.base.sha }}
          HEAD=${{ github.event.pull_request.head.sha }}

          LINES=$(git diff --numstat "$BASE..$HEAD" \
            | awk '{added += $1} END {print added+0}')
          echo "lines=$LINES" >> $GITHUB_OUTPUT

          git diff --name-only "$BASE..$HEAD" > /tmp/changed_files.txt
          if grep -qE '^(src/auth/|src/payments/|infra/|db/migrations/)' \
               /tmp/changed_files.txt; then
            echo "blast=critical" >> $GITHUB_OUTPUT
          elif grep -qE '^(src/api/|src/core/|src/services/)' \
               /tmp/changed_files.txt; then
            echo "blast=moderate" >> $GITHUB_OUTPUT
          else
            echo "blast=low" >> $GITHUB_OUTPUT
          fi

      - name: Determine tier
        id: tier
        run: |
          LINES=${{ steps.inputs.outputs.lines }}
          BLAST=${{ steps.inputs.outputs.blast }}
          HAS_AI=$(gh pr view ${{ github.event.pull_request.number }} \
            --json labels \
            -q '[.labels[].name] | contains(["ai-generated"])' \
            2>/dev/null || echo false)

          if [ "$BLAST" = "critical" ]; then
            echo "label=review/tier-3" >> $GITHUB_OUTPUT
          elif [ "$HAS_AI" = "true" ] && [ "$LINES" -gt 300 ]; then
            echo "label=review/tier-3" >> $GITHUB_OUTPUT
          elif [ "$LINES" -gt 300 ] || [ "$HAS_AI" = "true" ]; then
            echo "label=review/tier-2" >> $GITHUB_OUTPUT
          else
            echo "label=review/tier-1" >> $GITHUB_OUTPUT
          fi
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Apply label
        run: |
          for l in review/tier-1 review/tier-2 review/tier-3; do
            gh pr edit ${{ github.event.pull_request.number }} \
              --remove-label "$l" 2>/dev/null || true
          done
          gh pr edit ${{ github.event.pull_request.number }} \
            --add-label "${{ steps.tier.outputs.label }}"
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Enter fullscreen mode Exit fullscreen mode

Before this runs, create the three labels (review/tier-1, review/tier-2, review/tier-3) in your GitHub repo settings. The path patterns in grep -qE are illustrative; tune them to your actual directory structure.

The ai-generated label is set manually by the contributor today. If your AI tool's GitHub App supports labeling on commit, you can automate that too.

A policy template your team can commit

The automation handles assignment. A written policy handles what the reviewer is expected to actually do. Commit this to .github/ and reference it from CONTRIBUTING.md:

# .github/ai-review-policy.yml
# Review policy for AI-assisted and AI-generated pull requests.
# Version this file; update it when your team's trust calibration changes.

version: "1.0"

# Origin labels (applied by contributor before requesting review):
#   "ai-generated"  - PR was authored primarily by an AI tool
#   "ai-assisted"   - human-led PR with AI filling in sections
# No label = human-authored

tiers:
  tier-1:
    name: Skim
    requirements:
      min_approvals: 1
      ci_required: true
      checklist:
        - CI green
        - Full diff read (no section-skipping)
        - No hardcoded credentials or API keys
    target_sla_hours: 4

  tier-2:
    name: Scrutinize
    requirements:
      min_approvals: 1
      ci_required: true
      security_scan: recommended
      checklist:
        - Every changed function read and understood
        - Logic verified (not just syntax and style)
        - Branch coverage checked for new code paths
        - Error handling and edge cases reviewed
        - New dependencies audited (license + maintenance)
    target_sla_hours: 24

  tier-3:
    name: Mandatory Sign-off
    requirements:
      min_approvals: 2        # includes tech lead
      ci_required: true
      security_scan: required
      architecture_review: required
      checklist:
        - Threat model reviewed (updated if changed)
        - Rollback plan documented in PR description
        - Staging deployment verified before merge
        - Tech lead sign-off recorded in a review comment
        - Regulatory obligations flagged (EU AI Act if applicable)
    target_sla_hours: 48
Enter fullscreen mode Exit fullscreen mode

This is a starting point. The SLAs in particular need to match your team's actual capacity; 48 hours for Tier 3 is a ceiling, not a target.

What review tooling can and cannot do

In March 2026, Anthropic launched a dedicated code review tool for Claude, joining existing tools from CodeRabbit, Bito, and others. These tools surface obvious issues automatically and are worth running as part of CI on every PR. They do not replace tier assignment.

The tier determines who looks at the output, with how much attention, and with what checklist. An automated reviewer can flag a suspicious SQL interpolation. It cannot tell you whether the new auth middleware changes the behavior of a third-party SSO integration in a way that matters. That is still a human judgment call, and tiers are how you make sure the right humans are making it.

Making tiers stick

A policy that lives only in a file nobody reads will not hold. Two things that help more than documentation:

The label is visible. When a reviewer opens the PR list and sees review/tier-3 on four open PRs, they know before clicking what those reviews require. That visibility reduces the chance of a quick glance standing in for a real review.

Pattern tracking matters too. If your Tier 3 queue is consistently dominated by one service, one contributor pattern, or one type of AI-generated output, that is a signal worth addressing at the root rather than only at review time. Some teams use a shared task board to track which work items were completed by AI agents and which are still waiting for human sign-off. Agiflow's Claude Code integration, for example, gives engineering leads a view of AI-agent task handoffs alongside open work, so reviewers have context before they open the diff.

Trust through evidence

The data on AI code quality is real, and the right response to it is not to treat AI-generated code as permanently suspect. It is to build the kind of evidence trail that tells you, over time, whether your calibration is right.

Start with the matrix. Add the labels. Ship the workflow. After a month, look at your Tier 2 defect rate for AI-generated PRs versus human ones. If they are converging, tighten the thresholds. If they are not, you have found the signal that tells you where to spend more attention.

That is the version of trust that holds up in a tiered code review system. Earned, specific, adjusted as the tools improve. Not reflexive approval, and not permanent suspicion.

Top comments (0)