DEV Community

Aria13
Aria13

Posted on • Originally published at forge.closerhub.app

The QA and Code Review Checklist for AI-Generated PRs That Nobody Wrote

๐Ÿ“ฅ TL;DR โ€” Want the complete playbook? This article covers the concepts. The full guide includes complete QA checklists, review templates, and team workflow integration guide.

โ†’ Get the AI PR Review Checklist guide โ€” 12โ‚ฌ, instant PDF ยท 30-day refund

My team ships roughly 40 PRs a week. About half of them are AI-assisted โ€” Copilot, Cursor, Claude. The velocity is real. So is the chaos if you don't have a process to match it.

After six months of watching AI-generated code sneak subtle bugs, over-engineered abstractions, and hallucinated API calls into production, I built a review playbook that actually scales. Here's what works.


Why AI PRs Break Your Existing Review Process

Traditional code review assumes the author understood the problem deeply before writing code. AI-generated code breaks that assumption in three specific ways:

Confident incorrectness. The diff looks clean. Tests pass. The logic is plausible. But the AI misunderstood a subtle requirement โ€” maybe it ignored a race condition in your async queue, or used a deprecated SDK method that still compiles.

Surface-level coherence. AI output is syntactically tidy and stylistically consistent, which tricks reviewers into approving it faster. The signal-to-noise ratio feels high, so you speed up exactly when you should slow down.

Missing context stitching. AI doesn't know that getUserById has a known N+1 issue in production, or that your auth middleware behaves differently behind the CDN. It writes code that's correct in isolation and broken in your system.

Your old checklist โ€” "does it work, is it readable, are there tests" โ€” isn't enough.


The Triage Protocol: Before You Open the Diff

Don't review every AI PR the same way. Triage first, then allocate attention proportionally.

High-risk signals (slow down, full review):

  • Touches auth, permissions, or session handling
  • Modifies database schema or migration files
  • Introduces a new external dependency
  • Changes error handling in critical paths (payments, data exports)
  • Large surface area: 500+ lines across many files

Lower-risk signals (faster review, lighter checklist):

  • Isolated utility function with full unit test coverage
  • UI-only change with no state or API interaction
  • Copy/content update
  • Refactor with identical input/output contract

Tag PRs on open with a risk label. This alone cuts review time by ~30% because reviewers stop applying maximum scrutiny to a CSS padding fix.


The Core QA Checklist (Use This)

For every AI-generated PR, I run through these in order. The sequence matters โ€” stop early if you hit a red flag.

1. Requirement alignment

  • [ ] Does the diff actually solve the stated problem, or a plausible-sounding adjacent problem?
  • [ ] Are edge cases from the ticket/spec covered, or just the happy path?

2. Integration correctness

  • [ ] Does it call internal APIs/services with the current contracts, not outdated signatures?
  • [ ] Are environment-specific configs (staging vs prod, feature flags) handled correctly?
  • [ ] Does it interact correctly with existing shared state (cache, queue, DB)?

3. Error handling

  • [ ] Are errors surfaced meaningfully, or swallowed silently?
  • [ ] Does it distinguish between retryable and non-retryable failures?
  • [ ] What happens when a dependency is unavailable?

4. Security

  • [ ] Are user inputs validated and sanitized at trust boundaries?
  • [ ] Is any sensitive data (tokens, PII) being logged or exposed in errors?
  • [ ] Does it respect existing auth/permission checks, not re-implement them?

5. Performance

  • [ ] Any N+1 query patterns or unbounded loops over user-controlled data?
  • [ ] Are there missing indexes implied by new query patterns?
  • [ ] Will this degrade under load (missing pagination, large payload responses)?

6. Test quality

  • [ ] Do tests assert behavior, or just that code runs without throwing?
  • [ ] Are mocks hiding real integration issues?
  • [ ] Is the unhappy path tested?

Print this out. Paste it as a PR template comment for AI-assisted work. It takes 8 minutes to run through on a medium-sized PR.


Red Flags That Should Pause Any Review

These are immediate "request changes" without reading further:

  • Placeholder error handling: catch (e) { console.log(e) } or empty catch blocks
  • Hardcoded credentials or URLs: AI loves writing http://localhost:3000 or test API keys in code
  • Unexplained new dependencies: AI sometimes pulls in a 200KB library to replace a 3-line function
  • Overridden type safety: as any, // @ts-ignore, or ! non-null assertions without comments
  • Database operations outside transactions: especially create + update patterns
  • Commented-out code blocks: AI frequently leaves old implementations commented out "just in case"

Any of these triggers a conversation before review continues.


Scaling Reviews Across a Team

Individual checklists break down at team scale. Here's what holds up:

Automate the mechanical layer. Static analysis, linting, type checks, and secret scanning should fail CI before a human even opens the diff. Don't waste reviewer attention on things machines catch reliably. We use gitleaks for secrets, semgrep for basic security patterns, and strict TypeScript with no any escape hatch in CI.

Require AI disclosure in PR descriptions. A simple template field: "AI assistance used: [none / Copilot autocomplete / full AI generation]." This lets reviewers calibrate. A Copilot-autocompleted utility function gets lighter treatment than a 300-line Cursor-generated feature.

Assign specialist reviewers for high-risk areas. Auth PRs always get someone who owns auth. DB schema changes always get someone who writes migrations. Don't let AI-generated code bypass domain ownership just because the diff looks clean.

Track AI-related bug escapes. We add a tag in our incident tracker when a post-deploy bug originated in AI-generated code. After two months, we had clear patterns: 70% of issues were in the "missing context stitching" category โ€” AI didn't know about our specific infra constraints. That insight sharpened our checklist questions.


Making This Stick

The checklist is worthless if it's optional. Three things that make it stick:

First, embed it in your PR template. GitHub's .github/pull_request_template.md means every PR opens with the checklist pre-filled. Reviewers check boxes, not recall from memory.

Second, do async review-of-reviews monthly. Pick 5 merged AI PRs at random, do a 20-minute retro: what did the reviewer catch, what did they miss, what would have been caught by an extra checklist item? This keeps the list calibrated to your actual codebase.

Third, make the reviewer's job feel different from the author's job. Reviewing AI code isn't proofreading โ€” it's adversarial testing. Reviewers should be asking "how does this break" not "does this look right." That mental shift alone catches more bugs than any checklist item.


AI-assisted development isn't going away, and neither is the responsibility for what ships. The teams that will win are the ones that build review infrastructure fast enough to match their generation velocity.

I compiled everything into a practical guide: AI PR Review Playbook: QA Checklists That Scale

Top comments (0)