How to Review AI-generated Pull Requests in 6 Steps with Claude Code
When I started seeing three AI-written PRs land in my inbox every hour, my old checklist fell apart. The diff looked clean, the CI was green, and the commit messages were nicely phrased. Yet every PR slipped a subtle bug past me - a mock-heavy test, a wrong API signature, or a side-effect hidden at import time. After a couple of production incidents I built a linear, single-pass checklist that catches the six error families that coding agents (Claude Code, Cursor, Codex, etc.) tend to introduce. Below is the exact workflow I now run on every AI-authored PR.
Prerequisites
- At least a year of code-review experience (comfortable with
git diff, GitHub PR UI, and local test runs). - Have used a coding agent that can generate a diff (Claude Code is the reference).
- The repo must have a test runner you can invoke locally or via CI.
Step 1 - Write the Expected Scope before opening the diff
Before you click Files changed, open the PR metadata and write a one- or two-sentence scope that defines exactly what the PR is allowed to touch.
# Grab the title and body for quick copy-paste
gh pr view 1234 --json title,body,headRefName
{
"title": "refactor: extract auth helper",
"body": "Extract `verifyJwt` from `auth/handler.ts` into `auth/jwt.ts`. No behavior change.",
"headRefName": "feat/extract-auth-helper"
}
From that I write the explicit expectation:
Add `verifyJwt` in `auth/jwt.ts`, update import in `auth/handler.ts`, no other files change.
Why it matters - if you read the diff first you're prone to rationalising "oh, the extra files look okay". With a concrete anchor you can quickly filter the diff later and spot scope drift.
Step 2 - Detect Fake Tests and Over-mocking
Agents love to make the CI green by writing tests that mock everything. A quick way to expose a fake test is to flip one assertion and see if the test still passes.
# Pick a representative test and invert an assertion
pytest tests/test_jwt.py::test_verify_valid_token -x --tb=short
If the test still passes after you changed assertEqual to assertNotEqual, the test isn't exercising production code at all.
Next, audit the mock-to-assert ratio:
# Count mock usage
grep -c "mock\|patch\|MagicMock" tests/test_jwt.py
# Count assertions
grep -c "assert" tests/test_jwt.py
A ratio greater than 2:1 (mocks per assertion) is a red flag - the test is basically asserting that the mock was called with the arguments the mock itself supplied.
Fix path - If you find a fake test, comment with a concrete example of a real assertion (e.g., "assert that verifyJwt returns a decoded payload for a real token") and ask the author to add it.
Step 3 - Verify API Call Signatures
For each new or changed function call, jump to the definition and compare the signature with the official library docs, not just the local stub.
// Example of a subtle bug the type-checker missed
const result = await db.query(sql, params, callback);
The local stub shows db.query(sql: string, params: any[], callback: fn): void. Because it returns void, the await results in undefined. In production this means the caller always receives undefined and may silently skip error handling.
How to check - Open the library's README or the version-pinned docs in package.json/requirements.txt and ensure the signature matches. Run a quick integration test with edge inputs (e.g., null query, empty param array) to confirm runtime behaviour.
If the signature is wrong, a single-line comment with a link to the correct docs and a suggested fix is usually enough.
Step 4 - Confirm the Diff stays inside the declared scope
Now that you have a concrete scope string, filter the diff stat and list any files that fall outside it.
git diff --stat origin/main...HEAD | grep -v '^ auth/'
Sample output:
src/utils/logger.ts | 12 ++++++------
src/api/users/handler.ts | 34 ++++++++++++++--------
2 files changed
Both files are outside the expected auth/ folder. Look at the commit messages for those files - if they say "while we're here, tidy logger", that's scope drift.
Action - Request a split of the PR. Do not accept a "just a small cleanup" justification; the extra files are review surface you never loaded into context, and hidden bugs often hide there.
Step 5 - Hunt for Hallucinated Imports and Hidden Side-effects
Two quick commands catch the most common issues.
# Verify the import resolves to the expected module
python -c "from utils.security import sanitize_html; print(sanitize_html.__module__)"
If the import resolves to an unexpected module (e.g., a local utils/__init__.py re-exports bleach.sanitizer.sanitize_html), you have a hallucinated import - the code compiles but does nothing at runtime.
Next, look for top-level side-effects that will run on import.
# Grep for network or file I/O at module level
grep -R "requests.get\|open(\|os\.environ\|subprocess" **/*.py
If you find a call like requests.get(URL) at the top of a module, run the test suite with networking disabled:
pytest --disable-socket tests/
If the suite still passes, the import-time network call is being swallowed by a mock - a classic hidden side-effect that will break in production.
Remediation - Ask the author to move the call into a function or guard it with if __name__ == "__main__": and ensure a real test exercises the code path.
Step 6 - Align Commit Messages with the Diff
Pick a random commit and compare its message with the actual changes.
# Show concise log for the PR range
git log --oneline origin/main..HEAD
a1b2c3d fix: null check in verifyJwt
e4f5g6h refactor: extract auth helper
i7j8k9l test: add jwt verification cases
m0n1o2p chore: update package.json
# Show the diff for the selected commit
git show a1b2c3d
If the diff touches a retry policy, timeout adjustments, and a null check, but the message only mentions the null check, the message covers less than 20 % of the change - a commit-message mismatch.
Rule of thumb - The diff should be explainable by the message within a ±20 % margin. If it isn't, request a rewrite of the commit (or a squash-and-rebase) so the history stays trustworthy.
Decision: Ship or Reject?
After the six checks, apply the following matrix:
| Failed step | Can a one-sentence nudge fix it? | Action |
|---|---|---|
| 1 - Scope not declared | ✅ | Comment: add concrete scope line. |
| 2 - Fake test | ❌ | Reject - attach the failing inverted-assertion example. |
| 3 - Wrong API signature | ✅ (if isolated) | Nudge with docs link; reject if pervasive. |
| 4 - Scope drift | ❌ | Reject - ask for split PR. |
| 5 - Hallucinated import / hidden side-effect | ❌ | Reject - include verification command output. |
| 6 - Misleading commit message | ❌ | Reject - require rewrite. |
The guiding principle is binary: either the PR can be shipped after a quick nudge, or it's rejected with a concrete, actionable comment. "Comment and forget" leads to half-reviewed PRs that linger forever.
Key Takeaways
- Write a concrete scope first - it prevents confirmation bias when you later look at the diff.
- Flip one assertion - a simple fail-fast check for fake tests.
- Audit mocks vs. asserts - a high ratio signals over-mocking.
- Cross-check signatures against official docs, not just the local stub.
- Filter diff by the declared scope; any out-of-scope file should trigger a split request.
- Verify imports resolve to the intended module and that no top-level I/O runs on import.
- Commit messages must map to the diff within ~20 %; otherwise reject.
Running this checklist takes about 10-12 minutes for a typical 4-commit PR and catches at least four of the six error families that a human-only checklist would miss. It's a small habit shift that saves hours of post-merge fire-fighting.
If you find this workflow useful, we packaged it into a Claude Code plugin that automates the repetitive parts (scope extraction, mock-audit, import verification). Feel free to check it out if you want the automated version.
Originally published at https://shipwithai.io/blog/vi/reviewing-ai-generated-pull-requests-2026-part1
Top comments (0)