137Foundry

Posted on Jun 10

5 Quick Checks I Run on Every AI-Generated Pull Request Before Approving It

#ai #productivity #codereview #programming

After a year of reviewing AI-generated pull requests across client projects, a short checklist has emerged. Five checks. About ten minutes total per PR. They catch a meaningful fraction of the bugs that would otherwise leak into production.

This post is that checklist, the reasoning behind each step, and the tooling that makes each check fast.

Photo by Tima Miroshnichenko on Pexels

What this checklist is and is not

It is the high-yield review pass that runs on every AI-generated PR. It is not the only review pass. Domain knowledge, architectural fit, naming conventions, security review, and performance characteristics all still apply. This checklist sits at the front of the review process and catches the AI-specific failure modes that traditional review sometimes misses because they look plausible.

The five checks below are ordered by how much time they save when they catch something. The earlier in the list, the more often the bug shows up and the faster the verification.

Check 1: All imports are real and at the right version

Open every import statement. For each imported module:

Confirm the package exists. For Python that means PyPI; for Node that means the npm registry; for Go that means the module path resolves.
Confirm the version installed in your project supports the functions the code calls. AI assistants often produce code that is correct for a different major version of the library than the one you have installed.

This catches the hallucinated-package case (the package does not exist at all) and the version-mismatch case (the package exists but the API the code uses does not exist in your installed version).

Total time: under two minutes for a normal-sized PR. The fastest win on the list.

Check 2: Every named function and method actually exists

Read through the code and identify every function call or method call with a name you do not immediately recognize. Look it up in the library's documentation. Confirm the signature matches what the code is using.

The reason this is a separate check from imports is that import succeeding does not mean the methods you are calling on the imported object exist. The assistant can produce client.refresh_token_safe() where the real client has only client.refresh_token(). The import works; the call fails.

The Mozilla Developer Network is the canonical reference for JavaScript and Web Platform APIs. The Python standard library docs cover the built-in surface. For third-party packages, the package's own documentation site or its GitHub README is the source of truth.

Total time: three to five minutes for a normal PR.

Check 3: Edge cases the prompt did not mention are handled

This is the slowest check but the highest-value one once you get past the structural problems. Read the code and ask: what inputs would break this?

Standard list:

Empty inputs (empty list, empty string, null/None)
Single-element collections
Very large inputs
Inputs containing the delimiter the code uses internally
Unicode in inputs the code assumes are ASCII
Time zone edge cases
Concurrent inputs the code assumes are sequential

Run each one in your head against the code. If a case is not handled and your real data includes that case, write a test that reproduces it. The test makes the bug concrete and gives you a clean repro for the fix.

This is also where you catch the silently-wrong cases. Date arithmetic that breaks at daylight saving boundaries. Floating-point comparisons that fail on exact equality. Off-by-one errors in pagination. AI-generated code is correct on the happy path almost always, and quietly broken on edge cases more often than human-written code.

Total time: five to ten minutes for a normal PR. Most of the review time.

Check 4: Tests actually test the behavior, not the assumption

If the PR includes tests (and it should), read them carefully. For each test, ask: if the implementation were silently wrong, would this test catch it?

The failure mode to watch for: tests that exercise the code but do not assert anything meaningful about the result. A test that calls a function and checks that no exception was thrown is not a test of correctness; it is a test of "this does not crash." Useful, but limited.

Another failure mode: tests that assert the implementation's current (wrong) behavior is correct. The assistant produced both the implementation and the tests with the same flawed assumption, and the tests "pass" because they agree with the buggy implementation. The test does not catch the bug because the test is also wrong in the same way.

The fix: at least one test per behavior should be written with a known correct answer computed independently of the implementation. If the test data and the expected output are both derived from the same code path the assistant generated, the test is tautological.

Total time: two to three minutes for normal test files.

Check 5: The change matches the PR description

The last check is the cheapest and the most often skipped. Read the PR description. Read the code. Confirm they describe the same change.

AI-generated PRs sometimes drift from the original intent. The assistant produced a 200-line change for a request that should have been a 50-line change. Or the change is the right size but touches different files than the description suggested.

For each file in the diff, ask: is this file mentioned in the description, and does the change to it match what the description says about it? If a file is changed and not mentioned, ask why. The answer is sometimes "it was a related cleanup," which is fine. Sometimes the answer is "the assistant decided to refactor something we did not ask it to," which is a flag.

Total time: under a minute for most PRs.

The total

Five checks. About fifteen minutes for an average-sized AI-generated PR. The first two catch most structural problems before you even need to think about the logic. The middle one catches most logic bugs. The last two catch the failures of judgment that show up when the assistant generates more than asked or generates tests that confirm its own assumptions.

For the full debugging workflow that runs after a PR has landed and a bug surfaced in production, see the longer 137Foundry guide on debugging AI-generated code. It covers the failure modes (hallucinated APIs, edge cases, version mismatches, silently wrong logic) and the order to check them in.

Tooling that makes these checks faster

A few specific tools that cut review time:

A package-existence check as a CI step. Run pip install -r requirements.txt (or the equivalent) on every PR. Catches hallucinated packages before review even starts.
A type checker. Mypy for Python, TypeScript for JavaScript, strong typing in Go and Rust. Type errors catch many wrong-signature problems automatically.
A linter configured with the team's style. Most hallucinated parameters trigger linter warnings. A clean linter pass cuts the manual review burden.
A test runner that fails the build on tests with no assertions. Caches the "test calls the function but does not check the result" failure mode.

Each tool buys back a few minutes per PR. None replace the human review, but they make the human review focus on the parts that actually need judgment.

A note on team workflow

Adopting an AI coding assistant changes the shape of the review burden. Before, most review time was spent understanding the change. After, much of the review time is spent verifying that the change is what the description claims and that the structural pieces (imports, function calls, parameters) are all real.

That is a different muscle and it takes a few weeks to develop. The team at 137Foundry has been iterating on this checklist for client work for about a year now, and the shape has stabilized around the five checks above. The exact tooling varies by project; the discipline of running each check carefully does not.

For background on team practices for AI coding assistant adoption, the GitHub guidance on AI in code review and the OpenAI guidance on developer productivity cover the published guidance from the platforms. None of it is a substitute for a careful checklist; it is supporting material.

Closing

Five checks, fifteen minutes, most of the AI-generated bug surface caught before the code lands. Worth running. Worth iterating on as your team learns which failure modes show up most often in your specific stack.

The checklist works because it focuses on the AI-specific failure modes (hallucination, version drift, tautological tests) rather than treating AI-generated code like any other code. The shift is small. The result is significant.

DEV Community