137Foundry

Posted on Jun 9

What I Learned From One Year of Reviewing AI-Generated Pull Requests

#ai #webdev #programming #productivity

A year ago, my team started using AI coding assistants heavily. Twelve months and roughly 800 pull requests later, I have a clearer picture of what AI-generated code looks like in review than I had any reason to expect at the start.

Here is the unvarnished view from inside the review queue. What works, what does not, what surprised me, what I wish I had known on day one.

Photo by ThisIsEngineering on Pexels

The Pattern I Did Not Expect

The biggest surprise was that AI-generated bugs are not random. They cluster around three specific categories of code, and the categories are predictable enough that I now know which files to review with extra care before I even open them.

The three categories:

Code that calls external APIs. Roughly 60 percent of the AI-related bugs we shipped traced back to hallucinated method signatures, wrong parameter names, or imagined option keys in library calls.
Refactors of code older than six months. Roughly 25 percent traced back to AI removing what looked like dead code or unnecessary logic but was actually load-bearing for an edge case nobody had thought about in a while.
Tests that exercise the wrong path. Roughly 15 percent traced back to AI-generated tests that passed for the wrong reason and gave false confidence.

If you have not separated these categories in your own review flow, do it. It is the single most useful filter for where to spend review time.

The Test Problem Is the Most Insidious

Of the three categories, AI-generated tests are the failure mode that I underestimated most. The pattern:

Engineer asks AI to write a test for a function.
AI generates a test with a fixture and an assertion on a specific output value.
The test passes. CI is green. PR ships.
Six months later, a refactor changes the implementation in a way that produces a different but still valid output.
The test fails. Engineer "fixes" the test by updating the expected value.
Three months later, a real regression slips through because the test was never actually testing what its comment claimed.

The fix is to require a comment on every AI-generated test that explains what behavior is being verified, not just what the assertion is. If the comment cannot be written without restating the assertion, the test is probably wrong.

This was the single biggest change to our code review template, and it has caught more issues than any other rule. The deeper structural take on this and other review-related guidelines is in 137Foundry's AI coding guidelines guide.

The External API Problem Has a Cheaper Fix

For the external API category, the fix is mechanical: require the PR description to include a line like "Verified against [library] [version] docs at [URL]." If that line is missing for any code that calls an external system, the PR bounces.

This sounds tedious. In practice it adds about 90 seconds per PR. The cost is trivial compared to the cost of debugging a hallucinated method call after it ships.

Resources we found useful for staying anchored to real library behavior:

The Python docs for anything in the standard library
The Node.js API reference for server-side JavaScript
The MDN reference for browser APIs
The official docs of any third-party library, opened in another tab while reviewing

The verification step is fast when the docs are open already. It is slow when you have to find them after a hallucinated method has confused the diff.

The Refactor Problem Is the Hardest to Fix

Refactors are the hardest failure mode because the broken code is often in production for weeks before anyone notices. The pattern:

AI refactors a function. The new version is cleaner.
All existing tests pass. The PR ships.
Three weeks later, a customer hits an edge case the original version handled and the new version does not.
Engineering scrambles. The fix turns out to be re-adding the load-bearing code that the AI removed.

The most reliable fix we found: require the original author or a current owner of the file to approve any AI-generated refactor. They have context that no other reviewer has. If the original author is no longer on the team, the file's most recent significant contributor takes their place.

This is harder than it sounds because it requires knowing who owns what across the codebase. We use a CODEOWNERS file that is updated quarterly. Imperfect but workable.

What Changed in Our Process

Six months in, our process changes were:

PR template added an "AI-generated code" checkbox. Author has to flag which files contain AI-generated code.

Review template added a "verified against docs" line. For any external-system code.

Test comments became mandatory for AI-generated tests. Explaining what behavior is being verified.

CODEOWNERS-based reviewer assignment for refactors. Original authors get pulled in automatically.

Quarterly retros include an AI failure-mode review. What bugs traced back to AI, what patterns we saw, what guidelines need to update.

After implementing these, the AI-related bug rate dropped by roughly 70 percent over the next quarter. Not zero, but manageable.

What I Stopped Worrying About

A few things I worried about that turned out not to matter:

AI-generated code style drift. I expected to see the codebase slowly drift toward an AI-default style. It happened a little, but not nearly as much as I feared. Linters and formatters caught most of it. The drift was real but cosmetic.

Junior engineers becoming dependent on AI. I expected to see juniors lose the ability to write code without AI assistance. The opposite happened: juniors who paired with AI heavily early on were actually better at reading code by month four, because they had seen so many variations of the same problem.

Senior engineers refusing to use AI. I expected pushback. Most seniors started using AI within three months, mostly because their workload required it. The holdouts were few and they self-selected into roles where AI assistance was less relevant.

What Surprised Me

A few things that were unexpected:

AI is very good at code I do not enjoy writing. Boilerplate, schema migrations, type definitions, test scaffolding. The volume gain on this category alone was worth the entire investment.

AI is bad at code I enjoy writing. Architecture decisions, abstractions, anything that requires holding the whole system in your head. I still do this work by hand and probably always will.

The review process became more collaborative. Reviewers and authors talk more about the code in PRs than they used to, partly because the AI-related questions ("did you verify this?") tend to surface broader design conversations.

Junior engineers ramp faster. A junior with AI assistance plus a clear review process ramps faster than a junior without. The AI handles the routine; the review process catches the mistakes. The junior learns from both.

What I Would Do Differently

If I were starting from scratch today, I would:

Set up the PR template and review macros before introducing the AI tool, not after.
Run a one-week intensive on AI failure modes with the senior engineers, focused on the three categories above.
Update CODEOWNERS before allowing AI-generated refactors to land.
Schedule the quarterly retro from day one. Without the retro, the team does not consolidate lessons.

For broader context on the engineering culture work that has to happen alongside the technical setup, the 137Foundry blog covers some of the patterns we see across client teams. The services page goes into specific engagements around AI workflow design for engineering organizations.

The one-year retrospective on this is shorter than I expected. AI assistance is a productivity tool. The team that builds the right review and culture around it gets the productivity gains without the cost. The team that does not pays the cost in subtle bugs and slow burn on the codebase.

Worth the year of figuring it out. Would do it again, but faster.

DEV Community