What 12 Months of AI-Generated Pull Requests Taught My Engineering Team

#ai #productivity #learning #softwareengineering

When our platform team adopted AI coding assistants across every repository in early 2025, I expected productivity gains. What I did not expect was that the most valuable lesson would come from the failures, not the successes. After reviewing roughly 4,200 merged pull requests where AI played a meaningful role in authoring code, the picture that emerged contradicts most of the marketing material I had read. The economic momentum behind this technology is undeniable, with the $644 billion USD bet on AI infrastructure reshaping how capital flows through Silicon Valley and beyond, but the day-to-day reality of shipping production software with these tools is messier, more nuanced, and ultimately more interesting than any keynote presentation suggests. This is what we learned, what we changed, and what we wish someone had told us before we started.

The Productivity Numbers Are Real, But Misleading

Our internal telemetry showed individual developers shipping between 26 and 55 percent more code by line count once AI assistance became standard practice. That sounds like a clear win, and in narrow contexts it is. Boilerplate generation, test scaffolding, API client wrappers, and routine refactoring all collapsed from hours to minutes. A junior engineer on our team rewrote a legacy ETL pipeline in three days that would have taken six weeks under our previous workflow.

But code volume is the wrong metric, and we knew it almost immediately. By month four, our incident rate had climbed 31 percent compared to the previous year. Reverts were up. Mean time to resolution stretched longer. When we dug into the postmortems, a pattern emerged: the regressions were rarely catastrophic bugs in the AI-generated code itself. They were subtle integration failures, edge cases the model could not have known about, and accumulated complexity from accepting suggestions that worked locally but violated unwritten conventions elsewhere in the system.

A widely circulated study by METR found that experienced developers working on familiar codebases were actually 19 percent slower when using AI assistants, even though they believed they were 20 percent faster. The findings from METR's randomized controlled trial on AI developer productivity match what we observed in our own data once we separated greenfield work from maintenance on mature services. The productivity story depends entirely on context, and the contexts where AI shines are not the contexts where most senior engineers spend their time.

The Review Bottleneck Nobody Warned Us About

The single largest operational change AI adoption forced on us was a complete restructuring of code review. When a developer can produce 800 lines of plausible-looking code in twenty minutes, the bottleneck shifts immediately and permanently to whoever has to review it. Our senior engineers started burning out within three months. Review queues grew. PRs sat for days. People started rubber-stamping changes because the volume made careful review impossible.

We eventually solved this by inverting our workflow. Authors are now required to walk reviewers through any AI-assisted change in a recorded video under five minutes, explaining what the code does, why this approach was chosen, and what they verified manually. The video requirement sounds bureaucratic, but it accomplished two things. It forced authors to actually understand code they had generated, which closed a dangerous knowledge gap. And it gave reviewers a starting point that respected their time. Review velocity recovered within six weeks, and the quality of merged code improved measurably.

What Actually Works

After a year of experimentation, a few practices separated the teams that benefited from AI tools from the teams that drowned in their output. None of these are revolutionary, but the discipline required to apply them consistently turned out to be the differentiator.

Tight specification before generation matters more than prompt engineering tricks. Engineers who wrote detailed acceptance criteria, type signatures, and example inputs before invoking the assistant got dramatically better results than engineers who described their intent in natural language. The model is a literal-minded collaborator. Give it ambiguous instructions and it will produce ambiguous code, often confidently.

Test-first workflows became non-negotiable for any nontrivial change. Writing the test before generating the implementation accomplishes what type systems used to do alone: it constrains the search space and provides an objective signal for whether the output is correct. Teams that skipped this step ended up debugging plausible-looking code that failed silently in production.

Pairing AI assistance with mandatory human checkpoints at architecture boundaries prevented the slow drift toward incoherent system design. We require a human to write any code that crosses a service boundary, defines a new public API, or modifies authentication or authorization logic. The model is allowed to suggest, but not to author, in these zones. This rule alone prevented several near-misses that would have shipped without it.

Investment in observability paid for itself many times over. When you cannot fully trust the provenance of every line of code in your repository, you need to be able to detect problems quickly in production. We doubled our spending on tracing, structured logging, and alerting in the second half of the year. The cost was significant. The cost of not doing it would have been catastrophic.

The Skills That Suddenly Got More Valuable

Watching our team adapt over twelve months revealed which engineering skills compound in an AI-assisted environment and which depreciate. The picture inverted some long-standing assumptions about what makes a strong developer.

Reading code carefully and quickly became the single most valuable skill on the team. Engineers who could scan a 300-line diff and identify the two suspicious blocks were ten times more productive than engineers who relied on tests alone to catch problems. Debugging skills also gained value, because AI-generated bugs often manifest in unfamiliar shapes that resist standard troubleshooting heuristics.

System design and architectural judgment became more important, not less. The model can produce any individual component you ask for, but it cannot tell you which components your system actually needs, how they should interact, or which trade-offs are worth making. The engineers who thrived were the ones who could hold an entire system in their head and direct the AI toward implementations that fit a coherent whole.

Conversely, the ability to remember API syntax, recite framework idioms, or rapidly type boilerplate became almost worthless overnight. Engineers who built their identity around these skills had the hardest adjustment. Engineers who had always treated these as incidental to the real work of building software barely noticed the change.

What I Would Tell My Past Self

If I could send one message back to the version of myself who was rolling out these tools in January 2025, it would be this: the technology is not the hard part. The hard part is rebuilding your team's review processes, quality bars, and skill development pathways around a fundamentally different production function. The teams that win this transition are not the ones with the best models or the cleverest prompts. They are the ones who treat AI assistance as a serious organizational change and invest accordingly. Everyone else is shipping more code that nobody fully understands, accumulating a kind of debt that traditional refactoring cannot pay down, and discovering the cost only when something important breaks at the worst possible moment.

Top comments (1)

xulingfeng • May 24

Solid read. The comparison between What 12 Months of AI-Generated Pull Requests Taught My Engin approaches is useful — most articles only cover one side. Having the trade-offs side by side helps a lot.