AI-generated code survives 16% longer in production than human code before anyone touches it.
That's the headline finding from a new study out of Concordia University's DAS Lab, led by Emad Shihab and accepted at EASE 2026.
The researchers tracked over 200,000 individual code units across 201 open-source projects using survival analysis, a method borrowed from medical research, to answer a simple question: how long does AI-generated code last in production?
The answer is agent-authored code has a 15.4 percent lower modification rate than human code. At any given time, it faces 16% less risk of being changed.
Sounds like a win for AI coding agents, right?
Not necessarily.
The researchers wanted to understand why AI code was being left untouched for longer. Is it because it's higher quality, or something else?
The data suggests something else.
When they examined what happens when AI code is finally modified, they found that 26.3% of modifications to AI code are bug fixes, compared to 23% for human code. When someone finally touches AI-generated code, it's more likely to be because something was broken.
So AI code appears to contain more latent bugs, yet those bugs sit unaddressed for longer. Why?
The researchers point to a well-documented phenomenon in software engineering as one possible reason: the "Don't touch my code!" effect. Developers avoid modifying code they didn't write. AI-generated code has no human author so nobody feels responsible for maintaining it.
Nuance in the per-tool data.
The study tracked five AI coding tools, and the results between them varied.
Cursor, one of the most sophisticated tools tested, had the lowest corrective modification rate of any tool at just 13.8%. When someone touches Cursor-assisted code, it's rarely to fix a bug.
Yet Claude Code, also a powerful offering, had a corrective rate of 44.4%, nearly double the human baseline.
One possible explanation here is that Cursor tends to keep the code visible in the interface as you work, but Claude Code has an interface that abstracts the code further away from the developer's view.
The idea that the degree to which a developer sees, understands, and engages with the code during generation matters as much as the quality of the tool itself is a sensible theory.
But a stronger clue for why AI generated code survives longer comes from a separate study entirely.
The review burden
Amazon recently summoned a large group of engineers for a "deep dive" into a spate of outages, including incidents tied to AI coding tools. A briefing note cited "novel GenAI usage for which best practices and safeguards are not yet fully established" as a contributing factor.
Researchers at NAIST (Nara Institute of Science and Technology), in a paper accepted at MSR 2026, analyzed 1,664 merged agentic pull requests across 197 open-source projects. They found that 75% of agentic PRs pass through review with zero revisions. Three out of four AI-generated PRs sail through without a single change requested.
There is a growing chorus of developers complaining about the ballooning burden of code review. As AI coding agents improve, engineers who use them ship more code. As engineers ship more code, the volume of code that has to pass through review skyrockets.
The tidal wave of code review could be what's driving developers to rubber stamp bugs into production at AWS, in the above-referenced studies, and beyond.
Bugs that would have been caught in a more thorough review slip through. And if nobody engaged deeply with the code during review (nor at time of generation), nobody understands it well enough to feel equipped, or responsible, to maintain it later.
Amazon's response is to require junior and mid-level engineers to get senior sign-off on all AI-assisted changes. But adding more human sign-off to a process that's already struggling to keep up cannot fix the core tension.
So how do we prevent orphaned, buggy code filling up codebases, without drowning humans in an impossible mountain of manual review?
The end of human code review
The best articulation I've seen of what the answer should look like comes from Kayvon Beykpour, previously the CEO of Periscope / head of product at Twitter, and presently the cofounder of the AI code review tool Macroscope.
In a widely shared post, he predicted that "soon, human engineers will review close to zero pull requests," and that instead, "code review will become always-on and increasingly automatic as code is being written. A new orchestration layer will emerge where agents will decide when PRs are ready to merge and only (infrequently) escalate to humans."
Beykpour argues code review needs to be pulled closer to where code is being written, not delayed until a PR is opened. Specialized review agents should continuously analyze code as it's generated, verify correctness, and coordinate with coding agents to address issues in real time.
If the AI-generated code in the Concordia study had been continuously reviewed by a dedicated agent as it was written, the bugs would have already been caught, and it wouldn't have mattered that an engineer waved the code into production.
Then, at the PR stage, Beykpour says AI agents should orchestrate "merge readiness": assessing whether the code was sufficiently tested, evaluating blast radius, checking trust profiles, and deciding whether human escalation is actually required.
When low-risk PRs are taken off an engineer's plate, they have more time and bandwidth for the reviews that actually matter. And when those reviews reach them, they know they're important.
The first glimpse of this future is "Approvability," a feature rolled out by Beykpour's team last month that automatically evaluates every PR against two hurdles before deciding whether it can merge without a human reviewer.
Trusting an AI to decide whether code can merge without a human reviewer will seem reckless to some, but this is how every generation of programming has evolved.
When the compiler took over the task of writing machine code in the 1950s, programmers didn't trust it—so they inspected its binary output line-by-line, swapping writing for reviewing. Over time, a set of checks and balances were built around the compiler—listing prints, error diagnostics, optimization passes, etc—until manual verification became redundant.
The same pattern played out with operating systems, CI tools, and cloud platforms. Each initially added a burden of oversight, and each eventually earned enough trust that the oversight became unnecessary.
While the above discussed research can help us diagnose some of the problems with LLM-written code, that it contains more bugs, gets left untouched longer, and mostly sails through review unchallenged, the teams that solve these challenges won't be the ones who review harder, but rather the ones who review smarter.
Top comments (0)