DEV Community: Artem Koltunov

The Assimilation Problem: Why AI-Generated Code Becomes Legacy Faster Than You Think

Artem Koltunov — Tue, 19 May 2026 14:13:23 +0000

What happens when you merge code you don't understand

A New Kind of Legacy

Every developer knows what legacy code is. Old, tangled, written by people who left the company long ago. No documentation, no context, nobody remembers why the code is structured the way it is. Legacy emerges over time: people leave, requirements change, knowledge dissipates.

Traditional legacy = knowledge lost over time.
AI-legacy = knowledge never created.

AI-legacy is different. It does not become legacy with age. It is born as legacy.

When AI generates code and a developer accepts it without deep understanding, there was no context to begin with. There was no decision-making process. There was no moment when an engineer consciously chose one architecture over another. The code appeared, passed tests, was accepted into the codebase — and from that point on, nobody on the team truly owns it.

AI-legacy is code that was accepted faster than it was understood.

Margaret-Anne Storey (University of Victoria) named the result of this process cognitive debt (Storey, 2026). In her model: technical debt lives in the code — it's a conscious trade-off the team makes, knowing they'll come back to fix it. Cognitive debt lives in people — it's the gap between the state of the system and the team's understanding of it. Martin Fowler developed the concept further in his Fragments series (Fowler, 2026), formulating the key question: do we need a discipline analogous to refactoring, but for team understanding? AI-legacy is an artifact in the codebase. Cognitive debt is the cost of not understanding it, paid by the developers.

In the first article of this series, we described the results of three experiments with AI tools on real production code. We measured a sustainable productivity gain of 25-40% — but only with thorough code review. There, we showed what happens without review. Now — why it happens and how AI-legacy forms.

The Story of One Image ID

In the third experiment, two developers integrated a JS SDK into a product using Cursor. A Redux environment was set up, Figma was connected, layouts were generated, business logic was integrated. Working code appeared in ~20 hours instead of the estimated ~40.

Everything compiled. The UI worked. Basic tests passed.

During code review, one of the developers noticed an anomaly. The image identifier was already contained in the object being passed through the system. Logically, the code should have simply used that ID — one line, one reference.

Instead, the AI-generated implementation took the long way around:

Retrieve the image ID
Download the blob by that ID
Create a new file from the blob
Upload that file back to the server
Receive a new identifier

Five steps instead of one. Additional network requests on every call. Data duplication. Growing resource consumption.

Nothing was broken on the outside. Inside, the system was doing several times more work than necessary.

This problem was discovered by a person who understood the domain logic. Not a linter. Not tests. Not another AI model. A person who had a mental model of the system, and the code before their eyes didn't match that model.

And here's the key question: why could this code have passed review without a single objection?

Because it looked like code written by a competent developer. Clean formatting. Meaningful variable names. Logical structure. Green tests. The only thing missing was a human understanding of why the code was structured that way.

And this is where the central problem of AI-assisted development emerges — not code generation, but code assimilation. When a developer writes code themselves, understanding is built into the creation process: every architectural decision passes through their mental model of the system. With AI-generated code, this link is severed. The code appears ready-made, but the reasoning behind it exists in no human mind. If the developer doesn't reconstruct that reasoning during review, the code enters the system without cognitive ownership. This is the exact moment AI-generated code becomes AI-legacy — not because it's wrong, but because it was integrated into the system faster than it was understood.

The Assimilation Model: Why This Happens

The gap between generation and understanding is closed through a process we propose calling assimilation — by analogy with cognitive psychology and the Gestalt approach.

In psychology, assimilation is the active processing of experience, after which the result of contact with something external becomes part of a person's internal model. The opposite of assimilation is introjection: acceptance without processing, "swallowing whole" (Perls, Hefferline, Goodman, 1951). In cognitive psychology, this is confirmed by levels-of-processing theory: durable understanding requires active processing, not passive perception (Craik & Lockhart, 1972).

Let's map this onto engineering practice.

Assimilation of AI code:

The developer reads the generated code
Asks questions: why this way, not another?
Compares against the system's architecture
Rewrites parts that don't fit
Can explain the reasoning behind each decision to a colleague
Merge

Introjection of AI code:

The code looks functional
Tests pass
"Looks good"
Merge
Nobody understands why the code is structured that way

An important clarification: this is an analytical model, not a literal claim that therapeutic mechanisms operate in software engineering. But the parallel is useful — it explains why accepting code without understanding creates a specific and predictable kind of technical debt.

Model vs Evidence. The assimilation model is an explanatory framework, not an empirically proven mechanism. Existing research does not measure assimilation directly. It documents phenomena — rising review costs, the illusion of productivity, quality variance — that the assimilation model helps explain.

In Storey and Fowler's terms, every act of introjection — accepting code without processing — increases the team's cognitive debt. Assimilation is the process that prevents this debt from accumulating. Cognitive debt is not a side effect of AI-assisted development. It is a direct and predictable result of the absence of assimilation.

When It Matters: The Scale Boundary

The assimilation model raises a natural question: when is it even applicable? If the project is a 200-line script, a personal utility, a hackathon prototype — is assimilation necessary? Why understand something you can recreate in half an hour?

The Regeneration Paradigm

In February 2025, Andrej Karpathy gave this approach a name — vibe coding: "There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists." Don't understand it — don't need to. It works — good enough. It broke — regenerate.

This is neither fantasy nor irony. a16z (Anish Acharya, August 2025) described an entirely new class of software — disposable software: "Building small, throwaway apps is starting to feel like doodling in a notebook and that shift changes why we build software in the first place." Personal scripts, event utilities, one-off tools for a specific task. For this category, assimilation is genuinely unnecessary. The code is meant to be discarded, not maintained.

The AI-legacy model doesn't apply to them. And we honestly acknowledge that.

But where does the boundary lie?

Three Thresholds

The boundary isn't binary (small/large). It's defined by three thresholds, each of which imposes its own constraint on the "just regenerate" paradigm.

The context threshold: does the project fit in one head?

Human working memory holds about four independent entities simultaneously (Cowan, 2001). AI has its own analogue: MindStudio and Chroma Research (2026) described context rot — predictable degradation of output quality as the context window fills. The effective limit is around 130,000 tokens, after which generation becomes locally correct but globally incoherent. A small project fits entirely in the context window — the AI reasons about all of it. A large one doesn't, and the AI begins creating code that contradicts parts of the system already pushed out of context.

When task complexity exceeds working memory capacity — both human and machine — understanding becomes impossible (Sweller, 1988). AI reduces routine load — it writes faster, removes mechanical work. But essential load — the connections within the domain — remains unchanged. Generation doesn't reduce the complexity of understanding.

The specification threshold: can you recreate from a description?

In April 2026, METR and Epoch AI published MirrorCode results — a benchmark in which Claude Opus reassembled gotree, a bioinformatics CLI tool of ~16,000 lines of Go. The result is impressive — equivalent to 2-17 weeks of human work. But there's a critical nuance: the reassembly relied on hundreds of end-to-end tests as a machine-verifiable specification. In real projects, such a specification almost never exists.

Margaret-Anne Storey, in "From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI" (University of Victoria, arXiv:2603.22106, 2026), called this intent debt — the loss of externalized knowledge about intentions. Storey formalized three interacting types of debt: technical debt (lives in code — conscious trade-offs), cognitive debt (lives in people — erosion of shared team understanding), and intent debt (lives in artifacts — the absence of recorded rationale for decisions). Intent debt determines whether a system can be recreated: if intentions and rationale aren't documented, the specification exists only in people's heads. And people leave, forget, and Slack threads drown in archives.

Ward Cunningham (1992) created the technical debt metaphor. Martin Fowler popularized it. Storey expanded the model with two categories Cunningham didn't foresee — because in 1992, all code was written by humans, and knowledge had passed through someone's head at least once. With AI generation, intent debt arises at the moment of code creation: there was no person making the decision — there's nothing to externalize.

The state threshold: does the project have a life beyond the code?

Andreas Kirsch (Google DeepMind, March 2026), in his essay "The Flawed Ephemeral Software Hypothesis," articulated the strongest argument against the naive "just regenerate" paradigm: "Generation cost was never the dominant term for mature software systems; discovery, validation, integration, and coordination were." Generation is a small part of the cost. The bulk is discovering correct behavior, validation, integration, coordination.

And more precisely: "Continuous regeneration with minimal persisted code artifacts is unstable at scale, and systems that start there migrate toward persisted artifacts as they accumulate users, state, and integration complexity." A project that started small and regenerable eventually accumulates users, data, integrations — and regenerating it means losing all of that. You can regenerate code. You can't regenerate database migrations, user sessions, or external API contracts.

The formula: below ALL three thresholds — the "just regenerate" paradigm is rational, vibe coding works, AI-legacy doesn't matter. Above ANY threshold — assimilation is mandatory.

In practice, the question is simple: does the system now contain state, users, integrations, or behavior that cannot be safely recreated from tests? If yes, regeneration is no longer a strategy. Assimilation becomes part of engineering hygiene.

Connections, Not Lines of Code

Where exactly does each threshold lie? The intuitive answer is "it depends on lines of code." But software engineering has long known that lines of code are a poor measure of complexity.

As early as 1972, David Parnas showed that two modules with identical LOC can have radically different complexity — everything depends on what they hide and how they're connected. A 100K-line project with clean interfaces can be more comprehensible than a chaotic 10K with dense coupling. Fred Brooks in The Mythical Man-Month (1975) formalized this: the number of potential dependencies between n components grows as n(n-1)/2 — with 10 modules that's 45 connections, with 50 it's 1,225. A large system isn't merely "bigger" than a small one — it has a fundamentally different structure of relationships. In No Silver Bullet (1986), Brooks took the argument to its conclusion: the complexity of software is an essential property, not an accidental one. AI generation removes accidental complexity — syntax, boilerplate, tooling. But essential complexity — domain connections, edge cases, state transitions — doesn't compress.

Empirical research confirms this: metrics of connections between modules — dependency graphs, information flow (fan-in/fan-out) — predict defects and effort significantly better than lines of code (Zimmermann & Nagappan, Henry & Kafura, McConnell).

The practical conclusion: assimilation thresholds should be determined not by lines of code, but by connection density — dependencies, integrations, data flows between components.

A Gradient, Not a Wall

Thresholds aren't a wall a project crashes into at a single moment. They're a gradient that projects ascend — sometimes without the team noticing.

BayTech Consulting (January 2026) gave this transition a name — Spaghetti Point: "While Vibe Coding may appear faster in the first week of a project, the 'crossing point' typically occurs around month 3," when a vibe-coded project "hits a wall of complexity where adding new features breaks existing ones." Three months is a rough empirical estimate. A project that started small enough to regenerate gradually accumulates complexity until one day, adding a new feature begins breaking old ones.

Lehman's Laws of Software Evolution (1974-1996) explain why: a system's complexity increases over time unless active effort is made to reduce it. This isn't a hypothesis — it's an empirical law, confirmed on Apache Tomcat, Apache Ant, and dozens of other projects. And AI generation accelerates this process: research by Zhu, Tsantalis, and Rigby (arXiv:2605.02741, May 2026) discovered the "Volume-Quality Inverse Law" — the volume of AI-generated code turned out to be an almost perfect predictor of structural degradation. The more code AI generates, the worse the architecture. AI doesn't merely fail to reduce essential complexity — it actively increases coupling.

Kirsch strengthens the argument: even projects conceived as small and regenerable migrate above the thresholds over time — "systems that start there migrate toward persisted artifacts as they accumulate users, state, and integration complexity." A user uploaded their data. An external service connected via API. Permanent URLs appeared. Regeneration was no longer possible.

Andrej Karpathy illustrated this with his own biography. Inventor of vibe coding (February 2025) → declared it obsolete and introduced "agentic engineering" (February 2026) → and ultimately hand-coded his serious project Nanochat, because AI agents "didn't work well enough at all and net unhelpful, possibly the repo is too far off the data distribution." The person who defined the "forget that the code even exists" paradigm, when working on a real project, returned to human understanding.

And Addy Osmani (Google, March 2026) showed that comprehension debt accumulates even when the developer is alone. Osmani introduced the term comprehension debt — "the growing gap between how much code exists in your system and how much of it any human being genuinely understands." The danger is that existing metrics don't notice this gap: "The reason comprehension debt is so dangerous is that nothing in your current measurement system captures it. Velocity metrics look immaculate. DORA metrics hold steady. PR counts are up. Code coverage is green." A project can look perfectly healthy across all dashboards — while nobody understands it.

Research by Anthropic (arXiv:2601.20245, 52 engineers, randomized controlled trial) confirms this quantitatively: developers who used AI scored 17% lower on tests of understanding their own code (50% vs 67%), with the largest drop in debugging — precisely the area where understanding internal logic is critical.

The scale boundary is not static. Projects start below the thresholds and migrate above them. AI generation accelerates this migration. The question isn't "is my project small or large," but "when will it cross the threshold — and will I be ready?"

These thresholds aren't a theoretical construct. The data shows what happens when teams operate above them.

What the Data Shows

The illusion of understanding. A randomized controlled trial by METR (July 2025) showed: experienced developers with AI tools spent 19% more time while subjectively estimating they were 20% faster. The gap between perception and reality — 39 percentage points. This is consistent with the well-known illusion of competence: the developer thinks they've assimilated the result, but objectively — they haven't.

The cost of review. Sonar's State of AI in Code report (January 2026): 95% of developers spend significant effort reviewing AI code. 38% find reviewing AI code more labor-intensive than reviewing human code. Assimilating AI code is objectively harder because it has no "author" you can ask about intentions.

Quality. According to industry analysis by CodeRabbit (470 pull requests), AI code may contain significantly more issues than human code. Unassimilated code carries more defects because the reviewer lacks the context in which the code was created — that context doesn't exist.

Systemic effect. Google's DORA Report 2024 recorded: a 25% increase in AI usage correlates with a 7.2% decrease in delivery stability. This may indicate a systemic effect — when unassimilated code accumulates not in one developer's work, but across the entire team.

Technical debt. Research by HFS Research + Unqork (November 2025): 43% of Global 2000 organizations acknowledge that AI creates new technical debt. GitClear (2025) shows: with generation speed increasing 20-55%, the volume of "durable code" (not rewritten within weeks) grows by only ~10%. Most AI code gets rewritten — meaning it wasn't assimilated the first time.

Commit as an Act of Responsibility

git commit is not a technical action. It's an act of taking responsibility for code.

Responsibility presupposes understanding. You're responsible for code because you know what it does, why it's structured that way, and what consequences it has for the system.

When a developer commits AI code they haven't assimilated, a gap emerges:

Formally — they're responsible. Their name is in git blame.
Cognitively — they don't own this code. They can't explain why it's structured one way and not another.

Commit creates formal ownership. Assimilation creates cognitive ownership.
When the two diverge — AI-legacy begins.

This is responsibility in name only. The system has changed, but the team's mental model hasn't.

The gap between the system's state and the team's understanding — this is the definition of debt. But what kind exactly? Storey and Fowler draw the line: technical debt is a conscious trade-off in code; cognitive debt is an unconscious loss of understanding in people (Storey, 2026; Fowler, 2026). AI-legacy falls into the second category.

Ordinary tech debt grows when a team consciously makes a trade-off — "we know this needs refactoring, but there's no time right now." Cognitive debt grows when the team doesn't realize a trade-off was even made. And unlike technical debt, it accumulates not at the speed of development, but at the speed of generation.

What to Do About It

The problem isn't AI tools. The problem is the absence of assimilation. Here's a practical framework.

The assimilation test. Before merge, ask yourself one question: "Can I explain to a colleague why the code solves the problem this way, and not another?" If the answer is "no" — the code isn't assimilated. That doesn't mean it needs to be rewritten. It means it needs to be understood before it's accepted.

Review as a mechanism for distributed assimilation. When a reviewer asks questions and the author answers — assimilation distributes across the team. Knowledge stops being concentrated in one person. Bus factor decreases. But this only works if the questions are substantive, not perfunctory.

The bottleneck is not generation, but understanding. The real constraint on productivity isn't the speed at which AI generates code, but the speed at which the team understands it. What needs optimizing is not prompts, but the review process.

AI code requires deeper review than human code. When a colleague wrote code, you can ask them: "Why did you do it this way?" AI code has no author. The decision context doesn't exist. The only way to obtain that context is to reconstruct it yourself through review.

In the first article of this series, we showed that with this approach, our team achieved a sustainable 25-40% gain. Not because AI generated perfect code. But because the team assimilated every decision.

Conclusion

AI tools don't create legacy code by themselves. Legacy is created by the absence of assimilation — accepting code without understanding it.

This is not an argument against AI. This is an argument for a conscious integration process. Every merge is a moment of choice: understand the code or take it on faith. The first path is slower. The second creates debt faster — debt that will have to be paid.

From this follows a simple but important conclusion: the primary limit on productivity in AI-assisted development is no longer the speed of writing code, but the speed of assimilating it. Cognitive debt is the price a team pays for every skipped step in this process. Code generation scales nearly linearly — each new model makes it faster. Assimilation scales much more slowly, because it requires human understanding. Therefore, the central engineering challenge of the coming years is not learning to generate more code, but learning to understand it faster and more reliably.

This challenge has two dimensions. The first is cognitive: how a developer claims ownership of key decisions in AI code without reading every line, through directed learning and active engagement with the code. The second is engineering: how a team transforms understanding into verifiable artifacts — specifications, state machines, architectural tests, metrics — so that AI operates within formalized engineering constraints rather than as a free-form author. Both paths — in the next articles.

This is the second article in the series. First: "AI Coding Tools in Practice: What a 25-40% Productivity Gain Really Looks Like"

Source Overview

This article draws on 31 sources from five domains. Below is a grouping by role in the argument; full references follow in the next section.

Cognitive Foundations

Perls, Hefferline & Goodman (1951) — the concept of assimilation and introjection in Gestalt psychology. The basis for the analogy: accepting code without processing = introjection. Craik & Lockhart (1972) — levels-of-processing theory: durable understanding requires active processing, not passive perception. Cowan (2001) — revised "magic number": real working memory capacity is about four chunks, which limits the number of components a developer can simultaneously hold during review. Sweller (1988) — cognitive load theory: AI reduces mechanical load (extraneous), but the essential domain load (intrinsic) remains unchanged.

Classical Software Engineering Complexity

Brooks (1975, 1986) — quadratic growth of dependencies between components; the distinction between essential and accidental complexity. AI reduces accidental — tooling, syntax. Essential — domain connections, edge cases — doesn't compress. Parnas (1972) — information hiding: complexity is determined not by LOC, but by what modules hide and how they're connected. Zimmermann & Nagappan (ICSE 2008) — empirical confirmation on the Windows Server 2003 codebase: dependency graph metrics predict defects 10 percentage points better than code complexity metrics; network measures identified 60% of critical binaries (twice as many as complexity metrics). Henry & Kafura (IEEE TSE, 1981) — information flow between modules (FAN-IN x FAN-OUT) highly correlates with effort; LOC showed weak correlation. McConnell (2004, Code Complete, Ch. 27) — practical benchmark: below 2,000 LOC individual skill determines everything; above — exponential effort growth (10K LOC = 13.5 person-months, 100K = 170 person-months instead of the linear 135). Lehman (1974-1996) — laws of software evolution: a system's complexity increases over time unless active effort is made to reduce it.

The New Debt of the AI Era

Storey (arXiv:2603.22106, 2026) — Triple Debt Model: technical debt (in code), cognitive debt (in people), intent debt (in artifacts). The key expansion of Cunningham's model for the era of AI generation. Fowler (Fragments, 2026) — development of the cognitive debt concept: do we need a discipline analogous to refactoring, but for team understanding? Osmani (March 2026) — comprehension debt: the gap between the volume of code and what a human actually understands. The danger is that existing metrics (velocity, DORA, coverage) don't notice it.

Empirical Evidence on AI-Assisted Development

METR (arXiv:2507.09089, 2025) — RCT: experienced developers 19% slower with AI, while subjectively estimating +20% speedup. Perception gap — 39 pp. Sonar (2026) — 95% of developers spend significant effort reviewing AI code; 38% find it harder than human code. Google DORA (2024) — 25% increase in AI usage correlates with 7.2% decrease in delivery stability. Anthropic (arXiv:2601.20245, 2026) — RCT, 52 engineers: AI users scored 17% lower on tests of understanding their own code. GitClear (2025) — with generation speed increasing 20-55%, the volume of "durable code" grows by only ~10%. HFS Research + Unqork (2025) — 43% of Global 2000 acknowledge AI creates new tech debt. CodeRabbit — analysis of 470 PRs: AI code contains significantly more issues. Zhu, Tsantalis & Rigby (arXiv:2605.02741, 2026) — Volume-Quality Inverse Law: AI code volume is an almost perfect predictor of structural degradation.

The Regeneration Paradigm and Its Limits

Karpathy (2025-2026) — introduced vibe coding (February 2025), declared it obsolete and proposed agentic engineering (February 2026), hand-coded Nanochat. His biography illustrates the paradigm's boundary. a16z / Acharya (2025) — disposable software as a new class of software for which assimilation is unnecessary. Kirsch (Google DeepMind, 2026) — "generation cost was never the dominant term" for mature systems; projects that start as disposable migrate toward persisted artifacts. METR / Epoch AI (MirrorCode, 2026) — AI reassembled a 16K LOC Go project, but relied on hundreds of end-to-end tests as a specification; in real projects such specifications almost never exist. BayTech Consulting (2026) — Spaghetti Point: ~3 months until a vibe-coded project hits the complexity wall. MindStudio / Chroma Research (2026) — context rot: predictable degradation of AI output after ~130K tokens.

Sources

Perls, F., Hefferline, R., & Goodman, P. (1951). Gestalt Therapy: Excitement and Growth in the Human Personality
Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. Journal of Verbal Learning and Verbal Behavior
Parnas, D. L. (1972). On the Criteria to Be Used in Decomposing Systems into Modules. Communications of the ACM — https://dl.acm.org/doi/10.1145/361598.361623
Brooks, F. P. (1975). The Mythical Man-Month: Essays on Software Engineering
Brooks, F. P. (1986). No Silver Bullet — Essence and Accident in Software Engineering — https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf
Henry, S. & Kafura, D. (1981). Software Structure Metrics Based on Information Flow. IEEE TSE, SE-7(5) — https://ieeexplore.ieee.org/document/1702877/
Sweller, J. (1988). Cognitive Load During Problem Solving: Effects on Learning. Cognitive Science, 12(2)
Cunningham, W. (1992). The WyCash Portfolio Management System — http://c2.com/doc/oopsla92.html
Lehman, M. M. (1996). Laws of Software Evolution Revisited — https://en.wikipedia.org/wiki/Lehman%27s_laws_of_software_evolution
Cowan, N. (2001). The magical number 4 in short-term memory. Behavioral and Brain Sciences — https://www.cambridge.org/core/journals/behavioral-and-brain-sciences/article/magical-number-4-in-shortterm-memory-a-reconsideration-of-mental-storage-capacity/44023F1147D4A1D44BDC0AD226838496
McConnell, S. (2004). Code Complete, 2nd ed., Chapter 27: How Program Size Affects Construction — https://www.oreilly.com/library/view/code-complete-2nd/0735619670/ch27s04.html
Boehm, B. W. (1981). Software Engineering Economics (COCOMO model) — https://en.wikipedia.org/wiki/COCOMO
Zimmermann, T. & Nagappan, N. (2008). Predicting Defects using Network Analysis on Dependency Graphs. ICSE 2008 — https://thomas-zimmermann.com/publications/files/zimmermann-icse-2008.pdf
Karpathy, A. (2025). Vibe Coding. X/Twitter — https://x.com/karpathy/status/1886192184808149383
Karpathy, A. (2025-2026). Nanochat: inventor of vibe coding hand-codes his project. Futurism — https://futurism.com/artificial-intelligence/inventor-vibe-coding-doesnt-work
Acharya, A. / a16z (2025). Disposable Software — https://a16z.com/disposable-software/
METR (2025). Measuring the Impact of Early 2025 AI on Experienced Open-Source Developers — https://arxiv.org/abs/2507.09089
Sonar (2026). State of AI in Code Report — https://www.sonarsource.com/resources/developer-survey-report/
CodeRabbit. AI Code Quality Analysis (470 PRs)
Google DORA Report 2024 — https://dora.dev/research/2024/dora-report/
HFS Research + Unqork (2025). AI and Technical Debt — https://www.hfsresearch.com/press-release/ai-wont-save-enterprises-from-tech-debt-unless-they-change-the-architecture-first/
GitClear (2025). AI Assistant Code Quality Research (211M lines) — https://www.gitclear.com/ai_assistant_code_quality_2025_research
Storey, M.-A. D. (2026). From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI. arXiv:2603.22106 — https://arxiv.org/abs/2603.22106
Fowler, M. (2026). Fragments on Cognitive Debt. martinfowler.com — https://martinfowler.com/fragments/2026-02-13.html, https://martinfowler.com/fragments/2026-04-02.html
Kirsch, A. (2026). The Flawed Ephemeral Software Hypothesis. Google DeepMind — https://www.blackhc.net/essays/future_of_software/
Osmani, A. (2026). Comprehension Debt: The Hidden Cost of AI-Generated Code — https://addyosmani.com/blog/comprehension-debt/
Anthropic (2026). How AI Impacts Skill Formation. arXiv:2601.20245 — https://arxiv.org/abs/2601.20245
MindStudio / Chroma Research (2026). Context Rot — https://www.mindstudio.ai/blog/context-rot-ai-coding-agents-explained | https://research.trychroma.com/context-rot
METR / Epoch AI (2026). MirrorCode Preliminary Results — https://metr.org/blog/2026-04-10-mirrorcode-preliminary-results/
BayTech Consulting (2026). AI Technical Debt: How Vibe Coding Increases TCO — https://www.baytechconsulting.com/blog/ai-technical-debt-how-vibe-coding-increases-tco-and-how-to-fix-it
Zhu, Y., Tsantalis, N., & Rigby, P. C. (2026). AI-Generated Smells. arXiv:2605.02741 — https://arxiv.org/abs/2605.02741v1

AI Coding Tools in Practice: What a 25-40% Productivity Gain Really Looks Like

Artem Koltunov — Sat, 25 Apr 2026 17:29:40 +0000

Our JavaScript team tested AI-assisted development on production code. Here's what we measured, what surprised us, and why we think the real gain is 25-40% -- not the 10x you keep hearing about.

Over the past year, AI coding tools have been surrounded by bold claims: "Develop twice as fast." "10x developer productivity." "Code that practically writes itself."

We decided to test these claims on real work -- not demo projects, but production code. The kind of long-lived repositories that power SDKs and developer platforms, systems that must be maintained, reviewed, and understood years after the code is written.

What We Tested

Our JavaScript team works with AI models like GPT Codex, GPT-5.2, Opus 4.5, and Gemini 3.5 through IDE plugins -- specifically GitHub Copilot Chat in WebStorm and IntelliJ IDEA.

Recently, we also got access to Cursor, an IDE with deeply integrated AI that can operate across an entire project. Unlike traditional AI plugins where you manually select files and copy code into prompts, Cursor sees the whole codebase, creates files in the right locations, and applies changes directly.

The biggest immediate impact wasn't smarter code generation -- it was the disappearance of small mechanical tasks. Less time copying code, managing context, and stitching pieces together. That alone produced an early productivity improvement of roughly 20%.

To see where this advantage held up -- and where it didn't -- we ran three experiments on active codebases.

Three Experiments

Important note: The first two experiments used GitHub Copilot Chat inside WebStorm, our usual IDE. The third introduced Cursor, which gave us a chance to compare a traditional AI plugin approach with a full-project AI environment.

Experiment 1: Extending a Production SDK

We added new AI-related functionality to an existing JavaScript SDK: AI Summarize (generating summaries from ~1000 chat messages) and AI Gateway (recognizing text in images and generating descriptions). The task included API integration, SDK adaptation, tests, and usage examples.

For this task we used GitHub Copilot Chat inside WebStorm. The AI could generate useful code, but we still had to gather context manually -- selecting files, pasting snippets, and explaining how modules interact -- before integrating whatever came back.

Even with that overhead, AI assistance made a noticeable difference.

Result: ~18 hours with AI vs. 24+ hours without. A gain of 30-35%.

What sped things up wasn't deep architectural insight. It was the smaller tasks: generating scaffolding, following existing patterns, and wiring pieces together faster than a human would type them.

Experiment 2: Untangling Long-Lived Branches

Several parallel branches had been evolving separately since 2021. They contained overlapping logic, slightly different implementations, and subtle behavioral differences.

Normally, merging something like this is slow and mentally draining. It requires reading a lot of unfamiliar code and carefully comparing approaches.

Using Copilot Chat, we could feed sections of each branch to the model, ask it to highlight overlaps and divergences, and get explanations of unfamiliar code. That made it much easier to focus on the important part of the job -- deciding which implementation actually made sense.

Result: ~1.5 days with AI vs. ~1 week without. Acceleration was several times for tasks involving analysis and comparison of large codebases.

The biggest advantage here wasn't generating code at all. It was simply making large amounts of existing code easier to understand.

Experiment 3: Integrating an SDK Into a Product (with Cursor)

This experiment used Cursor. Two developers worked in parallel using different AI models (GPT-5.2 Codex and Opus 4.5). We created a complete Redux environment, connected Figma, generated layouts, and integrated business logic.

At first, the results looked impressive.

Result: ~20 hours with Cursor vs. ~40 hours without. Getting to working code 2x faster.

But this experiment also exposed a limitation that didn't show up in the earlier tasks.

The Hidden Problem With AI-Generated Code

The AI-generated code from Experiment 3 compiled, the interface behaved correctly, and the basic tests passed. If we had stopped there, we would have considered the integration complete.

But during code review, one of the developers noticed something odd.

An image identifier already existed inside one of the objects being passed through the system. Logically, the code should have simply reused that ID. Instead, the generated implementation took a much longer route: it fetched the ID, downloaded the associated blob, created a new file from it, uploaded that file back to the server, and then returned a new identifier.

From the outside, nothing was broken. Internally, the process was doing far more work than necessary. Each time the logic ran, it duplicated data, added network calls, and quietly increased resource usage.

We discovered this only because we opened the code and read it carefully.

This turned out to be a pattern we started noticing more often with AI-generated code. The output usually works, but the logic behind it doesn't always match the architecture of the system it's being added to. In shared components like SDKs, such inefficiencies can spread quietly through every product that depends on them.

What Industry Research Shows

While we were running these experiments, we studied key industry research. Our experience aligned closely with what independent analysts are measuring.

Productivity and Code Quality

GitClear's 2025 analysis found that AI tools can increase development speed by 20-55%, but the amount of "sustainable code" -- code that stays in the codebase without being rewritten -- grows by only about 10%. Developers produce code faster, but a noticeable portion still ends up being revised or refactored later. Full PDF report.

A randomized controlled study by METR (July 2025) produced a striking result: experienced developers working on their own mature projects actually spent 19% more time with AI tools, while subjectively estimating a 20% speedup. The key takeaway: perceived speed and actual speed are different things. Full data on arXiv and GitHub.

The Cost of Reviewing AI Code

Sonar's State of AI in Code report (January 2026) found that 95% of developers spend significant effort checking AI-generated code, and 38% consider it harder to review than human-written code. Developers read and verify code far more slowly than AI generates it, which creates a natural ceiling on productivity gains. Full PDF.

Architectural Limitations of AI-Generated Code

Ox Security's "Army of Juniors" report (October 2025) describes AI-generated code as "highly functional but systematically lacking architectural thinking." This explains why the code works but accumulates hidden problems. Report PDF.

Technical Debt

HFS Research + Unqork (November 2025) surveyed 123 respondents from Global 2000 organizations: while 84% expect AI to reduce costs, 43% admit that AI creates new technical debt. Opinions on long-term impact are split almost evenly -- 55% expect debt reduction, 45% expect increase.

Forrester predicts that by 2026, 75% of tech leaders will face moderate or serious technical debt, with AI code generation without engineering discipline being a key factor.

Impact on Delivery Stability

Google DORA Report 2024 found a critical correlation: a 25% increase in AI usage leads to a 7.2% decrease in delivery stability. There's a 2.1% productivity gain and 2.6% job satisfaction increase -- but at the cost of 1.5% throughput decrease and 7.2% stability decrease. Full PDF. The 2025 DORA Report confirms these findings.

Why the Real Gain Is 25-40%

Looking across both our experiments and the broader research, the same pattern keeps appearing.

AI tools clearly speed up certain parts of development: reducing boilerplate, navigating large codebases, scaffolding new functionality, and accelerating the path to a working implementation.

But those gains come with a counterweight. The code still needs to be understood, reviewed, and integrated into an existing system. Developers reason about code far more slowly than AI can generate it.

Without proper review, teams accumulate what we call "AI legacy code" -- code that works but nobody on the team truly understands. Over time, it becomes easier to regenerate than to modify. But regeneration means spending time and resources on problems that were already solved. In high-debt environments, losses reach 30-40% of the change budget and 10-20% of system operation costs.

This situation can develop within months after active AI code adoption without full developer involvement.

That's why the dramatic claims about "10x productivity" rarely hold up in real engineering environments. In practice, the gains stabilize in the 25-40% range -- meaningful enough to matter, but not so large that engineering judgment becomes unnecessary.

Conclusion

AI coding tools are most useful when treated as assistants rather than replacements for engineering judgment.

They excel at analyzing and comparing large volumes of code -- tasks that take humans significant time, AI handles very quickly. They reduce friction in everyday development and can meaningfully accelerate time-to-working-code.

At the same time, tasks requiring deep understanding of business logic and architectural optimization are often solved by AI in suboptimal ways. The resulting code works but is redundant. The system functions correctly on the surface, but hidden problems related to performance, resource usage, and maintainability can form inside.

Architectural decisions, quality control, and responsibility for results must stay with the team. With this discipline in place, AI tools deliver a real, measurable, and sustainable productivity boost.

DEV Community: Artem Koltunov

The Assimilation Problem: Why AI-Generated Code Becomes Legacy Faster Than You Think

A New Kind of Legacy

The Story of One Image ID

The Assimilation Model: Why This Happens

When It Matters: The Scale Boundary

The Regeneration Paradigm

Three Thresholds

Connections, Not Lines of Code

A Gradient, Not a Wall

What the Data Shows

Commit as an Act of Responsibility

What to Do About It

Conclusion

Source Overview

Cognitive Foundations

Classical Software Engineering Complexity

The New Debt of the AI Era

Empirical Evidence on AI-Assisted Development

The Regeneration Paradigm and Its Limits

Sources

AI Coding Tools in Practice: What a 25-40% Productivity Gain Really Looks Like

What We Tested

Three Experiments

Experiment 1: Extending a Production SDK

Experiment 2: Untangling Long-Lived Branches

Experiment 3: Integrating an SDK Into a Product (with Cursor)

The Hidden Problem With AI-Generated Code

What Industry Research Shows

Productivity and Code Quality

The Cost of Reviewing AI Code

Architectural Limitations of AI-Generated Code

Technical Debt

Impact on Delivery Stability

Why the Real Gain Is 25-40%

Conclusion

References

Productivity and Code Quality

AI Code Review and Security

Technical Debt

DevOps Metrics

Independent Reviews