Alex Cloudstar

Posted on Mar 20 • Edited on Mar 31 • Originally published at alexcloudstar.com

AI-Generated Code Is Creating a Technical Debt Crisis Nobody Is Auditing

#ai #devtools #productivity #javascript

Six months ago, I merged a pull request that Claude Code generated in about eight minutes. Feature worked perfectly. Tests passed. The implementation followed the existing patterns in the codebase. I reviewed the diff, approved it, and moved on.

Last week, I had to modify that feature. I opened the file and realized I had no idea how half of it worked.

Not because the code was bad. It was actually well-structured. But I had never written it. I had never thought through the design decisions. I had approved an output without building the mental model that normally comes from writing code yourself. And now, six months later, I was paying for that shortcut.

This is not a story about AI writing bad code. It is a story about a new kind of debt that accumulates silently in every codebase where AI does significant implementation work. And based on conversations with other developers, it is far more widespread than anyone is talking about.

The Numbers That Should Worry You

Forty-one percent of all new code written globally is now AI-generated. That number comes from multiple sources tracking the shift through 2025 and into 2026. GitHub's own data shows similar trends across their platform.

But here is the number that matters more: how much of that AI-generated code gets genuinely reviewed versus rubber-stamped?

A GitClear analysis of over 100 million lines of changed code found that code churn, the percentage of lines that are reverted or updated within two weeks of being written, increased by 39 percent in projects heavily using AI coding tools. That is not a small signal. That is code being written, shipped, and then immediately needing to be fixed or replaced.

Stack Overflow's 2026 developer survey found that 76 percent of developers using AI coding tools reported generating code they did not fully understand at least some of the time. Not beginners. Experienced developers. The tools produce plausible, functional code fast enough that the review step gets compressed or skipped entirely.

Gartner projected that by 2026, 75 percent of enterprise software engineers would use AI code assistants, up from less than 10 percent in early 2023. That adoption curve is steep, and the debt is compounding at the same rate.

Three Types of AI-Generated Debt

Traditional technical debt is well understood. You take a shortcut, you know you took it, and you plan to fix it later (or at least you tell yourself that). AI-generated debt is different because you often do not realize you are taking on debt at all.

After spending the last month deliberately auditing my own codebases and talking to other developers doing the same, I see three distinct categories.

Comprehension Debt

This is the one I described in my opening. Addy Osmani coined the term in early 2026, and it perfectly captures the problem.

Comprehension debt accumulates when you ship code you did not write and do not deeply understand. Every time you approve an AI-generated implementation without building a genuine mental model of how it works, you add to this debt. The code functions correctly today, but your ability to debug it, extend it, or refactor it tomorrow is compromised.

In traditional development, comprehension comes as a side effect of writing. You understand your code because you made every decision that produced it. With AI-generated code, the output appears fully formed. You can read it, but reading is not the same as understanding.

The cost shows up later: when you need to debug a production issue at 2 AM and you are staring at code that technically works but whose internal logic you never internalized. Or when a new requirement needs you to modify a module you "reviewed" six months ago but cannot actually explain.

Verification Debt

This is about testing gaps, but not the obvious kind.

When AI generates a feature with tests, the tests usually verify that the implementation works as written. But they rarely test edge cases the AI did not consider. They almost never test failure modes that require understanding the broader system context. And they tend to test the happy path thoroughly while leaving error handling undertested.

The problem is subtle. Your test suite looks healthy. Coverage numbers might even go up. But the tests are validating what the AI built, not what the system actually needs to handle. It is the difference between "does this function return the right value for these inputs" and "what happens when the database connection drops mid-transaction while this function is running."

I found three separate instances in my own code where AI-generated tests were essentially tautological. They tested that the function did what the function did, without testing whether what the function did was correct in the broader system context. The tests passed. The code had bugs.

Architectural Drift

This is the most dangerous category because it is the hardest to spot in individual code reviews.

Each AI-generated implementation is locally reasonable. It looks at the files it has access to, follows the patterns it sees, and produces code that fits within its immediate context. But across dozens or hundreds of AI-generated changes, small inconsistencies accumulate.

One agent uses a repository pattern for database access in a new module because that is what it saw in the file it was reading. Another agent uses direct queries in a different module because it read a different file. Both are reasonable in isolation. Together, they represent architectural drift, where the codebase gradually becomes inconsistent in ways that no single code review would flag.

I found four different patterns for handling API errors across my codebase. Not because I decided to use four patterns, but because different AI sessions, with different context windows, made different reasonable choices. Each one was fine. The inconsistency across them was not.

Why Traditional Tools Miss This

Your linter will not catch comprehension debt. SonarQube cannot detect that you do not understand the code it is scanning. Test coverage tools happily report 85 percent coverage without telling you that the tests are superficial.

The reason is straightforward: traditional code quality tools measure properties of the code itself. They check syntax, complexity, style, coverage, dependency health. AI-generated debt is not primarily about the code. It is about the relationship between the code and the team that maintains it.

A function with cyclomatic complexity of 3 and full test coverage can still represent massive comprehension debt if nobody on the team understands why it works. A codebase with consistent style and zero linting errors can still have severe architectural drift if the consistency is superficial and the underlying patterns are fragmented.

This is why the standard response of "just review the code more carefully" misses the point. The problem is not sloppy reviews. The problem is that reviewing AI-generated code is fundamentally different from reviewing human-written code, and most teams have not adjusted their review practices to account for that difference.

A Practical Audit Framework

After researching what teams are doing about this and experimenting with my own approach, here is the framework I have settled on. It is not perfect, but it has caught real problems in my codebases.

Step 1: Identify Your AI-Generated Code

This sounds obvious, but most teams cannot actually answer the question "which parts of our codebase were AI-generated?" with any precision.

If you use Claude Code, check your git history for patterns. AI-generated commits often have specific characteristics: they touch many files in a single commit, they add complete features rather than incremental changes, and they tend to follow patterns very precisely (because the agent was reading existing patterns and replicating them).

Some teams have started tagging AI-generated commits with a conventional commit prefix or a trailer. If you are not doing this, start now. Future you will thank present you when it is time to audit.

For existing codebases without tags, I grep for a rough proxy: commits that add more than 200 lines with fewer than 3 manual edits in the surrounding days. It is not perfect, but it catches the bulk of large AI-generated implementations.

Step 2: Run the Comprehension Check

For each significant AI-generated module, ask yourself one question: can I explain, without reading the code, why this implementation works the way it does?

Not what it does. Why it does it that way. What tradeoffs were made. What alternatives were considered and rejected. What assumptions it relies on.

If you cannot answer that, you have comprehension debt on that module. The severity depends on how critical the module is and how likely you are to need to modify it.

I keep a simple spreadsheet for this. Module name, comprehension level (high, medium, low), criticality (high, medium, low), and last-reviewed date. Modules that are high-criticality and low-comprehension go to the top of the review queue.

Step 3: Audit Your Tests for Depth

Pull up the tests for your AI-generated code and categorize each test:

Surface tests verify that the function returns the expected output for a given input. These are the tests AI writes well. They confirm the code does what it was designed to do.

Integration tests verify that the module works correctly within the broader system. Does it handle the actual data shapes it receives in production? Does it behave correctly when upstream services are slow or unavailable?

Boundary tests verify behavior at the edges. Empty inputs. Null values. Concurrent access. Maximum load. The scenarios that AI tends to skip because they are not obvious from reading the existing code.

If your AI-generated module has 20 surface tests and zero boundary tests, you have verification debt. The coverage number looks great. The actual coverage is shallow.

Step 4: Map Your Patterns

This is the architectural drift check. Pick a common pattern in your codebase, error handling, database access, API response formatting, authentication checks, and search for every implementation of that pattern.

If you find three or more distinct approaches for the same pattern, that is drift. Not necessarily wrong, but it means your codebase is accumulating inconsistency that will slow down future development and make onboarding harder.

The fix is not to rewrite everything to match one pattern right now. The fix is to document which pattern is canonical, add it to your CLAUDE.md or rules file, and ensure future AI-generated code follows the standard. Then migrate the outliers incrementally.

Step 5: Build the Feedback Loop

The audit is not a one-time event. The debt keeps accumulating because the practices that create it, fast AI-generated code with compressed review cycles, are not going away.

Build the audit into your regular workflow:

Tag AI-generated commits so you can find them later
Schedule monthly comprehension reviews for critical modules
Add boundary test requirements to your PR review checklist
Run the pattern consistency check quarterly
Track your debt metrics over time so you can see if the situation is improving or deteriorating

What Good AI Code Review Actually Looks Like

The standard code review checklist does not work well for AI-generated code. Here is what I have adjusted in my own review process.

Read the spec first, not the code. Before looking at the AI's implementation, make sure you can clearly articulate what the code should do, what constraints it should respect, and what patterns it should follow. If you wrote a good spec using spec-driven development, this is straightforward. If you did not, the review is already compromised.

Trace one complex path end-to-end. Pick the most complicated flow through the new code and trace it manually. Not skim it. Actually follow the data from entry point to exit. This is where comprehension happens. If you find yourself skipping parts because "it probably works," that is a signal to slow down.

Check what is not there. AI-generated code tends to handle the cases it was told about. Look for what is missing. Error states that are not handled. Edge cases that are not tested. Logging that is not present. The absences are often more important than the additions.

Ask why, not just what. For every non-obvious design decision in the AI-generated code, ask yourself: is this the approach I would have taken? If not, is there a good reason the AI chose this approach, or did it just pick the first viable option? Understanding the "why" behind AI decisions is how you prevent comprehension debt from accumulating.

Run the code, do not just read it. Reading AI-generated code is deceptively easy because it is usually clean and well-structured. Actually running it, stepping through with a debugger, throwing unexpected inputs at it, reveals things that reading alone misses.

The Cultural Shift That Matters Most

The hardest part of managing AI-generated debt is not technical. It is cultural.

When AI tools make it possible to ship features in minutes instead of hours, the pressure to move fast increases. Review becomes a bottleneck. Testing beyond the happy path feels like wasted time. Understanding the implementation deeply seems unnecessary when you can just ask the AI to explain it later.

Every one of those rationalizations is a debt deposit.

The teams I have talked to who manage this well share a common trait: they measure quality by what they understand, not by what they ship. A feature is not done when the tests pass. It is done when someone on the team can confidently modify it six months from now without the AI that wrote it.

This is the same principle I wrote about regarding AI brain fry. The temptation is always to maximize output. The sustainable approach is to optimize for long-term maintainability, even when the tools make short-term velocity feel free.

The Debt Is Not Going Away

I want to be clear: I am not saying stop using AI coding tools. I use them every day. They make me significantly more productive. The articles I have written about agentic coding and context engineering should make that obvious.

But productivity without understanding is a credit card, not a paycheck. You are spending capacity you have not earned. The bill comes due when you need to debug, extend, or refactor code that nobody on the team genuinely understands.

The developers and teams who will thrive in this era are not the ones who generate the most code. They are the ones who maintain genuine understanding of their codebases while using AI to work faster. That is a harder balance to strike than it sounds. But it is the only approach that does not end with a codebase nobody can maintain.

Run the audit. Tag your commits. Review with intention. The debt compounds fast, but so does the benefit of catching it early.

Top comments (4)

German Heller • Mar 21

The audit gap is real. Most teams have CI that checks if the code works (tests pass, types check, linter clean) but zero tooling that checks if the code makes sense architecturally. AI-generated code passes all the automated gates while quietly introducing patterns that no human on the team would have chosen.

The scariest version of this is when AI generates code that works but uses a completely different approach than the rest of the codebase. Now you have two implicit architectures coexisting, and nobody noticed because the PR looked clean and the tests passed.

Hooks that enforce architectural constraints (not just style) on every edit are the best defense I've found so far. Making the environment catch drift instead of relying on human reviewers to spot it in a 500-line diff.

CrisisCore-Systems • Mar 21

This is strong because it gets past the lazy version of the conversation. The problem is not just bad AI code. The problem is teams shipping code that no one has metabolized enough to actually own later.

That is why I think this is more than technical debt. It is trust debt inside the team. A feature can pass tests, look clean, and still leave nobody with real operational ownership once something breaks, shifts, or needs to be extended months later.

What stands out to me is that the real failure is not always in the code itself. It is in the gap between apparent review and actual understanding. That gap gets hidden because AI output often arrives already shaped like finished work.

The audit framework here is solid. I would just push one layer further and say teams should start treating maintainability as a question of retained human comprehension, not just passing checks. If nobody can confidently reason about a module under pressure, the debt is already there even if the dashboard stays green.

Important post.

Codequal • Jul 20

The architectural debt section is the one I'd underline — 'each AI-generated implementation is locally reasonable. Across dozens of changes, small inconsistencies accumulate.' That's the whole trap in two sentences: nothing looks wrong at PR scale, so review can't catch it, no matter how careful the reviewer is.
Your step about tracking architectural patterns is the route I went, and I can share what it looks like in practice. I took LangChain — a repo whose entire 38-month history sits within the AI era — and measured every commit window against the repo's own historical baselines, with no external standards. The accumulation you're describing is visible in the numbers: typical commit size went from 2 files to 131, builds from ~10 seconds to 55 minutes (889x the baseline), and the novelty of file pairings collapsed from 0.99 to zero — the codebase shifted from restructuring to repetition. Every finding is tied to a named commit, so it's checkable rather than vibes.
Write-up: dev.to/codequal/what-38-months-of-... — full interactive report: codequal.dev/reports/langchain
You called it "comprehension debt"; nobody is auditing; the git history calls it "drift". Same animal, I think — curious whether the numbers match what you'd have predicted.

Erdi BAY • Jul 9

Great breakdown. We ended up giving the PR-level version of this a name — diff debt: the gap between what was merged and what was actually reviewed. Wrote up the full definition if useful.
dev.to/erdibay/diff-debt-every-ai-...