Part 1 of “The AI Code Debt Tsunami” series
Six months ago, GitHub Copilot helped me write a user validation function in 30 seconds. Yesterday, it wrote the same function again. And again. Five different versions across my codebase, each slightly different, none aware of the others.
This isn’t a bug in the AI. This is the new normal. The probabilistic nature is the characteristics of the LLM coding agents and the limited context for each interaction.
We’re witnessing the fastest productivity boost in software development history. AI coding assistants have made us 5x-10x faster at writing individual functions. But there’s a dark side we’re only beginning to understand: AI-generated code creates tech debt at an unprecedented scale and speed.
Traditional tech debt accumulates exponentially by number of team member multiplied by the number of coding agent each one uses — messy code compounds over months into years. AI code debt accumulates exponentially. What used to take 18 months to become unmaintainable now happens in 6 weeks.
The tsunami is here. Most teams don’t even see the wave.
The Paradox: Going Fast While Falling Behind
Here’s what I observed while building receiptclaimer, a receipt processing SaaS
:
Week 1–2: 🚀 Amazing! We’re shipping features daily. Copilot writes boilerplate, Claude helps with complex logic, ChatGPT generates tests. We’re moving 3x faster than any team I’ve been on.
Week 3–4: 🤔 Hmm. Our AI assistants keep suggesting we create utilities that… already exist? They’re also suggesting 3 different patterns for the same API endpoint. Which one is “right”?
Week 5–6: 😰 Wait. Our codebase has 23 nearly-identical handler functions. Our import chains are 8 levels deep. AI tools are now giving worse suggestions because they can’t fit our context into their windows. We’ve gone from 10x faster to 5x faster, probably drop quicker as time passes.
The math: 1 months of 10x productivity = 12 months of traditional work. But we also accumulated what feels like 24 months of tech debt. Net result? We need to shift the focus to maintain that high multiplier.
This is the AI code debt paradox: The faster AI helps you write code, the faster you accumulate debt you can’t see. Eventually, it becomes a codebase only for AI but not for human.
The Four Horsemen of AI Code Debt
After analyzing dozens of AI-assisted projects (including my own), I’ve identified four distinct problems that traditional metrics completely miss:
1. Knowledge Cutoff Gaps (The Outdated Pattern Problem)
AI models have training cutoffs. GPT-4’s knowledge ends in April 2023. Claude’s is a bit later. But your best practices evolved last month.
The result: AI confidently suggests patterns that were deprecated in your codebase months ago. It recommends libraries you’ve already migrated away from. It writes code that technically works but violates architectural decisions made after its training data was collected.
Real example from receiptclaimer
// AI suggested this in November 2025:
app.get('/api/receipts', async (req, res) => {
const { userId } = req.query;
// ... validation logic
});
// But we standardized on this pattern in a few times!!!:
app.get('/api/receipts', withAuth(async (req, res) => {
const userId = req.user.id; // From auth middleware
// ... no validation needed, it's in middleware
}));
AI didn’t know about our withAuth middleware because it was created 3 months after training cutoff. Result? 18 endpoints using the old pattern, 12 using the new one. All written by AI. All technically correct. All inconsistent.
2. Model Preference Drift (The Team Chaos Problem)
Your frontend dev prefers Cursor. Your backend dev swears by GitHub Copilot. Your junior dev uses ChatGPT. Each AI has different preferences for how to solve problems.
The result: Your codebase becomes a Frankenstein of 3 different “AI styles,” each internally consistent, but totally incompatible with each other.
Real example:
Copilot likes this: const user = await db.users.findById(userId)
Claude prefers: const user = await getUserById(userId) (wrapped in helper)
ChatGPT suggests: const user = await User.findById(userId) (ORM style)
All three work. None are wrong. But when you have all three scattered across 100 files, your AI assistants get confused trying to help with refactoring. Which pattern should they follow?
3. Undetected Semantic Duplicates (The Invisible Repetition Problem)
This is the most insidious one. AI generates code that looks different but does the same thing.
Traditional duplicate detection tools (like jscpd) only catch copy-paste duplicates — exact text matches. But AI never copy-pastes. It generates fresh code every time, with different variable names, slightly different logic, but functionally identical.
Real example from receiptclaimer:
// File 1: src/api/receipts.ts
const validateReceipt = (data) => {
if (!data.amount || data.amount <= 0) return false;
if (!data.date || new Date(data.date) > new Date()) return false;
if (!data.merchant || data.merchant.trim().length === 0) return false;
return true;
}
// File 2: src/services/receipt-validator.ts
export function isValidReceipt(receipt) {
const hasAmount = receipt.amount && receipt.amount > 0;
const hasValidDate = receipt.date && new Date(receipt.date) <= new Date();
const hasMerchant = receipt.merchant?.trim().length > 0;
return hasAmount && hasValidDate && hasMerchant;
}
// File 3: src/utils/validation.ts
class ReceiptValidator {
static validate(r) {
return r.amount > 0 &&
new Date(r.date) <= new Date() &&
r.merchant.trim() !== '';
}
}
Three different files. Three different names. Three different syntaxes. Same exact logic.
Traditional linters see zero duplication (0% text overlap). But they’re wasting hundreds of AI tokens and confusing the models. When Copilot sees all three, it doesn’t know which pattern to follow, so it creates a fourth variant.
We found 23 of these in our codebase. That’s 8,450 tokens of wasted context every time AI tries to understand our validation logic.
4. Context Fragmentation (The Token Budget Problem)
AI models have limited context windows. GPT-4 Turbo has 128K tokens. Claude 3.5 has 200K. Sounds like a lot, right?
Wrong.
When your code is fragmented across dozens of files with deep import chains, AI needs to load massive amounts of context just to understand one function.
Real example:
// src/api/users.ts (850 tokens)
import { getUserById } from '../services/user-service'; // +2,100 tokens
import { validateUser } from '../utils/user-validation'; // +1,800 tokens
import { UserModel } from '../models/user'; // +2,100 tokens
import { logger } from '../lib/logger'; // +450 tokens
import { cache } from '../helpers/cache'; // +900 tokens
export const getUser = async (id) => {
// 20 lines of actual code
}
To understand this 20-line function, AI needs to load:
The function itself: 850 tokens
All its imports: 7,350 tokens
Become a member
- Their transitive dependencies: ~4,000 more tokens
Total: 12,200 tokens for a 20-line function.
Now multiply this across your entire codebase. We discovered that some of our “simple” user management operations were costing 15,000+ tokens just for AI to understand the context. That’s 10% of GPT-4’s context window for one feature domain.
The result? AI gives incomplete answers, misses important context, or suggests refactorings that break transitive dependencies it couldn’t fit in its window.
Why Traditional Metrics Miss This Entirely
If you’re running SonarQube, CodeClimate, or similar tools, you feel pretty confident about your code quality. You shouldn’t be.
Traditional metrics were designed for human code review, not AI code consumption:
Cyclomatic complexity: Measures branching logic (good for humans debugging). Useless for detecting semantic duplicates.
Code coverage: Measures test coverage (good for reliability). Doesn’t detect context fragmentation.
Duplication detection: Measures text similarity (catches copy-paste). Blind to AI-generated semantic duplicates.
Dependency graphs: Shows imports (good for architecture). Doesn’t measure token cost.
None of these tools answer the questions that matter in an AI-first world:
How much does it cost AI to understand this file?
Are there semantically similar patterns AI keeps recreating?
Is this code organized in a way AI can consume efficiently?
Will AI suggestions be consistent with our existing patterns?
We’re using 2015 metrics for 2025 problems.
The Real Cost (In Numbers You Can Measure)
Let me translate this into business impact, using real numbers from receiptclaimer:
Before AI-readiness optimization:
23 semantic duplicate patterns (undetected by traditional tools)
Average context budget per feature: 12,000 tokens
AI response quality: ~60% useful without additional clarification
Time to onboard new AI patterns: ~2 hours of prompt engineering per feature
Developer frustration: High (AI keeps suggesting “wrong” patterns)
Impact on velocity
:
Week 1–4: 10x faster than baseline ✅
Week 5–12: 5x faster than baseline ⚠️
Week 13–20: pretty much baseline ❌
Week 21+: Velocity crisis — considering partial rewrite ❌❌
The hidden cost: We spent 4 months going fast in the wrong direction. The refactoring tax came due, and it was massive.
What Comes Next
Here’s the uncomfortable truth: Every team using AI coding assistants is accumulating this debt right now. The only difference is some realize it, most don’t.
The good news? This is measurable. Fixable. Preventable.
Over the next few weeks, I’m going to break down:
How to detect semantic duplicates AI creates (even traditional tools miss)
How to measure context costs and fragmentation
How to optimize your codebase so AI tools work with your patterns instead of against them
Real case study of how we refactored receiptclaimer and quantified the results
I built aiready suite tools to solve this problem for myself and my team. It’s open source, configurable, and designed for the AI-first development workflow.
@aiready/cli: Unified CLI interface for running all below analysis tools together or individually
@aiready/pattern-detect: Detect semantic duplicate patterns that waste AI context window tokens
@aiready/context-analyzer: Analyze context window costs, import depth, cohesion, and fragmentation
@aiready/consistency: Check naming conventions and pattern consistency across your codebase
Because here’s what I learned: Making your codebase AI-ready doesn’t just make AI better. It makes your code better for humans too.
Clean, consistent, well-organized code has always been the ideal. AI just makes the cost of not doing it much more immediate and painful.
The tsunami is here. But we can learn to surf!
Next in this series: Part 2 — “Why Your Codebase is Invisible to AI (And What to Do About It)” — We’ll dive deep into semantic duplicates and context fragmentation, with concrete examples and detection strategies.
Have questions or war stories about AI-generated tech debt? Drop them in the comments. I read every one.



Top comments (0)