klement Gunndu

Posted on Mar 19

AI-Generated Code Is Building Tech Debt You Can't See

#ai #programming #architecture #devops

Your team shipped more features last quarter than any quarter before. The AI coding tools are working. Everyone feels faster.

Then you look at the codebase six months later and nothing makes sense.

GitClear analyzed 211 million changed lines of code across repositories from Google, Microsoft, and Meta between 2020 and 2024. Their finding: copy-pasted code rose from 8.3% to 12.3% of all changes, while refactored code dropped from 25% to under 10%. Code duplication blocks increased eightfold. The codebase is growing, but the architecture is rotting.

This is not traditional tech debt. Traditional debt comes from shortcuts under deadline pressure. AI-generated tech debt comes from code that works, passes tests, and reads fine — but lacks architectural judgment.

The Measurement Problem

Ox Security analyzed 300 repositories and found 10 recurring anti-patterns in 80-100% of AI-generated code. The top offenders: excessive commenting (90-100% of repos), avoidance of refactoring (80-90%), and duplicated bug patterns across files (80-90%). They called AI-generated code "highly functional but systematically lacking in architectural judgment."

The METR study made this concrete. Sixteen experienced open-source developers (averaging 22,000+ star repositories) were randomly assigned tasks with and without AI tools. The result: developers using AI took 19% longer to complete tasks. But when surveyed afterward, those same developers estimated they were 20% faster. The perception gap was 39 percentage points.

If your team cannot measure the debt, they cannot manage it. Here are five detection patterns that surface AI-generated tech debt before it compounds.

Pattern 1: Cyclomatic Complexity Drift Detection

AI-generated code tends to solve problems by adding conditions rather than abstracting patterns. A function that started at complexity 5 slowly grows to 15 as the AI adds edge case handling inline rather than extracting helper functions.

Track complexity over time, not just at a single point.

"""
complexity_tracker.py — Track cyclomatic complexity drift per function.
Requires: pip install radon
Radon docs: https://radon.readthedocs.io/
"""
import json
import subprocess
import sys
from datetime import date
from pathlib import Path


def get_complexity(source_dir: str) -> list[dict]:
    """Run radon cc and return per-function complexity scores."""
    result = subprocess.run(
        ["radon", "cc", source_dir, "-j", "-n", "C"],
        capture_output=True, text=True, check=True,
    )
    raw = json.loads(result.stdout)
    functions = []
    for filepath, blocks in raw.items():
        for block in blocks:
            functions.append({
                "file": filepath,
                "name": block["name"],
                "complexity": block["complexity"],
                "lineno": block["lineno"],
            })
    return functions


def load_baseline(path: Path) -> dict:
    """Load previous complexity snapshot."""
    if path.exists():
        return json.loads(path.read_text())
    return {}


def detect_drift(baseline: dict, current: list[dict], threshold: int = 3) -> list[dict]:
    """Flag functions whose complexity increased beyond threshold."""
    alerts = []
    for func in current:
        key = f"{func['file']}::{func['name']}"
        prev = baseline.get(key, {}).get("complexity", func["complexity"])
        delta = func["complexity"] - prev
        if delta >= threshold:
            alerts.append({
                "function": key,
                "was": prev,
                "now": func["complexity"],
                "delta": delta,
                "line": func["lineno"],
            })
    return alerts


def save_snapshot(functions: list[dict], path: Path) -> None:
    """Save current complexity as the new baseline."""
    snapshot = {}
    for f in functions:
        key = f"{f['file']}::{f['name']}"
        snapshot[key] = {
            "complexity": f["complexity"],
            "date": str(date.today()),
        }
    path.write_text(json.dumps(snapshot, indent=2))


if __name__ == "__main__":
    source = sys.argv[1] if len(sys.argv) > 1 else "src"
    baseline_path = Path(".complexity-baseline.json")

    current = get_complexity(source)
    baseline = load_baseline(baseline_path)
    alerts = detect_drift(baseline, current)

    if alerts:
        print(f"Found {len(alerts)} complexity drift alerts:")
        for a in alerts:
            print(f"  {a['function']} line {a['line']}: "
                  f"{a['was']} -> {a['now']} (+{a['delta']})")
        sys.exit(1)
    else:
        print(f"No drift detected across {len(current)} functions.")

    save_snapshot(current, baseline_path)

Run this in CI on every pull request. When a function's complexity jumps by 3 or more since the last baseline, the build flags it. The developer must either justify the increase or refactor before merging.

The threshold of 3 is deliberate. A single if adds 1 point. Three conditional branches added to one function in a single PR almost always means inline logic that should be extracted.

Pattern 2: Clone Detection With Structural Matching

AI models generate code by statistical prediction. When similar problems appear in different parts of a codebase, the model generates similar — but not identical — solutions. These near-duplicates are harder to find than exact copies.

jscpd (copy-paste detector) catches both exact and near-duplicates across 150+ languages.

# Install: npm install -g jscpd
# Docs: https://github.com/kucherenko/jscpd

# Scan your source directory for duplicates
jscpd ./src --min-lines 5 --min-tokens 50 --reporters consoleFull

# Output shows duplicate blocks with file locations:
# Clone found (Python):
#   src/auth/login.py [10:25]
#   src/auth/register.py [15:30]
#   Lines: 15, Tokens: 89

# Set a duplication threshold for CI
# Configure in .jscpd.json: {"threshold": 5}
jscpd ./src --threshold 5 --reporters consoleFull

The --threshold flag turns this into a CI gate. GitClear's data shows the industry average crossed 12% duplication in 2024. Set your threshold at your current level and ratchet it down each quarter.

For Python-specific detection, pylint has a built-in duplicate checker:

# Uses Pylint's similarity checker across your codebase
# Docs: https://pylint.readthedocs.io/
pylint --disable=all --enable=duplicate-code src/

Both approaches complement each other. jscpd catches structural similarity across languages. Pylint catches Python-specific patterns like duplicated class hierarchies and repeated decorator chains.

Pattern 3: Dead Code Accumulation Tracking

AI assistants frequently generate utility functions, helper classes, and imports that the final implementation never uses. Over weeks of AI-assisted development, dead code accumulates silently.

vulture detects unused Python code by analyzing ASTs:

# Install: pip install vulture
# Docs: https://github.com/jendrikseipp/vulture

# Scan for dead code with 80% confidence threshold
vulture src/ --min-confidence 80

# Output:
# src/utils/helpers.py:45: unused function 'format_response' (90% confidence)
# src/models/user.py:12: unused import 'Optional' (100% confidence)
# src/api/routes.py:89: unused variable 'temp_cache' (80% confidence)

The confidence scoring matters. At 100%, vulture is certain the code is unreachable within the analyzed files. At 60%, there might be dynamic usage the static analysis missed. Start at 80% for CI gates and 60% for manual review.

Track dead code percentage over time:

"""
dead_code_tracker.py — Track dead code accumulation over time.
Requires: pip install vulture
"""
import subprocess
import json
from datetime import date
from pathlib import Path


def count_dead_code(source_dir: str, min_confidence: int = 80) -> dict:
    """Run vulture and count findings by type."""
    result = subprocess.run(
        ["vulture", source_dir, f"--min-confidence={min_confidence}"],
        capture_output=True, text=True,
    )
    lines = result.stdout.strip().split("\n") if result.stdout.strip() else []
    counts = {"unused_function": 0, "unused_import": 0, "unused_variable": 0, "other": 0}
    for line in lines:
        if "unused function" in line:
            counts["unused_function"] += 1
        elif "unused import" in line:
            counts["unused_import"] += 1
        elif "unused variable" in line:
            counts["unused_variable"] += 1
        else:
            counts["other"] += 1
    counts["total"] = len(lines)
    counts["date"] = str(date.today())
    return counts


def append_history(counts: dict, history_path: Path) -> None:
    """Append today's count to the tracking history."""
    history = []
    if history_path.exists():
        history = json.loads(history_path.read_text())
    history.append(counts)
    history_path.write_text(json.dumps(history, indent=2))


if __name__ == "__main__":
    counts = count_dead_code("src")
    append_history(counts, Path(".dead-code-history.json"))
    print(f"Dead code: {counts['total']} findings "
          f"({counts['unused_function']} functions, "
          f"{counts['unused_import']} imports, "
          f"{counts['unused_variable']} variables)")

When dead code count climbs week over week, something is generating code nobody uses. That is the signal to review AI-assisted PRs more carefully.

Pattern 4: Refactoring Ratio Gate

GitClear's most striking finding was not that duplication increased — it was that refactoring collapsed. From 25% of all code changes in 2021 to under 10% in 2024. AI tools generate new code. They rarely suggest consolidating existing code.

Measure the ratio of refactoring to new code in every sprint:

"""
refactor_ratio.py — Measure refactoring vs new code ratio from git history.
Uses git log to classify commits as refactoring or feature work.
"""
import subprocess
import re
import sys


def get_recent_commits(days: int = 14) -> list[str]:
    """Get commit messages from the last N days."""
    result = subprocess.run(
        ["git", "log", f"--since={days} days ago",
         "--pretty=format:%s", "--no-merges"],
        capture_output=True, text=True, check=True,
    )
    return [line.strip() for line in result.stdout.split("\n") if line.strip()]


def classify_commits(messages: list[str]) -> dict:
    """Classify commits as refactor, feature, fix, or other."""
    refactor_patterns = re.compile(
        r"refactor|extract|consolidate|simplify|rename|restructure|deduplicate|cleanup|clean up",
        re.IGNORECASE,
    )
    feature_patterns = re.compile(
        r"add|implement|create|build|introduce|new|feature",
        re.IGNORECASE,
    )
    fix_patterns = re.compile(r"fix|bug|patch|resolve|hotfix", re.IGNORECASE)

    counts = {"refactor": 0, "feature": 0, "fix": 0, "other": 0}
    for msg in messages:
        if refactor_patterns.search(msg):
            counts["refactor"] += 1
        elif feature_patterns.search(msg):
            counts["feature"] += 1
        elif fix_patterns.search(msg):
            counts["fix"] += 1
        else:
            counts["other"] += 1
    return counts


def compute_ratio(counts: dict) -> float:
    """Compute refactoring ratio as percentage of total commits."""
    total = sum(counts.values())
    if total == 0:
        return 0.0
    return (counts["refactor"] / total) * 100


if __name__ == "__main__":
    days = int(sys.argv[1]) if len(sys.argv) > 1 else 14
    commits = get_recent_commits(days)
    counts = classify_commits(commits)
    ratio = compute_ratio(counts)

    print(f"Last {days} days: {len(commits)} commits")
    print(f"  Refactoring: {counts['refactor']} ({ratio:.1f}%)")
    print(f"  Features:    {counts['feature']}")
    print(f"  Fixes:       {counts['fix']}")
    print(f"  Other:       {counts['other']}")

    if ratio < 15:
        print(f"\nRefactoring ratio ({ratio:.1f}%) is below 15% threshold.")
        print("Consider scheduling dedicated refactoring time.")

This is a proxy metric. Commit messages are noisy. But the trend matters more than any single measurement. If your refactoring ratio drops below 15% for three consecutive sprints, your codebase is accumulating structural debt regardless of the source.

The fix is not to stop using AI tools. The fix is to schedule explicit refactoring time — separate from feature work, tracked separately in your sprint. AI tools generate. Humans consolidate. Both steps are necessary.

Pattern 5: Architectural Boundary Enforcement

AI-generated code does not respect module boundaries. A function in src/auth/ might import directly from src/billing/ because the model saw that pattern somewhere in its training data. Over time, the dependency graph becomes a web.

Enforce boundaries with import rules:

"""
boundary_check.py — Enforce architectural boundaries via import analysis.
Uses Python's ast module (standard library) to parse imports.
"""
import ast
import sys
from pathlib import Path


# Define allowed imports between modules.
# Each key is a module, values are modules it MAY import from.
ALLOWED_IMPORTS = {
    "auth": {"models", "utils", "config"},
    "billing": {"models", "utils", "config"},
    "api": {"auth", "billing", "models", "utils", "config"},
    "models": {"utils", "config"},
    "utils": {"config"},
    "config": set(),
}


def get_module_name(filepath: Path, src_root: Path) -> str:
    """Extract the top-level module name from a file path."""
    relative = filepath.relative_to(src_root)
    return relative.parts[0] if len(relative.parts) > 1 else ""


def check_imports(filepath: Path, src_root: Path) -> list[dict]:
    """Parse a Python file and check imports against boundary rules."""
    module = get_module_name(filepath, src_root)
    if module not in ALLOWED_IMPORTS:
        return []

    violations = []
    source = filepath.read_text()
    tree = ast.parse(source, filename=str(filepath))

    for node in ast.walk(tree):
        target = None
        if isinstance(node, ast.Import):
            for alias in node.names:
                parts = alias.name.split(".")
                if parts[0] in ALLOWED_IMPORTS and parts[0] != module:
                    target = parts[0]
        elif isinstance(node, ast.ImportFrom):
            if node.module:
                parts = node.module.split(".")
                if parts[0] in ALLOWED_IMPORTS and parts[0] != module:
                    target = parts[0]

        if target and target not in ALLOWED_IMPORTS.get(module, set()):
            violations.append({
                "file": str(filepath),
                "line": node.lineno,
                "module": module,
                "imports": target,
                "allowed": sorted(ALLOWED_IMPORTS[module]),
            })
    return violations


def scan_directory(src_root: Path) -> list[dict]:
    """Scan all Python files for boundary violations."""
    all_violations = []
    for pyfile in src_root.rglob("*.py"):
        all_violations.extend(check_imports(pyfile, src_root))
    return all_violations


if __name__ == "__main__":
    src = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("src")
    violations = scan_directory(src)

    if violations:
        print(f"Found {len(violations)} boundary violations:")
        for v in violations:
            print(f"  {v['file']}:{v['line']} — "
                  f"'{v['module']}' imports '{v['imports']}' "
                  f"(allowed: {v['allowed']})")
        sys.exit(1)
    else:
        print("No boundary violations found.")

The ALLOWED_IMPORTS dictionary is your architecture. When the AI generates an import that crosses a boundary, the check fails. The developer must either fix the import or update the architecture — both of which force a deliberate decision.

This pattern scales. Start with top-level module boundaries. Add sub-module rules as the codebase grows. The AI does not know your architecture. This tool enforces it.

Putting It Together: The CI Pipeline

Each pattern works independently. Together, they form a debt detection pipeline:

# .github/workflows/debt-detection.yml
name: Tech Debt Detection
on: [pull_request]

jobs:
  complexity-drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install radon
      - run: python complexity_tracker.py src

  clone-detection:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install -g jscpd
      - run: jscpd ./src --threshold 5

  dead-code:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install vulture
      - run: vulture src/ --min-confidence 80

  refactor-ratio:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - run: python refactor_ratio.py 14

  boundary-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python boundary_check.py src

None of these tools know whether code was written by a human or an AI. They measure structural quality. That is the point. The source does not matter. The architecture does.

What This Means for Your Team

The research is clear. AI coding tools increase output velocity. They also increase structural debt. The METR study showed experienced developers were 19% slower with AI tools while believing they were 20% faster — a 39 percentage point perception gap.

This does not mean AI tools are bad. It means teams need to pair generation speed with detection systems. The five patterns above give you concrete metrics: complexity drift, duplication percentage, dead code count, refactoring ratio, and boundary violations.

Track these metrics every sprint. Set thresholds. Ratchet them tighter over time. AI generates code faster than humans. Humans still need to maintain the architecture.

The teams that ship fast in 2026 will not be the ones that generate the most code. They will be the ones that detect and resolve structural debt before it compounds.

Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (9)

DevGab • Mar 20

Solid article — the GitClear and METR data really drive the point home. The perception gap (feeling 20% faster while being 19% slower) is probably the most dangerous finding here, because it means teams won't self-correct without measurement.

I agree with most of this, with one caveat: I think the framing slightly over-indexes on AI as the cause. The refactoring decline and duplication trends were already underway — AI just poured fuel on existing habits. Teams that were already rigorous about architecture tend to stay rigorous with AI tools. The ones that weren't are now generating debt faster.

The real challenge, in my experience, is that the mitigations (careful prompting, pushing back on AI's proposed approach, thorough code reviews, periodic refactoring) all require discipline that scales poorly with team size. On a small team where everyone knows the codebase, you can catch the subtle architectural drift. On a larger team? No-one wants to spend their entire day reviewing AI-generated PRs, and code reviews themselves become performative when you're staring at 100 changed files.

One practical tip I'd add to your detection patterns: aim for smaller, more frequent PRs. This is probably the single highest-leverage workflow change for teams using AI tools heavily. A 15-file PR gets a genuine review. A 100-file PR gets a rubber stamp. If AI is helping you write code faster, use that speed to ship smaller increments — not bigger ones. It makes every other mitigation (review quality, refactoring ratio, clone detection) actually feasible.

The "AI generates, humans consolidate" framing is exactly right. The problem is when teams treat the generation step as the finish line.

klement Gunndu • Mar 20

You're absolutely right that the refactoring decline predates AI — it accelerated trends that were already there. Your point about smaller, more frequent PRs is the practical lever I should have emphasized more. A 15-file PR gets genuine scrutiny; a 100-file one gets a rubber stamp regardless of whether AI wrote it. That's probably the single highest-leverage workflow change teams can make right now. The perception gap compounds exactly because the feedback loop is broken — teams feel faster, so they never measure, and the drift stays invisible until it's structural. Appreciate you adding that nuance.

klement Gunndu • Mar 27

The perception gap is what makes this systemic — teams genuinely believe they are shipping faster while the codebase degrades underneath. Without explicit measurement (churn rate, refactoring ratio, duplication index), there is no feedback loop to self-correct. Curious what caveat you had in mind — always interested in where the nuance lies.

klement Gunndu • Mar 28

The perception gap is exactly what makes this insidious — teams optimizing for velocity metrics look great on paper while the codebase quietly degrades. Measurement has to include churn rate and code half-life, not just output speed.

klement Gunndu • Mar 20

You nailed it — AI amplified existing habits, not created new ones. And the smaller PRs tip is the most actionable takeaway I wish I'd emphasized more. A 15-file PR gets genuine review; a 100-file PR gets a rubber stamp. That alone changes everything downstream.

Andre Cytryn • Mar 19

the refactoring ratio finding is what stood out most. 25% to under 10% in three years, while teams keep measuring velocity by how fast new code ships. the architectural boundary enforcement pattern is underrated. most teams have informal module rules that live in ADRs nobody reads, but codifying them as a CI check means they actually get enforced instead of gradually eroded. have you seen teams resist adding these gates, or do they accept them once you frame it as quality metrics rather than productivity policing?

klement Gunndu • Mar 27

The refactoring ratio collapse is the number that should alarm engineering leads the most. When teams stop refactoring, they are borrowing against future velocity, not saving time. Codifying architectural boundaries as CI checks rather than ADR prose is the pattern that scales — it moves enforcement from human memory to automated gates that block violations before they compound.

klement Gunndu • Mar 28

The refactoring ratio drop is the silent killer — teams celebrate shipping speed while the codebase quietly calcifies. Codified boundary enforcement beats ADRs every time because it fails the build instead of hoping someone reads the docs.

klement Gunndu • Mar 19

The resistance pattern is real, but it usually follows a predictable arc.

The initial pushback comes from senior engineers who see boundary checks as a vote of no-confidence in their judgment. Framing matters here — when you introduce it as "automated architecture documentation" rather than "enforcement," adoption goes up significantly. The CI check is just making implicit rules explicit.

What actually kills resistance fastest is the first time the gate catches a real issue in code review. Someone refactors a module boundary at 2 AM, the check flags it, and the team realizes the alternative was a silent regression that would have surfaced three sprints later. After that moment, the gates stop being "overhead" and start being "the thing that saved us."

The teams that struggle most are the ones where architectural boundaries were never defined clearly in the first place. You can't enforce what you haven't articulated — so the real work is getting alignment on what the boundaries are, not automating the enforcement.