How to Keep Your Codebase From Rotting When AI Writes Most of It

#ai #programming #codequality #softwareengineering

Snap just announced that AI agents generate over 65% of their new code. The same week, they laid off roughly 1,000 employees — 16% of their workforce. CEO Evan Spiegel shared the AI stat on April 15, 2026, and the stock actually went up about 7-8%.

Whatever you think about the human side of that story (and there's a lot to think about), it surfaces a very real engineering problem I've been wrestling with on my own teams: when the majority of your codebase is AI-generated, how do you stop it from becoming an unmaintainable mess?

I've been there. Not at Snap's scale, but I've watched a codebase go from "AI helps us move faster" to "nobody understands why this service exists" in about four months. Here's what actually goes wrong, and how to fix it.

The Root Cause: AI Code Has No Institutional Memory

Human developers write code with context. They know why the retry logic uses exponential backoff with a 30-second cap — because that one vendor's API starts dropping connections after 45 seconds. AI doesn't know that. It generates plausible code that works today and becomes a mystery tomorrow.

The real danger isn't bad code. Modern AI agents write surprisingly decent code. The danger is context loss at scale. When 65% of your new code comes from AI, you're accumulating technical debt in the form of undocumented decisions, duplicated patterns, and subtle inconsistencies that no single person fully understands.

Step 1: Enforce Architecture Decision Records (ADRs) for AI-Generated Code

Before anything else, you need a system that captures why code exists, not just what it does. I started requiring lightweight ADRs for any AI-generated module that touches core business logic.

<!-- docs/adr/0042-order-validation-pipeline.md -->
# ADR-0042: Order Validation Pipeline

## Status: Accepted
## Date: 2026-04-10

## Context
AI agent generated the order validation pipeline based on prompt:
"Create validation middleware for incoming orders with fraud scoring."

## Decision
- Uses streaming validation (not batch) because order volume exceeds 10k/min
- Fraud score threshold set at 0.7 based on Q1 chargeback analysis
- Circuit breaker trips after 5 consecutive failures to fraud service

## Consequences
- Must update fraud threshold quarterly (see runbook RB-119)
- Circuit breaker fallback approves orders — acceptable per PM sign-off

The key insight: the ADR documents the human decisions that shaped the AI's output, not the AI's output itself. Without this, three months later someone asks "why does this approve orders when fraud detection is down?" and nobody knows.

Step 2: Build Automated Consistency Checks

AI agents don't remember what they generated last week. Ask one to build a user service on Monday and an auth service on Friday, and you'll get two completely different error handling patterns. Multiply that across 65% of a codebase and you've got chaos.

I use a combination of custom linting rules and AST analysis to catch this drift:

# scripts/check_consistency.py
import ast
import sys
from pathlib import Path

def find_error_handling_patterns(filepath: str) -> list[str]:
    """Detect which error handling pattern a file uses."""
    patterns = []
    tree = ast.parse(Path(filepath).read_text())

    for node in ast.walk(tree):
        if isinstance(node, ast.Try):
            for handler in node.handlers:
                if handler.type and hasattr(handler.type, 'id'):
                    patterns.append(handler.type.id)
    return patterns

def check_service_consistency(service_dir: str):
    """Flag services that deviate from the dominant error pattern."""
    all_patterns = {}
    for py_file in Path(service_dir).rglob("*.py"):
        file_patterns = find_error_handling_patterns(str(py_file))
        all_patterns[str(py_file)] = set(file_patterns)

    # Find the most common pattern set across files
    if not all_patterns:
        return

    pattern_counts = {}
    for patterns in all_patterns.values():
        key = frozenset(patterns)
        pattern_counts[key] = pattern_counts.get(key, 0) + 1

    dominant = max(pattern_counts, key=pattern_counts.get)

    # Report deviations
    for filepath, patterns in all_patterns.items():
        if frozenset(patterns) != dominant:
            print(f"INCONSISTENCY: {filepath}")
            print(f"  Expected: {sorted(dominant)}")
            print(f"  Found:    {sorted(patterns)}")

if __name__ == "__main__":
    check_service_consistency(sys.argv[1])

This is a simplified version, but the idea scales. Run it in CI. Every PR that introduces a new pattern has to justify the deviation. It sounds heavy-handed, but when AI is churning out code at volume, you need guardrails that a human reviewer would instinctively catch — if they had time to review everything.

Step 3: Implement Ownership Tags, Not Just CODEOWNERS

GitHub's CODEOWNERS file tells you who reviews code. But when AI generates most of your codebase, you need something stronger: semantic ownership that tracks who understands the intent behind a module.

# .service-owners.yml
services:
  order-validation:
    owner: platform-team
    intent-contact: jpark           # Person who wrote the original AI prompt
    last-human-review: 2026-04-01   # When a human last deeply reviewed this
    review-cadence: 30d             # Max days between thorough human reviews
    ai-generated: true
    context-doc: docs/adr/0042-order-validation-pipeline.md

  user-preferences:
    owner: growth-team
    intent-contact: mchen
    last-human-review: 2026-03-15
    review-cadence: 60d             # Lower risk, less frequent review
    ai-generated: true
    context-doc: docs/adr/0038-user-preferences.md

Then build a simple bot or CI check that flags modules overdue for human review. This is crucial. AI-generated code doesn't rot faster than human code, but it rots silently — nobody notices because nobody had the mental model in the first place.

Step 4: Test Intent, Not Just Behavior

Here's the most counterintuitive lesson I've learned. When AI writes code, your tests need to change too. Standard unit tests verify that the code does what it does. But with AI-generated code, you need tests that verify the code does what you intended.

# tests/test_order_validation_intent.py

def test_fraud_threshold_matches_business_requirement():
    """Intent: Orders above 0.7 fraud score should be rejected.
    Source: Q1 2026 chargeback analysis, see ADR-0042.
    """
    validator = OrderValidator()
    # Threshold is a business decision, not an implementation detail
    assert validator.fraud_threshold == 0.7, (
        "Fraud threshold changed from business-approved value. "
        "Requires PM sign-off before modifying. See ADR-0042."
    )

def test_circuit_breaker_fails_open_by_design():
    """Intent: When fraud service is down, approve orders.
    This is a deliberate business choice, not a bug.
    """
    validator = OrderValidator()
    validator.fraud_service = MockDownService()
    result = validator.validate(sample_order())
    # This SHOULD pass — failing open is intentional
    assert result.approved is True
    assert result.fraud_check_skipped is True

Notice how the test names and docstrings encode business intent. When an AI agent refactors this code six months from now and a test fails, the error message tells the agent (or the human reviewing its output) that this wasn't a bug — it was a deliberate choice.

Prevention: The 65% Rule Needs a 100% Understanding Rule

Here's my take after watching multiple teams scale up AI code generation: the percentage of AI-generated code should never exceed the percentage of code your team deeply understands.

If AI writes 65% of your code, at least 65% of your codebase needs to have clear ownership, documented intent, and recent human review. That sounds obvious, but almost nobody does it. Teams celebrate the productivity gains and skip the comprehension investment.

Whether Snap has these systems in place, I have no idea. But the broader trend is clear — companies are scaling AI code generation faster than they're scaling the practices needed to maintain it.

The tools I've described aren't complicated. ADRs, consistency checks, ownership tracking, intent-based tests. None of this is rocket science. The hard part is making it habitual before your codebase becomes a black box that technically works but nobody can safely change.

Start with the ADRs. That alone will save you more debugging hours than any AI agent will create.