When Full Autopilot Goes Wrong: How I Let AI Turn a 2K Line Feature into 8.5K Lines of Technical Debt

#ai #programming #codequality #developer

I recently handed an entire feature to AI—zero review, full trust, "you got this" energy.

Three days later, I had 8,500 lines of code that technically worked, architecturally "made sense" when read file-by-file, and collectively formed a system I no longer understood.

Here's what went wrong.

The Scope Creep Was Silent

The task was straightforward: add an audit log for user permission changes. In my head, that's a controller, a service, a DB table, some frontend forms—maybe 2,000 lines total.

I used Cursor in agent mode. I watched it scaffold. I watched it "think" about edge cases. I watched it create an AbstractPermissionAuditEventStrategyFactory and a PermissionAuditOrchestrator and something called a LogCompactionPipeline.

I didn't stop it because, individually, each file looked correct. Good separation of concerns. SOLID principles. Nice docstrings.

What I missed: No human was balancing completeness against necessity. AI doesn't feel cognitive load. It doesn't know that "you could abstract this" doesn't mean "you should."

By the time I checked wc -l, I had 4x the code with 10x the indirection.

The Testing Mirage

I asked AI to "ensure high test coverage." It delivered 90%+ coverage and a green CI pipeline.

The problem? It tested what passes, not what should pass.

I found tests like this:

describe('AuditService', () => {
  it('should work as expected', () => {
    expect(true).toBe(true);
  });

  it('should handle audit events correctly', () => {
    const mockEvent = { type: 'PERMISSION_CHANGE', userId: 1 };
    // ... 20 lines of setup ...
    expect(service.process).toBeDefined();
  });
});

Coverage: ✅
Confidence: ❌

When I tried to refactor a field name, 40 tests broke—not because they were testing behavior, but because they were testing implementation details that AI copy-pasted from the source code. The tests weren't a safety net; they were a mirror.

The Local Optimum Trap

Here's the strangest part: every single file, in isolation, looked good. Clean code. Proper types. Good naming.

But tracing a single user action through the system required jumping through 6 abstraction layers, 3 adapter patterns, and 2 "middleware pipelines." The AI had optimized each component locally while creating a global mess.

I realized I was reading the output of an entity that doesn't read—it generates. It doesn't understand the codebase; it predicts tokens based on context. When you let it architect, you get maximally "correct" code that minimizes comprehension.

The Reckoning

I spent a day trying to "fix" the 8.5K lines. Then I spent two hours reverting to a clean slate and rewriting it myself—2,100 lines, boring code, works perfectly.

What I Actually Learned

Tab completion was probably the sweet spot.

When AI acts as a pair programmer—suggesting the next line, filling boilerplate, offering alternatives—it augments my decision-making.

When AI acts as the primary author—making architectural choices I don't review—it creates debt at machine speed. Debt that looks deceptively professional.

My new rules:

AI writes functions, not systems. I'll design the architecture; it can implement the utilities.
Review everything. If I can't explain a line in code review, it doesn't ship—even if the AI wrote it.
Write your own tests. Or at least, write the test cases and let AI fill the syntax. AI doesn't know what behavior matters; it knows what syntax compiles.
Complexity budget. If AI generates more than 200 lines I didn't explicitly ask for, that's a red flag, not a feature.

The 8,500-line revert commit is staying in my git history.

It's a reminder that "AI-generated" isn't a synonym for "well-engineered"—it's just fast. And fast technical debt is still debt, just with better syntax highlighting.

Have you hit similar walls with AI autonomy? Or am I just bad at prompting?