I recently handed an entire feature to AI—zero review, full trust, "you got this" energy.
Three days later, I had 8,500 lines of code that technically worked, architecturally "made sense" when read file-by-file, and collectively formed a system I no longer understood.
Here's what went wrong.
The Scope Creep Was Silent
The task was straightforward: add an audit log for user permission changes. In my head, that's a controller, a service, a DB table, some frontend forms—maybe 2,000 lines total.
I used Cursor in agent mode. I watched it scaffold. I watched it "think" about edge cases. I watched it create an AbstractPermissionAuditEventStrategyFactory and a PermissionAuditOrchestrator and something called a LogCompactionPipeline.
I didn't stop it because, individually, each file looked correct. Good separation of concerns. SOLID principles. Nice docstrings.
What I missed: No human was balancing completeness against necessity. AI doesn't feel cognitive load. It doesn't know that "you could abstract this" doesn't mean "you should."
By the time I checked wc -l, I had 4x the code with 10x the indirection.
The Testing Mirage
I asked AI to "ensure high test coverage." It delivered 90%+ coverage and a green CI pipeline.
The problem? It tested what passes, not what should pass.
I found tests like this:
describe('AuditService', () => {
it('should work as expected', () => {
expect(true).toBe(true);
});
it('should handle audit events correctly', () => {
const mockEvent = { type: 'PERMISSION_CHANGE', userId: 1 };
// ... 20 lines of setup ...
expect(service.process).toBeDefined();
});
});
Coverage: ✅
Confidence: ❌
When I tried to refactor a field name, 40 tests broke—not because they were testing behavior, but because they were testing implementation details that AI copy-pasted from the source code. The tests weren't a safety net; they were a mirror.
The Local Optimum Trap
Here's the strangest part: every single file, in isolation, looked good. Clean code. Proper types. Good naming.
But tracing a single user action through the system required jumping through 6 abstraction layers, 3 adapter patterns, and 2 "middleware pipelines." The AI had optimized each component locally while creating a global mess.
I realized I was reading the output of an entity that doesn't read—it generates. It doesn't understand the codebase; it predicts tokens based on context. When you let it architect, you get maximally "correct" code that minimizes comprehension.
The Reckoning
I spent a day trying to "fix" the 8.5K lines. Then I spent two hours reverting to a clean slate and rewriting it myself—2,100 lines, boring code, works perfectly.
What I Actually Learned
Tab completion was probably the sweet spot.
When AI acts as a pair programmer—suggesting the next line, filling boilerplate, offering alternatives—it augments my decision-making.
When AI acts as the primary author—making architectural choices I don't review—it creates debt at machine speed. Debt that looks deceptively professional.
My new rules:
- AI writes functions, not systems. I'll design the architecture; it can implement the utilities.
- Review everything. If I can't explain a line in code review, it doesn't ship—even if the AI wrote it.
- Write your own tests. Or at least, write the test cases and let AI fill the syntax. AI doesn't know what behavior matters; it knows what syntax compiles.
- Complexity budget. If AI generates more than 200 lines I didn't explicitly ask for, that's a red flag, not a feature.
The 8,500-line revert commit is staying in my git history.
It's a reminder that "AI-generated" isn't a synonym for "well-engineered"—it's just fast. And fast technical debt is still debt, just with better syntax highlighting.
Have you hit similar walls with AI autonomy? Or am I just bad at prompting?
Top comments (0)