TL;DR
Vibe coding—Andrej Karpathy's term for conversational AI code generation—creates massive technical debt in production systems. Despite promises of acceleration, research shows developers using conversational AI tools are 19% slower on complex codebases and write 47% more code per task. This article explains why unstructured AI coding fails and presents the structured alternative: context engineering and spec-driven development.
Key Statistics:
- 76% more code written with AI tools (GitClear, 2025)
- 19% slower on complex codebases (METR Study, 2025)
- 47% code bloat per task (METR Study, 2025)
Introduction: The Rise and Fall into Disgrace of Vibe Coding
Origin of the Term
In February 2025, Andrej Karpathy coined the term "vibe coding" in a viral tweet that garnered over 4.5 million views. He described it as "a new kind of coding where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."
His original post painted a picture of frictionless development:
- "I 'Accept All' always, I don't read the diffs anymore"
- "When I get error messages I just copy paste them in with no comment, usually that fixes it"
- "I just talk to Composer with SuperWhisper so I barely even touch the keyboard"
Karpathy acknowledged this approach worked for his use case: "It's not too bad for throwaway weekend projects, but still quite amusing."
The term resonated so strongly that it became Collins Dictionary's Word of the Year for 2025. Karpathy later elaborated on the concept in his AI Startup School talk "Software Is Changing (Again)" at Y Combinator, where he discussed vibe coding as part of the shift to "Software 3.0"—where natural language becomes the programming interface. Y Combinator also dedicated an episode of the Lightcone Podcast to discussing this new paradigm, with YC president Garry Tan declaring: "This isn't a fad, this isn't going away, this is actually the dominant way to code."
The Key Difference: From Prototypes to Production
Karpathy was explicit about the scope: "throwaway weekend projects." The approach he described—accepting everything without review, not reading diffs, copy-pasting errors blindly—was intentionally reckless for code you don't need to maintain.
But here's what happened: developers saw the speed and excitement of vibe coding and applied it to production codebases.
The approach that works for throwaway prototypes becomes catastrophic when applied to systems that need to:
- Run in production
- Be maintained over time
- Evolve with changing requirements
- Be understood by team members
- Handle edge cases correctly
As a result, the term "vibe coding" now carries a stigma in professional software development circles. It's associated with developers who generate code they don't understand, can't maintain, and that accumulates technical debt at an alarming rate. We'll explore why this happens and what forces amplified the problem in the sections ahead.
The Evolution of the Term
Karpathy himself further narrowed the definition by introducing a contrasting approach. In a follow-up tweet, he described a completely different workflow for "code I actually and professionally care about," introducing the term AI-Assisted Coding (referencing an article in the same thread) where AI assists rather than generates blindly, with the principle of "stuffing everything relevant into context" before code generation.
Almost in parallel, other experts arrived at the same conclusion. Kent Beck, creator of Test-Driven Development and Extreme Programming, coined the term Augmented Coding to describe the alternative: "In vibe coding you don't care about the code, just the behavior. In augmented coding you care about the code, its complexity, the tests, & their coverage."
Beck's insight is particularly relevant because TDD itself becomes a form of context engineering—writing tests first creates executable specifications that guide AI code generation, just as they've always guided human developers. The tests are the spec. For Beck, who has over five decades of programming experience and is now re-energized working with AI agents, TDD is a "superpower" when combined with AI—maintaining the same value system as hand-coding (clean, tested, well-structured code), with AI handling the mechanical typing within constraints defined by your tests and architectural decisions.
This philosophy—comprehensive context over improvised prompts—has emerged under various names across the industry, each progressively more specific:
AI-Assisted Coding - The most generic term, widely used across the industry. More a reaction to the stigma of vibe coding than a concrete methodology. Doesn't specify much beyond "AI helps but doesn't generate blindly"
Context Engineering - The more specific term, championed by industry leaders from OpenAI to Shopify. Focuses on the core principle: providing comprehensive context to AI before code generation
Spec-Driven Development - A methodology that emerged in mid-2025 where formal, detailed specifications serve as executable blueprints for AI code generation. Goes beyond "context" to emphasize "specifications"—though there's still ambiguity in the industry about what qualifies as a "spec." Multiple implementations exist: GitHub's Spec Kit (open-source CLI), Amazon's Kiro (agentic IDE), and others.
To understand why these structured approaches emerged and why they matter, let's examine the specific problems that make vibe coding fail for production systems.
The Problems of Improvised Coding
Problem 1: Why Vibe Coding Creates Technical Debt and Wasted Time
The chat-based interface of AI coding assistants creates a deceptive workflow. It feels natural to have a conversation, gradually explaining what you want. You type a message, the AI responds with code, you refine your request, and the cycle continues. This conversational approach seems intuitive—it's how we communicate with humans, after all.
But this approach is fundamentally flawed for software development.
When you start describing a feature conversationally, the AI makes architectural decisions based on incomplete information. You haven't yet explained all your constraints, your existing patterns, or your edge cases. The AI generates code based on what it knows right now—which is only a fraction of what it needs to know.
Then you see the generated code and realize: "that's not what I wanted." Now you're in correction mode. You explain more context, adjust requirements, re-explain constraints. The AI generates new code, but it's building on the foundation of those early incorrect decisions. You're not moving forward efficiently—you're constantly backtracking and adjusting.
Chat interfaces encourage this incremental information sharing by design. They're optimized for human conversation, not for providing comprehensive technical context. But AI coding agents need that comprehensive context upfront, not gradually built through dialogue. Every round of clarification generates code that may need to be partially or completely discarded. You're wasting time building up context that should have been provided initially.
This becomes nearly impossible with existing codebases. Trying to explain your architecture, patterns, constraints, and existing functionality through conversation is extraordinarily frustrating. The AI cannot effectively understand your system's organization through conversational prompts—it makes incorrect assumptions about how components interact, where functionality already exists, and what patterns you follow. You're essentially trying to reverse-engineer your own codebase through chat, explaining piece by piece what should be comprehensive context from the start.
The result is frustration, wasted time, and code that doesn't match your actual requirements—even after extensive back-and-forth.
Problem 2: How Token Limits Cause Inconsistent Code
Even if you eventually build up the right context through conversation, you hit another wall: token limits.
The back-and-forth conversation consumes tokens rapidly. Your initial prompts, the AI's responses, your corrections, the regenerated code—it all adds up. Eventually the context window is full, even though you've finally gotten the AI to understand what you want and start generating useful code.
With existing codebases, this problem becomes catastrophic. Without clear boundaries on what the AI should analyze, it loads irrelevant code into context. Working on a monolithic codebase with thousands of lines? The AI might analyze and load the entire frontend into context when you're actually modifying a database interface. Token limits fill up with code that has nothing to do with your actual task, leaving no room for the context that actually matters.
Now the AI needs to reset or summarize to continue. When context resets, the AI attempts to summarize the conversation to preserve the most important information. But summaries are inherently lossy. The nuance disappears. The architectural decisions and their rationale get compressed into bullet points. The "why" behind decisions—the context you spent so much time building—evaporates.
The AI's subsequent responses lack the full understanding you had established. Even though you were very specific during each individual session, you end up with inconsistent code across sessions because the AI can't maintain that context.
This frustration is what triggered the emergence of prompt engineering and context engineering. Developers who went through this experience started extracting the guidelines and instructions they were repeatedly giving to AI agents across different sessions and saving them as reusable prompts in files. Instead of rebuilding context through conversation every time, they could provide comprehensive context upfront. This simple insight—saving context in files rather than reconstructing it conversationally—became the foundation for more structured approaches to AI-assisted development.
Problem 3: Why Vibe Coding Generates Excessive Code and Causes Review Fatigue
AI models are verbose by default, and this compounds the previous problems dramatically.
Large language models generate extensive implementations. Where a human might write a concise solution, an LLM tends to be thorough—sometimes excessively so. The data supports this: GitClear's analysis shows the median developer checked in 76% more code in 2025 than in 2022 (with the average being 131% more). This phenomenon has been dubbed "code slop"—code that compiles and runs but is verbose, brittle, and flawed.
More code means more to review and more to maintain. Worse, scope creep sneaks in: the AI helpfully implements features you didn't explicitly ask for, with developers writing 47% more lines of code per forecasted task size as they handle edge cases and add additional features. The METR study—which examined conversational AI coding tools like Cursor Pro on complex, real-world codebases—found that experienced developers were actually 19% slower when using these tools, despite expecting a 24% speedup. This stands in stark contrast to Phase 2 autocomplete tools, which show genuine productivity gains.
Now you're facing a perfect storm of exhaustion. You're already tired from the conversational back-and-forth of Problem 1. You're frustrated by the inconsistencies caused by context loss in Problem 2. And now you're staring at walls of verbose code that need careful review.
With existing codebases, your threshold drops even further. When there's already significant code—potentially legacy code of varying quality—your mental fatigue intensifies. The AI adds more verbose code on top of existing complexity. The scope creep is harder to spot because you're less familiar with every corner of the system. Your review becomes even more superficial because the cognitive load is overwhelming.
Review fatigue sets in. You know you should carefully examine every line, but you're mentally drained. The code looks reasonable at first glance. AI code often follows similar patterns, causing "template blindness"—you skim rather than deeply analyze.
The result: you've just approved inconsistent code with unnecessary complexity that will be painful to maintain. But you won't realize this until later—when you're trying to debug it, extend it, fix a security vulnerability, or hand it off to another team member.
The Cumulative Effect: Black Box Full of Spaghetti
All these problems converge into one critical consequence: you generate tons of code with AI agents that you don't understand and cannot maintain.
This matters because we're not (yet?) in a future where AI autonomously maintains complex production systems. Someone—a human—still needs to read and comprehend the code, maintain it when issues arise, evolve it as requirements change, debug it when things break, and onboard new team members to understand it. Until AI can reliably reason about complex systems and create novel solutions for chaotic real-world problems—something pattern matching fundamentally cannot do—humans remain critical to the software development lifecycle.
Structure and understanding aren't optional. Unless you're building throwaway prototypes, you need code you can understand, patterns you can follow, architecture you can reason about, and systems you can debug.
"But wait," you might be thinking, "I already have plenty of code I don't understand—written by humans." Fair point. Human developers absolutely generate incomprehensible code. But AI-generated spaghetti code is even worse: At least when a human writes messy code, they can explain why they made certain decisions, remember (or reconstruct) the context, maintain some internal consistency through personal coding style, and reason about the trade-offs they made. AI without structure has none of these advantages—no memory of why decisions were made, no consistent style or patterns, random mixing of paradigms and approaches, and zero ability to explain trade-offs. You get a codebase that nobody—not even the AI that generated it—can reliably maintain or evolve.
Read the full article including the 4 Phases of AI Adoption and practical next steps:
👉 The Vibe Coding Trap: Complete article
Are you a vibe coder, or do you follow a more structured approach to AI-assisted development?

Top comments (0)