AI is getting scary good at writing code. Drop a prompt into Lovable or Claude Code, and minutes later you have a working prototype with a modern tech stack. It feels like we solved all big problems in AI-assisted software development.
Then production happens. And production does not care how clean your code looks. It cares whether it survives reality. Can it handle traffic spikes or recover when a dependency fails at 2 AM? Can a person who inherits it six months from now actually understand what it does?
The uncomfortable truth is AI can generate code at scale, but it cannot own what happens after you deploy it.
The Illusion: "Perfect" Code from AI
What AI actually produces is impressive. Clean structure. Readable functions. Fast prototypes. Even decent architecture suggestions if you prompt it right. From the outside, it looks like senior-level output.
But here is the thing. AI does not understand business risk. It does not know what happens when your database connection pool runs out under load. It cannot tell you which of two equally valid solutions will survive a 10x traffic spike. It generates what looks right based on what it has seen before, not what is right for your specific system with its specific constraints.
Andy Anderson, a researcher at IBM, spent four months building a Kubernetes dashboard from scratch using Claude Code. No team. Just him and the AI. The first two weeks were exhilarating. Code poured out at a pace he had never experienced. Features that would normally take days appeared in hours. It felt like having a tireless junior developer who typed at the speed of thought.
Then the limitations hit. All at once: broken builds, wrong architectural patterns, scope creep, the AI trying to modify files Andy did not ask it to touch, etc. The problems were cascading: fix one thing, three others break. The researcher ended up spending more time reviewing and reverting than he would have spent writing the code himself.
We saw the same pattern with a scheduling app we built recently. Lovable generated a working prototype in hours with React frontend, Node.js backend, and PostgreSQL database. The calendar looked fine in the demo. But when we tested it, the AI-generated scheduler could not handle recurring events or drag-and-drop reliably. It was a concept car made of clay. So we swapped that component for DHTMLX Scheduler, a production-grade library, and rebuilt the backend for real-world scale. The prototype was useful, but the production system needed actual engineering.
This pattern shows up everywhere. The tools are great at generating code but terrible at understanding what the code is for.
Where Things Break: Production Reality
In production, systems behave differently. Traffic is unpredictable. Dependencies fail. Data grows. Users do things you never expected. Integrations become fragile. And suddenly, that “perfect” AI-generated code becomes just one layer in a very messy environment. The problem is not the code but everything around it
A large-scale study analyzing 302,600 verified AI-authored commits across 6,299 GitHub repositories found something sobering. AI-generated code introduces real issues, such as code smells, correctness problems, or security vulnerabilities, and 22.7% of those issues survive long-term, silently accumulating as technical debt in production codebases. More than 15% of commits from every AI coding assistant introduce at least one issue. Code that looks fine in a PR can quietly rot in production for years.
Stéphane Maes, who writes about software engineering, calls this the “Great Toil Shift.” The time you save generating code gets entirely consumed by the downstream work. This includes architectural review, security auditing, code understanding, documentation, and ongoing maintenance. The efficiency gains are not what they seem.
This is a common thing for teams that go all in on AI-generated code. The prototypes ship fast. Everyone is excited. Then the first production incident hits. The AI-generated code does something unexpected under load. Nobody fully understands the system because nobody wrote it. The team spends days debugging something that would have taken hours if they had built it themselves.
The Missing Piece: Engineering Judgment
Here is where the conversation needs to shift. What AI does is options generation. It gives you suggestions, often good ones. But someone still has to decide which suggestion to take. Someone still has to understand the trade-offs.
And they are the hard part:
Architecture trade-offs: Do we use a message queue or direct calls?
Failure handling: What happens when this service goes down?
Performance constraints: How many requests per second can this handle?
Security decisions: Is this endpoint properly authenticated?
Long-term maintainability: Will someone understand this in two years?
This is where experience matters more than output. Two solutions can both "work" in AI terms. But only one survives production load.
Andy Anderson's central finding from his experience report is worth quoting directly: "The intelligence of an AI-driven development system resides not in the AI model itself, but in the infrastructure of instructions, tests, metrics, and feedback loops that surround it."
In other words, AI is just a tool. The real system is everything you build around it. The tests. The review process. The deployment pipeline. The monitoring. The on-call rotation. The institutional knowledge about why things are the way they are.
In production environments, organizations fundamentally need what they call a "throat to choke," a human engineer who holds ultimate accountability, who can assess blame, who possesses the systemic intuition to troubleshoot complex, real-world failures. A machine cannot sign off on a Service Level Agreement (SLA). It cannot face a compliance audit. It cannot absorb liability for downtime.
This is not a philosophical point. Researchers who analyzed the terms of service for nine AI coding assistants found a consistent pattern: while users own the generated output, full liability, responsibility for correctness, and downstream production risk rests on the human user. The providers explicitly disclaim warranties and allocate responsibility to the developer.
If your AI-generated code causes a production outage, the AI company is not getting the pager duty alert. You are.
What This Means for How We Build Software
None of this means the tools are useless. They are incredibly useful. They can generate boilerplate, write tests, refactor code, and help you explore solutions faster than ever before. But the tools do not replace judgment. They do not eliminate the need for testing. They do not make architecture decisions for you. And they definitely do not own the systems they help you build.
The real skill in modern AI-assisted development is learning how to integrate generated code into a system that you fully understand, can debug, and evolve over time. It is knowing when to accept a suggestion and when to rewrite it.
Some teams are figuring this out. They use AI for rapid prototyping, like the Lovable-generated scheduler that we built, which validated the concept before replacing the AI-generated components with a production-ready solution. The prototype was useful for testing and validation. But the production system required real engineering: understanding the domain, handling edge cases, and ensuring the component could actually survive real-world use.
AI changes how fast we build software. But it has not changed what it means to be responsible for it. And in production, responsibility is still the hardest part.
Top comments (0)