Andrew Shu

Posted on May 5 • Originally published at ashu.co

Day 2 of vibe coding in production: what breaks when adoption scales

#ai #devops #vibecoding #security

In Part 1 of this series, I enumerated a few obstacles for engineers taking vibe coding from side projects to production. Part 2 looked at AI usage from the manager's perspective: measuring adoption, understanding the gap, coaching to fill the gap. Both of those were "Day 1" problems: getting started, getting people on board, figuring out the tools.

This article focuses on what comes next: the vibe coding process problems that emerge after adoption is up. I'd call them "Day 2 problems". Let's say that AI adoption is up, and code is shipping faster. Then, things start breaking in places you didn't expect. My goal is to point to specific problems that you can observe and fix.

Engineers may feel these Day 2 problems as daily friction: PRs stuck in review, surprise token bills, coming back to AI-generated code that's unrecognizable.

Managers face problems more from team and process perspective: senior engineers stuck reviewing instead of building, budget surprises, "quality" meaning different things depending on who's asking. I'll walk through what I've seen break, and for each one, suggest an actionable starting point.

Let's start with the software development lifecycle. When engineers say "coding is only one part of shipping a product to customers", what do they mean?

Vibe coding code review: the first bottleneck you'll hit

Code review is often where AI's impact is immediately noticeable. It's a bottleneck on implementation (code generation), because most organizations require code reviews for quality and security reasons.

Code generation is much faster: this can easily rise to thousands of lines per day per engineer, if not orders of magnitude more. This means more commits, more PRs, more lines of code landing in review queues. Code review was already a chore for many engineers, so AI has compounded this problem.

I spoke with a tech lead at a large enterprise who said members of his team had started distrusting AI because of the quality of code coming through in PRs. Not because AI can't write decent code, but because engineers were submitting AI output without reviewing it themselves first. The PR became the first time anyone looked critically at what the agent produced.

Senior engineers face a practical question here: how much should they comb over each line of code the way they used to? When a PR is 5,000 lines of AI-generated code, a line-by-line review is time consuming. But skimming feels irresponsible.

So what can you do about it?

Think about your CI/CD pipeline and what parts of the review can be automated.

Luckily, AI code review tools like CodeRabbit, Greptile, Cursor's Bugbot, or Anthropic's Claude Code review catch a lot of the surface-level issues: style, obvious bugs, missing tests. These don't replace human review, but they reduce the surface area your senior engineers need to cover manually.

When using AI code review tools, engineers I've spoken to have reported good findings, but also a lot of false positives. It can be helpful to coach early career engineers to spot the false positives and explain why they're not a problem, or they're an acceptable risk.

Another idea, more from a process side: ask authors to review their own PRs before sending a review request to someone else. In other words, a pre-review review. "Ease of review" and "quality of code" is still a responsibility of the author: it reflects on their engineering skills, regardless of whether AI wrote the first pass. If your team doesn't have that norm yet, it's a good time to consider setting it.

The upstream AI coding bottleneck: issue tracking

Code review is the downstream constraint of generating more code. But there's an upstream one too in the planning phase: ticketing (e.g. in JIRA, Linear or GitHub Issues).

Upstream, we have the work that happens before anyone writes a line: ticket creation, design conversations, bug reproduction, requirements gathering, stakeholder alignment. None of that got faster when you adopted AI coding tools.

Vague tickets slow down development because engineers have to ask clarifying questions. Delays add up when clarifying takes multiple back-and-forths. Clear acceptance criteria, reproduction steps, and system context help engineers get work done faster.

And it's not just ticket creation. Think about communication and "paperwork" across the whole software development lifecycle. Status updates, stakeholder check-ins, handoff notes, design docs: all the connective tissue that keeps a team aligned. It's not what we traditionally think of when we think of accelerating software engineers who are vibe coding, but these are common sinks.

What can be done about issue tracking and requirements gathering?

Here are a few things I've been experimenting with: You can build a Cursor or Claude skill that pulls a ticket from your issue tracker (JIRA, Linear, whatever you use) via an MCP server and runs it through a series of quality checks. Does the ticket have a clear objective? Clear requirements? Business impact? Stakeholder named? If it's a bug, does it have steps to reproduce? If it's incomplete, the tool can automatically flag the gaps and notify the stakeholder. This takes an afternoon to set up and it pays for itself within the first sprint.

Before an engineer works on a ticket, you could take the description of a problem, and perform automated research on that ticket: in the codebase, in the database, or in a browser to explore the UI. If there is a description of a bug, the automation could verify that the description can be observed easily, and potentially take screenshots.

But beyond the initial ticket creation, how can you speed up feedback cycles by helping engineers to act on tickets, then reduce the paperwork?

You can create CLI tools / desktop applications that help engineers package up their progress (git commits), findings (command line output, screenshots, summaries) and attach them back to the ticket. It sounds small, but reducing the friction of sharing blockers and getting feedback keeps the pipeline moving. The gains from AI coding don't fully materialize if the non-coding parts of your process stay manual.

Vibe coding code quality: duplication and maintainability

AI ships duplicate code constantly. Instead of reusing existing modules or reaching for community packages, agents tend to reimplement. I've watched Claude Code write a date parsing utility from scratch in a codebase that already had three date parsing utilities (all also written by Claude Code in previous sessions). The agent didn't know they existed, because the context window didn't include them, and nobody had documented the pattern.

You need awareness and diligence to notice the duplicates and circle back to clean them up. And even then, I forget half the time.

This matters more than it might seem. Code duplication hurts runtime performance and build times. When there are duplicates, it's harder to fix bugs: you patch one copy and the other three still have the vulnerability. Security patches need to be applied in multiple places. The codebase quietly gets worse while the velocity numbers look great.

GitClear's 2025 study analyzed 211 million changed lines across repos from Google, Microsoft, Meta, and enterprises across 2020–2024. This covers the early AI adoption era. Code churn (new code revised within two weeks) roughly doubled from 3.1% to 5.7%. Copy/pasted code rose from 8.3% to 12.3%. Refactoring dropped from about 25% to under 10% of changed lines. The code ships faster, but doesn't age well.

Sidebar: I don't see code churn discussed much, but I'd love to see more research on potential impacts on maintainability. For folks vibe coding, seeing a "+2k / -2k lines of code" change is pretty common. What worries me is the impact of continuous churning of code (and tests) over time. Subtle bugfixes and "matured" code don't survive that kind of constant rewriting.

A few ideas on what to do for code quality:

In the code review section above, I mentioned CI/CD improvements for review. For maintainability specifically, look for tools that measure test coverage, code duplication, and code complexity at the repository level; not just at the PR level. PR reviews catch incremental issues, but as changes accumulate, you want a broader snapshot.

But it's not just 3rd party tools. Can you create hooks that run as a part of a code review check, helping engineers detect duplicate code? They're incredibly easy to build. For example: a Skill or Subagent that scans for existing implementations before the agent writes a new one. The question is when engineers run this so they don't forget. A git hook, or preprocessing before a PR is submitted works; the mechanism matters less than making it automatic.

OK, let's switch out of the "development" dimension of software, and talk about the "operational" dimensions of software.

Vibe coding quality: code-level vs customer-facing

Code maintainability is one kind of quality engineers care about. The other kind is customer-facing quality, and that's what keeps all of us employed.

A manager I interviewed at a Fortune 500 company distilled their AI adoption objectives into two themes: "velocity" and "quality." When I pressed on what "quality" meant, it was clear they meant product uptime and customer-facing incidents. Not code complexity. Not test coverage.

This is typically what executives mean by "quality." If your engineering dashboards show code metrics and leadership means production stability, you're measuring two different quality layers. Clarify what quality means: the disconnect is more common than you'd think.

The DORA 2024 report found that for every 25% increase in AI adoption, delivery stability decreased 7.2%. Their 2025 follow-up added nuance: "[AI] shines a light on what's working, accelerating what's already in motion, but it also surfaces what needs to change." Strong teams with good practices benefited from AI. Struggling teams faced greater challenges. If your delivery pipeline had cracks before AI, AI adoption widened them.

What you can do:

Use your issue tracking system alongside git to track quality of AI-assisted versus non-AI code. Git commits are increasingly labeled with AI tool footers (e.g., Co-Authored-By: Claude Opus 4.5). You could create a CI check requiring all commits to carry this footer; even manually-written ones should be explicitly "human." It's a small discipline, but it makes the data traceable.

In the issue tracker, find ways to link customer-reported issues to the candidate commits that had problems. Remember the blameless post-mortem: you're linking to the problematic code change, not to a specific person.

And have labelling, or other categorization that can differentiate between customer-reported issues and internally-found issues. You'll catch many more internal issues that customers may not care about, so it helps to keep explicitly customer-impacting issues as the priority.

Security: more code, more surface, smarter adversaries

Security shares some DNA with code quality, but it's a different domain and has much higher stakes.

Here are some things I think about from an engineer's perspective: More lines of code with less human understanding means the attack surface is evolving. AI agents act on your behalf with your permissions and credentials. Hallucinations in development environments can cause real damage, not just in production. And vulnerabilities ship faster than before.

Research confirms that LLM-generated code will include vulnerabilities. Tihanyi et al. (2024) analyzed 331,000 C programs across 9 LLMs (e.g. GPT-4o-mini, Gemini Pro, Code Llama, and others) and found 62%+ contained vulnerabilities, with minimal differences between models. The problem isn't a bad model. It's that code generation at scale produces vulnerabilities at scale. It might be better than humans, but if code gen is accelerating, then vulnerabilities will scale linearly too.

And from the other side, the window from vulnerability disclosure to active exploitation is compressing from 8.5 to 5 days on average. AI-assisted cyber attacks rose 72% in 2025. More code, faster attackers, cheaper discovery. It's a scaling problem, not a skill problem.

What can you do along the security front?

I'd start by adding security monitoring to your CI/CD. Linting, SAST, supply chain scanning, secrets scanning: tools like Semgrep or Snyk, or open source alternatives. Code review bots include security checks as well. And the standard practices still apply: periodic auditing, security considerations early in project planning, security checks woven into the review process. Defense in depth, with the "depth" updated for a world where agents generate code faster than humans can review it.

I would also start with updating your "least privilege" access controls for the agentic world. To get work done, I have to grant agents control of certain tools and infrastructure, and I always worry about how much unintended damage that could cause.

Sidebar: I find that "isolation" is a theme I think a lot about when it comes to improving AI security. How do you isolate your AI agent from your secrets (but give it some access)? From destroying files in your filesystem? From other computers in your network? I think that techniques like containerization (docker), jails, firewalling, splitting identities/credential access into more granular chunks, will be fruitful here.

Humans are better trained than agents at knowing which lines not to cross, so it makes sense to scope agents more tightly than human developers. Agents are also more numerous and shorter-lived. Think about how to generate lightweight, temporary permissions rather than sharing your personal credentials.

A concrete example of agents doing something sketchy: your .env file getting read by an agent and shipped up in an AI-generated bug report, or used in an unintended API call. The kind of thing you only laugh about months later.

Another example: an agent inheriting your admin role, hallucinating, and taking a destructive action with permissions it should never have had.

Use vaults and password managers to reduce agent access. Add degrees of isolation between "write access" and "read-only access." Isolate production from development environments. Wrap binaries and filesystem folders in containers, jails, or VMs to constrain blast radius.

This by no means is a complete list. It's merely highlighting some of the security risks that AI is introducing and being discussed in engineering communities. And suggesting some starting points for how we can use and extend existing tooling.

What I didn't cover (and why it matters)

There are a few topics I can't go deep on here, but they're worth flagging.

Design documents are evolving. In order to write more code thoughtfully, teams are producing more design docs. But the tech lead I mentioned earlier noticed they're appearing more generic: the same structure, the same level of detail, the same boilerplate that suggests an AI wrote the first draft and nobody pushed it further. ("Slop" is a useful description here, not to be disparaging to the authors. But to describe the "averaging" effect of LLM generation of prose.) Design docs are supposed to force you to think through the hard parts before coding. If AI is writing them, and humans are rubber-stamping them, we've lost the "thinking" and "intentionality" of designing solutions that actually fit the problem.

Operating code in production is another one, but I've covered it in Part 1 of this series. As you develop more code, you have to maintain it: deploy it, monitor it, troubleshoot it, patch it. How to enable your repositories and infrastructure to let AI help with operations safely is a separate conversation, and it's one I'm spending a lot of time on.

Where this leaves us

I've been thinking a lot about what a real measurement layer for AI coding should look like, and what kinds of insights it should surface. More on that soon.

Underneath all of these practices, there's a thread that keeps surfacing in every conversation I have: continuous learning. It's a classic practice: Agile retrospectives, Toyota's production system, are always good practice. But it feels newly urgent when the tools and practices are changing this fast. Engineers and managers can't keep up with the rate at which research keeps appearing, and intentional practice helps.

I'm collecting stories from engineers and managers working through the post-adoption phase. If you've hit the review bottleneck, had concerns about code quality and security, or if you've found something that works, I'd like to hear from you. You can find me on LinkedIn or X.

Originally published at ashu.co.

Top comments (1)

Jill Mercer • May 6

the velocity is intoxicating until you're staring at a thousand lines of code you didn't actually write. cursor makes it too easy to bloat a project — i've had to learn the hard way that 'vibe first' still needs a cleanup crew. keeping things alive at scale is where the real work happens. glad someone is finally talking about the mess left behind the curtain.