Andrew Shu

Posted on Apr 7 • Edited on Apr 27 • Originally published at ashu.co

Vibe Coding in Production: What's Holding Us Back?

#ai #programming #productivity #vibecoding

Vibe coding techniques need to be adapted when you work on production applications with AI. I walk through some challenges and solutions that I've found helpful on real projects.

I'm going to share some experiences from a few months ago, about how I expanded the scope of my use of agents from vibe coded apps to working on real world problems in production.

I had been coding with AI agents for a while now: greenfield scripts, prototypes, and features I could build and throw away. Early on in this experimentation, I set my sights on building tools and practices for safely using AI in production. I knew I had to maintain and operate the code I developed. So as I explored AI by building isolated and greenfield code, I made mental notes of techniques that wouldn't work and those that I could bring to a production infrastructure.

There are numerous articles and posts describing techniques for vibe coding well. But there isn't enough documentation describing the practices around customizing your repository and infrastructure to take advantage of your AI agents on real infrastructure and workloads.

Daghan Altas, a former Cisco Meraki colleague, phrased it well: what's the point of a 10x productivity boost if you can't operate and maintain the thing you built any faster? That reframed the question for me. Not "is AI fast?"; obviously it's fast. But: what specifically is holding me back?

Here's what I ran into when I applied vibe coding techniques against production infrastructure and workloads. And this is how I've updated my configurations and techniques to address these issues.

AI codes quickly, but what about troubleshooting and testing?

AI is great at implementation, but there's so much surrounding the act of writing code.

Here's a concrete example: I was building a prototype that needed to normalize messy data from multiple API and database sources. The numbers kept being wrong. I pointed Claude Code at the problem, and it churned for an hour, trying different parsing strategies, refactoring the aggregation logic, adding fallback handlers.

The fix turned out to be surprisingly simple: the logging wasn't capturing everything it needed to. The agent was trusting the logs at face value and never questioned whether the data was complete. An hour of sophisticated troubleshooting on a problem that needed five minutes of "wait, do we have enough logs to capture the symptom of the problem?"

The same dynamic plays out with writing unit tests. So I needed to think more broadly about this.

Implementation is often not the bottleneck. The bottleneck is everything around the implementation. After implementation, that would include: verifying the code does what you think it does, troubleshooting when it doesn't, understanding what already exists so you don't reinvent it, and making sure the architecture holds up next month.

What I've started doing: I built subagents for the two patterns that burned the most time when I was catching and fixing AI-written bugs.

Firstly, a "Skeptical Testing Subagent" that scrutinizes test suites: checking for duplicates, testing meaningfulness, flagging assertions that don't actually prove anything.
And secondly, a "Skeptical Troubleshooting Subagent" that focuses on production logs and data integrity before jumping into code changes. Both are early, but they've already caught things I would have missed.

When I say "skeptical" you can translate it to the term "adversarial", which is what folks in the AI community use more frequently. People have talked about using "adversarial agents" to review code, and how these agents "think differently" than an agent told to "write code". My testing and troubleshooting subagents solve specific code review and production log review problems that I've encountered, in a more narrow and specific context.

Fear slows us down when we vibe code production apps

One of the things that accelerates vibe coding is accepting AI suggestions quickly (specifically, auto-approving the shell commands the agent wants to run). But many of those suggestions are commands run inside a shell with superadmin privileges and access to the internet. Even when the AI isn't doing anything malicious, I worry about a stray rm -rf or a drop database or a terraform apply that destroys a folder, a Google Drive, an RDS instance, a DNS record. Nightmares abound.

This isn't hypothetical: Alexey Grigorev accidentally dropped his production RDS database while using AI tools and wrote up the full post-mortem. Amazon has called for new safeguards and review processes after AI-assisted errors in production. Research from Snyk has documented AI coding tools hallucinating entire package names that don't exist, and attackers registering those packages to exploit the gap.

Hallucinations in a local sandbox are an inconvenience. In production, they're a late night page and an embarrassing post-mortem.

What I've started doing: here are some examples of ways I've improved the safety of my work.

Instead of executing unsafe bash commands, ask AI to write a script you can review. I find AI helpful for sysadmin/SRE work, but I have to monitor it closely — no background agents here. Watching commands scroll by is risky, so I ask the agent to write them into a script I can review first. Then as a bonus, I get a script that is reusable.
Extracting repeat database or log queries into a script. When I was troubleshooting customer issues, I often ran a few Postgres database queries or fetched a few logs related to some Lambdas. This was tedious, but I also didn't want AI to be running PG or AWS Lambda commands by itself. So I wrote scripts like fetch_customer_events_pg.ts <customer-alias> <event-type>, or fetch_customer_logs.ts <customer-alias> --start <start_time> --end <end_time>
In addition to scripts, you can do similar things by codifying repetitive tasks as Skills and Subagents. Skills are available in Codex, Cursor, and Claude Code. Subagents are also available in Codex, Cursor, and Claude Code.

These are some of the techniques I've used. I might elaborate on this topic in a future post. If you're interested in chatting about how I do this, DM me on LinkedIn or on X.

I ❤️ docs, but AI might just love it more (as "context")

I've hopped onto screen shares with other engineers where one of us will spot a hallucination go by mid-session. There's a brief moment of annoyance or concern, and then we keep going. Most of the time, we just let the agent continue. I've done it myself; we have more pressing tasks to finish. You see the wrong thing, you wince, and you move on because you're in flow.

This is a bigger source of inefficiency than it appears. Not everyone realizes that many hallucination patterns can be fixed with better context. (By context, I'm basically talking about code, docs and additional markdown files.) And those who do know often haven't had the time to pay attention to how context is actually structured across their tools. There's a growing stack of context layers: AGENTS.md files, skills, subagents, team-shared vs. individual context, memory architecture, code indexers, connections to databases and wikis and issue trackers.

But here's the pattern I see most often: someone sets up an AGENTS.md or a .cursorrules file when they first adopt a tool, and then never touches it again. Six weeks later the agent is hallucinating patterns you deprecated a month ago, suggesting libraries you've already replaced. Or maybe your automatic memory.md is outdated and no longer reflects your code's reality. The agent's context drifts from reality a little more every week, and the hallucinations compound.

This is the source of a lot of churn. When the agent doesn't know what already exists in the codebase, it reinvents. When it doesn't know your architectural patterns, it improvises. When it doesn't know what you deprecated last sprint, it resurrects it.

What I've started doing: I treat documentation as infrastructure. When agents hallucinate or reinvent something that already exists, I update the docs so it doesn't happen again. I use MCP servers to push context to my knowledge base. I run Claude Code's /context command mid-session to see how the 200K token window is being consumed, and it often exposes wasteful allocation I wouldn't have caught otherwise. It's a small amount of effort that compounds over time. If you're going to obsess over something, context hygiene has the best return on neuroticism I've found so far.

Another technique I use is to keep a plans/ folder and a docs/ folder for architecture decisions and system patterns that agents should know before generating. Markdown plan files are still the single best thing I've done for my workflow, and the docs folder is a great supplement. Recently, Andrej Karpathy posted the importance of "LLM Knowledge Bases". I also use Obsidian in a similar way, but I find in-repo docs more pragmatic for keeping context closer to the code.

You can also layer on custom instructions to Codex, Cursor, and Claude Code to customize your harness to behave differently beyond what your team has done.

AI Token anxiety is real: what if I run out of budget?

To borrow a friend's metaphor: inference tokens are like Spice in Dune. They're a substance that augments your abilities, that once taken, you can't live without. And a scarce resource that requires extraordinary effort to accumulate.

I heard this from a tech lead at a large enterprise. There are technically token budgets per engineer, but they're not being enforced. The current objective is to increase adoption, so that's not an issue. But this engineer worried about what happens when it does get enforced.

The anxiety comes from multiple directions. There's the worry about rationing: how do you make sure you have enough tokens to hit your deadlines? And if an inference provider goes down mid-sprint, you're stuck without tokens or scrambling to switch to an unconfigured, unfamiliar tool.

Then we get to the opacity of pricing. I measured my Claude Code sessions over a week and found that 2 out of 5 sessions burned tokens at 2x the normal rate, with no obvious change in my behavior. In Theo's YouTube video on Claude Code's recent (March 2026) capacity reduction, his conclusion is the same as mine.

Beyond capacity allowances, a feature change from the AI labs can silently double your token costs. There's a pattern emerging across providers in early 2026: models getting more verbose, spawning sub-agents for simple tasks, and nobody has a baseline to tell whether the amount of AI agent-hours they can use per month was reduced 10% or 50%.

I've written about this a lot, but I don't think we need to over-rotate on token reduction. I've found it helpful just to learn how token limits are enforced, how they're being used. This helps me be mindful of costs. The first thing I'd recommend is to understand the tools' pricing structure and learn how they enforce token limits so I can make the most out of the tokens they provide (and subsidize). There are also open source tools like ccusage that track token usage. You can also try vibe coding your own!

Questions I ask myself to improve my use of AI agents on production

I've found a number of techniques that have improved my workflow, but I have so many open questions still! I find that thinking about these questions help me find where real improvements can be made. These questions don't require adding tools to immediately seek gains. I'll share them with you, and hopefully they help you reflect on your own engineering work.

Are my productivity gains real or not? What did I do in the last week with AI? This is a question I ask myself regularly, because productivity gains may feel great but are actually an illusion. METR ran a randomized controlled trial where experienced developers were 19% slower with AI tools, while believing they were 20% faster. This study ran in July 2025, before the major model improvements in November 2025. Nonetheless, that perception gap is a reminder that intuition alone isn't enough.

Where am I losing time? How do I increase AI's autonomy? This goes to my work around adversarial agents for scrutinizing code, tests, and logs. I've found that AI often churns out meaningless work, or takes shortcuts. These are signs the agent isn't truly autonomous, so I need to troubleshoot how to increase the autonomy.

How much does fear cost me? I monitor what my agents are doing more than I probably need to. Are my colleagues even familiar with which bash commands are risky? How much collective time gets lost to hovering, second-guessing, or just not knowing whether it's safe to let the agent run? Reducing risk here feels like unlocked velocity.

What is my process to reflect on and improve the effectiveness of my agents? Right now, most of us are vibe coding and vibe evaluating. We finish a session, we feel like it went well or it didn't, and we move on. I think there's value in building a habit of structured reflection: what worked, what didn't, what would I change? And sharing those reflections across a team, not just keeping them in your own head. And how do I share what I learn with colleagues? There's something from Toyota's production system and from Agile retrospectives that applies here: the discipline of continuous reflection and improvement.

Vibe coding deserves more than vibe evaluation

The FOMO around AI coding skills is real. There are new tools every week, new techniques, new claims about what's possible. Most of us are figuring it out on hunches: not fully able to keep up, not clicking into AI news articles to read them in full, not totally understanding the tradeoffs, but feeling the paradigm shift happening underneath us.

I think that's fine because we're early in the technology adoption. But I also think we can do better than vibes. The engineers I see getting the most value aren't the ones with the most expensive tools or the most aggressive token spend. They're the ones building habits of honest reflection: what did I ship versus what did I generate? Where did I invest versus where did I waste time? What would I do differently next session?

Everything I've talked about here is from the engineer's perspective: what I can see, what I can measure, what I can control. But I've been having conversations with engineering managers too, and they're wrestling with a different version of the same question: how do you know your team's AI investment is paying off when you can't see inside any of these tools? That's a different problem with different constraints. More on that soon.

What are you working through? What's the question you keep asking yourself about your AI workflow? I'd genuinely like to hear: if you're wrestling with the same things, reach out on LinkedIn.

Originally published at ashu.co.