<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Andrew Shu</title>
    <description>The latest articles on DEV Community by Andrew Shu (@0xandrewshu).</description>
    <link>https://dev.to/0xandrewshu</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3795591%2Fc6636bb4-a665-4faa-affc-56f2e4c9adce.jpg</url>
      <title>DEV Community: Andrew Shu</title>
      <link>https://dev.to/0xandrewshu</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/0xandrewshu"/>
    <language>en</language>
    <item>
      <title>Day 2 of vibe coding in production: what breaks when adoption scales</title>
      <dc:creator>Andrew Shu</dc:creator>
      <pubDate>Tue, 05 May 2026 16:08:12 +0000</pubDate>
      <link>https://dev.to/0xandrewshu/day-2-of-vibe-coding-in-production-what-breaks-when-adoption-scales-1h0g</link>
      <guid>https://dev.to/0xandrewshu/day-2-of-vibe-coding-in-production-what-breaks-when-adoption-scales-1h0g</guid>
      <description>&lt;p&gt;In &lt;a href="https://www.ashu.co/taking-vibe-coded-into-production/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt; of this series, I enumerated a few obstacles for engineers taking vibe coding from side projects to production. &lt;a href="https://www.ashu.co/ai-coding-adoption-engineering-manager/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt; looked at AI usage from the manager's perspective: measuring adoption, understanding the gap, coaching to fill the gap. Both of those were "Day 1" problems: getting started, getting people on board, figuring out the tools.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This article focuses on what comes next: the vibe coding process problems that emerge after adoption is up. I'd call them "Day 2 problems". Let's say that AI adoption is up, and code is shipping faster. Then, things start breaking in places you didn't expect. My goal is to point to specific problems that you can observe and fix.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Engineers may feel these Day 2 problems as daily friction: PRs stuck in review, surprise token bills, coming back to AI-generated code that's unrecognizable.&lt;/p&gt;

&lt;p&gt;Managers face problems more from team and process perspective: senior engineers stuck reviewing instead of building, budget surprises, "quality" meaning different things depending on who's asking. I'll walk through what I've seen break, and for each one, suggest an actionable starting point.&lt;/p&gt;

&lt;p&gt;Let's start with the software development lifecycle. When engineers say "coding is only one part of shipping a product to customers", what do they mean?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4z73jr3coi77mpil5uy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4z73jr3coi77mpil5uy.png" alt="High level visualization of the next few sections, showing AI process challenges " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Vibe coding code review: the first bottleneck you'll hit
&lt;/h2&gt;

&lt;p&gt;Code review is often where AI's impact is immediately noticeable. It's a bottleneck on implementation (code generation), because most organizations require code reviews for quality and security reasons.&lt;/p&gt;

&lt;p&gt;Code generation is much faster: this can easily rise to thousands of lines per day per engineer, if not orders of magnitude more. This means more commits, more PRs, more lines of code landing in review queues. Code review was already a chore for many engineers, so AI has compounded this problem.&lt;/p&gt;

&lt;p&gt;I spoke with a tech lead at a large enterprise who said members of his team had started distrusting AI because of the quality of code coming through in PRs. Not because AI can't write decent code, but because engineers were submitting AI output without reviewing it themselves first. The PR became the first time anyone looked critically at what the agent produced.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Senior engineers face a practical question here: how much should they comb over each line of code the way they used to? When a PR is 5,000 lines of AI-generated code, a line-by-line review is time consuming. But skimming feels irresponsible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frxan34oh79k3o1arajzq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frxan34oh79k3o1arajzq.png" alt="Visualization of challenges with reviewing a higher volume of AI-generated code." width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So what can you do about it?&lt;/p&gt;

&lt;p&gt;Think about your CI/CD pipeline and what parts of the review can be automated.&lt;/p&gt;

&lt;p&gt;Luckily, AI code review tools like CodeRabbit, Greptile, Cursor's Bugbot, or Anthropic's Claude Code review catch a lot of the surface-level issues: style, obvious bugs, missing tests. These don't replace human review, but they reduce the surface area your senior engineers need to cover manually.&lt;/p&gt;

&lt;p&gt;When using AI code review tools, engineers I've spoken to have reported good findings, but also a lot of false positives. It can be helpful to coach early career engineers to spot the false positives and explain why they're not a problem, or they're an acceptable risk.&lt;/p&gt;

&lt;p&gt;Another idea, more from a process side: ask authors to review &lt;strong&gt;their own&lt;/strong&gt; PRs &lt;strong&gt;before&lt;/strong&gt; sending a review request to someone else. In other words, a pre-review review. "Ease of review" and "quality of code" is still a responsibility of the author: it reflects on their engineering skills, regardless of whether AI wrote the first pass. If your team doesn't have that norm yet, it's a good time to consider setting it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The upstream AI coding bottleneck: issue tracking
&lt;/h2&gt;

&lt;p&gt;Code review is the downstream constraint of generating more code. But there's an upstream one too in the planning phase: ticketing (e.g. in JIRA, Linear or GitHub Issues).&lt;/p&gt;

&lt;p&gt;Upstream, we have the work that happens before anyone writes a line: ticket creation, design conversations, bug reproduction, requirements gathering, stakeholder alignment. None of that got faster when you adopted AI coding tools.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Vague tickets slow down development because engineers have to ask clarifying questions. Delays add up when clarifying takes multiple back-and-forths. Clear acceptance criteria, reproduction steps, and system context help engineers get work done faster.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And it's not just ticket creation. Think about communication and "paperwork" across the whole software development lifecycle. Status updates, stakeholder check-ins, handoff notes, design docs: all the connective tissue that keeps a team aligned. It's not what we traditionally think of when we think of accelerating software engineers who are vibe coding, but these are common sinks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.ghost.io%2Fc%2F88%2F32%2F88324540-e2fb-4090-9a02-e7ad52675f91%2Fcontent%2Fimages%2Fsize%2Fw1000%2F2026%2F05%2FAI-Utility-Productivity---Issue-Tracking-Challenges-1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstorage.ghost.io%2Fc%2F88%2F32%2F88324540-e2fb-4090-9a02-e7ad52675f91%2Fcontent%2Fimages%2Fsize%2Fw1000%2F2026%2F05%2FAI-Utility-Productivity---Issue-Tracking-Challenges-1.png" alt="Visualization of how bottlenecking caused by issue tracking, and how to use AI to unblock them.&amp;lt;br&amp;gt;
" width="800" height="261"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What can be done about issue tracking and requirements gathering?&lt;/p&gt;

&lt;p&gt;Here are a few things I've been experimenting with: You can build a Cursor or Claude skill that pulls a ticket from your issue tracker (JIRA, Linear, whatever you use) via an MCP server and runs it through a series of quality checks. Does the ticket have a clear objective? Clear requirements? Business impact? Stakeholder named? If it's a bug, does it have steps to reproduce? If it's incomplete, the tool can automatically flag the gaps and notify the stakeholder. This takes an afternoon to set up and it pays for itself within the first sprint.&lt;/p&gt;

&lt;p&gt;Before an engineer works on a ticket, you could take the description of a problem, and perform automated research on that ticket: in the codebase, in the database, or in a browser to explore the UI. If there is a description of a bug, the automation could verify that the description can be observed easily, and potentially take screenshots.&lt;/p&gt;

&lt;p&gt;But beyond the initial ticket creation, how can you speed up feedback cycles by helping engineers to act on tickets, then reduce the paperwork?&lt;/p&gt;

&lt;p&gt;You can create CLI tools / desktop applications that help engineers package up their progress (git commits), findings (command line output, screenshots, summaries) and attach them back to the ticket. It sounds small, but reducing the friction of sharing blockers and getting feedback keeps the pipeline moving. The gains from AI coding don't fully materialize if the non-coding parts of your process stay manual.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vibe coding code quality: duplication and maintainability
&lt;/h2&gt;

&lt;p&gt;AI ships duplicate code constantly. Instead of reusing existing modules or reaching for community packages, agents tend to reimplement. I've watched Claude Code write a date parsing utility from scratch in a codebase that already had three date parsing utilities (all also written by Claude Code in previous sessions). The agent didn't know they existed, because the context window didn't include them, and nobody had documented the pattern.&lt;/p&gt;

&lt;p&gt;You need awareness and diligence to notice the duplicates and circle back to clean them up. And even then, I forget half the time.&lt;/p&gt;

&lt;p&gt;This matters more than it might seem. Code duplication hurts runtime performance and build times. When there are duplicates, it's harder to fix bugs: you patch one copy and the other three still have the vulnerability. Security patches need to be applied in multiple places. The codebase quietly gets worse while the velocity numbers look great.&lt;/p&gt;

&lt;p&gt;GitClear's &lt;a href="https://www.gitclear.com/ai_assistant_code_quality_2025_research" rel="noopener noreferrer"&gt;2025 study&lt;/a&gt; analyzed 211 million changed lines across repos from Google, Microsoft, Meta, and enterprises across 2020–2024. This covers the early AI adoption era. Code churn (new code revised within two weeks) roughly doubled from 3.1% to 5.7%. Copy/pasted code rose from 8.3% to 12.3%. Refactoring dropped from about 25% to under 10% of changed lines. The code ships faster, but doesn't age well.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Sidebar: I don't see code churn discussed much, but I'd love to see more research on potential impacts on maintainability. For folks vibe coding, seeing a "+2k / -2k lines of code" change is pretty common. What worries me is the &lt;strong&gt;impact of continuous churning&lt;/strong&gt; of code (and tests) over time. Subtle bugfixes and "matured" code don't survive that kind of constant rewriting.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ynib96ku8fqmns3lki0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ynib96ku8fqmns3lki0.png" alt="Code quality has always been challenging, but AI-generated code imposes new ones." width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A few ideas on what to do for code quality:&lt;/p&gt;

&lt;p&gt;In the code review section above, I mentioned CI/CD improvements for review. For maintainability specifically, look for tools that measure test coverage, code duplication, and code complexity at the repository level; not just at the PR level. PR reviews catch incremental issues, but as changes accumulate, you want a broader snapshot.&lt;/p&gt;

&lt;p&gt;But it's not just 3rd party tools. Can you create hooks that run as a part of a code review check, helping engineers detect duplicate code? They're incredibly easy to build. For example: a Skill or Subagent that scans for existing implementations before the agent writes a new one. The question is when engineers run this so they don't forget. A git hook, or preprocessing before a PR is submitted works; the mechanism matters less than making it automatic.&lt;/p&gt;

&lt;p&gt;OK, let's switch out of the "development" dimension of software, and talk about the "operational" dimensions of software.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vevtc4zpiox2e2rvqmx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4vevtc4zpiox2e2rvqmx.png" alt="High level visualization describing software " width="800" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Vibe coding quality: code-level vs customer-facing
&lt;/h2&gt;

&lt;p&gt;Code maintainability is one kind of quality engineers care about. The other kind is customer-facing quality, and that's what keeps all of us employed.&lt;/p&gt;

&lt;p&gt;A manager I interviewed at a Fortune 500 company distilled their AI adoption objectives into two themes: "velocity" and "quality." When I pressed on what "quality" meant, it was clear they meant product uptime and customer-facing incidents. Not code complexity. Not test coverage.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is typically what executives mean by "quality." If your engineering dashboards show code metrics and leadership means production stability, you're measuring two different quality layers. Clarify what quality means: the disconnect is more common than you'd think.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The &lt;a href="https://dora.dev/research/2024/dora-report/" rel="noopener noreferrer"&gt;DORA 2024 report&lt;/a&gt; found that for every 25% increase in AI adoption, delivery stability decreased 7.2%. Their &lt;a href="https://dora.dev/research/2025/dora-report/" rel="noopener noreferrer"&gt;2025 follow-up&lt;/a&gt; added nuance: "[AI] shines a light on what's working, accelerating what's already in motion, but it also surfaces what needs to change." Strong teams with good practices benefited from AI. Struggling teams faced greater challenges. If your delivery pipeline had cracks before AI, AI adoption widened them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvniog8swihhdmclyg3x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxvniog8swihhdmclyg3x.png" alt="Above we talk about code quality. That's typically internal. Here we visualize " width="800" height="615"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What you can do:&lt;/p&gt;

&lt;p&gt;Use your issue tracking system alongside git to track quality of AI-assisted versus non-AI code. Git commits are increasingly labeled with AI tool footers (e.g., &lt;code&gt;Co-Authored-By: Claude Opus 4.5&lt;/code&gt;). You could create a CI check requiring all commits to carry this footer; even manually-written ones should be explicitly "human." It's a small discipline, but it makes the data traceable.&lt;/p&gt;

&lt;p&gt;In the issue tracker, find ways to link customer-reported issues to the candidate commits that had problems. Remember the blameless post-mortem: you're linking to the problematic code change, not to a specific person.&lt;/p&gt;

&lt;p&gt;And have labelling, or other categorization that can differentiate between customer-reported issues and internally-found issues. You'll catch many more internal issues that customers may not care about, so it helps to keep explicitly customer-impacting issues as the priority.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security: more code, more surface, smarter adversaries
&lt;/h2&gt;

&lt;p&gt;Security shares some DNA with code quality, but it's a different domain and has much higher stakes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Here are some things I think about from an engineer's perspective: More lines of code with less human understanding means the attack surface is evolving. AI agents act on your behalf with your permissions and credentials. Hallucinations in development environments can cause real damage, not just in production. And vulnerabilities ship faster than before.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Research confirms that LLM-generated code will include vulnerabilities. &lt;a href="https://arxiv.org/abs/2404.18353" rel="noopener noreferrer"&gt;Tihanyi et al. (2024)&lt;/a&gt; analyzed 331,000 C programs across 9 LLMs (e.g. GPT-4o-mini, Gemini Pro, Code Llama, and others) and found 62%+ contained vulnerabilities, with minimal differences between models. The problem isn't a bad model. It's that code generation at scale produces vulnerabilities at scale. It might be &lt;a href="https://arxiv.org/abs/2204.04741" rel="noopener noreferrer"&gt;better than humans&lt;/a&gt;, but if code gen is accelerating, then vulnerabilities will scale linearly too.&lt;/p&gt;

&lt;p&gt;And from the other side, the window from vulnerability disclosure to active exploitation is &lt;a href="https://www.infosecurity-magazine.com/news/exploitation-accelerates-in-2025/" rel="noopener noreferrer"&gt;compressing from 8.5 to 5 days on average&lt;/a&gt;. AI-assisted cyber attacks &lt;a href="https://deepstrike.io/blog/ai-cyber-attack-statistics-2025" rel="noopener noreferrer"&gt;rose 72% in 2025&lt;/a&gt;. More code, faster attackers, cheaper discovery. It's a scaling problem, not a skill problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fehhq5hwhmzphs3c0og2n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fehhq5hwhmzphs3c0og2n.png" alt="Highlighting a few security implications of vibe coding." width="800" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What can you do along the security front?&lt;/p&gt;

&lt;p&gt;I'd start by adding security monitoring to your CI/CD. Linting, SAST, supply chain scanning, secrets scanning: tools like Semgrep or Snyk, or open source alternatives. Code review bots include security checks as well. And the standard practices still apply: periodic auditing, security considerations early in project planning, security checks woven into the review process. Defense in depth, with the "depth" updated for a world where agents generate code faster than humans can review it.&lt;/p&gt;

&lt;p&gt;I would also start with updating your "least privilege" access controls for the agentic world. To get work done, I have to grant agents control of certain tools and infrastructure, and I always worry about how much unintended damage that could cause.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Sidebar: I find that "isolation" is a theme I think a lot about when it comes to improving AI security. How do you isolate your AI agent from your secrets (but give it some access)? From destroying files in your filesystem? From other computers in your network? I think that techniques like containerization (docker), jails, firewalling, splitting identities/credential access into more granular chunks, will be fruitful here.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Humans are better trained than agents at knowing which lines not to cross, so it makes sense to scope agents more tightly than human developers. Agents are also more numerous and shorter-lived. Think about how to generate lightweight, temporary permissions rather than sharing your personal credentials.&lt;/p&gt;

&lt;p&gt;A concrete example of agents doing something sketchy: your &lt;code&gt;.env&lt;/code&gt; file getting read by an agent and shipped up in an AI-generated bug report, or used in an unintended API call. The kind of thing you only laugh about months later.&lt;/p&gt;

&lt;p&gt;Another example: an agent inheriting your admin role, hallucinating, and taking a destructive action with permissions it should never have had.&lt;/p&gt;

&lt;p&gt;Use vaults and password managers to reduce agent access. Add degrees of isolation between "write access" and "read-only access." Isolate production from development environments. Wrap binaries and filesystem folders in containers, jails, or VMs to constrain blast radius.&lt;/p&gt;

&lt;p&gt;This by no means is a complete list. It's merely highlighting some of the security risks that AI is introducing and being discussed in engineering communities. And suggesting some starting points for how we can use and extend existing tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I didn't cover (and why it matters)
&lt;/h2&gt;

&lt;p&gt;There are a few topics I can't go deep on here, but they're worth flagging.&lt;/p&gt;

&lt;p&gt;Design documents are evolving. In order to write more code thoughtfully, teams are producing more design docs. But the tech lead I mentioned earlier noticed they're appearing more generic: the same structure, the same level of detail, the same boilerplate that suggests an AI wrote the first draft and nobody pushed it further. ("Slop" is a useful description here, not to be disparaging to the authors. But to describe the "averaging" effect of LLM generation of prose.) Design docs are supposed to force you to think through the hard parts before coding. If AI is writing them, and humans are rubber-stamping them, we've lost the "thinking" and "intentionality" of designing solutions that actually fit the problem.&lt;/p&gt;

&lt;p&gt;Operating code in production is another one, but I've covered it in &lt;a href="https://www.ashu.co/taking-vibe-coded-into-production/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 1 of this series&lt;/a&gt;. As you develop more code, you have to maintain it: deploy it, monitor it, troubleshoot it, patch it. How to enable your repositories and infrastructure to let AI help with operations safely is a separate conversation, and it's one I'm spending a lot of time on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this leaves us
&lt;/h2&gt;

&lt;p&gt;I've been thinking a lot about what a real measurement layer for AI coding should look like, and what kinds of insights it should surface. More on that soon.&lt;/p&gt;

&lt;p&gt;Underneath all of these practices, there's a thread that keeps surfacing in every conversation I have: continuous learning. It's a classic practice: Agile retrospectives, Toyota's production system, are always good practice. But it feels newly urgent when the tools and practices are changing this fast. Engineers and managers can't keep up with the rate at which research keeps appearing, and intentional practice helps.&lt;/p&gt;

&lt;p&gt;I'm collecting stories from engineers and managers working through the post-adoption phase. If you've hit the review bottleneck, had concerns about code quality and security, or if you've found something that works, I'd like to hear from you. You can find me on &lt;a href="https://www.linkedin.com/in/0xandrewshu/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; or &lt;a href="https://x.com/0xAndrewShu" rel="noopener noreferrer"&gt;X&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.ashu.co/vibe-coding-process-problems/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;ashu.co&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>vibecoding</category>
      <category>ai</category>
      <category>security</category>
      <category>devops</category>
    </item>
    <item>
      <title>Measuring AI coding adoption: What I learned as a manager</title>
      <dc:creator>Andrew Shu</dc:creator>
      <pubDate>Mon, 27 Apr 2026 19:38:16 +0000</pubDate>
      <link>https://dev.to/0xandrewshu/measuring-ai-coding-adoption-what-i-learned-as-a-manager-51kk</link>
      <guid>https://dev.to/0xandrewshu/measuring-ai-coding-adoption-what-i-learned-as-a-manager-51kk</guid>
      <description>&lt;p&gt;Let's say you're an engineering manager, and you're participating in an organization-wide campaign to increase the adoption of AI. The initial goal is to get people to use the AI tools.&lt;/p&gt;

&lt;p&gt;That's easy: hand out licenses to your team for Cursor, Claude, or GPT. Congratulations! What's next?&lt;/p&gt;

&lt;p&gt;Maybe you're among the early adopters in your company, or maybe the initiative originated elsewhere. Senior leaders across companies are investing in AI because they see the potential of greater velocity and more expansive creativity. But they're also responding to pressure from &lt;em&gt;their&lt;/em&gt; stakeholders: board members asking for slides about AI strategy (both internal and customer adoption), investors who expect adoption to keep up with current startup pace, competitors making the same bets. They need to show these investments are paying off.&lt;/p&gt;

&lt;p&gt;Meanwhile, these AI tools cost real money. &lt;strong&gt;I've spoken to startups who are spending $1,500 / month / engineer&lt;/strong&gt; as they seek to understand the new paradigms of coding and insights for building leaner. This is a major step above startups who were previously spending $300 - $500 / month / engineer. &lt;strong&gt;Even for enterprises who spend $5,000 / month / engineer, adding $1,500 / month would be a big leap in investment.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In March 2026, Jensen Huang (CEO of Nvidia) said that senior Nvidia engineers earning &amp;gt; $500k in salary &lt;a href="https://www.businessinsider.com/jensen-huang-500k-engineers-250k-ai-tokens-nvidia-compute-2026-3" rel="noopener noreferrer"&gt;should be consuming well over $250k of tokens per year&lt;/a&gt;. This sort of paradigm shift hasn't rolled out broadly in the industry. But talking to folks on different teams and seeing my own usage, I don't think it would be difficult for engineers to spend $1k / month ($12k / year).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;For that kind of cost, it's important to know what value your team is getting and to optimize the organization's usage to make the most out of that money. But what's the outcome? Should engineers &lt;a href="https://www.nytimes.com/2026/03/20/technology/tokenmaxxing-ai-agents.html" rel="noopener noreferrer"&gt;max out tokens&lt;/a&gt;? Or max out lines of code? It's a new paradigm, and you're trying to make sense of it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In &lt;a href="https://www.ashu.co/taking-vibe-coded-into-production/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 1 of this series&lt;/a&gt;, I wrote about the rough edges of vibe coding from the engineer's perspective: things in production that slow down engineers when you move AI coding from side projects to production.&lt;/p&gt;

&lt;p&gt;This article tells the story from the managers' perspective, based on conversations with engineering managers and my own experience. What does it actually take to drive real AI adoption on an engineering team?&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you actually measure adoption of AI coding?
&lt;/h2&gt;

&lt;p&gt;Managers I've spoken to say that they're being measured by senior leadership based on the number of licenses they've distributed (or not distributed) to their team. They're counting PRs and lines of code, and doing qualitative surveys of team members to gauge AI adoption. But these don't tell you whether adoption is meaningful to the business and customer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research suggests that bulk license distribution won't lead to actual usage.&lt;/strong&gt; &lt;a href="https://visualstudiomagazine.com/articles/2025/09/17/report-github-tops-ai-coding-assistants-with-microsoft-related-cautions.aspx" rel="noopener noreferrer"&gt;Gartner&lt;/a&gt; found that often fewer than a third of purchased licenses see active use after several months. The &lt;a href="https://survey.stackoverflow.co/2025/ai/" rel="noopener noreferrer"&gt;2025 Stack Overflow Developer Survey&lt;/a&gt; tells a similar story: 81% of professional developers surveyed are using (or are planning to use) AI, but 41.4% of professional developers believed that AI struggled with complex tasks. (Note: that 41.4% level dropped since the previous year, but is still high.)&lt;/p&gt;

&lt;p&gt;Even with usage, I've found that volume of AI tool utilization and the variety of techniques in regular use is uneven across organizations and even within a single team. So even with licenses distributed, uneven training is a distinct challenge.&lt;/p&gt;

&lt;p&gt;So how can we assess adoption?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4oh8xzftgcp6yzcsg3vj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4oh8xzftgcp6yzcsg3vj.png" alt="Observe team AI adoption" width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As a practical experiment, try speaking with a sample of the team in your 1:1s about how they use AI, and perhaps collaborate with them on a project. &lt;strong&gt;It's more effective to see how your team uses AI (e.g., during a demo, presentation, or pair coding) than to verbally poll them.&lt;/strong&gt; That way you can see how they're using AI, rather than a simple yes/no poll.&lt;/p&gt;

&lt;h2&gt;
  
  
  Are there more concrete metrics of adoption?
&lt;/h2&gt;

&lt;p&gt;I also find it helpful to use a quick but highly flawed (and highly contentious) metric: &lt;strong&gt;token spend&lt;/strong&gt;. This is the theoretical cost of the tokens a person uses. It's often subsidized under a monthly license, or an enterprise agreement.&lt;/p&gt;

&lt;p&gt;For token cost, here's a rough rule of thumb for token spend: state-of-the-art models like Claude Opus or GPT-5 can easily cost $100/day of heavy use (of tracked token costs, not necessarily real dollar spent). For folks that aren't past the $20/month base subscription tier, they're likely not yet "vibe coding". That's not bad, it's simply a metric to ballpark usage volume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maximizing token usage is not an end in and of itself.&lt;/strong&gt; Token spend is a cheap, weak signal, and can be gamed. But it's available right now without additional tooling. (Note: it's also wasteful to optimize for high-cost workloads). But I've found it helpful to ballpark my own style of adoption from token costs. When I'm in a totally different ballpark from someone else, it's a helpful signal to trigger me to ask why.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5d6lo7fwg88g0uib7cqo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5d6lo7fwg88g0uib7cqo.png" alt="Concrete observations of AI usage" width="800" height="459"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are a few other metrics worth considering, other than token costs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lines of Code.&lt;/strong&gt; This is another deeply flawed metric (and famously so), but it's a useful factor to consider because there are real implications for code reviews as well. When PRs are changing from hundreds of lines to thousands or tens of thousands, this hints at changing AI adoption styles. But you should also think about the quality of code reviews.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maturity of AI configs, context and tooling (e.g. &lt;code&gt;agents.md&lt;/code&gt;).&lt;/strong&gt; These are typically markdown files shared between engineers in the repository. Or maybe it exists as instructions / documentation in your team's wiki / knowledge base (to find relevant docs, search for "Claude" / "Cursor" / "Codex").&lt;/p&gt;

&lt;p&gt;Maturity of AI configs is perhaps the most interesting sign of usage, because it shows AI being used and customized. Is your team using Skills, Subagents, Agent Teams, Automation like Routines? How often are these configurations being customized? These configurations fall out of date, so teams regularly using AI are likely to be measuring and tuning their AI configurations and documentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understand your team's perspectives on AI adoption before acting
&lt;/h2&gt;

&lt;p&gt;To return to the topic of polling and observing your team in action: your team may have legitimate concerns blocking adoption. There are legitimate concerns about security and code quality, and the benefits aren't evenly distributed across experience levels (more on both below).&lt;/p&gt;

&lt;p&gt;Some senior or SRE engineers have told me that their work involves precision, complexity, or high risk, and AI is an unacceptable risk. Or maybe the team is too busy to try out that new tool, and they need someone to be brave and test it out first in your environment.&lt;/p&gt;

&lt;p&gt;Before pushing to increase adoption or spend, talk to the team.&lt;/p&gt;

&lt;p&gt;This is the step that is easiest to skip over. It's tempting to see low utilization and lean into the instinct to push harder: more training, more encouragement, more tooling. But the question is about workflow, mindset, and preferences, and the answer will differ per team.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Sidebar: It's also valuable for you to try out different AI workflows. Some of the managers I've spoken to are themselves skeptical about AI. It's valuable to suspend your disbelief for a few days and experiment. Have some fun with it; play a bit like you got some shiny new gear and you can build whatever silly thing you've been meaning to for a while. Greenfield projects, CLI scripts, or small bugs are great starting points. Try out a &lt;a href="https://www.ashu.co/markdown-plan-files-vibe-coding/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;markdown plan&lt;/a&gt;, and play around with easy &lt;a href="https://www.ashu.co/parallel-claude-code-agents/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;parallelism&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;How you engage with your team depends on where you sit. As a line manager, you're close enough for 1:1s and small team discussions. Ask to pair code directly. Ask people what's working and not working in their &lt;code&gt;CLAUDE.md&lt;/code&gt; or &lt;code&gt;.cursor/rules/*&lt;/code&gt;. Ask what they reverted last week due to AI. The goal is to identify specific concerns, pushback, and knowledge gaps: not to audit or blame.&lt;/p&gt;

&lt;p&gt;If you're a senior manager, you may need to shift organizational momentum more broadly: clear communication about organizational objectives, setting metrics to measure progress, funding training (both money and time), or setting explicit norms that AI tool usage is supported and expected, and reflecting on which metrics helped and didn't. This is a different communication problem than a 1:1.&lt;/p&gt;

&lt;p&gt;Listen for objections and disagreements: they're valuable signals. Engineers who say "AI is inconsistent" aren't wrong. I've measured token consumption that &lt;a href="https://www.linkedin.com/posts/0xandrewshu_fascinating-saturday-i-measured-that-activity-7439405612087635968-AseS" rel="noopener noreferrer"&gt;varied by 2x session to session&lt;/a&gt; for identical work. Harnesses regress. (By harness, I mean the Claude Code harness that wraps the Claude models.) Prompts that worked last week hallucinate this week. Take these concerns seriously.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part nobody budgets for: coaching for AI coding
&lt;/h2&gt;

&lt;p&gt;After I recognized these signals, I started testing them out on my teams, among friends, and with strangers I met. I realized that the gap often wasn't tools or licenses: it was listening, persuasion and coaching.&lt;/p&gt;

&lt;p&gt;I spoke with a number of skeptics, but there were also a lot of folks who wanted to vibe code more. Quite often, they didn't have the time to keep up with the firehose of new information. And another frequent concern was that they were worried about &lt;a href="https://www.ashu.co/taking-vibe-coded-into-production/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;vibe coding in production&lt;/a&gt; environments, or in local environments with permission systems (e.g. credentials for AWS, SSH keys to machines set up for thoughtful humans). Specifically, I'm referring to the adoption of "hands-free" vibe coding and not AI coding where engineers are manipulating code or commands.&lt;/p&gt;

&lt;p&gt;Moreover, adoption wasn't uniform. Some engineers had been happily vibe coding, and I spoke to them to see what worked. There was a spectrum of skeptics and aficionados, and the information needed help to spread faster. Research published in &lt;em&gt;Science&lt;/em&gt; (&lt;a href="https://www.science.org/doi/10.1126/science.adz9311" rel="noopener noreferrer"&gt;Daniotti et al., 2026&lt;/a&gt;) found that AI productivity gains (more commits, broader library use, exploration of new functionality) accrued mostly to experienced developers, with early-career engineers showing no statistically significant benefit.&lt;/p&gt;

&lt;p&gt;Other studies, like &lt;a href="https://pubsonline.informs.org/doi/10.1287/mnsc.2025.00535" rel="noopener noreferrer"&gt;Cui et al. (2026)&lt;/a&gt;, found the opposite in controlled corporate settings: less experienced developers benefited more. The takeaway for managers isn't that one study is right and the other wrong: it's that the gains aren't uniform, and that deserves special consideration when you're planning training, setting expectations, and measuring progress.&lt;/p&gt;

&lt;p&gt;At the time, I was specifically interested in how to do DevOps / operational / maintenance work safely. So I thought through what kinds of tedious tasks people would like to do less of, filtered out risky operations, and then built starter configuration files, subagents, and shell scripts. (I'll elaborate in my next post.)&lt;/p&gt;

&lt;p&gt;After posting about it, and sharing it in team meetings, I realized it required more active persuasion (as opposed to passive announcement). So, as one does, I switched from optional knowledge-sharing sessions to more proactive 1:1s, and team calls.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazy04qv29asepup7o8ee.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fazy04qv29asepup7o8ee.png" alt="AI playbook: investigate production bugs" width="800" height="402"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I also tried another playbook: to build confidence among skeptics, I also troubleshot production issues in parallel with other engineers troubleshooting those same issues. I constrained myself to use mostly-autonomous AI agents (equipped with a context system). I accumulated 2-3 concrete examples where AI can help engineers step into unfamiliar parts of the codebase to troubleshoot an on-call situation. This helped spark ideas for methods and techniques, and overcame some mental blocks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Two key things that helped out when conducting knowledge sharing sessions in my most recent campaign: hinting at how it could be used more safely in production, and teasing out specific concerns that engineers had but hadn't voiced yet.&lt;/p&gt;

&lt;p&gt;When you see lightbulbs go off, it's incredibly rewarding. But the work to get there is often invisible in your current metrics. And it requires changes beyond conversation: to the codebase, the tooling environment, and how your team works day-to-day. As mentioned above, I'll cover the concrete technical stuff: agent configuration files, sandboxed execution, CI pipelines, and workflow changes in my next post.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd do this week to improve adoption
&lt;/h2&gt;

&lt;p&gt;To distill my ideas and anecdotes above, here are 3 things I'd suggest that could be done as small projects and experiments you could do in a week:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hunt for adoption through quantitative and qualitative signals.&lt;/strong&gt; Look at the numbers you already have: token spend per engineer, AI-assisted PR rates if your platform tracks them, git commit footers that show AI assistance. Then pair those with qualitative input from team retros, 1:1s, or a short survey. Neither signal type is sufficient alone. Token spend sheds light on how deeply your team is using the tools. Conversations tell you &lt;em&gt;how&lt;/em&gt; and &lt;em&gt;why&lt;/em&gt; (or why not). The combination replaces guesswork with a baseline you can act on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tease out the obstacles getting in the way.&lt;/strong&gt; You're not going to get far with "are you using AI?" Instead, surface specifics through whatever channel fits your team: 1:1s, retrospectives, Slack threads, brown bag sessions, pair coding sessions. What tasks are they using AI for? Where did it break down? What would make it more useful? The goal is to map the gap between where your team is and where productive AI usage actually lives, then address the top blockers, whether those are configuration, training, trust, or tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pick one process to automate. Show concrete examples of its benefits.&lt;/strong&gt; Don't try to overhaul everything. And remember, it's not just code generation. I gave the example of troubleshooting production errors. It could also be: a planning template, a test generation step, a deployment checklist, an observability alert summary, or updating JIRA tickets. Isolated wins build confidence, both yours and the team's. They also give you concrete stories to share with leadership when they ask for evidence that the investment is working, and can be cross-pollinated across the wider organization.&lt;/p&gt;

&lt;h2&gt;
  
  
  The gains from AI are real, but there will be new problems
&lt;/h2&gt;

&lt;p&gt;When coaching resonates and the adoption picks up, the individuals on your team will be equipped with new career skills and your team will ship faster. Engineers tackle problems they would have avoided before.&lt;/p&gt;

&lt;p&gt;But adoption was only the first thing you needed to measure. Once your team is using AI coding tools for real, a new set of problems surfaces: bottlenecks that shift in unexpected directions, quality concerns that span multiple dimensions, and a measurement layer that hasn't caught up yet. I'll dive into some of the technical areas there in Part 3.&lt;/p&gt;

&lt;p&gt;In the meantime, I'm collecting stories from engineering managers working through AI adoption. If you're in the middle of it, I'd like to hear from you. What metrics are you using? What pushback surprised you? Reach out &lt;a href="https://www.linkedin.com/in/0xandrewshu/" rel="noopener noreferrer"&gt;on LinkedIn&lt;/a&gt; — these conversations are the most valuable part of this work. I'm happy to swap tips and ideas!&lt;/p&gt;

</description>
      <category>vibecoding</category>
      <category>ai</category>
      <category>management</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Vibe Coding in Production: What's Holding Us Back?</title>
      <dc:creator>Andrew Shu</dc:creator>
      <pubDate>Tue, 07 Apr 2026 16:07:05 +0000</pubDate>
      <link>https://dev.to/0xandrewshu/vibe-coding-in-production-whats-holding-us-back-5kh</link>
      <guid>https://dev.to/0xandrewshu/vibe-coding-in-production-whats-holding-us-back-5kh</guid>
      <description>&lt;p&gt;Vibe coding techniques need to be adapted when you work on production applications with AI. I walk through some challenges and solutions that I've found helpful on real projects.&lt;/p&gt;

&lt;p&gt;I'm going to share some experiences from a few months ago, about how I expanded the scope of my use of agents from vibe coded apps to working on real world problems in production.&lt;/p&gt;

&lt;p&gt;I had been coding with AI agents for a while now: greenfield scripts, prototypes, and features I could build and throw away. Early on in this experimentation, I set my sights on building tools and practices for safely using AI in production. I knew I had to maintain and operate the code I developed. So as I explored AI by building isolated and greenfield code, I made mental notes of techniques that wouldn't work and those that I could bring to a production infrastructure.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;There are numerous articles and posts describing techniques for vibe coding well. But there isn't enough documentation describing the practices around customizing your repository and infrastructure to take advantage of your AI agents on real infrastructure and workloads.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Daghan Altas, a former Cisco Meraki colleague, &lt;a href="https://www.linkedin.com/posts/daghanaltas_we-were-promised-a-10x-ai-productivity-boost-share-7438596222082453504-dMf1" rel="noopener noreferrer"&gt;phrased it well&lt;/a&gt;: what's the point of a 10x productivity boost if you can't operate and maintain the thing you built any faster? That reframed the question for me. Not "is AI fast?"; obviously it's fast. But: &lt;strong&gt;what specifically is holding me back?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's what I ran into when I applied vibe coding techniques against production infrastructure and workloads. And this is how I've updated my configurations and techniques to address these issues.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI codes quickly, but what about troubleshooting and testing?
&lt;/h2&gt;

&lt;p&gt;AI is great at implementation, but there's so much surrounding the act of writing code.&lt;/p&gt;

&lt;p&gt;Here's a concrete example: I was building a prototype that needed to normalize messy data from multiple API and database sources. The numbers kept being wrong. I pointed Claude Code at the problem, and it churned for an hour, trying different parsing strategies, refactoring the aggregation logic, adding fallback handlers.&lt;/p&gt;

&lt;p&gt;The fix turned out to be surprisingly simple: the logging wasn't capturing everything it needed to. The agent was trusting the logs at face value and never questioned whether the data was complete. An hour of sophisticated troubleshooting on a problem that needed five minutes of "wait, do we have enough logs to capture the symptom of the problem?"&lt;/p&gt;

&lt;p&gt;The same dynamic plays out with writing unit tests. So I needed to think more broadly about this.&lt;/p&gt;

&lt;p&gt;Implementation is often not the bottleneck. The bottleneck is everything around the implementation. After implementation, that would include: verifying the code does what you think it does, troubleshooting when it doesn't, understanding what already exists so you don't reinvent it, and making sure the architecture holds up next month.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjufvt96eae9lkare93dh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjufvt96eae9lkare93dh.png" alt="I found that AI wrote low value unit tests, and had difficulty troubleshooting in production." width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I've started doing:&lt;/strong&gt; I built subagents for the two patterns that burned the most time when I was catching and fixing AI-written bugs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Firstly, a "Skeptical Testing Subagent" that scrutinizes test suites: checking for duplicates, testing meaningfulness, flagging assertions that don't actually prove anything.&lt;/li&gt;
&lt;li&gt;And secondly, a "Skeptical Troubleshooting Subagent" that focuses on production logs and data integrity before jumping into code changes. Both are early, but they've already caught things I would have missed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When I say "skeptical" you can translate it to the term "adversarial", which is what folks in the AI community use more frequently. People have talked about using &lt;a href="https://asdlc.io/patterns/adversarial-code-review/" rel="noopener noreferrer"&gt;"adversarial agents" to review code&lt;/a&gt;, and how these agents "&lt;a href="https://dev.to/marcosomma/adversarial-planning-for-spec-driven-development-4c3n"&gt;think differently&lt;/a&gt;" than an agent told to "write code". My testing and troubleshooting subagents solve specific code review and production log review problems that I've encountered, in a more narrow and specific context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fear slows us down when we vibe code production apps
&lt;/h2&gt;

&lt;p&gt;One of the things that accelerates vibe coding is accepting AI suggestions quickly (specifically, auto-approving the shell commands the agent wants to run). But many of those suggestions are commands run inside a shell with superadmin privileges and access to the internet. Even when the AI isn't doing anything malicious, I worry about a stray &lt;code&gt;rm -rf&lt;/code&gt; or a &lt;code&gt;drop database&lt;/code&gt; or a &lt;code&gt;terraform apply&lt;/code&gt; that destroys a folder, a Google Drive, an RDS instance, a DNS record. Nightmares abound.&lt;/p&gt;

&lt;p&gt;This isn't hypothetical: Alexey Grigorev &lt;a href="https://alexeyondata.substack.com/p/how-i-dropped-our-production-database" rel="noopener noreferrer"&gt;accidentally dropped his production RDS database&lt;/a&gt; while using AI tools and wrote up the full post-mortem. Amazon has called for &lt;a href="https://thenewstack.io/amazon-ai-assisted-errors/" rel="noopener noreferrer"&gt;new safeguards and review processes&lt;/a&gt; after AI-assisted errors in production. Research from Snyk has documented AI coding tools &lt;a href="https://snyk.io/articles/package-hallucinations/" rel="noopener noreferrer"&gt;hallucinating entire package names&lt;/a&gt; that don't exist, and attackers registering those packages to exploit the gap.&lt;/p&gt;

&lt;p&gt;Hallucinations in a local sandbox are an inconvenience. In production, they're a late night page and an embarrassing post-mortem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fndtopid54lu6q01mjhdq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fndtopid54lu6q01mjhdq.png" alt="Common engineering metaphor: adding safety rails helps increase confidence to move faster. Generated with Gemini." width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I've started doing:&lt;/strong&gt; here are some examples of ways I've improved the safety of my work.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Instead of executing unsafe &lt;code&gt;bash&lt;/code&gt; commands, ask AI to write a script you can review.&lt;/strong&gt; I find AI helpful for sysadmin/SRE work, but I have to monitor it closely — no background agents here. Watching commands scroll by is risky, so I ask the agent to write them into a script I can review first. Then as a bonus, I get a script that is reusable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extracting repeat database or log queries into a script.&lt;/strong&gt; When I was troubleshooting customer issues, I often ran a few Postgres database queries or fetched a few logs related to some Lambdas. This was tedious, but I also didn't want AI to be running PG or AWS Lambda commands by itself. So I wrote scripts like &lt;code&gt;fetch_customer_events_pg.ts &amp;lt;customer-alias&amp;gt; &amp;lt;event-type&amp;gt;&lt;/code&gt;, or &lt;code&gt;fetch_customer_logs.ts &amp;lt;customer-alias&amp;gt; --start &amp;lt;start_time&amp;gt; --end &amp;lt;end_time&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In addition to scripts, you can do similar things by codifying repetitive tasks as Skills and Subagents.&lt;/strong&gt; Skills are available in &lt;a href="https://developers.openai.com/codex/skills" rel="noopener noreferrer"&gt;Codex&lt;/a&gt;, &lt;a href="https://cursor.com/docs/skills" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;, and &lt;a href="https://code.claude.com/docs/en/skills" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;. Subagents are also available in &lt;a href="https://developers.openai.com/codex/subagents" rel="noopener noreferrer"&gt;Codex&lt;/a&gt;, &lt;a href="https://cursor.com/docs/subagents" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;, and &lt;a href="https://code.claude.com/docs/en/sub-agents" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are some of the techniques I've used. I might elaborate on this topic in a future post. If you're interested in chatting about how I do this, DM me &lt;a href="https://www.linkedin.com/in/0xandrewshu/" rel="noopener noreferrer"&gt;on LinkedIn&lt;/a&gt; or &lt;a href="https://x.com/0xAndrewShu" rel="noopener noreferrer"&gt;on X&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  I ❤️ docs, but AI might just love it more (as "context")
&lt;/h2&gt;

&lt;p&gt;I've hopped onto screen shares with other engineers where one of us will spot a hallucination go by mid-session. There's a brief moment of annoyance or concern, and then we keep going. Most of the time, we just let the agent continue. I've done it myself; we have more pressing tasks to finish. You see the wrong thing, you wince, and you move on because you're in flow.&lt;/p&gt;

&lt;p&gt;This is a bigger source of inefficiency than it appears. Not everyone realizes that many hallucination patterns can be fixed with better context. (By context, I'm basically talking about code, docs and additional markdown files.) And those who do know often haven't had the time to pay attention to how context is actually structured across their tools. There's a growing stack of context layers: &lt;code&gt;AGENTS.md&lt;/code&gt; files, skills, subagents, team-shared vs. individual context, memory architecture, code indexers, connections to databases and wikis and issue trackers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;But here's the pattern I see most often: someone sets up an &lt;code&gt;AGENTS.md&lt;/code&gt; or a &lt;code&gt;.cursorrules&lt;/code&gt; file when they first adopt a tool, and then never touches it again. Six weeks later the agent is hallucinating patterns you deprecated a month ago, suggesting libraries you've already replaced. Or maybe your automatic &lt;a href="https://code.claude.com/docs/en/memory" rel="noopener noreferrer"&gt;memory.md&lt;/a&gt; is outdated and no longer reflects your code's reality. The agent's context drifts from reality a little more every week, and the hallucinations compound.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the source of a lot of churn. When the agent doesn't know what already exists in the codebase, it reinvents. When it doesn't know your architectural patterns, it improvises. When it doesn't know what you deprecated last sprint, it resurrects it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5kcy22nd9o8ilvjma6o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5kcy22nd9o8ilvjma6o.png" alt="Visualization of the numerous sources of context you can feed your AI." width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I've started doing:&lt;/strong&gt; I treat documentation as infrastructure. When agents hallucinate or reinvent something that already exists, I update the docs so it doesn't happen again. I use MCP servers to push context to my knowledge base. I run Claude Code's &lt;code&gt;/context&lt;/code&gt; command mid-session to see how the 200K token window is being consumed, and it often exposes wasteful allocation I wouldn't have caught otherwise. It's a small amount of effort that compounds over time. If you're going to obsess over something, context hygiene has the best return on neuroticism I've found so far.&lt;/p&gt;

&lt;p&gt;Another technique I use is to keep a &lt;code&gt;plans/&lt;/code&gt; folder and a &lt;code&gt;docs/&lt;/code&gt; folder for architecture decisions and system patterns that agents should know before generating. &lt;a href="https://www.ashu.co/markdown-plan-files-vibe-coding/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Markdown plan files&lt;/a&gt; are still the single best thing I've done for my workflow, and the docs folder is a great supplement. Recently, Andrej Karpathy posted the importance of "&lt;a href="https://x.com/karpathy/status/2039805659525644595" rel="noopener noreferrer"&gt;LLM Knowledge Bases&lt;/a&gt;". I also use Obsidian in a similar way, but I find in-repo docs more pragmatic for keeping context closer to the code.&lt;/p&gt;

&lt;p&gt;You can also layer on custom instructions to &lt;a href="https://developers.openai.com/codex/guides/agents-md" rel="noopener noreferrer"&gt;Codex&lt;/a&gt;, &lt;a href="https://cursor.com/docs/rules#user-rules" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;, and &lt;a href="https://code.claude.com/docs/en/memory#choose-where-to-put-claude-md-files" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; to customize your harness to behave differently beyond what your team has done.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI Token anxiety is real: what if I run out of budget?
&lt;/h2&gt;

&lt;p&gt;To borrow a friend's metaphor: inference tokens are like Spice in &lt;em&gt;Dune&lt;/em&gt;. They're a substance that augments your abilities, that once taken, you can't live without. And a scarce resource that requires extraordinary effort to accumulate.&lt;/p&gt;

&lt;p&gt;I heard this from a tech lead at a large enterprise. There are technically token budgets per engineer, but they're not being enforced. The current objective is to increase adoption, so that's not an issue. But this engineer worried about what happens when it does get enforced.&lt;/p&gt;

&lt;p&gt;The anxiety comes from multiple directions. There's the worry about rationing: how do you make sure you have enough tokens to hit your deadlines? And if an inference provider goes down mid-sprint, you're stuck without tokens or scrambling to switch to an unconfigured, unfamiliar tool.&lt;/p&gt;

&lt;p&gt;Then we get to the opacity of pricing. I &lt;a href="https://www.linkedin.com/posts/0xandrewshu_fascinating-saturday-i-measured-that-activity-7439405612087635968-AseS" rel="noopener noreferrer"&gt;measured my Claude Code sessions&lt;/a&gt; over a week and found that 2 out of 5 sessions burned tokens at 2x the normal rate, with no obvious change in my behavior. In Theo's &lt;a href="https://youtu.be/j_kJNYLI6Tw?si=WI1b4l7SbO7Ondlt" rel="noopener noreferrer"&gt;YouTube video&lt;/a&gt; on Claude Code's recent (March 2026) capacity reduction, his conclusion is the same as mine.&lt;/p&gt;

&lt;p&gt;Beyond capacity allowances, a feature change from the AI labs can silently double your token costs. There's a pattern emerging across providers in early 2026: models getting more verbose, spawning sub-agents for simple tasks, and nobody has a baseline to tell whether the amount of &lt;a href="https://www.ashu.co/claude-code-vs-cursor-pricing/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;AI agent-hours&lt;/a&gt; they can use per month was reduced 10% or 50%.&lt;/p&gt;

&lt;p&gt;I've written about this a lot, but I don't think we need to over-rotate on token reduction. I've found it helpful just to learn how token limits are enforced, how they're being used. This helps me be mindful of costs. The first thing I'd recommend is to &lt;a href="https://www.ashu.co/claude-code-vs-cursor-pricing/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;understand the tools' pricing structure&lt;/a&gt; and &lt;a href="https://www.ashu.co/cursor-to-claude-code-stuck-at-16-percent-utilization/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;learn how they enforce token limits&lt;/a&gt; so I can make the most out of the tokens they provide (and subsidize). There are also open source tools like &lt;a href="https://ccusage.com/" rel="noopener noreferrer"&gt;ccusage&lt;/a&gt; that track token usage. You can also try vibe coding your own!&lt;/p&gt;

&lt;h2&gt;
  
  
  Questions I ask myself to improve my use of AI agents on production
&lt;/h2&gt;

&lt;p&gt;I've found a number of techniques that have improved my workflow, but I have so many open questions still! I find that thinking about these questions help me find where real improvements can be made. These questions don't require adding tools to immediately seek gains. I'll share them with you, and hopefully they help you reflect on your own engineering work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffn3jis3afxg6bbfc9e1n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffn3jis3afxg6bbfc9e1n.png" alt="Summary of how I think about using AI on real production workloads; problems, symptoms and solutions." width="800" height="473"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Are my productivity gains real or not? What did I do in the last week with AI?&lt;/strong&gt; This is a question I ask myself regularly, because productivity gains may feel great but are actually an illusion. METR ran a &lt;a href="https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/" rel="noopener noreferrer"&gt;randomized controlled trial&lt;/a&gt; where experienced developers were 19% slower with AI tools, while believing they were 20% faster. This study ran in July 2025, before the major model improvements in November 2025. Nonetheless, that perception gap is a reminder that intuition alone isn't enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where am I losing time? How do I increase AI's autonomy?&lt;/strong&gt; This goes to my work around adversarial agents for scrutinizing code, tests, and logs. I've found that AI often churns out meaningless work, or takes shortcuts. These are signs the agent isn't truly autonomous, so I need to troubleshoot how to increase the autonomy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How much does fear cost me?&lt;/strong&gt; I monitor what my agents are doing more than I probably need to. Are my colleagues even familiar with which bash commands are risky? How much collective time gets lost to hovering, second-guessing, or just not knowing whether it's safe to let the agent run? Reducing risk here feels like unlocked velocity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is my process to reflect on and improve the effectiveness of my agents?&lt;/strong&gt; Right now, most of us are vibe coding &lt;em&gt;and&lt;/em&gt; vibe evaluating. We finish a session, we feel like it went well or it didn't, and we move on. I think there's value in building a habit of structured reflection: what worked, what didn't, what would I change? And sharing those reflections across a team, not just keeping them in your own head. And how do I share what I learn with colleagues? There's something from Toyota's production system and from Agile retrospectives that applies here: the discipline of continuous reflection and improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vibe coding deserves more than vibe evaluation
&lt;/h2&gt;

&lt;p&gt;The FOMO around AI coding skills is real. There are new tools every week, new techniques, new claims about what's possible. Most of us are figuring it out on hunches: not fully able to keep up, not clicking into AI news articles to read them in full, not totally understanding the tradeoffs, but feeling the paradigm shift happening underneath us.&lt;/p&gt;

&lt;p&gt;I think that's fine because we're early in the technology adoption. But I also think we can do better than vibes. The engineers I see getting the most value aren't the ones with the most expensive tools or the most aggressive token spend. They're the ones building habits of honest reflection: what did I ship versus what did I generate? Where did I invest versus where did I waste time? What would I do differently next session?&lt;/p&gt;

&lt;p&gt;Everything I've talked about here is from the engineer's perspective: what I can see, what I can measure, what I can control. But I've been having conversations with engineering managers too, and they're wrestling with a different version of the same question: how do you know your &lt;em&gt;team's&lt;/em&gt; AI investment is paying off when you can't see inside any of these tools? That's a different problem with different constraints. More on that soon.&lt;/p&gt;

&lt;p&gt;What are you working through? What's the question you keep asking yourself about your AI workflow? I'd genuinely like to hear: if you're wrestling with the same things, &lt;a href="https://www.linkedin.com/in/0xandrewshu/" rel="noopener noreferrer"&gt;reach out on LinkedIn&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.ashu.co/taking-vibe-coded-into-production/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;ashu.co&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Is Claude Code 5x Cheaper Than Cursor? I Ran 12 Experiments to Find Out</title>
      <dc:creator>Andrew Shu</dc:creator>
      <pubDate>Tue, 31 Mar 2026 16:52:02 +0000</pubDate>
      <link>https://dev.to/0xandrewshu/is-claude-code-5x-cheaper-than-cursor-i-ran-12-experiments-to-find-out-315m</link>
      <guid>https://dev.to/0xandrewshu/is-claude-code-5x-cheaper-than-cursor-i-ran-12-experiments-to-find-out-315m</guid>
      <description>&lt;p&gt;In &lt;a href="https://www.ashu.co/cursor-to-claude-code-stuck-at-16-percent-utilization/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 1 of this series&lt;/a&gt;, I noticed something strange while using Claude Code's Max 20x plan: it was the same $200/month as Cursor Ultra, doing the same work, but my Claude Code utilization was stuck at 16% while I had been burning through Cursor's token budget. In &lt;a href="https://www.ashu.co/parallel-claude-code-agents/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt;, I figured out how to push past 50% utilization with parallel Claude Code agents.&lt;/p&gt;

&lt;p&gt;Given that I could use so many more Sonnet/Opus tokens on Claude Code, my first instinct was: "is Claude Code actually 5x cheaper than Cursor?"&lt;/p&gt;

&lt;p&gt;And then I realized you can't compare them apples to apples. I couldn't ask: &lt;em&gt;at the same price, how much token capacity does each tool actually give you?&lt;/em&gt; Their pricing models are enforced incredibly differently (see &lt;a href="https://www.ashu.co/cursor-to-claude-code-stuck-at-16-percent-utilization/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt;), and Cursor has 2 pools of tokens (API, and "Auto + Composer").&lt;/p&gt;

&lt;p&gt;So instead, I came up with a metric — "agent-hours" — to serve as a proxy: &lt;em&gt;given each plan's token capacity, how many hours of agents can I run per month?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I had some hunches, but I couldn't be sure they would hold up. So, I did what any engineer with too much curiosity would do: I designed an experiment to find out.&lt;/p&gt;

&lt;p&gt;A few key caveats before we dive in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;This is a loosely controlled experiment, not a rigorous benchmark. The findings are directional: order of magnitude, not precise. Readings fluctuated significantly day by day, and the product/capacity changed. But this reflects real life.&lt;/li&gt;
&lt;li&gt;I'm using Individual, not Team plans, focusing on $200/month tiers.&lt;/li&gt;
&lt;li&gt;Things change rapidly in the world of vibe coding token use, models, and costs. The 1M context window for Opus 4.6 dropped for Claude Code and then Cursor. Cursor dropped Composer 2.0, an upgrade from Composer 1.5. &lt;a href="https://x.com/trq212/status/2037254607001559305" rel="noopener noreferrer"&gt;Claude session limits were updated&lt;/a&gt; in between experiments. I normalized for differing "2x limits" promotions in &lt;a href="https://support.claude.com/en/articles/14063676-claude-march-2026-usage-promotion" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; and &lt;a href="https://github.com/openai/codex/discussions/11406" rel="noopener noreferrer"&gt;Codex&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To return to the article: my intuition suggested there was a notable difference in price, and I wanted to quantify. I learned a considerable amount digging into pricing, and this helps me understand how to make the most out of the different models.&lt;/p&gt;

&lt;p&gt;I hope this token and tool pricing analysis helps (and interests) you as much as it did me. It's a long article, but given the volatility of the experiment, I figured it would help for me to show you all the messy details and how I think about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The headline: Claude Code delivers ~5x more capacity per dollar
&lt;/h2&gt;

&lt;p&gt;Here's the summary. At $200/month on individual plans:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspe4iowdywoj3qew3sxh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspe4iowdywoj3qew3sxh.png" alt="Graph comparing " width="800" height="474"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool + Plan&lt;/th&gt;
&lt;th&gt;Agent-Hours / Month&lt;/th&gt;
&lt;th&gt;vs. Cursor Ultra&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cursor Ultra ($200)&lt;/td&gt;
&lt;td&gt;~138 hours&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex Pro ($200)&lt;/td&gt;
&lt;td&gt;~220 hours&lt;/td&gt;
&lt;td&gt;~1.6x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code Max 20x ($200)&lt;/td&gt;
&lt;td&gt;~678 hours&lt;/td&gt;
&lt;td&gt;~4.9x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;So at the same $200/month, Claude Code gives you ~5x more room to work than Cursor.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Important context before we get further.&lt;/strong&gt; This measures &lt;em&gt;capacity per month&lt;/em&gt; (for my workload + codebase): how many agent-hours your subscription delivers if you use it fully. It does not measure work quality, code correctness, or features completed. You shouldn't read it as "5x cheaper" because that assumes you can actually use all that capacity.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But this is too simplistic of a view, because there are greater nuances to the pricing. We should next look at how Cursor's pricing works, because this makes the story considerably more interesting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cursor Ultra's pricing structure: two pools of different tokens
&lt;/h2&gt;

&lt;p&gt;Before we go deeper into the comparison, we need to understand &lt;a href="https://cursor.com/docs/models-and-pricing" rel="noopener noreferrer"&gt;Cursor's pricing structure&lt;/a&gt;. Cursor Ultra doesn't give you one big pool of tokens. It gives you two, and they're dramatically different in size and model characteristics.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The first pool is API credits&lt;/strong&gt;, which cover SOTA models: "state of the art" frontier models like Opus 4.6, Sonnet 4.6, and GPT-5.4 (at the time of publishing). These are usually the models scoring highest on benchmarks, and also the most expensive models available.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The second pool is "Auto+Composer" credits&lt;/strong&gt;, which cover Cursor's proprietary Composer models — faster, cheaper models that Cursor has built and optimized for code generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you upgrade to Ultra expecting unlimited access to the best models available, what you actually get is a small allocation of frontier model credits and a much larger allocation of Composer credits. Here's how the two pools break down:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Cursor Ultra Usage Pool&lt;/th&gt;
&lt;th&gt;Estimated Agent-Hours&lt;/th&gt;
&lt;th&gt;% of total&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API credits (We use Opus 4.6, both 200k and 1M)&lt;/td&gt;
&lt;td&gt;~18 hours&lt;/td&gt;
&lt;td&gt;13%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto + Composer credits&lt;/td&gt;
&lt;td&gt;~120 hours&lt;/td&gt;
&lt;td&gt;87%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~138 hours&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Note: API agent-hours depend on &lt;a href="https://cursor.com/docs/models-and-pricing" rel="noopener noreferrer"&gt;the price of the model&lt;/a&gt; you choose. Opus 4.6 is one of the most expensive options; a cheaper SOTA model would stretch further.&lt;/p&gt;

&lt;p&gt;That ~18 agent-hours of frontier model is a key factor to consider when you use Cursor. When I ran experiments using only Opus 4.6 on Cursor, the API pool burned through fast. When I ran experiments using Composer models, the Composer pool lasted roughly 7–8x longer.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;And this is a key finding: Cursor incentivizes you to spend most of your time using the faster Composer 2 model. This seems to be a deliberate design choice, and it's a reasonable one. The combined 5x headline reflects what happens when you use Composer for most of your work, which is how Cursor intends for you to use it. If you default to frontier models, the gap is far wider.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This explains a frustration I've seen across forums and from other engineers: you upgrade to Cursor Ultra and exclusively use SOTA models, only to find out that you hit API credits faster than expected.&lt;/p&gt;

&lt;p&gt;Let's see what this looks like in numbers. We strip out the generous "Auto + Composer" tier and exclusively use SOTA models. (Again, not the optimal use of Cursor.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibgi5beij5z6v1vywuj5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibgi5beij5z6v1vywuj5.png" alt="Graph focusing on " width="800" height="486"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool + Plan&lt;/th&gt;
&lt;th&gt;Agent-Hours / Month (SOTA only)&lt;/th&gt;
&lt;th&gt;vs. Cursor (SOTA)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cursor Ultra — API only ($200)&lt;/td&gt;
&lt;td&gt;~18 hours&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex Pro ($200)&lt;/td&gt;
&lt;td&gt;~220 hours&lt;/td&gt;
&lt;td&gt;~12x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Code Max 20x ($200)&lt;/td&gt;
&lt;td&gt;~678 hours&lt;/td&gt;
&lt;td&gt;~38x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;That's a 38x difference in agent-hours (ignoring the vast amount of Composer 2 tokens that Cursor provides). For engineers exclusively focused on frontier model access for complex reasoning (Opus, GPT, Gemini) and comparing Claude Code to Cursor, this is the source of their surprise.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Even this is too simplistic; I think we need to dive deeper.&lt;/p&gt;

&lt;h2&gt;
  
  
  But capacity isn't velocity: Composer 2 is genuinely fast
&lt;/h2&gt;

&lt;p&gt;Here's where the story gets more interesting than "Tool A gives you more." I tracked project completions across all 12 experiments, and the velocity data tells a different story than the capacity data.&lt;/p&gt;

&lt;p&gt;Here's how long the models took to complete Project 1, which involved a bulk rename across the project:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9vj59t42i2wsvei9sp0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi9vj59t42i2wsvei9sp0.png" alt="Average duration (minutes) to complete Project 1, a large refactor" width="800" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, let's look at Project 2, which involved cutting out a set of features:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmzd8v0r9ihm41jfv52mo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmzd8v0r9ihm41jfv52mo.png" alt="Average duration (minutes) to complete Project 2, another large refactor. Codex did not reach the end of Project 2 so is not present." width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I think the first 2 charts provide a much better signal, because they compare 2 larger, more complex refactor projects on the same scope.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Caveat: I'm going to share the following chart, even though it's flawed. After the first 2 large projects, I queued up many small projects like "research X and then build a small full stack feature".&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But nonetheless, I wanted to share the different feeling of speed as I worked with different models:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybts3xsqnx4bwd2giumy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybts3xsqnx4bwd2giumy.png" alt="Graph showing average overall projects completed. Don't read numbers too literally — projects were unevenly sized. But it illustrates the feeling of velocity when using Composer 2." width="800" height="497"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In all charts, the Composer models were at least 2x faster than the other models. Since it finished the first 2 larger projects, it was able to race ahead to do all the small projects at the end. If you have a mix of small/large projects, Composer's lead may pull it ahead.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You might notice the Opus 4.6 200k/1M models not showing a clear trend. The sample size was small, so the fluctuation is a bit noisy.&lt;/p&gt;

&lt;p&gt;So, speed is another tradeoff when choosing tools. Claude Code may give you more capacity per dollar. But using Cursor Composer can dramatically increase throughput. If the work is clearly defined and implementation-focused, you may get more done in fewer agent-hours.&lt;/p&gt;

&lt;h2&gt;
  
  
  A quick aside about Codex + GPT 5.4
&lt;/h2&gt;

&lt;p&gt;If you're looking at Codex + GPT 5.4's velocity, you might notice that it didn't move as quickly. I wouldn't read too much into it. Each metric gives you a part of the picture; each tool has different strengths and weaknesses.&lt;/p&gt;

&lt;p&gt;Firstly, I'm not as proficient at Codex's quirks as I am with Claude Code, so I don't know how to squeeze the most juice out of it. I noticed that during the experimental runs, GPT was much more cautious and spent more time slicing up the work into different groups.&lt;/p&gt;

&lt;p&gt;And qualitatively, consider the &lt;a href="https://www.youtube.com/watch?v=HD5TWE8xD7o" rel="noopener noreferrer"&gt;multiple&lt;/a&gt; pieces of &lt;a href="https://x.com/mitchellh/status/2029348087538565612" rel="noopener noreferrer"&gt;anecdotal&lt;/a&gt; &lt;a href="https://developers.openai.com/community#:~:text=My%20new%20Sunday%20morning%20routine,%40youyuxi" rel="noopener noreferrer"&gt;evidence&lt;/a&gt; that Codex and GPT 5.4 can solve complex issues and that people are loving it. I've been hearing similar things in my conversations with colleagues. It's a potent tool and you should definitely give it a shot.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I tested and how
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The setup
&lt;/h3&gt;

&lt;p&gt;I ran all 12 experiments on the same codebase: a monorepo with Elixir/Phoenix, React, and Terraform infrastructure, roughly 80k lines of code. Every experiment started from the same git commit. I used 4 parallel agents per tool, each on a separate git worktree (the same setup I described in &lt;a href="https://www.ashu.co/parallel-claude-code-agents/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 2&lt;/a&gt;). Each agent worked through the same sequence of self-contained refactoring projects: rename all instances of X, extract a module, add an API integration.&lt;/p&gt;

&lt;p&gt;Each experiment ran roughly 60 minutes. I played a lightweight manager role — confirming "done" claims, assigning the next project. My controls tightened over the week as I learned what to watch for.&lt;/p&gt;

&lt;p&gt;If you're interested in the raw data, reach out &lt;a href="https://www.linkedin.com/in/0xandrewshu/" rel="noopener noreferrer"&gt;via LinkedIn&lt;/a&gt; or &lt;a href="https://x.com/0xAndrewShu" rel="noopener noreferrer"&gt;on X&lt;/a&gt;. If there's enough interest, I'd be happy to publish it on my Github.&lt;/p&gt;

&lt;h3&gt;
  
  
  The tool configurations
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;th&gt;Claude Code&lt;/th&gt;
&lt;th&gt;Cursor&lt;/th&gt;
&lt;th&gt;Codex&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Interface&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;CLI / Agent mode&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Opus 4.6 (200k, 1M context)&lt;/td&gt;
&lt;td&gt;Opus 4.6 / Composer 1.5 / Composer 2 (varied)&lt;/td&gt;
&lt;td&gt;GPT-5.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Plan tested&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Max 5x ($100)&lt;/td&gt;
&lt;td&gt;Pro+ ($60) → Ultra ($200)&lt;/td&gt;
&lt;td&gt;Pro ($200)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Autonomy mode&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Accept edits on&lt;/td&gt;
&lt;td&gt;CLI with allow-listing (not YOLO)&lt;/td&gt;
&lt;td&gt;Runs commands without asking&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Parallel instances&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few notes. I tested Claude Code on Max 5x ($100), not Max 20x ($200). The 20x projection uses Anthropic's published 4x multiplier — more on this in the calculations section. All three tools ran in semi-autonomous mode with different allow-listing behavior, which affects velocity asymmetrically and is unavoidable. Both Claude Code and Codex had active 2x capacity promotions during this period. Codex's promo applied 24/7. Claude Code's applied during specific off-peak hours.&lt;/p&gt;

&lt;h3&gt;
  
  
  What I measured
&lt;/h3&gt;

&lt;p&gt;For Claude Code, I tracked the percentage of the 5-hour session consumed and the percentage of the weekly limit consumed. For Cursor, I tracked dollar amounts of API usage and Auto/Composer usage consumed, plus the combined total percentage. For Codex, I tracked the same session and weekly percentages as Claude Code.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I calculated capacity
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Defining "agent-hours"
&lt;/h3&gt;

&lt;p&gt;An agent-hour equals one agent running for one hour. If 4 agents run for 1 hour, that's 4 agent-hours. The key question: how many agent-hours does each plan sustain in a month?&lt;/p&gt;

&lt;h3&gt;
  
  
  Session-based tools (Claude Code, Codex)
&lt;/h3&gt;

&lt;p&gt;Technically, there are 2 limits: the 5-hour session limit and the weekly limit. The weekly limit is always more constricting than the sum of all the 5-hour session limits.&lt;/p&gt;

&lt;p&gt;For each experiment, I measured the usage % at the start and end of the session, and calculated the difference. Since I know how many minutes the experiment ran, I calculate the "percentage consumed per minute" of both the 5-hour session capacity and the weekly limit. Monthly projection: weekly capacity × ~4 weeks × 4 agents.&lt;/p&gt;

&lt;p&gt;To normalize the "5-hour session capacity" to "weekly capacity", I figured that 7 days has 168 hours. Thus, 168h / 5h = 33.6 sessions. If I can reach 100% capacity in 70 minutes, then I can multiply that by 33.6 sessions, and get 2,352 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cursor's two-pool system
&lt;/h3&gt;

&lt;p&gt;This is where the SOTA vs Composer insight emerges naturally from the math.&lt;/p&gt;

&lt;p&gt;I measured the percentage consumed per minute of the monthly API pool (from the Opus-on-Cursor experiments) and separately the monthly Auto+Composer pool (from the Composer experiments). The API pool yielded roughly 1,065 agent-minutes per month, or about 18 agent-hours. The Auto+Composer pool yielded roughly 7,200 agent-minutes, or about 120 agent-hours. Combined: ~138 agent-hours.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Max 5x → Max 20x projection
&lt;/h3&gt;

&lt;p&gt;All of my Claude Code experiments ran on the Max 5x plan ($100/month). To estimate Max 20x ($200/month), I used Anthropic's published multiplier.&lt;/p&gt;

&lt;p&gt;Anthropic's &lt;a href="https://support.claude.com/en/articles/11049741-what-is-the-max-plan" rel="noopener noreferrer"&gt;support documentation&lt;/a&gt; states that Max 5x provides 5x Pro usage and Max 20x provides 20x Pro usage — so Max 20x = 4x Max 5x capacity. This is a projection, not a measurement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Off-peak and promo normalization
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://support.claude.com/en/articles/14063676-claude-march-2026-usage-promotion" rel="noopener noreferrer"&gt;Anthropic's 2x off-peak discount&lt;/a&gt; applied to several experiments. I normalized by halving observed capacity during off-peak hours: conservative but approximate. I also ran experiments during peak, off-peak, and on the threshold of both.&lt;/p&gt;

&lt;p&gt;When it was on the threshold of both, I just removed the values from the calculation. I was curious how the code would behave.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/openai/codex/discussions/11406" rel="noopener noreferrer"&gt;Codex's 24/7 2x promo&lt;/a&gt; (through April 2) was similarly halved. Both the promo and normalized figures are shown throughout for transparency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Walking through the math: Experiment 7
&lt;/h3&gt;

&lt;p&gt;Let me show the math for Experiment 7 comparing Claude Code vs Cursor Ultra, both using Opus 4.6.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Claude Code (Opus 4.6 1M window):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The weekly limit went from 37% to 42% over 60 minutes with 4 agents — that's 5% of weekly capacity consumed.&lt;/li&gt;
&lt;li&gt;Weekly capacity = 100% / 5% × 60 min ≈ 1,200 minutes of 4-agent usage.&lt;/li&gt;
&lt;li&gt;That's with the 2x off-peak discount.&lt;/li&gt;
&lt;li&gt;Normalize to 1x: 1,200 minutes / 2 = 600 minutes.&lt;/li&gt;
&lt;li&gt;Monthly: 600 × 4 weeks ≈ 2,400 minutes.&lt;/li&gt;
&lt;li&gt;Convert to agent-hours: 2,400 / 60 × 4 concurrent agents ≈ 160 agent-hours on Max 5x.&lt;/li&gt;
&lt;li&gt;Apply the 4x multiplier for Max 20x: ~640 agent-hours.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cursor Ultra (API Pool, Opus 4.6 200K window):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API credits went from 0% to 26% over 60 minutes.&lt;/li&gt;
&lt;li&gt;Monthly API pool capacity: 100% / 26% × 60 minutes = ~231 minutes&lt;/li&gt;
&lt;li&gt;Normalize to agent-hours: ~231 mins × 4 agents / 60 mins/hour ≈ 15.4 agent-hours.&lt;/li&gt;
&lt;li&gt;Since this experiment used only Opus (a frontier model), only the API pool was consumed. We borrow the estimated ~138 total agent-hours for Cursor's 2 pools for the combined estimates.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this single experiment, Claude Code Max 20x delivers roughly 41x more than Cursor's API pool (640 / 15.4), or roughly 4.6x more than Cursor's combined capacity (640 / 138). Other experiments produced different ratios depending on the model, discounting, and control tightness. The ~5x headline is the central estimate across all experiments.&lt;/p&gt;

&lt;p&gt;This is back-of-the-spreadsheet math, not a precise benchmark. But for an order-of-magnitude comparison, it's enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Qualitative observations
&lt;/h2&gt;

&lt;p&gt;A few things that don't show up in the numbers but matter for choosing a tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Composer 2 velocity was great.&lt;/strong&gt; Some of the projects were eye-opening: Composer 2 raced through an average of 7.1 projects to Opus 4.6's 2.3. Experiencing it in real time was striking. Whether that speed holds up on complex, ambiguous tasks is an open question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Opus 4.6 performed consistently across both platforms.&lt;/strong&gt; Same model, same velocity on Claude Code and Cursor. The capacity difference between these tools is pricing architecture, not model quality. If you're choosing based on model capability, both platforms give you access to the same thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token consumption is volatile day to day.&lt;/strong&gt; Model updates, features, regressions, and discounting all hit during the same period. This may have caused noise in the experimental data, but it's also representative of daily life at a particularly active time in the technology and business of AI coding tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. The capacity gap is real: ~5x combined, ~38x on frontier models.&lt;/strong&gt; If you use Claude Code with Opus (its default), you get substantially more runway per dollar than Cursor. If you only compare frontier model access, it's not close.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. To make the most out of Cursor, you should be using Composer a lot.&lt;/strong&gt; Most of your Ultra budget buys Composer credits, not SOTA access. If Composer fits your workflow, you get ~138 agent-hours and strong velocity. If you want frontier models full-time, Cursor becomes extremely expensive per agent-hour. A common pattern is to use SOTA models for initial planning and research, then Composer models to implement the plan much more rapidly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Velocity matters — Composer 2 is much faster at completing projects.&lt;/strong&gt; More capacity doesn't automatically mean more output. An engineer running Composer 2 on tasks may complete more work in 138 hours than another running Opus.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. The pricing model shapes your workflow.&lt;/strong&gt; Claude Code's speed-limit model rewards consistent daily usage with parallel agents. Cursor's monthly budget is more forgiving for bursty schedules. The "best" plan depends on how you work, not just the capacity math. (I covered this difference in &lt;a href="https://www.ashu.co/cursor-to-claude-code-stuck-at-16-percent-utilization/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 1&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Codex is a real contender&lt;/strong&gt; at ~1.6x Cursor's capacity, and a number of engineers I know and follow online have been enjoying Codex for its knack at solving harder problems that Opus 4.6 may have challenges with. And you get the SOTA model for all the agent-hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. With Anthropic's "capacity reductions" for 7% of users, I ran out more often in the 5-hour session, but not necessarily the weekly session.&lt;/strong&gt; I'm not 100% sure yet, because the measurements keep fluctuating. But the weekly session seems to be similar to what it was before. And since it is the constraining factor, running out of 5-hour sessions may not necessarily mean that I have overall fewer tokens per month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Caveats and open questions
&lt;/h2&gt;

&lt;p&gt;This section is long on purpose. The caveats are as important as the findings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Experimental design limitations
&lt;/h3&gt;

&lt;p&gt;No two experiments were identical. Models changed, plans changed, promotions came and went. Each experiment is a snapshot of a specific configuration on a specific day.&lt;/p&gt;

&lt;p&gt;I was the human bottleneck. Confirming "done" claims, assigning projects, occasional breaks — all of this introduces noise. Semi-autonomous mode created asymmetry across tools: each tool pauses at different moments for permission, which affects velocity differently and is unavoidable.&lt;/p&gt;

&lt;p&gt;Also, velocity was not the primary objective, since I was interested in token capacity (or agent-hours). In particular, code quality was probably decent but not audited. From my experience, the AI agents usually get most of the way to the finish line.&lt;/p&gt;

&lt;p&gt;And, Codex and Claude Code both have lighter, faster models (e.g. GPT 5.4 mini, Sonnet) for varying speed and token usage.&lt;/p&gt;

&lt;p&gt;There are many interesting variables and questions, and I didn't test them out for the sake of time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Limitations in measurement and extrapolation
&lt;/h3&gt;

&lt;p&gt;The whole purpose of this experiment is to normalize across tools that report usage in fundamentally different units, and that's also the main source of imprecision. Claude Code reports percentages of session and week. Cursor reports dollars for API plus a separate pool for Composer, with a combined total. Converting between these systems requires assumptions.&lt;/p&gt;

&lt;p&gt;Resolution of the measurements is often low. If your measurement jumps from 0% to 3% in an hour, the true value could be anywhere from 3.0% to 3.99% — a roughly 33% range of uncertainty. For that reason, I ran multiple experiments to get a sense of averages and ranges. Using 4 agents helped me accelerate burn to see more numerical change in less time.&lt;/p&gt;

&lt;p&gt;I simplified my extrapolation for agent-hours by multiplying the weekly estimated agent-hours by 4, totaling 28 days. Technically, the average number of days is slightly over 30.&lt;/p&gt;

&lt;h3&gt;
  
  
  The chaotic experimental window
&lt;/h3&gt;

&lt;p&gt;I get the sense that something around March 13 or March 14 may have changed Claude Code's token burn to accelerate.&lt;/p&gt;

&lt;p&gt;Moreover, the 2x off-peak discount launched March 14 and ended March 28. I normalized by halving, but the normalization is an approximation. Composer 2 shipped March 19, Experiment 7 may not represent steady state, though Experiment 8 (March 20, no discount) confirms the pattern. Codex's 2x promo was active through April 2, normal-rate Codex may go to 0.8x rather than ~1.6x Cursor. Or, focusing on frontier models, 6.2x instead of 12.4x.&lt;/p&gt;

&lt;p&gt;I could have waited for a quiet week. But there hasn't been a quiet week in AI coding tools in months. This chaos &lt;em&gt;is&lt;/em&gt; normal usage — the launches, the promotions, the regressions. A perfectly controlled experiment would be more precise but less representative of what you'd actually experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open questions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Does the capacity gap change for different work types: greenfield vs refactoring vs debugging?&lt;/li&gt;
&lt;li&gt;What about for tech stack? I was doing full-stack engineering in Elixir/React/Terraform. How does that change for Python/Svelte/Pulumi? Firmware? Mobile? SRE? Database internals?&lt;/li&gt;
&lt;li&gt;What's the quality gap? If any of the models' speed comes at a quality cost, the velocity advantage shrinks.&lt;/li&gt;
&lt;li&gt;How does this look on team and enterprise plans, particularly Claude Code Premium Seats in Teams?&lt;/li&gt;
&lt;li&gt;Will these numbers hold as all the companies adjust pricing and models adjust velocity?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tips for reducing token usage
&lt;/h3&gt;

&lt;p&gt;I wanted to share a few resources I found online or heard while discussing this with friends and colleagues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you're using Opus, consider switching to Sonnet as your default model.&lt;/strong&gt; A few of my friends report that Sonnet is similarly effective, but faster and more token efficient. I've been mostly focused on Opus, so I can't speak to this directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reading &lt;a href="https://code.claude.com/docs/en/costs#reduce-token-usage" rel="noopener noreferrer"&gt;Claude Code best practices&lt;/a&gt;.&lt;/strong&gt; Regardless which tool you're using, some of the concepts in the guide may help.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clearing context more frequently is an easy change.&lt;/strong&gt; My experiments ran on models with 1M context, and I just let them run and auto-compact over the course of the hour. I believe the whole conversation gets sent up (minus caching effects), so clearing might be impactful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cron job 2–3 hours before you start your work day to send Claude/Codex a trivial message.&lt;/strong&gt; Given that the 5-hour session limit is a constraining factor, consider that you typically have 2 sessions in an 8-hour work day. You can get a 3rd window if you trigger it before you start the bulk of your work. Note that in the end, you'll still hit the weekly constraints.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use "token-reducing" libraries like &lt;a href="https://www.rtk-ai.app/" rel="noopener noreferrer"&gt;RTK&lt;/a&gt;.&lt;/strong&gt; The premise is that a lot of CLI binaries that the AI coding agents call generate noisy output that is bad for LLMs. It creates a proxy to optimize the tokens. Consider looking for more, since this is a class of tooling. In the CLI, there is &lt;a href="https://github.com/mpecan/tokf" rel="noopener noreferrer"&gt;tokf&lt;/a&gt;. There are also prompt compressors like &lt;a href="https://github.com/microsoft/LLMLingua" rel="noopener noreferrer"&gt;Microsoft's LLMLingua&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current events: recent news about token costs
&lt;/h2&gt;

&lt;p&gt;At the cost of extending this article further, I wanted to highlight a few recent items of news as it pertains to this analysis.&lt;/p&gt;

&lt;p&gt;On March 5, 2026, &lt;a href="https://www.forbes.com/sites/annatong/2026/03/05/cursor-goes-to-war-for-ai-coding-dominance/" rel="noopener noreferrer"&gt;Forbes reported&lt;/a&gt; that Cursor's internal analysis showed that the $200/mo Claude Code subscription could get $2,000 of tokens at the end of last year, and in early March 2026 was getting $5,000 of tokens. On the other hand, compare that to Cursor's $200/mo plan offering &lt;a href="https://cursor.com/docs/models-and-pricing" rel="noopener noreferrer"&gt;$400/mo of API usage&lt;/a&gt; plus generous Auto+Composer. But the reason I was interested in this experiment is to begin to translate it to "how many hours of engineering work can I do with this?" and begin to quantify this.&lt;/p&gt;

&lt;p&gt;Also on March 5, 2026, investor-entrepreneur &lt;a href="https://x.com/chamath/status/2029634071966666964" rel="noopener noreferrer"&gt;Chamath Palihapitiya tweeted&lt;/a&gt; that his company 8090 chose to migrate off of Cursor because AI costs have tripled since November 2025, and are "now spending many millions per year", trending to $10m per year. He mentions that it may be how the engineers are using the tooling as well, e.g. running runaway loops ("Ralph loops") without regard to cost. But the main point is that it's a topic of interest and an area worth thinking about.&lt;/p&gt;

&lt;p&gt;Around the weeks of March 14–26, users were reporting increased token burn rates. (See my LinkedIn posts noting my initial observation on &lt;a href="https://www.linkedin.com/posts/0xandrewshu_fascinating-saturday-i-measured-that-activity-7439405612087635968-AseS/" rel="noopener noreferrer"&gt;March 14&lt;/a&gt;, then &lt;a href="https://www.linkedin.com/posts/0xandrewshu_in-the-last-2-days-folks-on-x-and-reddit-activity-7442678814041632769-XBr5/" rel="noopener noreferrer"&gt;my LinkedIn post&lt;/a&gt; when it trended on X on March 25). It looks like Anthropic announced &lt;a href="https://x.com/trq212/status/2037254607001559305" rel="noopener noreferrer"&gt;a capacity change&lt;/a&gt; on March 26, estimated to affect a minor ~7% of users. But as of publishing this article (Mar 30), it seems like &lt;a href="https://x.com/lydiahallie/status/2038686571676008625" rel="noopener noreferrer"&gt;they're still working on it&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I speculate that Anthropic tweaked the 5-hour token limits which helps them with scale, but the weekly token limits didn't change that much. If that's true, then overall monthly token capacity doesn't change much. It just means you run into the limits a lot more per day. (You might try that cron job I mention above.)&lt;/p&gt;

&lt;p&gt;Anyways, this article represents a moment in time as our use of the tool and the pricing models around it change. Last June/July (2025), Cursor &lt;a href="https://techcrunch.com/2025/07/07/cursor-apologizes-for-unclear-pricing-changes-that-upset-users/" rel="noopener noreferrer"&gt;changed its pricing models&lt;/a&gt; in a way that upset users. I wouldn't be surprised if this continues to change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping up
&lt;/h2&gt;

&lt;p&gt;This started with a pricing question and ended up capturing a lot more. While the technology, pricing and business will continue to evolve, I wanted to do this deep dive to understand a snapshot of the ecosystem today. As things evolve further, I can have an anchoring mental model to reason about future changes.&lt;/p&gt;

&lt;p&gt;The choice isn't just "cheaper vs more expensive." It's what kind of capacity you need. Frontier model capacity for complex reasoning? Reach for Claude Code or Codex. Fast implementation throughput on well-scoped tasks, or you prefer an IDE? Cursor Composer has a real speed advantage when you combine frontier models for planning and troubleshooting with fast, lightweight models. Most engineers probably need some of both — the question is which default fits your workflow.&lt;/p&gt;

&lt;p&gt;I plan to keep running experiments as both tools evolve. If you're interested in discussing the findings, seeing the raw data, or talking about token math — I'd like to hear about it: &lt;a href="https://www.linkedin.com/in/0xandrewshu/" rel="noopener noreferrer"&gt;connect on LinkedIn&lt;/a&gt; or find me &lt;a href="https://x.com/0xAndrewShu" rel="noopener noreferrer"&gt;on X&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is Part 3 of my series on transitioning from Cursor to Claude Code. Catch up: &lt;a href="https://www.ashu.co/cursor-to-claude-code-stuck-at-16-percent-utilization/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 1: Stuck at 16%&lt;/a&gt;, &lt;a href="https://www.ashu.co/parallel-claude-code-agents/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 2: Parallel Agents&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>From 1 to 3 Parallel Claude Code Agents: How I Broke Past 16% Utilization</title>
      <dc:creator>Andrew Shu</dc:creator>
      <pubDate>Tue, 17 Mar 2026 16:41:24 +0000</pubDate>
      <link>https://dev.to/0xandrewshu/from-1-to-3-parallel-claude-code-agents-how-i-broke-past-16-utilization-mee</link>
      <guid>https://dev.to/0xandrewshu/from-1-to-3-parallel-claude-code-agents-how-i-broke-past-16-utilization-mee</guid>
      <description>&lt;p&gt;At the end of my last post, I was: &lt;a href="https://www.ashu.co/cursor-to-claude-code-stuck-at-16-percent-utilization/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;stuck at 16% Claude Code utilization&lt;/a&gt; on the Max 20x plan, and had just figured out that parallel agents could help me break past that limit, explore git worktrees, and make better use of the $200 plan I had paid for. &lt;/p&gt;

&lt;p&gt;But I knew that git commits and conflicts would be a problem as 2 agents make the same commits in the same repository. So how do I coordinate and isolate them?&lt;/p&gt;

&lt;p&gt;Spinning up a second (or third) agent in another terminal is easy. But keeping them productive and increasing velocity was the new challenge. I had been reading many posts about people orchestrating 10's or 100's of background agents, but I haven't read many tutorials covering the evolution from 1 to 3 agents. &lt;/p&gt;

&lt;p&gt;Anthropic recently shipped &lt;a href="https://code.claude.com/docs/en/agent-teams" rel="noopener noreferrer"&gt;Claude Code Agent Teams&lt;/a&gt;, which automates this: a lead agent coordinates teammates, assigns tasks, and synthesizes results across multiple sessions. But this is more about automated delegation of &lt;em&gt;a single existing&lt;/em&gt; project rather than adding the ability to parallelize arbitrary new projects. &lt;/p&gt;

&lt;p&gt;This post covers the observations, changes in my local development environment and the reasoning that took me from 16% to 50%+ utilization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token FOMO: how were people using 3+ Claude Code accounts?
&lt;/h2&gt;

&lt;p&gt;I'll confess: I was feeling token FOMO, watching engineers post articles about their agent squads and getting 100 agents run in parallel. Meanwhile, I was stuck at 16% on a $200 / month. I didn't feel like I needed 100 agents, but I wanted to understand how to break past that to get on the path to higher output.&lt;/p&gt;

&lt;p&gt;After the initial experiment with a second agent to consume more tokens, I realized that extra agents would be chaos: merge conflicts, inconsistent databases, agents pulling the rug out from under each other. &lt;/p&gt;

&lt;p&gt;So, I decided to investigate how to coordinate them. This eventually took me on a journey that improved my workflow. But before I took the first step, I realized I needed to ask: would it actually increase my output?&lt;/p&gt;

&lt;h2&gt;
  
  
  Before adding a second agent, make sure the first is busy
&lt;/h2&gt;

&lt;p&gt;It's a bit counterintuitive. It's so easy to spin up a second agent that it's also easy to miss that you're only getting value if you're able to keep both agents mostly busy. Here are a few hints to figure out where you are.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fko0dhfooe2svc6w7xli2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fko0dhfooe2svc6w7xli2.png" alt="Visualization showing 1 busy agent beats 3 low-utilization agents." width="800" height="525"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;If your agent is often waiting on you:&lt;/strong&gt; waiting for an answer to its question or needing clarification on ambiguous tasks, it's not doing real work. You probably need a way to keep them busy better, and this is where &lt;a href="https://www.ashu.co/markdown-plan-files-vibe-coding/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;structured markdown plan files&lt;/a&gt; pay off. Or the agent is waiting on you to execute some queries or deployment, maybe the problem is a tooling problem - you need to automate something and put it in the hands of the agents (if it's safe to).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you keep interrupting the agent to micromanage,&lt;/strong&gt; it's effectively the same problem. It may be helpful to review the markdown plan files and get them into more agreeable state before you let the agents run.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If your agents keep churning out work that you end up disagreeing with&lt;/strong&gt; (despite having reasonable plans), then the workload may not be good for parallelizing. I see this often when troubleshooting complex systems or complex projects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If you prefer to hand-code, or you don't like context-switching&lt;/strong&gt; between multiple agents and tasks, you have a totally valid reason not parallelize. Not every workflow benefits from more agents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;But if you're sitting idle while the agent runs&lt;/strong&gt;—working on the next task, responding to Slack, browsing Reddit—this is the signal you can juggle another agent. Basically: if you're waiting on the agent regularly, committing and shipping regularly, then add more agents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two agents crammed in a repo: isolation is a problem
&lt;/h2&gt;

&lt;p&gt;When you're ready for Agent 2, you become an engineering manager and face the problem of assigning useful work to your team. You need to source the work: come up with ideas, talk to people. You need to scope it so it's parallelizable and pragmatic. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzovyamgnsh40jrvd71g8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzovyamgnsh40jrvd71g8.png" alt="Visualization of file editing collisions from 2 agents working in the same repo." width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But there's a technical problem you'll face first. If two agents are touching the same files, you'll get merge conflicts, overwritten work and outdated understanding of the code. In the first few hours working with parallel coding agents, I tried to keep them productive and focusing on separate concerns in the same repository. &lt;/p&gt;

&lt;p&gt;Here are a few methods I used to separate the agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frontend / backend split: these are often separate concerns in separate files&lt;/li&gt;
&lt;li&gt;Application and infrastructure code: e.g. one agent writes Typescript, and another, Terraform&lt;/li&gt;
&lt;li&gt;Feature pipelining: first ship feature 1 behind a feature flag, and validate it / work on corner cases while another agent starts feature 2&lt;/li&gt;
&lt;li&gt;Async refactoring, hardening, polishing, documentation: sometimes if I have a bit of extra bandwidth, I'll spin up an extra agent to do maintenance that avoids my main work. It's useful to accumulate maintenance tasks in a backlog for the agent to pull from.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After a few coding sessions, I realized "separation" wasn't enough: I was trying to keep agents separate through convention rather than configuration. Collision reduction wasn't enough; I needed to properly isolate them to eliminate collisions so they can be more autonomous and be faster.&lt;/p&gt;

&lt;p&gt;And note: these agent "separation" methods may not have fully solved the "isolation" problem. But they're useful scoping / delegation approaches to fully isolated agents, too.&lt;/p&gt;

&lt;p&gt;I also began to be aware of the kinds of workloads that required more active attention, and some kinds of workloads that ran longer. I knew that I could only handle 1 workload that required active attention, which meant the other agents needed longer projects. And having longer projects meant a bit of planning ahead. &lt;/p&gt;

&lt;h2&gt;
  
  
  Git worktrees: multiple coding agents in the same repo
&lt;/h2&gt;

&lt;p&gt;Once you have isolated, right-sized projects, I was still running into the risk of git conflicts. Parallel coding agents are still editing files in the same directory.&lt;/p&gt;

&lt;p&gt;Git worktrees solve this problem. Each worktree is a separate checkout of the same repository: a different folder, a different branch, but linked to the same git history and object store. They're lightweight to create, and you can have 3 agents working in the same sandbox, all contributing back to the same repo. Learn more about &lt;a href="https://git-scm.com/docs/git-worktree" rel="noopener noreferrer"&gt;git worktrees&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Why Git worktrees instead of separate clones? They're a bit more lightweight because they're linked to a history of the various branches and commits. So, for example, a single &lt;code&gt;git fetch&lt;/code&gt; makes it into all worktree directories. And commits in each worktree are known by the others.&lt;/p&gt;

&lt;p&gt;There are a few ways to set this up:&lt;/p&gt;

&lt;h3&gt;
  
  
  Worktrees with git
&lt;/h3&gt;

&lt;p&gt;Git worktrees are both branches and folders, so here are a few commands you can use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# List worktrees&lt;/span&gt;
git worktree list

&lt;span class="c"&gt;# Create a new worktree AND a new branch in 1 command&lt;/span&gt;
git worktree add &lt;span class="nt"&gt;-b&lt;/span&gt; &amp;lt;new-branch-name&amp;gt; &amp;lt;path/to/new/directory&amp;gt;

&lt;span class="c"&gt;# Create a new worktree with an existing branch&lt;/span&gt;
git worktree add &amp;lt;path&amp;gt; &lt;span class="o"&gt;[&lt;/span&gt;&amp;lt;branch&amp;gt;]

&lt;span class="c"&gt;# Remove the worktree (i.e. the folder) but the branch remains&lt;/span&gt;
git worktree remove &amp;lt;path&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Built-in support in vibe coding tools
&lt;/h3&gt;

&lt;p&gt;I won't elaborate here, but I'll link to the documentation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cursor: &lt;a href="https://cursor.com/docs/configuration/worktrees" rel="noopener noreferrer"&gt;Parallel Agents&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Claude Code: &lt;a href="https://code.claude.com/docs/en/common-workflows#run-parallel-claude-code-sessions-with-git-worktrees" rel="noopener noreferrer"&gt;Run parallel Claude Code sessions with Git worktrees&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Codex: &lt;a href="https://developers.openai.com/codex/app/worktrees/" rel="noopener noreferrer"&gt;Worktrees&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note: If you want automated multi-agent coordination, check out &lt;a href="https://code.claude.com/docs/en/agent-teams" rel="noopener noreferrer"&gt;Claude Code Agent Teams&lt;/a&gt;. This is useful to parallelize tasks within a single project. The worktree-based approach I describe is slightly different. You can create and control your own system of parallel agents to launch multiple, arbitrary projects. Claude Code Agent Teams lets you burn down a project's list faster, and the worktree-based approach lets you branch out to work on multiple projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  3 parallel API servers, 3 frontend servers, 3 databases
&lt;/h2&gt;

&lt;p&gt;Git worktrees went a long way, but I noticed that verifying my code was tedious. Stateless unit tests were easy, I could run them to my heart's content in each worktree. But integration tests that touched the database encountered different postgres schemas. And my local servers obviously collided on ports. &lt;/p&gt;

&lt;p&gt;Here's what a sample web server might look like, with an API server and UI server that each have environment variables.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3q4otvknx3gyoi983d8r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3q4otvknx3gyoi983d8r.png" alt="Baseline visualization of a sample web server. We want to replicate this, 1 per agent." width="800" height="473"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I found myself spinning up and tearing down servers, running migrations back and forth. I started thinking about how to isolate them a bit better by extracting configs into environment files, then parameterizing different ports and databases. &lt;a href="https://12factor.net/" rel="noopener noreferrer"&gt;Classic DevOps practices&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj5xuwhw6h9nq12z5ezqi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj5xuwhw6h9nq12z5ezqi.png" alt="Visualization of worktree creation / teardown so we can run multiple, isolated services locally." width="800" height="516"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So, I wrote a few Claude Skills that wrap the underlying git worktree command: create a worktree, allocate ports, provision databases, generate &lt;code&gt;.env&lt;/code&gt; files, install packages and register the ports/database in a central JSON file. &lt;/p&gt;

&lt;p&gt;I've open-sourced a generic version that you can customize for your application: you can find it &lt;a href="https://github.com/0xandrewshu/ai-utils/tree/main/skill-worktree" rel="noopener noreferrer"&gt;on Github&lt;/a&gt;. You'll need to customize the setup/teardown to the specifics of your environment. While I can't make it turnkey for every solution, I wanted to share the structural elements: where it runs and how it runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Results: from 16% to 50% with 3 parallel agents
&lt;/h3&gt;

&lt;p&gt;Parallelizing coding agents was fast and straightforward: for a small application, I reconfigured my tooling in about a day. With the agent Skills I shared in Github, I was able to create a worktree and provision the environment in such a way that I could scale up the number of parallel agents to 3 and beyond.&lt;/p&gt;

&lt;p&gt;Whereas I was stuck at 16% before, I was quickly hitting 45+% consistently. I got to the point where I started bumping up against the weekly rate limits. And I could finally see the pathway to 10 or more agents, and the need for 2 or more Max 20x subscriptions.&lt;/p&gt;

&lt;p&gt;But in the end, it wasn't about getting to the top of a token leaderboard. Boosting my Claude Code utilization from 16% up was an exercise to ground my use of AI coding agents towards getting useful work. &lt;/p&gt;

&lt;p&gt;It was helpful to work on a small project to exercise my software development lifecycle: planning, implementing, testing, and running on simple cloud infrastructure. What's the use of fast coding if you can't operate and troubleshoot it?&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't feel FOMO about orchestrating 10+ parallel agents
&lt;/h2&gt;

&lt;p&gt;It's important to keep our eyes on the goal: to build things people use, enjoy and get value from.&lt;/p&gt;

&lt;p&gt;There are folks who are pushing limits and are aiming to build fleets of 10's or 100's of parallel agents. That's awesome, and I can't wait to see what abstractions and tools they create to make it useful for the rest of us.&lt;/p&gt;

&lt;p&gt;But, I wanted to figure out the pathway to parallel agents in a grounded, lightweight way. I wanted to figure out &lt;em&gt;when&lt;/em&gt; to parallelize, &lt;em&gt;how&lt;/em&gt; to split up the work, and &lt;em&gt;what infrastructure to set up&lt;/em&gt;. AI coding agents are clearly accelerating our work, but I wanted to feel the rough edges so I know how specific tools solve specific problems.&lt;/p&gt;

&lt;p&gt;If you take one thing from this: don't start with the tooling. Start by getting your first agent fully occupied, then find an isolated task for a second. The changes you make should follow the problems you encounter.&lt;/p&gt;

&lt;p&gt;If you want to skip the manual setup, &lt;a href="https://github.com/0xandrewshu/ai-utils/tree/main/skill-worktree" rel="noopener noreferrer"&gt;grab the worktree bootstrap script on GitHub&lt;/a&gt; and customize it for your project. It handles port allocation, database creation, and env config for Rails, Phoenix, Django, and similar stacks. Check out the &lt;code&gt;readme.md&lt;/code&gt; for instructions.&lt;/p&gt;

&lt;p&gt;Now that I'm running 3 agents and burning tokens 3x faster, the cost comparison between Claude Code and Cursor becomes interesting again. Next up: I'm going to dig into the pricing math to answer my original question about why switching from Cursor to Claude Code seemed to drop my token usage by 64%.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is Part 2 of my 3-part series on my experience transitioning from Cursor to Claude Code. Catch up: &lt;a href="https://www.ashu.co/cursor-to-claude-code-stuck-at-16-percent-utilization/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;Part 1: Stuck at 16%&lt;/a&gt;. Part 3 next week.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>tutorial</category>
      <category>programming</category>
    </item>
    <item>
      <title>I Switched from Cursor to Claude Code and Got Stuck at 16% Utilization</title>
      <dc:creator>Andrew Shu</dc:creator>
      <pubDate>Fri, 13 Mar 2026 19:16:47 +0000</pubDate>
      <link>https://dev.to/0xandrewshu/i-switched-from-cursor-to-claude-code-and-got-stuck-at-16-utilization-4ca6</link>
      <guid>https://dev.to/0xandrewshu/i-switched-from-cursor-to-claude-code-and-got-stuck-at-16-utilization-4ca6</guid>
      <description>&lt;p&gt;While tinkering over the holidays, I remember thinking: "This is so strange! I was easily reaching $350 of Claude tokens in Cursor usage for the month. After switching to Claude Code, I was barely making it past 16% in Claude Code's 5-hour sessions. Comparing Claude vs Cursor's $200 plans, they both cost $200 / month. It's the same work, same velocity, yet I'm experiencing totally different limits."&lt;/p&gt;

&lt;p&gt;Given my ops and scaling experience, I'm mindful of how much it costs to operate software. So I obviously couldn't leave this alone! This journey started out with me worried I had overpaid for a $200 plan but ended up leading to a significant acceleration in my workflows as I tried to make full use of my Claude Code allotment.&lt;/p&gt;

&lt;h2&gt;
  
  
  From Cursor to Claude Code: monthly token counter vs 5-hour speed limits
&lt;/h2&gt;

&lt;p&gt;Within 15 minutes of using Claude Code, I realized that I was going to need much more than the Pro plan ($20/month). I started with the smallest paid plan to feel where the limits are. This threshold was initially a shock to me, since my mental model of "token limits" was still based on Cursor's monthly window. &lt;/p&gt;

&lt;p&gt;At the time, it would take me a few days to use up Cursor's tokens. But with Claude's 5-hour cycle, you get fast feedback that the Claude plan you've chosen is too small. So to reframe my observation: it was not that I had "used up all the tokens for the month", but that I was using tokens at a much faster speed in this 5-hour session than was supported by the plan.&lt;/p&gt;

&lt;p&gt;Given how fast I had hit my "$20 Pro Plan" limit, I assumed that I wouldn't need to try out the middle Max 5x plan and just jump up to the $200/mo Max 20x plan. (Anthropic only charged me the prorated difference, so it was easy to upgrade). &lt;/p&gt;

&lt;p&gt;I assumed I was going to hit 80-100+% utilization like I was in Cursor, but I was wrong. Imagine my surprise when, after a day or two of coding, I realized that I never hit anywhere close to the Max 20x plans' 80-100% utilization!&lt;/p&gt;

&lt;h2&gt;
  
  
  Did I overpay for Claude's Max 20x Plan? No, but I needed to learn how to use it.
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqm2rd18hb0jtktvc5u2h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqm2rd18hb0jtktvc5u2h.png" alt="Claude Code tool usage dashboard showing low utilization per 5-hour session" width="800" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After using Claude Code for the first 2-3 sessions, I noticed I was only using 6-12% of the 5-hour usage window in each session. Thus I was only using a fraction of the $200 I spent. This was a surprise! I was doing the same coding workloads on Cursor and Claude Code.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdsg1rt27l36y8012am8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdsg1rt27l36y8012am8.png" alt="Cursor tool usage dashboard showing high monthly token consumption" width="800" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Having such low Claude Code utilization was great, because it meant I could code more and spend less money! But it bothered me on two levels: firstly, could I have gotten away with paying less. Secondly, how were people hitting 100%? Not just 100% -- I was reading online about people using 2-3 Claude Code accounts.&lt;/p&gt;

&lt;p&gt;My goal wasn't to maximize token spend or get to the top of the leaderboard. I was puzzled and bothered by this underutilization. So, the first thing I did was to &lt;a href="https://www.ashu.co/markdown-plan-files-vibe-coding/" rel="noopener noreferrer"&gt;set up structured markdown plans&lt;/a&gt; to launch longer-running agents that made full use of the 5-hour session. This let me confirm that tasks were reasonable, and I was unlikely to need to pause Claude's work to answer questions and troubleshoot.&lt;/p&gt;

&lt;p&gt;After a focused few high-usage sessions, I realized I could push my utilization to 14 - 16%. And that seemed to be the ceiling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Code's 5-hour sessions are a "use it or lose it" rate limit that spreads out usage
&lt;/h2&gt;

&lt;p&gt;Let's dive into Claude Code's rate limiting system. The "5-hour usage" windows functions as a "speed limit". What this practically means is – you'll figure out your "speed of token usage", and calibrate your plan accordingly. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpyhxbp5vlilk11jat8me.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpyhxbp5vlilk11jat8me.png" alt="Cursor pricing model diagram showing monthly token accumulation" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So when we compare Cursor vs Claude Code's pricing models, Cursor is billed by monthly total tokens consumed. That means I could leave Cursor untouched for 29 days, then use up all my tokens in the 30th day. (I assume there is a rate limit for extremely bursty token usage in Cursor, but I've never hit it.) It also means that comparing Cursor to Claude Code licensing is an apples to oranges comparison.&lt;/p&gt;

&lt;p&gt;Claude Code also has "weekly limits" that are a second layer on top of the 5-hour limit. Imagine if you maximized the 5-hour limit, 24 / 7; that could get extremely expensive for Anthropic. So, Anthropic sets an upper limit for sustained usage over the week. If we reflect on pricing design, they could have set it at a 1 month limit. But by setting it to be weekly, you must utilize your weekly limits, because it doesn't roll over to the next week.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flico1xpgo2n7tgnlg67r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flico1xpgo2n7tgnlg67r.png" alt="Claude Code pricing model diagram showing 5-hour burst limit layered with weekly ceiling" width="800" height="541"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So the 5-hour limit is a "burst speed limit", and the weekly limit is a "sustained usage limit". The 5-hour window smooths out utilization across the day, and the weekly window smooths it out over the month. Since tokens don't roll over from week to the next, you use it or you lose it. Technically you can miss a few 5-hour windows, and you can make it up later in the week. And if you don't use it for the week, you don't get to make it back up. &lt;/p&gt;

&lt;p&gt;This isn't a bad thing. Most engineers are probably doing work spread out over days and weeks. Claude Code's system is a fair agreement for typical engineering work, that makes better use of compute resources and the effort you put in writing code. If you want higher levels of usage on either plan, you're an advanced user and you need to pay API usage (i.e. higher) token costs.&lt;/p&gt;

&lt;p&gt;Another way to look at it could be: the 5 hour session roughly maps out to a 10 hour workday, and then utilization assumes a ~40-hour work week of utilization. So some of the windowing and upper limits may make sense from this lens, and may not make sense if you're trying to utilize your license with a 24 / 7 weekly schedule. I haven't explored the math and logic behind this framing, but wanted to share it as a thought experiment. So take it with a grain of salt.&lt;/p&gt;

&lt;h2&gt;
  
  
  So did I get past 16% utilization in Claude Code?
&lt;/h2&gt;

&lt;p&gt;After all this, I was still stuck at 16% utilization. I understood why: the speed-limit system means that a single coding session with a single agent has a natural upper limit. No matter how focused I was, one human directing one AI agent can only consume tokens so fast.&lt;/p&gt;

&lt;p&gt;And that raised the obvious question: if one agent tops out at ~16%, and people online are hitting 100%+ across multiple accounts... they must be running agents in parallel. This meant I needed to figure out how to coordinate multiple AI agents working on the same codebase without them stepping on each other's toes.&lt;/p&gt;

&lt;p&gt;To coordinate parallel agents, I had to rethink how I broke down projects. It also led me to git worktrees and additional changes in my local development environments. I'll cover that in my next post, and describe how it took my utilization from 16% to 50%+.&lt;/p&gt;




&lt;p&gt;If you're tracking your own Claude Code or Cursor utilization, or you've figured out the parallel agent workflow, I'd love to hear about it — DM me &lt;a href="https://www.linkedin.com/in/0xandrewshu/" rel="noopener noreferrer"&gt;on LinkedIn&lt;/a&gt; or &lt;a href="https://x.com/0xAndrewShu" rel="noopener noreferrer"&gt;on X&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;I write weekly about vibe coding workflows, costs, and tools. Follow me here on Dev.to, or subscribe at &lt;a href="https://www.ashu.co/?utm_source=devto&amp;amp;utm_medium=syndication" rel="noopener noreferrer"&gt;ashu.co&lt;/a&gt; for email updates.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Why I use markdown plan files instead of Cursor and Claude's built-in planning</title>
      <dc:creator>Andrew Shu</dc:creator>
      <pubDate>Mon, 02 Mar 2026 17:49:53 +0000</pubDate>
      <link>https://dev.to/0xandrewshu/why-i-use-markdown-plan-files-instead-of-cursor-and-claudes-built-in-planning-35co</link>
      <guid>https://dev.to/0xandrewshu/why-i-use-markdown-plan-files-instead-of-cursor-and-claudes-built-in-planning-35co</guid>
      <description>&lt;p&gt;The technique that helped me jump over the threshold from "coding with AI" to actually "vibe coding" was the use of a plain markdown file. Interestingly, it wasn't Cursor's built-in planning mode, nor Claude's in-memory task lists. It was a plain markdown file, with numbered subtasks, and a breakdown of the work that needed to be done. &lt;/p&gt;

&lt;p&gt;Early on, I stayed "hands-on" with the coding agent, because I was concerned about multi-part problems, about the AI coding agent "jumping in too soon", and that I wouldn't know how the code worked if the AI hallucinated or ran out of context (tokens). But I found that vibe coding with markdown plans meant that I had an artifact that (quickly) helped me be more intentional with design.&lt;/p&gt;

&lt;p&gt;In this article, I'll share the prompt I put into &lt;code&gt;AGENTS.md&lt;/code&gt; (which I've shared &lt;a href="https://github.com/0xandrewshu/ai-utils/tree/main/rule-markdown-plan" rel="noopener noreferrer"&gt;on Github&lt;/a&gt;), what worked, and what didn't. But first, I'll share some context about how I got there, because I think the path is relevant to a pattern I hear a lot of engineers get stuck in.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I went from 'coding with AI' to actually vibe coding
&lt;/h2&gt;

&lt;p&gt;I used Cursor in my work through most of 2025. I found it useful for inline prompt editing, one-off chat questions, and greenfield scripts for reporting and maintenance. But I was mostly doing single-file work. Multi-file, multi-step projects with real complexity? I'd start in Cursor, hit a hallucination or a design decision I disagreed with, and fall back to writing it by hand.&lt;/p&gt;

&lt;p&gt;For me, it came down to a conversation with Michael Stahnke (leading engineering at Flox) at Github Universe last year. We were comparing notes on how we were using AI for coding, and he mentioned something that resonated instantly: he was structuring his work in markdown plan files. This gave me a lever of control -- it let me audit a multi-part project breakdown before implementation.&lt;/p&gt;

&lt;p&gt;From there, I was able to go from prompting the agent for each change I wanted, to taking a step back and creating a plan that I could let the AI coding agents run for hours on. This was when I truly went "hands off" and let the AI steer for itself.&lt;/p&gt;

&lt;p&gt;Here's what I've found makes this approach work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2nltfj4ct9h4xqhhwhx9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2nltfj4ct9h4xqhhwhx9.png" alt="Markdown plans are a great way to breakdown complex tasks and vibe code better with Claude and Cursor (Generated with GPT-4o)&amp;lt;br&amp;gt;
" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Benefits of a markdown plan file, for me and the AI coding agents
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;It gives me an artifact I can control.&lt;/strong&gt; With a file artifact, I can find it easily in my codebase, edit it if I want to, commit it for future reference, append notes and learnings, or feed it into another system or automation. This is a subtle point, but it's the key reason why I prefer a "markdown plan" that I own, over the "planning modes" in Cursor / Claude.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It makes me and the coding agents more intentional.&lt;/strong&gt; AI is great, but still not perfect. A markdown plan forces a quick up front research phase where I can identify gaps in understanding, and areas I may disagree with. It also forces me to roughly understand what's being built. Even as things become more autonomous, it's still important to understand what you own -- even if it's at a higher level and across many more systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I can control the pace better.&lt;/strong&gt; I can pause the AI to ask questions, I can rewind to a previous state (in git and the plan file), skip around tactically, I can adapt the plan as we discover new information, or go fully hands-free. The goal is to increase autonomy and parallelism, but having a file with clearly numbered and grouped tasks lets me communicate about and manage chunks of work. This can be extended by pointing multiple agents at different phases of the same plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It gives me more options in managing my conversation's context windows.&lt;/strong&gt; Since agents &lt;a href="https://www.youtube.com/watch?v=rmvDxxNubIg" rel="noopener noreferrer"&gt;lose efficacy&lt;/a&gt; as their context windows fill up, different people have different preferences in how often they "reset" the agent. Some prefer 50% usage, 90% usage, or some people OK with "infinite-ish conversation" compaction. In any case, having a markdown plan with task status, a work log, and git commits gives you the option to clear the conversation any time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A plan file gives me a place to deposit learnings, error messages and TODO's.&lt;/strong&gt; This is more of a documentation step. But as I interact with the agent, there are times where it encounters an error message, or a design decision that I want to remember or revisit later, I like to log it for later. Since I always archive my plan files instead of deleting, I plan to use this as a work journal where I can come back and ask questions like "what tech debt have I accumulated"? This covers a gap in knowledge, because it's not contained in code or commit messages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I can automate post-implementation steps as a Cursor / Claude Skill.&lt;/strong&gt; Cursor Skills and Claude Skills both support reusable "prompt actions" - you can think of them as natural language scripts. When everything in the plan is done, I need to delete or archive the plan to a different folder. I've noticed that this is a natural point to run a Skill to review the code and look for opportunities to improve: security, testing, documentation, and code factoring.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example Markdown plan to create a local Next.js frontend / backend for the coding agent&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Markdown plan file best practices for Claude Code, Cursor, etc.
&lt;/h2&gt;

&lt;p&gt;I've put my plan prompt on Github, and you can &lt;a href="https://github.com/0xandrewshu/ai-utils/blob/main/rule-markdown-plan/examples/2026-03-01-nextjs-hello-world.md" rel="noopener noreferrer"&gt;read an example&lt;/a&gt; of a markdown plan for creating a simple "hello world" Next.js app.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/0xandrewshu" rel="noopener noreferrer"&gt;
        0xandrewshu
      &lt;/a&gt; / &lt;a href="https://github.com/0xandrewshu/ai-utils" rel="noopener noreferrer"&gt;
        ai-utils
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      A collection of scripts, prompts and docs for use with AI and vibe coding
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;AI Utilities: collection of scripts, prompts and docs&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;As I use AI for vibe coding or other types of work, I find it helpful to collect reusable prompts, skills, subagent files, configs, etc. I'm creating this repository to deposit artifacts that I've found useful.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Compatibility&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;The intent is for these snippets to be reusable across AI coding tools (e.g. Claude Code, Cursor, Codex, Gemini / Antigravity, Copilot, etc.). There are occasionally differences in capabilities, but they have typically "caught up" with one another pretty quickly after that.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Repository organization&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Initially, I plan to organize these as a flat directory until more organization is necessary.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;rule-$NAME/&lt;/code&gt; - e.g. CLAUDE.md, AGENT.md, AGENTS.md&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;skill-$NAME/&lt;/code&gt; - e.g. Claude Skills, Cursor Skills&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;subagent-$NAME/&lt;/code&gt; - e.g. Claude Subagents, Cursor Subagents&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;prompt-$NAME/&lt;/code&gt; - e.g. reusable prompts to copy/paste into vibe coding tools (Claude Code, Cursor) or chat AI tools (Claude.ai, ChatGPT)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In each directory, I'll aim to…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/0xandrewshu/ai-utils" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;h3&gt;
  
  
  Header: a quick summary of the plan
&lt;/h3&gt;

&lt;p&gt;I like to have a few lines at the top that summarize what this plan file contains: title, date, objective, and references to any "child" or "related" plans. I "spin off" and "split" big plans into child plans, and I find it useful for the "child plans" to reference the "parent plan", and vice versa.&lt;/p&gt;

&lt;h3&gt;
  
  
  Task List: the focal point for the implementation
&lt;/h3&gt;

&lt;p&gt;Near the top of the plan, I like to have a consistently structured, markdown table of tasks. This is the focal point of the plan that serves as a backlog that sequences and organizes work. Since AI is often inconsistent about formatting, and the structure of the task list is important to my workflow, I've made it a point to specify structure concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I like this markdown table to have these 5 columns: #, task, status, priority, comments&lt;/li&gt;
&lt;li&gt;All tasks are numbered, so I can tell the AI things like "Do 1.1 - 1.3 but skip 1.4"&lt;/li&gt;
&lt;li&gt;Tasks are grouped into "phases", so I can tell the AI things like "do phase 3 first"&lt;/li&gt;
&lt;li&gt;I find that using emojis like "✅ Completed" helps me visualize status better for larger lists&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Task Breakdown: a design doc to review
&lt;/h3&gt;

&lt;p&gt;This functions like a design doc - I like to audit this BEFORE implementation. It's usually accurate, so it's mostly to catch the occasional issue and to improve my understanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Work log: a journal of errors, problems and learnings
&lt;/h3&gt;

&lt;p&gt;This is a dropoff location where I ask the AI to deposit error messages, design decisions and tradeoffs, so I can reference them later. When I run a Claude skill to "close up my plan", I have hooks that will reflect on problems described in the work log. I want to be able to query for exact error messages so I can document it later.&lt;/p&gt;

&lt;h3&gt;
  
  
  TODO section: accumulating ideas that don't impact scope
&lt;/h3&gt;

&lt;p&gt;Regularly in my work, I have to make tradeoffs and tell the AI that "this is outside scope", but "log it for later". So I tell the AI "save a todo to do XYZ", and the TODO's are stored here. Since I like to archive my plans (in git, or Obsidian), you can query old plans belatedly to extract TODO's relating to "Github actions", for example.&lt;/p&gt;

&lt;h2&gt;
  
  
  What didn't work: hand-written plans, long AGENTS.md files
&lt;/h2&gt;

&lt;p&gt;I often find it helpful to read what people tried, but didn't work. So here's a few things that I tried:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write the plan files by hand.&lt;/strong&gt; When I started out, I would hand-write a plan. Very quickly, I realized that this time-consuming process could and should be done by the AI. I see and talk to engineers doing this, and I think this is a common misconception: use a short plan, and let the AI coding agent flesh out the plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Estimate and document "effort" and "risk", to influence the agent to behave differently.&lt;/strong&gt; I thought this would change its behavior so it would scale up rigor / safety up or down. But in the end, I've seen no evidence of this. And its estimates of effort and risk were incredibly inaccurate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A long planning guideline for &lt;code&gt;AGENTS.md&lt;/code&gt;.&lt;/strong&gt; The naive version of this planning guidelines accumulated to be long -- because I had the AI add instructions every time I was annoyed at something, it grew to 130 lines with a nearly full example. I've since condensed it to one that's around 35 lines, and it's productively good.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Having no planning guideline, and just telling the agent to "create a md plan".&lt;/strong&gt; It was usually correct, but it was inconsistent. I depend on the task list being near the top, and having it structured in a certain way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Telling the agent to "create a plan", and assuming it would follow the &lt;code&gt;AGENTS.md&lt;/code&gt;.&lt;/strong&gt; I have to explicitly say "create a markdown plan", and sometimes "create a md plan according to guidelines".&lt;/p&gt;

&lt;h2&gt;
  
  
  Counterpoint: why markdown plans may not be for everyone
&lt;/h2&gt;

&lt;p&gt;Before I close, I should note that this technique may not be for everyone. &lt;/p&gt;

&lt;p&gt;For starters, Claude and Cursor's Planning Mode is actually pretty solid. I think it works for many / most cases, and is simpler to use. It's time consuming to wait for a plan, review it, and simply implement it again. Also, you may get a lot of plans that you simply agree with and are OK to get going. If this is the case, you may as well have the AI jump straight into implementation and review the output at the end.&lt;/p&gt;

&lt;p&gt;And as engineers seek more parallel agents and longer running autonomous agents, they may have to re-evaluate markdown plans. Maybe markdown plans aren't scalable enough, or are too freeform. For example, you can look at Steve Yegge's &lt;a href="https://github.com/steveyegge/beads" rel="noopener noreferrer"&gt;beads project&lt;/a&gt;, as products in issue tracking. (I haven't tried it, but I'd like to.) I believe the idea is that a more structured workflow will boost clarity, performance and understanding, especially for "totally autonomous agent teams" like his &lt;a href="https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16dd04" rel="noopener noreferrer"&gt;Gastown&lt;/a&gt; project. &lt;/p&gt;

&lt;p&gt;There are other philosophies worth exploring. The &lt;a href="https://ghuntley.com/ralph/" rel="noopener noreferrer"&gt;Ralph Wiggum loop&lt;/a&gt;, created by Geoffrey Huntley, takes a different approach entirely: instead of planning across a long session, it runs the agent in a bash loop with fresh context each iteration. Progress lives in files and git, not in the agent's memory, so it avoids "context rot". &lt;/p&gt;

&lt;p&gt;Another philosophy: &lt;a href="https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html" rel="noopener noreferrer"&gt;Spec-driven development&lt;/a&gt; (with tools like GitHub &lt;a href="https://github.com/github/spec-kit" rel="noopener noreferrer"&gt;Spec Kit&lt;/a&gt; and AWS's &lt;a href="https://kiro.dev/docs/specs/" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt;) goes further on the planning axis — writing detailed specifications with acceptance criteria before any code generation, so the spec itself becomes the source of truth. &lt;/p&gt;

&lt;p&gt;My markdown plans sit somewhere in between: more structured than a Ralph loop prompt, lighter than a full SDD spec. &lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thoughts: markdown plans are medium weight, and that's the point
&lt;/h2&gt;

&lt;p&gt;This may seem like a heavyweight process, but in reality it's pretty quick. For starters, it's not meant for truly lightweight, one-shot prompts. I primarily use it for larger, multi-hour runs where I want to reduce the likelihood of poor quality work. &lt;/p&gt;

&lt;p&gt;To reference a recently common saying that "with vibe coding, all engineers become managers", I think a markdown plan is basically like a manager or a tech lead asking a teammember to do a bit of researching, project planning and designing things. The complexity and time spent should be scaled up or down depending on complexity and urgency.&lt;/p&gt;

&lt;p&gt;I should also add that I'm open to moving beyond markdown plans, and I don't think this is necessarily the end state of project planning. Specifically what I care about it is: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Taking a small amount of time to do some planning that I can iterate on as things change&lt;/li&gt;
&lt;li&gt;Having good task management: task identification, descriptions, explanations, groupings&lt;/li&gt;
&lt;li&gt;Having an artifact that I can control, integrate with and automate around&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're using markdown plans or have a different approach to keeping agents on track for longer projects, I'd like to hear about it — DM me &lt;a href="https://www.linkedin.com/in/0xandrewshu/" rel="noopener noreferrer"&gt;on LinkedIn&lt;/a&gt; or &lt;a href="https://x.com/0xAndrewShu" rel="noopener noreferrer"&gt;on X&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I write weekly about vibe coding workflows, costs, and tools. Follow me here on Dev.to, or subscribe at &lt;a href="https://www.ashu.co" rel="noopener noreferrer"&gt;ashu.co&lt;/a&gt; for email updates.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>programming</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
