DEV Community: Andrew Shu

How I use Obsidian across Claude Code and other AI tools

Andrew Shu — Sun, 21 Jun 2026 12:34:29 +0000

A few months ago as I worked to reduce my AI coding agents' hallucinations, I had a growing realization that I needed to improve my context management. Not just within a single session, but across all of them. I was learning things in one tool that would've been useful in another, and there was no mechanism to transfer that knowledge besides my own memory and copy-paste.

I eventually landed on using Obsidian as a central hub of long-term memories across my AI coding tools. This article describes the usability problems in the tools I was using, so you can understand how I think about it, and how I think about the non-technical problems that surround the software development lifecycle.

There has been wide discussion about this over the past few months. Some people call this a "memory architecture". Andrej Karpathy described how he uses Obsidian to create his LLM Knowledge Bases.

Since going deeper into vibe coding, I had been doing a blend of coding, researching tools, and general planning. Some of my AI conversations were incredibly useful, and I wanted to keep those insights somewhere I could actually use them again. I realized this was the "manual slinging of prompts" that Gene Kim and Steve Yegge suggest optimizing in the world of AI, except I wasn't just slinging prompts. I was slinging context.

I use multiple AI tools, across multiple interfaces (CLI, web, mobile, desktop), for widely different things: engineering, research, business planning, writing, daily life. It's important to use the best tool for the job. But they don't talk to one another. I was accumulating knowledge and didn't have a lightweight way to reuse it.

Sidebar: when I refer to context management in this article, I'm also thinking about practices for "personal note-taking" and "engineering documentation". Each has its function, but there's a lot of overlap. You can learn more about context engineering in the article I linked above about reducing hallucinations.

I wanted to feed AI output as input into the next chat session

Stated plainly: I wanted to pull key findings out of AI conversations, export long chats into clean summaries, enrich and transform artifacts across tools, and make all of it easily referenceable later.

Sometimes I want tools to share context. Other times I want them isolated so they don't go deleting random things. I also don't want my coding agent mixing notes from my daily life and my engineering so that unrelated context seeps into a code review. But I need a mechanism for me to control when they interact.

And I wanted whatever this mechanism was to be lightweight: drag-and-drop into a browser, easy editing on mobile, programmatic processing in the terminal.

Usability challenges when sharing context across AI tools

AI coding tools already solve this for code. Cursor and Claude Code read and write files. If the thing I want to remember is code, I'm set. But a lot of what I learn from AI sessions isn't code. It's research into new tooling or feature ideas, decisions about architecture, patterns I want to remember, preferences for how I want things structured. None of that belongs in a codebase.

Google Gemini's Deep Research export to Drive. This is genuinely cool. You run a deep research session and it lands in your Google Drive as a doc. But I needed something that wasn't locked to one AI's ecosystem. If I export to Drive, I still can't easily feed that into Claude Code or Cursor without copying it out, reformatting it, and pasting it somewhere else. Extra steps kill adoption.

The problem with chat search and history. Every AI tool has a chat history. None of them make it easy to find anything three weeks later. Search helps, but it doesn't solve curation. I don't want to search through 200 conversations. I want the 6 things worth keeping.

"Just use Claude Code or Cursor for everything." There's a camp of folks who are using AI coding tools for non-technical work like sales and marketing. So it is technically possible to use vibe coding tools for general purpose AI. But some tasks want a visual preview, or drag-and-drop, or a mobile screen while I'm running errands. Not everything is a terminal task.

MCP servers for Notion and JIRA. This is actually powerful. I've used MCP for read/write operations against team knowledge bases in work environments. For shared systems with structured data, it's great. But for quick personal notes, the interaction latency really feels heavyweight. I don't want to wait for an API round-trip to repeatedly export terminal output. It really adds up when you're moving quickly and iterating across multiple documents.

Why Obsidian specifically: old school files are lightweight

At some point I realized I was overcomplicating this. Files on a filesystem are easy for AI coding tools to read, easily shareable, and drag-and-drop compatible. I didn't have to wait for new plugins — or build them myself.

I remembered that a friend had been trying to get me to use Obsidian for years. I'd brushed it off as "another note-taking app." But when I actually looked at what it was (a local-first markdown editor with sync), I realized it was exactly the missing piece.

Local files mean universal compatibility. Every AI tool I use can read a markdown file. I can drag files into chat interfaces, point CLI tools at directories, or copy-paste from the editor. No export step, no API, no conversion. The file is the interface.

Sync across devices. Obsidian syncs markdown files between my laptop and my phone. I can review notes on mobile, jot something down while walking, and it's there when I sit down at my desk. This matters more than I expected, because I do a lot of thinking away from my computer, and being able to capture that in the same system I use for everything else let me replace Evernote as my note-taking tool. (Yes, I was still using Evernote in 2025!)

Isolation by default, sharing by choice. I was uncomfortable giving AI read/write access to my entire Google Drive. A filesystem-based approach lets me allowlist specific folders. Claude Desktop and Claude Code both support this. I control the boundary. My personal notes stay personal unless I deliberately share a specific folder.

Markdown is the lingua franca of AI tools. AI coding assistants already think in markdown. Plans, docs, summaries, rules files — it's all markdown. My notes are in the same format as the output. No translation layer. When I feed a note into an AI session, it doesn't need to parse anything special. When the AI produces something worth keeping, it's already in a format I can save directly.

Different aspects of my memory management workflow

As a long-term memory store alongside code and git history. There are things I learn that don't belong in a codebase or a commit message. Research about open source software and vendors. Interesting articles I read about new AI techniques that I'd like to try out. After working through something in ChatGPT or Gemini's web interface, I'll distill the key parts and save them where Claude Code, Cursor, or Codex can find them later.

Exporting good conversations. AI chats regularly surface something worth keeping — a novel approach, a well-structured analysis, a decision framework. When this happens, I often want to pull the key parts into a markdown note. This is the curation step that chat history search can't help with, especially when I want to mix and match the data with other sources (see "input/output hub" below).

Reusable prompts. Some prompts are too nuanced to keep in my head and too context-dependent to turn into a script. They live as markdown files I can grab when I need them. I have prompts for code review, for research synthesis, for writing in a specific style. They're not one-size-fits-all — they're the prompts that work for my workflow, refined over time.

As an input/output hub. I write notes, feed them to an AI tool, the tool enriches or transforms them, and I save the output back as a new version. The vault becomes a working memory that accumulates over time. My blogging workflow runs almost entirely this way: outlines become drafts, drafts get reviewed and iterated, and the whole history lives in the vault. I want to mix AI chat exports, research call notes, screenshots, and PDFs of research I want to use as input.

Full circle: shifting some gravity back to the local filesystem

After 15+ years of using cloud web apps for everything, I've come back to files on a local filesystem for my note-taking. Why did I want to write about this workflow as I post about coding in the modern world, especially considering I'm talking about a lot of non-technical work? There is a lot of non-technical work in a typical software engineer's week, and automating the work that surrounds engineering will help avoid bottlenecks from slowing down AI-driven work.

I'm still evolving how I use this. If you're juggling multiple AI tools and feel like you're losing good work to chat history graveyards, I'd be curious how you're handling it. What's your system for keeping the good stuff from AI conversations? Let me know; DM me on LinkedIn.

Originally published at ashu.co.

Why AI Coding Agents Hallucinate and How to Fix It

Andrew Shu — Sat, 23 May 2026 15:55:43 +0000

Most engineers I pair with can spot an AI coding agent hallucination when it happens. The AI invents an API that doesn't exist, reaches for a deprecated library, or rewrites a function you explicitly said to leave alone. They notice it, they're annoyed by it, and they move on.

What I've noticed is that fewer engineers have developed the instinct to do something about it. Not just fix the output, but fix the input. When I hallucinate the same way twice, something in my environment is wrong, and I should go find it.

I won't rehash the anatomy of a conversation or why context limits matter. Dex Horthy described the "dumb zone" concept well in his talk on solving hard problems in complex codebases: the gist being that LLM performance degrades as the context window fills, roughly past the 40% mark. Most of those concepts still hold. What I want to share instead is a feedback cycle that I've been trying to teach to anyone who'll listen, along with the specific things I've learned applying it.

Most of us react to AI hallucinations. Few investigate.

I've seen a range of reactions when the agent goes off the rails: some people are amused, some get annoyed. In an extreme case, an engineer I know will ctrl+c out of the conversation the moment it goes sideways, which is fair. If the agent is heading in the wrong direction, stopping it saves tokens and time.

But the reaction I see less often is the one that actually improves the agent for the next run: pausing to figure out why it happened, and making a change so it doesn't happen again.

This matters beyond just your own workflow. Hallucinations are one of the biggest reasons skeptical engineers stay skeptical. Every time a teammate watches the AI confidently invent a nonexistent function, it reinforces the "these tools aren't ready" narrative. When you can help someone understand that there are actual knobs and levers to reduce hallucinations (not eliminate, but reduce) you're shoring up someone's confidence in the whole approach.

And to be clear: hallucinations will never go away entirely. There's a strong theoretical argument that they're a fundamental property of how these models work, rooted in the way these models compute. The goal isn't zero hallucinations. It's fewer, and faster recovery when they happen.

My tip is simple: reflect on what actually happened

When I hit a hallucination that annoys me, I've trained myself to pause and run through a short checklist. It's not sophisticated, but the habit of doing it consistently is what makes it work.

What did the AI coding agent actually hallucinate?

This sounds obvious, but I find that most people skip the diagnosis. They see the wrong output and immediately re-prompt or restart the conversation.

Instead, I look through the conversation history and try to trace the reasoning. What was the agent trying to do? What did I want it to do? Where did those two things diverge? Sometimes it's a genuine misunderstanding of my instructions. Sometimes the agent pulled in context from a file I forgot was in scope. Sometimes it's just the model being confidently wrong about a library API, which tells me something different about what needs fixing.

That last case is worth understanding. Researchers at OpenAI published a paper showing that LLMs are trained in a way that rewards guessing over admitting uncertainty, like a student who fills in an answer on every exam question rather than leaving any blank. Knowing this changes how I investigate. I'm not looking for a bug. I'm looking for missing context that would have steered the guess in the right direction.

Where did the context come from?

This is the core of it. When you've identified a hallucination worth fixing, the question becomes: what information did the agent use to make that bad decision?

First, you need to familiarize yourself with the sources of context your agent is consuming. In Claude Code, that's CLAUDE.md files, project-level memory in ~/.claude/projects/, any files it's read in the current session, and the conversation history itself. In Cursor, it's your rules files, your hierarchy of AGENTS.md files scattered throughout your repository, indexed docs, and whatever files are open or referenced. Each tool has its own anatomy, and it's worth spending an afternoon just mapping out where your agent gets its information.

Once you know the anatomy, you can trace backwards from the hallucination. The agent thought it should use library X? Check if there's a stale reference in your rules file. It kept trying to write tests in a style you've moved away from? Maybe there's an old convention documented somewhere that you forgot to update. It invented an API endpoint? Maybe your schema docs are out of date, or maybe there are no schema docs and the agent is guessing.

When I can't figure it out on my own, I'll confront the agent directly. Something like: "I think you hallucinated here. I wanted you to do $X, but you did $Y. Why did this happen? What context did you use to make this decision, and what changes should I make to the rules files to avoid this in the future?"

Be direct, concrete, specific. The agent is often surprisingly good at self-diagnosing: it'll point to a rule it interpreted broadly, or a file it read that contained misleading information. Try out its suggested changes. They often work. Not always, but often enough that it's worth the 30 seconds to ask.

Is the "hallucination" a big enough issue to fix?

Before spending more time troubleshooting, I do a gut check. Maybe I disagreed with the approach, but is it actually wrong? Will the agent eventually get there?

Sidebar: when talking with a friend, he pointed out that "hallucination" isn't always the right term here. I'm actually referring to "all things you disagreed with an agent on". Hallucination is a distinct problem, but this advice applies to the broader set of problems. I use hallucination as a proxy term.

This is one of those things that's hard to learn from AI and easier to learn from serving as a manager. As a manager, you figure out that everyone does things differently. Sometimes an engineer takes a route you wouldn't have chosen, and the result is fine; it's often better than what you had in mind. I've started applying the same mentality to AI agents. If the approach is merely different from what I expected, I'll let it run for a while. The increased autonomy might be worth the extra time, even if the path isn't what I originally intended.

But if it's hallucinating facts, such as inventing APIs, fabricating function signatures, pulling in libraries that don't exist, that's a different category. That's when I stop and investigate.

Is your context too bloated? Smaller is better.

This one is counterintuitive, and I had to learn it the hard way. You'd think that giving the agent more instructions would make it smarter. More context about your database schemas, your API conventions, your testing preferences, your deployment pipeline... surely more information leads to better decisions?

In practice, there's a point where it flips. I went through a phase where I kept adding to my CLAUDE.md files every time something went wrong. Database schemas, model designs, interaction patterns, naming conventions. The file grew and grew (by month two it read less like documentation and more like a manifesto), and at some point the agent started getting worse. It was pulling in conflicting instructions, getting confused about which convention applied where, and occasionally just ignoring sections entirely.

Do you need this context every time? Or just some of the time? Can you extract this context, package it up as a Skill or Subagent and then trigger it when you need it?

This is where "context isolation" comes in. Instead of stuffing everything into the main rules file, you extract niche or complex use cases into separate mechanisms. The key insight is that subagents in particular have a stronger degree of context isolation, because the instructions don't load into the parent thread's context. You trigger them explicitly when you need them, and the rest of the time they're not competing for attention in the agent's working memory.

I ended up cutting my main CLAUDE.md from around 500 lines down to about 280 (it should probably be shorter), and moved the specialized stuff into skills that get invoked when relevant. The improvement was noticeable within a session or two.

Is the automatic memory working against you?

Most coding and chat tools manage a memory thread for you now. Claude Code keeps memories in ~/.claude/projects/$yourProject/memory/. Cursor has its own system. ChatGPT has its memory panel. The idea is good: the agent learns from your interactions and adapts over time.

The problem is that the agent doesn't always learn the right things. Andrej Karpathy flagged this a while ago on X: the model can obsess over a minor detail from 2 months ago that's completely irrelevant to what you're doing now. I've seen it happen in my own work: the agent latched onto a specific testing pattern I used once for a particular edge case, and kept invoking it on unrelated projects, long after it stopped applying.

In Claude Code, you can hunt for distracting memories in the memory directory and remove them, or ask Claude itself to clean up. In Claude's chat interface, you'll have to use the UI to inspect what it's stored and tell it what to change. (It doesn't let you edit memories directly, which is a minor annoyance.)

The broader point is that automatic memory is another source of context, and it can drift just like any other source. If your agent starts doing something odd that you can't trace to your rules files or the current conversation, check the memory. It might be holding onto something you'd rather it forgot.

Continuous learning with your AI, but you have to trigger it

Adding this trigger into my vibe coding repertoire has helped me improve the accuracy of my agents repeatedly. Even though AI work often feels a bit random, I keep a running list of problems. When something repeats itself, that's my signal that I might want to reflect and try to fix it. I'll pause and ask the agent to suggest changes to its own context that would prevent the issue in future runs. Then I test it on a controlled task to see if it sticks.

It's not a complicated habit, but it's a valuable one. Hallucinations cost you time, and that cost multiplies if you work on a team. If your teammates haven't run into a particular issue yet, you can save them time. And maybe they haven't developed this mental trigger yet. Sharing what you've fixed in your context files is one of the force-multiplying things you can do for a team adopting AI coding tools.

Most engineers I talk to treat their agent context like a set-and-forget thing. Write the AGENTS.md once, maybe tweak it when something goes really wrong. But context is a living system. It drifts, it bloats, it accumulates stale assumptions. The engineers who get the most out of these tools are the ones who treat context like code: something that needs regular review, refactoring, and testing.

This tweaking of the AGENTS.md files and other context is what I was referring to when I suggested looking at "concrete metrics of adoption" in my article measuring AI adoption as a manager.

If you've developed your own system for managing agent context, or if you've found patterns in how you debug hallucinations, I'd like to hear about it. DM me on LinkedIn.

Originally published at ashu.co.

Day 2 of vibe coding in production: what breaks when adoption scales

Andrew Shu — Tue, 05 May 2026 16:08:12 +0000

In Part 1 of this series, I enumerated a few obstacles for engineers taking vibe coding from side projects to production. Part 2 looked at AI usage from the manager's perspective: measuring adoption, understanding the gap, coaching to fill the gap. Both of those were "Day 1" problems: getting started, getting people on board, figuring out the tools.

This article focuses on what comes next: the vibe coding process problems that emerge after adoption is up. I'd call them "Day 2 problems". Let's say that AI adoption is up, and code is shipping faster. Then, things start breaking in places you didn't expect. My goal is to point to specific problems that you can observe and fix.

Engineers may feel these Day 2 problems as daily friction: PRs stuck in review, surprise token bills, coming back to AI-generated code that's unrecognizable.

Managers face problems more from team and process perspective: senior engineers stuck reviewing instead of building, budget surprises, "quality" meaning different things depending on who's asking. I'll walk through what I've seen break, and for each one, suggest an actionable starting point.

Let's start with the software development lifecycle. When engineers say "coding is only one part of shipping a product to customers", what do they mean?

Vibe coding code review: the first bottleneck you'll hit

Code review is often where AI's impact is immediately noticeable. It's a bottleneck on implementation (code generation), because most organizations require code reviews for quality and security reasons.

Code generation is much faster: this can easily rise to thousands of lines per day per engineer, if not orders of magnitude more. This means more commits, more PRs, more lines of code landing in review queues. Code review was already a chore for many engineers, so AI has compounded this problem.

I spoke with a tech lead at a large enterprise who said members of his team had started distrusting AI because of the quality of code coming through in PRs. Not because AI can't write decent code, but because engineers were submitting AI output without reviewing it themselves first. The PR became the first time anyone looked critically at what the agent produced.

Senior engineers face a practical question here: how much should they comb over each line of code the way they used to? When a PR is 5,000 lines of AI-generated code, a line-by-line review is time consuming. But skimming feels irresponsible.

So what can you do about it?

Think about your CI/CD pipeline and what parts of the review can be automated.

Luckily, AI code review tools like CodeRabbit, Greptile, Cursor's Bugbot, or Anthropic's Claude Code review catch a lot of the surface-level issues: style, obvious bugs, missing tests. These don't replace human review, but they reduce the surface area your senior engineers need to cover manually.

When using AI code review tools, engineers I've spoken to have reported good findings, but also a lot of false positives. It can be helpful to coach early career engineers to spot the false positives and explain why they're not a problem, or they're an acceptable risk.

Another idea, more from a process side: ask authors to review their own PRs before sending a review request to someone else. In other words, a pre-review review. "Ease of review" and "quality of code" is still a responsibility of the author: it reflects on their engineering skills, regardless of whether AI wrote the first pass. If your team doesn't have that norm yet, it's a good time to consider setting it.

The upstream AI coding bottleneck: issue tracking

Code review is the downstream constraint of generating more code. But there's an upstream one too in the planning phase: ticketing (e.g. in JIRA, Linear or GitHub Issues).

Upstream, we have the work that happens before anyone writes a line: ticket creation, design conversations, bug reproduction, requirements gathering, stakeholder alignment. None of that got faster when you adopted AI coding tools.

Vague tickets slow down development because engineers have to ask clarifying questions. Delays add up when clarifying takes multiple back-and-forths. Clear acceptance criteria, reproduction steps, and system context help engineers get work done faster.

And it's not just ticket creation. Think about communication and "paperwork" across the whole software development lifecycle. Status updates, stakeholder check-ins, handoff notes, design docs: all the connective tissue that keeps a team aligned. It's not what we traditionally think of when we think of accelerating software engineers who are vibe coding, but these are common sinks.

" width="800" height="261">

What can be done about issue tracking and requirements gathering?

Here are a few things I've been experimenting with: You can build a Cursor or Claude skill that pulls a ticket from your issue tracker (JIRA, Linear, whatever you use) via an MCP server and runs it through a series of quality checks. Does the ticket have a clear objective? Clear requirements? Business impact? Stakeholder named? If it's a bug, does it have steps to reproduce? If it's incomplete, the tool can automatically flag the gaps and notify the stakeholder. This takes an afternoon to set up and it pays for itself within the first sprint.

Before an engineer works on a ticket, you could take the description of a problem, and perform automated research on that ticket: in the codebase, in the database, or in a browser to explore the UI. If there is a description of a bug, the automation could verify that the description can be observed easily, and potentially take screenshots.

But beyond the initial ticket creation, how can you speed up feedback cycles by helping engineers to act on tickets, then reduce the paperwork?

You can create CLI tools / desktop applications that help engineers package up their progress (git commits), findings (command line output, screenshots, summaries) and attach them back to the ticket. It sounds small, but reducing the friction of sharing blockers and getting feedback keeps the pipeline moving. The gains from AI coding don't fully materialize if the non-coding parts of your process stay manual.

Vibe coding code quality: duplication and maintainability

AI ships duplicate code constantly. Instead of reusing existing modules or reaching for community packages, agents tend to reimplement. I've watched Claude Code write a date parsing utility from scratch in a codebase that already had three date parsing utilities (all also written by Claude Code in previous sessions). The agent didn't know they existed, because the context window didn't include them, and nobody had documented the pattern.

You need awareness and diligence to notice the duplicates and circle back to clean them up. And even then, I forget half the time.

This matters more than it might seem. Code duplication hurts runtime performance and build times. When there are duplicates, it's harder to fix bugs: you patch one copy and the other three still have the vulnerability. Security patches need to be applied in multiple places. The codebase quietly gets worse while the velocity numbers look great.

GitClear's 2025 study analyzed 211 million changed lines across repos from Google, Microsoft, Meta, and enterprises across 2020–2024. This covers the early AI adoption era. Code churn (new code revised within two weeks) roughly doubled from 3.1% to 5.7%. Copy/pasted code rose from 8.3% to 12.3%. Refactoring dropped from about 25% to under 10% of changed lines. The code ships faster, but doesn't age well.

Sidebar: I don't see code churn discussed much, but I'd love to see more research on potential impacts on maintainability. For folks vibe coding, seeing a "+2k / -2k lines of code" change is pretty common. What worries me is the impact of continuous churning of code (and tests) over time. Subtle bugfixes and "matured" code don't survive that kind of constant rewriting.

A few ideas on what to do for code quality:

In the code review section above, I mentioned CI/CD improvements for review. For maintainability specifically, look for tools that measure test coverage, code duplication, and code complexity at the repository level; not just at the PR level. PR reviews catch incremental issues, but as changes accumulate, you want a broader snapshot.

But it's not just 3rd party tools. Can you create hooks that run as a part of a code review check, helping engineers detect duplicate code? They're incredibly easy to build. For example: a Skill or Subagent that scans for existing implementations before the agent writes a new one. The question is when engineers run this so they don't forget. A git hook, or preprocessing before a PR is submitted works; the mechanism matters less than making it automatic.

OK, let's switch out of the "development" dimension of software, and talk about the "operational" dimensions of software.

Vibe coding quality: code-level vs customer-facing

Code maintainability is one kind of quality engineers care about. The other kind is customer-facing quality, and that's what keeps all of us employed.

A manager I interviewed at a Fortune 500 company distilled their AI adoption objectives into two themes: "velocity" and "quality." When I pressed on what "quality" meant, it was clear they meant product uptime and customer-facing incidents. Not code complexity. Not test coverage.

This is typically what executives mean by "quality." If your engineering dashboards show code metrics and leadership means production stability, you're measuring two different quality layers. Clarify what quality means: the disconnect is more common than you'd think.

The DORA 2024 report found that for every 25% increase in AI adoption, delivery stability decreased 7.2%. Their 2025 follow-up added nuance: "[AI] shines a light on what's working, accelerating what's already in motion, but it also surfaces what needs to change." Strong teams with good practices benefited from AI. Struggling teams faced greater challenges. If your delivery pipeline had cracks before AI, AI adoption widened them.

What you can do:

Use your issue tracking system alongside git to track quality of AI-assisted versus non-AI code. Git commits are increasingly labeled with AI tool footers (e.g., Co-Authored-By: Claude Opus 4.5). You could create a CI check requiring all commits to carry this footer; even manually-written ones should be explicitly "human." It's a small discipline, but it makes the data traceable.

In the issue tracker, find ways to link customer-reported issues to the candidate commits that had problems. Remember the blameless post-mortem: you're linking to the problematic code change, not to a specific person.

And have labelling, or other categorization that can differentiate between customer-reported issues and internally-found issues. You'll catch many more internal issues that customers may not care about, so it helps to keep explicitly customer-impacting issues as the priority.

Security: more code, more surface, smarter adversaries

Security shares some DNA with code quality, but it's a different domain and has much higher stakes.

Here are some things I think about from an engineer's perspective: More lines of code with less human understanding means the attack surface is evolving. AI agents act on your behalf with your permissions and credentials. Hallucinations in development environments can cause real damage, not just in production. And vulnerabilities ship faster than before.

Research confirms that LLM-generated code will include vulnerabilities. Tihanyi et al. (2024) analyzed 331,000 C programs across 9 LLMs (e.g. GPT-4o-mini, Gemini Pro, Code Llama, and others) and found 62%+ contained vulnerabilities, with minimal differences between models. The problem isn't a bad model. It's that code generation at scale produces vulnerabilities at scale. It might be better than humans, but if code gen is accelerating, then vulnerabilities will scale linearly too.

And from the other side, the window from vulnerability disclosure to active exploitation is compressing from 8.5 to 5 days on average. AI-assisted cyber attacks rose 72% in 2025. More code, faster attackers, cheaper discovery. It's a scaling problem, not a skill problem.

What can you do along the security front?

I'd start by adding security monitoring to your CI/CD. Linting, SAST, supply chain scanning, secrets scanning: tools like Semgrep or Snyk, or open source alternatives. Code review bots include security checks as well. And the standard practices still apply: periodic auditing, security considerations early in project planning, security checks woven into the review process. Defense in depth, with the "depth" updated for a world where agents generate code faster than humans can review it.

I would also start with updating your "least privilege" access controls for the agentic world. To get work done, I have to grant agents control of certain tools and infrastructure, and I always worry about how much unintended damage that could cause.

Sidebar: I find that "isolation" is a theme I think a lot about when it comes to improving AI security. How do you isolate your AI agent from your secrets (but give it some access)? From destroying files in your filesystem? From other computers in your network? I think that techniques like containerization (docker), jails, firewalling, splitting identities/credential access into more granular chunks, will be fruitful here.

Humans are better trained than agents at knowing which lines not to cross, so it makes sense to scope agents more tightly than human developers. Agents are also more numerous and shorter-lived. Think about how to generate lightweight, temporary permissions rather than sharing your personal credentials.

A concrete example of agents doing something sketchy: your .env file getting read by an agent and shipped up in an AI-generated bug report, or used in an unintended API call. The kind of thing you only laugh about months later.

Another example: an agent inheriting your admin role, hallucinating, and taking a destructive action with permissions it should never have had.

Use vaults and password managers to reduce agent access. Add degrees of isolation between "write access" and "read-only access." Isolate production from development environments. Wrap binaries and filesystem folders in containers, jails, or VMs to constrain blast radius.

This by no means is a complete list. It's merely highlighting some of the security risks that AI is introducing and being discussed in engineering communities. And suggesting some starting points for how we can use and extend existing tooling.

What I didn't cover (and why it matters)

There are a few topics I can't go deep on here, but they're worth flagging.

Design documents are evolving. In order to write more code thoughtfully, teams are producing more design docs. But the tech lead I mentioned earlier noticed they're appearing more generic: the same structure, the same level of detail, the same boilerplate that suggests an AI wrote the first draft and nobody pushed it further. ("Slop" is a useful description here, not to be disparaging to the authors. But to describe the "averaging" effect of LLM generation of prose.) Design docs are supposed to force you to think through the hard parts before coding. If AI is writing them, and humans are rubber-stamping them, we've lost the "thinking" and "intentionality" of designing solutions that actually fit the problem.

Operating code in production is another one, but I've covered it in Part 1 of this series. As you develop more code, you have to maintain it: deploy it, monitor it, troubleshoot it, patch it. How to enable your repositories and infrastructure to let AI help with operations safely is a separate conversation, and it's one I'm spending a lot of time on.

Where this leaves us

I've been thinking a lot about what a real measurement layer for AI coding should look like, and what kinds of insights it should surface. More on that soon.

Underneath all of these practices, there's a thread that keeps surfacing in every conversation I have: continuous learning. It's a classic practice: Agile retrospectives, Toyota's production system, are always good practice. But it feels newly urgent when the tools and practices are changing this fast. Engineers and managers can't keep up with the rate at which research keeps appearing, and intentional practice helps.

I'm collecting stories from engineers and managers working through the post-adoption phase. If you've hit the review bottleneck, had concerns about code quality and security, or if you've found something that works, I'd like to hear from you. You can find me on LinkedIn or X.

Originally published at ashu.co.

Measuring AI coding adoption: What I learned as a manager

Andrew Shu — Mon, 27 Apr 2026 19:38:16 +0000

Let's say you're an engineering manager, and you're participating in an organization-wide campaign to increase the adoption of AI. The initial goal is to get people to use the AI tools.

That's easy: hand out licenses to your team for Cursor, Claude, or GPT. Congratulations! What's next?

Maybe you're among the early adopters in your company, or maybe the initiative originated elsewhere. Senior leaders across companies are investing in AI because they see the potential of greater velocity and more expansive creativity. But they're also responding to pressure from their stakeholders: board members asking for slides about AI strategy (both internal and customer adoption), investors who expect adoption to keep up with current startup pace, competitors making the same bets. They need to show these investments are paying off.

Meanwhile, these AI tools cost real money. I've spoken to startups who are spending $1,500 / month / engineer as they seek to understand the new paradigms of coding and insights for building leaner. This is a major step above startups who were previously spending $300 - $500 / month / engineer. Even for enterprises who spend $5,000 / month / engineer, adding $1,500 / month would be a big leap in investment.

In March 2026, Jensen Huang (CEO of Nvidia) said that senior Nvidia engineers earning > $500k in salary should be consuming well over $250k of tokens per year. This sort of paradigm shift hasn't rolled out broadly in the industry. But talking to folks on different teams and seeing my own usage, I don't think it would be difficult for engineers to spend $1k / month ($12k / year).

For that kind of cost, it's important to know what value your team is getting and to optimize the organization's usage to make the most out of that money. But what's the outcome? Should engineers max out tokens? Or max out lines of code? It's a new paradigm, and you're trying to make sense of it.

In Part 1 of this series, I wrote about the rough edges of vibe coding from the engineer's perspective: things in production that slow down engineers when you move AI coding from side projects to production.

This article tells the story from the managers' perspective, based on conversations with engineering managers and my own experience. What does it actually take to drive real AI adoption on an engineering team?

How do you actually measure adoption of AI coding?

Managers I've spoken to say that they're being measured by senior leadership based on the number of licenses they've distributed (or not distributed) to their team. They're counting PRs and lines of code, and doing qualitative surveys of team members to gauge AI adoption. But these don't tell you whether adoption is meaningful to the business and customer.

Research suggests that bulk license distribution won't lead to actual usage. Gartner found that often fewer than a third of purchased licenses see active use after several months. The 2025 Stack Overflow Developer Survey tells a similar story: 81% of professional developers surveyed are using (or are planning to use) AI, but 41.4% of professional developers believed that AI struggled with complex tasks. (Note: that 41.4% level dropped since the previous year, but is still high.)

Even with usage, I've found that volume of AI tool utilization and the variety of techniques in regular use is uneven across organizations and even within a single team. So even with licenses distributed, uneven training is a distinct challenge.

So how can we assess adoption?

As a practical experiment, try speaking with a sample of the team in your 1:1s about how they use AI, and perhaps collaborate with them on a project. It's more effective to see how your team uses AI (e.g., during a demo, presentation, or pair coding) than to verbally poll them. That way you can see how they're using AI, rather than a simple yes/no poll.

Are there more concrete metrics of adoption?

I also find it helpful to use a quick but highly flawed (and highly contentious) metric: token spend. This is the theoretical cost of the tokens a person uses. It's often subsidized under a monthly license, or an enterprise agreement.

For token cost, here's a rough rule of thumb for token spend: state-of-the-art models like Claude Opus or GPT-5 can easily cost $100/day of heavy use (of tracked token costs, not necessarily real dollar spent). For folks that aren't past the $20/month base subscription tier, they're likely not yet "vibe coding". That's not bad, it's simply a metric to ballpark usage volume.

Maximizing token usage is not an end in and of itself. Token spend is a cheap, weak signal, and can be gamed. But it's available right now without additional tooling. (Note: it's also wasteful to optimize for high-cost workloads). But I've found it helpful to ballpark my own style of adoption from token costs. When I'm in a totally different ballpark from someone else, it's a helpful signal to trigger me to ask why.

Here are a few other metrics worth considering, other than token costs:

Lines of Code. This is another deeply flawed metric (and famously so), but it's a useful factor to consider because there are real implications for code reviews as well. When PRs are changing from hundreds of lines to thousands or tens of thousands, this hints at changing AI adoption styles. But you should also think about the quality of code reviews.

Maturity of AI configs, context and tooling (e.g. agents.md). These are typically markdown files shared between engineers in the repository. Or maybe it exists as instructions / documentation in your team's wiki / knowledge base (to find relevant docs, search for "Claude" / "Cursor" / "Codex").

Maturity of AI configs is perhaps the most interesting sign of usage, because it shows AI being used and customized. Is your team using Skills, Subagents, Agent Teams, Automation like Routines? How often are these configurations being customized? These configurations fall out of date, so teams regularly using AI are likely to be measuring and tuning their AI configurations and documentation.

Understand your team's perspectives on AI adoption before acting

To return to the topic of polling and observing your team in action: your team may have legitimate concerns blocking adoption. There are legitimate concerns about security and code quality, and the benefits aren't evenly distributed across experience levels (more on both below).

Some senior or SRE engineers have told me that their work involves precision, complexity, or high risk, and AI is an unacceptable risk. Or maybe the team is too busy to try out that new tool, and they need someone to be brave and test it out first in your environment.

Before pushing to increase adoption or spend, talk to the team.

This is the step that is easiest to skip over. It's tempting to see low utilization and lean into the instinct to push harder: more training, more encouragement, more tooling. But the question is about workflow, mindset, and preferences, and the answer will differ per team.

Sidebar: It's also valuable for you to try out different AI workflows. Some of the managers I've spoken to are themselves skeptical about AI. It's valuable to suspend your disbelief for a few days and experiment. Have some fun with it; play a bit like you got some shiny new gear and you can build whatever silly thing you've been meaning to for a while. Greenfield projects, CLI scripts, or small bugs are great starting points. Try out a markdown plan, and play around with easy parallelism.

How you engage with your team depends on where you sit. As a line manager, you're close enough for 1:1s and small team discussions. Ask to pair code directly. Ask people what's working and not working in their CLAUDE.md or .cursor/rules/*. Ask what they reverted last week due to AI. The goal is to identify specific concerns, pushback, and knowledge gaps: not to audit or blame.

If you're a senior manager, you may need to shift organizational momentum more broadly: clear communication about organizational objectives, setting metrics to measure progress, funding training (both money and time), or setting explicit norms that AI tool usage is supported and expected, and reflecting on which metrics helped and didn't. This is a different communication problem than a 1:1.

Listen for objections and disagreements: they're valuable signals. Engineers who say "AI is inconsistent" aren't wrong. I've measured token consumption that varied by 2x session to session for identical work. Harnesses regress. (By harness, I mean the Claude Code harness that wraps the Claude models.) Prompts that worked last week hallucinate this week. Take these concerns seriously.

The part nobody budgets for: coaching for AI coding

After I recognized these signals, I started testing them out on my teams, among friends, and with strangers I met. I realized that the gap often wasn't tools or licenses: it was listening, persuasion and coaching.

I spoke with a number of skeptics, but there were also a lot of folks who wanted to vibe code more. Quite often, they didn't have the time to keep up with the firehose of new information. And another frequent concern was that they were worried about vibe coding in production environments, or in local environments with permission systems (e.g. credentials for AWS, SSH keys to machines set up for thoughtful humans). Specifically, I'm referring to the adoption of "hands-free" vibe coding and not AI coding where engineers are manipulating code or commands.

Moreover, adoption wasn't uniform. Some engineers had been happily vibe coding, and I spoke to them to see what worked. There was a spectrum of skeptics and aficionados, and the information needed help to spread faster. Research published in Science (Daniotti et al., 2026) found that AI productivity gains (more commits, broader library use, exploration of new functionality) accrued mostly to experienced developers, with early-career engineers showing no statistically significant benefit.

Other studies, like Cui et al. (2026), found the opposite in controlled corporate settings: less experienced developers benefited more. The takeaway for managers isn't that one study is right and the other wrong: it's that the gains aren't uniform, and that deserves special consideration when you're planning training, setting expectations, and measuring progress.

At the time, I was specifically interested in how to do DevOps / operational / maintenance work safely. So I thought through what kinds of tedious tasks people would like to do less of, filtered out risky operations, and then built starter configuration files, subagents, and shell scripts. (I'll elaborate in my next post.)

After posting about it, and sharing it in team meetings, I realized it required more active persuasion (as opposed to passive announcement). So, as one does, I switched from optional knowledge-sharing sessions to more proactive 1:1s, and team calls.

I also tried another playbook: to build confidence among skeptics, I also troubleshot production issues in parallel with other engineers troubleshooting those same issues. I constrained myself to use mostly-autonomous AI agents (equipped with a context system). I accumulated 2-3 concrete examples where AI can help engineers step into unfamiliar parts of the codebase to troubleshoot an on-call situation. This helped spark ideas for methods and techniques, and overcame some mental blocks.

Two key things that helped out when conducting knowledge sharing sessions in my most recent campaign: hinting at how it could be used more safely in production, and teasing out specific concerns that engineers had but hadn't voiced yet.

When you see lightbulbs go off, it's incredibly rewarding. But the work to get there is often invisible in your current metrics. And it requires changes beyond conversation: to the codebase, the tooling environment, and how your team works day-to-day. As mentioned above, I'll cover the concrete technical stuff: agent configuration files, sandboxed execution, CI pipelines, and workflow changes in my next post.

What I'd do this week to improve adoption

To distill my ideas and anecdotes above, here are 3 things I'd suggest that could be done as small projects and experiments you could do in a week:

Hunt for adoption through quantitative and qualitative signals. Look at the numbers you already have: token spend per engineer, AI-assisted PR rates if your platform tracks them, git commit footers that show AI assistance. Then pair those with qualitative input from team retros, 1:1s, or a short survey. Neither signal type is sufficient alone. Token spend sheds light on how deeply your team is using the tools. Conversations tell you how and why (or why not). The combination replaces guesswork with a baseline you can act on.

Tease out the obstacles getting in the way. You're not going to get far with "are you using AI?" Instead, surface specifics through whatever channel fits your team: 1:1s, retrospectives, Slack threads, brown bag sessions, pair coding sessions. What tasks are they using AI for? Where did it break down? What would make it more useful? The goal is to map the gap between where your team is and where productive AI usage actually lives, then address the top blockers, whether those are configuration, training, trust, or tooling.

Pick one process to automate. Show concrete examples of its benefits. Don't try to overhaul everything. And remember, it's not just code generation. I gave the example of troubleshooting production errors. It could also be: a planning template, a test generation step, a deployment checklist, an observability alert summary, or updating JIRA tickets. Isolated wins build confidence, both yours and the team's. They also give you concrete stories to share with leadership when they ask for evidence that the investment is working, and can be cross-pollinated across the wider organization.

The gains from AI are real, but there will be new problems

When coaching resonates and the adoption picks up, the individuals on your team will be equipped with new career skills and your team will ship faster. Engineers tackle problems they would have avoided before.

But adoption was only the first thing you needed to measure. Once your team is using AI coding tools for real, a new set of problems surfaces: bottlenecks that shift in unexpected directions, quality concerns that span multiple dimensions, and a measurement layer that hasn't caught up yet. I'll dive into some of the technical areas there in Part 3.

In the meantime, I'm collecting stories from engineering managers working through AI adoption. If you're in the middle of it, I'd like to hear from you. What metrics are you using? What pushback surprised you? Reach out on LinkedIn — these conversations are the most valuable part of this work. I'm happy to swap tips and ideas!

Vibe Coding in Production: What's Holding Us Back?

Andrew Shu — Tue, 07 Apr 2026 16:07:05 +0000

Vibe coding techniques need to be adapted when you work on production applications with AI. I walk through some challenges and solutions that I've found helpful on real projects.

I'm going to share some experiences from a few months ago, about how I expanded the scope of my use of agents from vibe coded apps to working on real world problems in production.

I had been coding with AI agents for a while now: greenfield scripts, prototypes, and features I could build and throw away. Early on in this experimentation, I set my sights on building tools and practices for safely using AI in production. I knew I had to maintain and operate the code I developed. So as I explored AI by building isolated and greenfield code, I made mental notes of techniques that wouldn't work and those that I could bring to a production infrastructure.

There are numerous articles and posts describing techniques for vibe coding well. But there isn't enough documentation describing the practices around customizing your repository and infrastructure to take advantage of your AI agents on real infrastructure and workloads.

Daghan Altas, a former Cisco Meraki colleague, phrased it well: what's the point of a 10x productivity boost if you can't operate and maintain the thing you built any faster? That reframed the question for me. Not "is AI fast?"; obviously it's fast. But: what specifically is holding me back?

Here's what I ran into when I applied vibe coding techniques against production infrastructure and workloads. And this is how I've updated my configurations and techniques to address these issues.

AI codes quickly, but what about troubleshooting and testing?

AI is great at implementation, but there's so much surrounding the act of writing code.

Here's a concrete example: I was building a prototype that needed to normalize messy data from multiple API and database sources. The numbers kept being wrong. I pointed Claude Code at the problem, and it churned for an hour, trying different parsing strategies, refactoring the aggregation logic, adding fallback handlers.

The fix turned out to be surprisingly simple: the logging wasn't capturing everything it needed to. The agent was trusting the logs at face value and never questioned whether the data was complete. An hour of sophisticated troubleshooting on a problem that needed five minutes of "wait, do we have enough logs to capture the symptom of the problem?"

The same dynamic plays out with writing unit tests. So I needed to think more broadly about this.

Implementation is often not the bottleneck. The bottleneck is everything around the implementation. After implementation, that would include: verifying the code does what you think it does, troubleshooting when it doesn't, understanding what already exists so you don't reinvent it, and making sure the architecture holds up next month.

What I've started doing: I built subagents for the two patterns that burned the most time when I was catching and fixing AI-written bugs.

Firstly, a "Skeptical Testing Subagent" that scrutinizes test suites: checking for duplicates, testing meaningfulness, flagging assertions that don't actually prove anything.
And secondly, a "Skeptical Troubleshooting Subagent" that focuses on production logs and data integrity before jumping into code changes. Both are early, but they've already caught things I would have missed.

When I say "skeptical" you can translate it to the term "adversarial", which is what folks in the AI community use more frequently. People have talked about using "adversarial agents" to review code, and how these agents "think differently" than an agent told to "write code". My testing and troubleshooting subagents solve specific code review and production log review problems that I've encountered, in a more narrow and specific context.

Fear slows us down when we vibe code production apps

One of the things that accelerates vibe coding is accepting AI suggestions quickly (specifically, auto-approving the shell commands the agent wants to run). But many of those suggestions are commands run inside a shell with superadmin privileges and access to the internet. Even when the AI isn't doing anything malicious, I worry about a stray rm -rf or a drop database or a terraform apply that destroys a folder, a Google Drive, an RDS instance, a DNS record. Nightmares abound.

This isn't hypothetical: Alexey Grigorev accidentally dropped his production RDS database while using AI tools and wrote up the full post-mortem. Amazon has called for new safeguards and review processes after AI-assisted errors in production. Research from Snyk has documented AI coding tools hallucinating entire package names that don't exist, and attackers registering those packages to exploit the gap.

Hallucinations in a local sandbox are an inconvenience. In production, they're a late night page and an embarrassing post-mortem.

What I've started doing: here are some examples of ways I've improved the safety of my work.

Instead of executing unsafe bash commands, ask AI to write a script you can review. I find AI helpful for sysadmin/SRE work, but I have to monitor it closely — no background agents here. Watching commands scroll by is risky, so I ask the agent to write them into a script I can review first. Then as a bonus, I get a script that is reusable.
Extracting repeat database or log queries into a script. When I was troubleshooting customer issues, I often ran a few Postgres database queries or fetched a few logs related to some Lambdas. This was tedious, but I also didn't want AI to be running PG or AWS Lambda commands by itself. So I wrote scripts like fetch_customer_events_pg.ts <customer-alias> <event-type>, or fetch_customer_logs.ts <customer-alias> --start <start_time> --end <end_time>
In addition to scripts, you can do similar things by codifying repetitive tasks as Skills and Subagents. Skills are available in Codex, Cursor, and Claude Code. Subagents are also available in Codex, Cursor, and Claude Code.

These are some of the techniques I've used. I might elaborate on this topic in a future post. If you're interested in chatting about how I do this, DM me on LinkedIn or on X.

I ❤️ docs, but AI might just love it more (as "context")

I've hopped onto screen shares with other engineers where one of us will spot a hallucination go by mid-session. There's a brief moment of annoyance or concern, and then we keep going. Most of the time, we just let the agent continue. I've done it myself; we have more pressing tasks to finish. You see the wrong thing, you wince, and you move on because you're in flow.

This is a bigger source of inefficiency than it appears. Not everyone realizes that many hallucination patterns can be fixed with better context. (By context, I'm basically talking about code, docs and additional markdown files.) And those who do know often haven't had the time to pay attention to how context is actually structured across their tools. There's a growing stack of context layers: AGENTS.md files, skills, subagents, team-shared vs. individual context, memory architecture, code indexers, connections to databases and wikis and issue trackers.

But here's the pattern I see most often: someone sets up an AGENTS.md or a .cursorrules file when they first adopt a tool, and then never touches it again. Six weeks later the agent is hallucinating patterns you deprecated a month ago, suggesting libraries you've already replaced. Or maybe your automatic memory.md is outdated and no longer reflects your code's reality. The agent's context drifts from reality a little more every week, and the hallucinations compound.

This is the source of a lot of churn. When the agent doesn't know what already exists in the codebase, it reinvents. When it doesn't know your architectural patterns, it improvises. When it doesn't know what you deprecated last sprint, it resurrects it.

What I've started doing: I treat documentation as infrastructure. When agents hallucinate or reinvent something that already exists, I update the docs so it doesn't happen again. I use MCP servers to push context to my knowledge base. I run Claude Code's /context command mid-session to see how the 200K token window is being consumed, and it often exposes wasteful allocation I wouldn't have caught otherwise. It's a small amount of effort that compounds over time. If you're going to obsess over something, context hygiene has the best return on neuroticism I've found so far.

Another technique I use is to keep a plans/ folder and a docs/ folder for architecture decisions and system patterns that agents should know before generating. Markdown plan files are still the single best thing I've done for my workflow, and the docs folder is a great supplement. Recently, Andrej Karpathy posted the importance of "LLM Knowledge Bases". I also use Obsidian in a similar way, but I find in-repo docs more pragmatic for keeping context closer to the code.

You can also layer on custom instructions to Codex, Cursor, and Claude Code to customize your harness to behave differently beyond what your team has done.

AI Token anxiety is real: what if I run out of budget?

To borrow a friend's metaphor: inference tokens are like Spice in Dune. They're a substance that augments your abilities, that once taken, you can't live without. And a scarce resource that requires extraordinary effort to accumulate.

I heard this from a tech lead at a large enterprise. There are technically token budgets per engineer, but they're not being enforced. The current objective is to increase adoption, so that's not an issue. But this engineer worried about what happens when it does get enforced.

The anxiety comes from multiple directions. There's the worry about rationing: how do you make sure you have enough tokens to hit your deadlines? And if an inference provider goes down mid-sprint, you're stuck without tokens or scrambling to switch to an unconfigured, unfamiliar tool.

Then we get to the opacity of pricing. I measured my Claude Code sessions over a week and found that 2 out of 5 sessions burned tokens at 2x the normal rate, with no obvious change in my behavior. In Theo's YouTube video on Claude Code's recent (March 2026) capacity reduction, his conclusion is the same as mine.

Beyond capacity allowances, a feature change from the AI labs can silently double your token costs. There's a pattern emerging across providers in early 2026: models getting more verbose, spawning sub-agents for simple tasks, and nobody has a baseline to tell whether the amount of AI agent-hours they can use per month was reduced 10% or 50%.

I've written about this a lot, but I don't think we need to over-rotate on token reduction. I've found it helpful just to learn how token limits are enforced, how they're being used. This helps me be mindful of costs. The first thing I'd recommend is to understand the tools' pricing structure and learn how they enforce token limits so I can make the most out of the tokens they provide (and subsidize). There are also open source tools like ccusage that track token usage. You can also try vibe coding your own!

Questions I ask myself to improve my use of AI agents on production

I've found a number of techniques that have improved my workflow, but I have so many open questions still! I find that thinking about these questions help me find where real improvements can be made. These questions don't require adding tools to immediately seek gains. I'll share them with you, and hopefully they help you reflect on your own engineering work.

Are my productivity gains real or not? What did I do in the last week with AI? This is a question I ask myself regularly, because productivity gains may feel great but are actually an illusion. METR ran a randomized controlled trial where experienced developers were 19% slower with AI tools, while believing they were 20% faster. This study ran in July 2025, before the major model improvements in November 2025. Nonetheless, that perception gap is a reminder that intuition alone isn't enough.

Where am I losing time? How do I increase AI's autonomy? This goes to my work around adversarial agents for scrutinizing code, tests, and logs. I've found that AI often churns out meaningless work, or takes shortcuts. These are signs the agent isn't truly autonomous, so I need to troubleshoot how to increase the autonomy.

How much does fear cost me? I monitor what my agents are doing more than I probably need to. Are my colleagues even familiar with which bash commands are risky? How much collective time gets lost to hovering, second-guessing, or just not knowing whether it's safe to let the agent run? Reducing risk here feels like unlocked velocity.

What is my process to reflect on and improve the effectiveness of my agents? Right now, most of us are vibe coding and vibe evaluating. We finish a session, we feel like it went well or it didn't, and we move on. I think there's value in building a habit of structured reflection: what worked, what didn't, what would I change? And sharing those reflections across a team, not just keeping them in your own head. And how do I share what I learn with colleagues? There's something from Toyota's production system and from Agile retrospectives that applies here: the discipline of continuous reflection and improvement.

Vibe coding deserves more than vibe evaluation

The FOMO around AI coding skills is real. There are new tools every week, new techniques, new claims about what's possible. Most of us are figuring it out on hunches: not fully able to keep up, not clicking into AI news articles to read them in full, not totally understanding the tradeoffs, but feeling the paradigm shift happening underneath us.

I think that's fine because we're early in the technology adoption. But I also think we can do better than vibes. The engineers I see getting the most value aren't the ones with the most expensive tools or the most aggressive token spend. They're the ones building habits of honest reflection: what did I ship versus what did I generate? Where did I invest versus where did I waste time? What would I do differently next session?

Everything I've talked about here is from the engineer's perspective: what I can see, what I can measure, what I can control. But I've been having conversations with engineering managers too, and they're wrestling with a different version of the same question: how do you know your team's AI investment is paying off when you can't see inside any of these tools? That's a different problem with different constraints. More on that soon.

What are you working through? What's the question you keep asking yourself about your AI workflow? I'd genuinely like to hear: if you're wrestling with the same things, reach out on LinkedIn.

Originally published at ashu.co.

Is Claude Code 5x Cheaper Than Cursor? I Ran 12 Experiments to Find Out

Andrew Shu — Tue, 31 Mar 2026 16:52:02 +0000

In Part 1 of this series, I noticed something strange while using Claude Code's Max 20x plan: it was the same $200/month as Cursor Ultra, doing the same work, but my Claude Code utilization was stuck at 16% while I had been burning through Cursor's token budget. In Part 2, I figured out how to push past 50% utilization with parallel Claude Code agents.

Given that I could use so many more Sonnet/Opus tokens on Claude Code, my first instinct was: "is Claude Code actually 5x cheaper than Cursor?"

And then I realized you can't compare them apples to apples. I couldn't ask: at the same price, how much token capacity does each tool actually give you? Their pricing models are enforced incredibly differently (see Part 1), and Cursor has 2 pools of tokens (API, and "Auto + Composer").

So instead, I came up with a metric — "agent-hours" — to serve as a proxy: given each plan's token capacity, how many hours of agents can I run per month?

I had some hunches, but I couldn't be sure they would hold up. So, I did what any engineer with too much curiosity would do: I designed an experiment to find out.

A few key caveats before we dive in:

This is a loosely controlled experiment, not a rigorous benchmark. The findings are directional: order of magnitude, not precise. Readings fluctuated significantly day by day, and the product/capacity changed. But this reflects real life.
I'm using Individual, not Team plans, focusing on $200/month tiers.
Things change rapidly in the world of vibe coding token use, models, and costs. The 1M context window for Opus 4.6 dropped for Claude Code and then Cursor. Cursor dropped Composer 2.0, an upgrade from Composer 1.5. Claude session limits were updated in between experiments. I normalized for differing "2x limits" promotions in Claude Code and Codex.

To return to the article: my intuition suggested there was a notable difference in price, and I wanted to quantify. I learned a considerable amount digging into pricing, and this helps me understand how to make the most out of the different models.

I hope this token and tool pricing analysis helps (and interests) you as much as it did me. It's a long article, but given the volatility of the experiment, I figured it would help for me to show you all the messy details and how I think about it.

The headline: Claude Code delivers ~5x more capacity per dollar

Here's the summary. At $200/month on individual plans:

Tool + Plan	Agent-Hours / Month	vs. Cursor Ultra
Cursor Ultra ($200)	~138 hours	1x
Codex Pro ($200)	~220 hours	~1.6x
Claude Code Max 20x ($200)	~678 hours	~4.9x

So at the same $200/month, Claude Code gives you ~5x more room to work than Cursor.

Important context before we get further. This measures capacity per month (for my workload + codebase): how many agent-hours your subscription delivers if you use it fully. It does not measure work quality, code correctness, or features completed. You shouldn't read it as "5x cheaper" because that assumes you can actually use all that capacity.

But this is too simplistic of a view, because there are greater nuances to the pricing. We should next look at how Cursor's pricing works, because this makes the story considerably more interesting.

Cursor Ultra's pricing structure: two pools of different tokens

Before we go deeper into the comparison, we need to understand Cursor's pricing structure. Cursor Ultra doesn't give you one big pool of tokens. It gives you two, and they're dramatically different in size and model characteristics.

The first pool is API credits, which cover SOTA models: "state of the art" frontier models like Opus 4.6, Sonnet 4.6, and GPT-5.4 (at the time of publishing). These are usually the models scoring highest on benchmarks, and also the most expensive models available.
The second pool is "Auto+Composer" credits, which cover Cursor's proprietary Composer models — faster, cheaper models that Cursor has built and optimized for code generation.

When you upgrade to Ultra expecting unlimited access to the best models available, what you actually get is a small allocation of frontier model credits and a much larger allocation of Composer credits. Here's how the two pools break down:

Cursor Ultra Usage Pool	Estimated Agent-Hours	% of total
API credits (We use Opus 4.6, both 200k and 1M)	~18 hours	13%
Auto + Composer credits	~120 hours	87%
Total	~138 hours	100%

Note: API agent-hours depend on the price of the model you choose. Opus 4.6 is one of the most expensive options; a cheaper SOTA model would stretch further.

That ~18 agent-hours of frontier model is a key factor to consider when you use Cursor. When I ran experiments using only Opus 4.6 on Cursor, the API pool burned through fast. When I ran experiments using Composer models, the Composer pool lasted roughly 7–8x longer.

And this is a key finding: Cursor incentivizes you to spend most of your time using the faster Composer 2 model. This seems to be a deliberate design choice, and it's a reasonable one. The combined 5x headline reflects what happens when you use Composer for most of your work, which is how Cursor intends for you to use it. If you default to frontier models, the gap is far wider.

This explains a frustration I've seen across forums and from other engineers: you upgrade to Cursor Ultra and exclusively use SOTA models, only to find out that you hit API credits faster than expected.

Let's see what this looks like in numbers. We strip out the generous "Auto + Composer" tier and exclusively use SOTA models. (Again, not the optimal use of Cursor.)

Tool + Plan	Agent-Hours / Month (SOTA only)	vs. Cursor (SOTA)
Cursor Ultra — API only ($200)	~18 hours	1x
Codex Pro ($200)	~220 hours	~12x
Claude Code Max 20x ($200)	~678 hours	~38x

That's a 38x difference in agent-hours (ignoring the vast amount of Composer 2 tokens that Cursor provides). For engineers exclusively focused on frontier model access for complex reasoning (Opus, GPT, Gemini) and comparing Claude Code to Cursor, this is the source of their surprise.

Even this is too simplistic; I think we need to dive deeper.

But capacity isn't velocity: Composer 2 is genuinely fast

Here's where the story gets more interesting than "Tool A gives you more." I tracked project completions across all 12 experiments, and the velocity data tells a different story than the capacity data.

Here's how long the models took to complete Project 1, which involved a bulk rename across the project:

Next, let's look at Project 2, which involved cutting out a set of features:

I think the first 2 charts provide a much better signal, because they compare 2 larger, more complex refactor projects on the same scope.

Caveat: I'm going to share the following chart, even though it's flawed. After the first 2 large projects, I queued up many small projects like "research X and then build a small full stack feature".

But nonetheless, I wanted to share the different feeling of speed as I worked with different models:

In all charts, the Composer models were at least 2x faster than the other models. Since it finished the first 2 larger projects, it was able to race ahead to do all the small projects at the end. If you have a mix of small/large projects, Composer's lead may pull it ahead.

You might notice the Opus 4.6 200k/1M models not showing a clear trend. The sample size was small, so the fluctuation is a bit noisy.

So, speed is another tradeoff when choosing tools. Claude Code may give you more capacity per dollar. But using Cursor Composer can dramatically increase throughput. If the work is clearly defined and implementation-focused, you may get more done in fewer agent-hours.

A quick aside about Codex + GPT 5.4

If you're looking at Codex + GPT 5.4's velocity, you might notice that it didn't move as quickly. I wouldn't read too much into it. Each metric gives you a part of the picture; each tool has different strengths and weaknesses.

Firstly, I'm not as proficient at Codex's quirks as I am with Claude Code, so I don't know how to squeeze the most juice out of it. I noticed that during the experimental runs, GPT was much more cautious and spent more time slicing up the work into different groups.

And qualitatively, consider the multiple pieces of anecdotal evidence that Codex and GPT 5.4 can solve complex issues and that people are loving it. I've been hearing similar things in my conversations with colleagues. It's a potent tool and you should definitely give it a shot.

What I tested and how

The setup

I ran all 12 experiments on the same codebase: a monorepo with Elixir/Phoenix, React, and Terraform infrastructure, roughly 80k lines of code. Every experiment started from the same git commit. I used 4 parallel agents per tool, each on a separate git worktree (the same setup I described in Part 2). Each agent worked through the same sequence of self-contained refactoring projects: rename all instances of X, extract a module, add an API integration.

Each experiment ran roughly 60 minutes. I played a lightweight manager role — confirming "done" claims, assigning the next project. My controls tightened over the week as I learned what to watch for.

If you're interested in the raw data, reach out via LinkedIn or on X. If there's enough interest, I'd be happy to publish it on my Github.

The tool configurations

Detail	Claude Code	Cursor	Codex
Interface	CLI	CLI / Agent mode	CLI
Model	Opus 4.6 (200k, 1M context)	Opus 4.6 / Composer 1.5 / Composer 2 (varied)	GPT-5.4
Plan tested	Max 5x ($100)	Pro+ ($60) → Ultra ($200)	Pro ($200)
Autonomy mode	Accept edits on	CLI with allow-listing (not YOLO)	Runs commands without asking
Parallel instances	4	4	4

A few notes. I tested Claude Code on Max 5x ($100), not Max 20x ($200). The 20x projection uses Anthropic's published 4x multiplier — more on this in the calculations section. All three tools ran in semi-autonomous mode with different allow-listing behavior, which affects velocity asymmetrically and is unavoidable. Both Claude Code and Codex had active 2x capacity promotions during this period. Codex's promo applied 24/7. Claude Code's applied during specific off-peak hours.

What I measured

For Claude Code, I tracked the percentage of the 5-hour session consumed and the percentage of the weekly limit consumed. For Cursor, I tracked dollar amounts of API usage and Auto/Composer usage consumed, plus the combined total percentage. For Codex, I tracked the same session and weekly percentages as Claude Code.

How I calculated capacity

Defining "agent-hours"

An agent-hour equals one agent running for one hour. If 4 agents run for 1 hour, that's 4 agent-hours. The key question: how many agent-hours does each plan sustain in a month?

Session-based tools (Claude Code, Codex)

Technically, there are 2 limits: the 5-hour session limit and the weekly limit. The weekly limit is always more constricting than the sum of all the 5-hour session limits.

For each experiment, I measured the usage % at the start and end of the session, and calculated the difference. Since I know how many minutes the experiment ran, I calculate the "percentage consumed per minute" of both the 5-hour session capacity and the weekly limit. Monthly projection: weekly capacity × ~4 weeks × 4 agents.

To normalize the "5-hour session capacity" to "weekly capacity", I figured that 7 days has 168 hours. Thus, 168h / 5h = 33.6 sessions. If I can reach 100% capacity in 70 minutes, then I can multiply that by 33.6 sessions, and get 2,352 minutes.

Cursor's two-pool system

This is where the SOTA vs Composer insight emerges naturally from the math.

I measured the percentage consumed per minute of the monthly API pool (from the Opus-on-Cursor experiments) and separately the monthly Auto+Composer pool (from the Composer experiments). The API pool yielded roughly 1,065 agent-minutes per month, or about 18 agent-hours. The Auto+Composer pool yielded roughly 7,200 agent-minutes, or about 120 agent-hours. Combined: ~138 agent-hours.

The Max 5x → Max 20x projection

All of my Claude Code experiments ran on the Max 5x plan ($100/month). To estimate Max 20x ($200/month), I used Anthropic's published multiplier.

Anthropic's support documentation states that Max 5x provides 5x Pro usage and Max 20x provides 20x Pro usage — so Max 20x = 4x Max 5x capacity. This is a projection, not a measurement.

Off-peak and promo normalization

Anthropic's 2x off-peak discount applied to several experiments. I normalized by halving observed capacity during off-peak hours: conservative but approximate. I also ran experiments during peak, off-peak, and on the threshold of both.

When it was on the threshold of both, I just removed the values from the calculation. I was curious how the code would behave.

Codex's 24/7 2x promo (through April 2) was similarly halved. Both the promo and normalized figures are shown throughout for transparency.

Walking through the math: Experiment 7

Let me show the math for Experiment 7 comparing Claude Code vs Cursor Ultra, both using Opus 4.6.

Claude Code (Opus 4.6 1M window):

The weekly limit went from 37% to 42% over 60 minutes with 4 agents — that's 5% of weekly capacity consumed.
Weekly capacity = 100% / 5% × 60 min ≈ 1,200 minutes of 4-agent usage.
That's with the 2x off-peak discount.
Normalize to 1x: 1,200 minutes / 2 = 600 minutes.
Monthly: 600 × 4 weeks ≈ 2,400 minutes.
Convert to agent-hours: 2,400 / 60 × 4 concurrent agents ≈ 160 agent-hours on Max 5x.
Apply the 4x multiplier for Max 20x: ~640 agent-hours.

Cursor Ultra (API Pool, Opus 4.6 200K window):

API credits went from 0% to 26% over 60 minutes.
Monthly API pool capacity: 100% / 26% × 60 minutes = ~231 minutes
Normalize to agent-hours: ~231 mins × 4 agents / 60 mins/hour ≈ 15.4 agent-hours.
Since this experiment used only Opus (a frontier model), only the API pool was consumed. We borrow the estimated ~138 total agent-hours for Cursor's 2 pools for the combined estimates.

In this single experiment, Claude Code Max 20x delivers roughly 41x more than Cursor's API pool (640 / 15.4), or roughly 4.6x more than Cursor's combined capacity (640 / 138). Other experiments produced different ratios depending on the model, discounting, and control tightness. The ~5x headline is the central estimate across all experiments.

This is back-of-the-spreadsheet math, not a precise benchmark. But for an order-of-magnitude comparison, it's enough.

Qualitative observations

A few things that don't show up in the numbers but matter for choosing a tool.

Composer 2 velocity was great. Some of the projects were eye-opening: Composer 2 raced through an average of 7.1 projects to Opus 4.6's 2.3. Experiencing it in real time was striking. Whether that speed holds up on complex, ambiguous tasks is an open question.

Opus 4.6 performed consistently across both platforms. Same model, same velocity on Claude Code and Cursor. The capacity difference between these tools is pricing architecture, not model quality. If you're choosing based on model capability, both platforms give you access to the same thing.

Token consumption is volatile day to day. Model updates, features, regressions, and discounting all hit during the same period. This may have caused noise in the experimental data, but it's also representative of daily life at a particularly active time in the technology and business of AI coding tools.

Key takeaways

1. The capacity gap is real: ~5x combined, ~38x on frontier models. If you use Claude Code with Opus (its default), you get substantially more runway per dollar than Cursor. If you only compare frontier model access, it's not close.

2. To make the most out of Cursor, you should be using Composer a lot. Most of your Ultra budget buys Composer credits, not SOTA access. If Composer fits your workflow, you get ~138 agent-hours and strong velocity. If you want frontier models full-time, Cursor becomes extremely expensive per agent-hour. A common pattern is to use SOTA models for initial planning and research, then Composer models to implement the plan much more rapidly.

3. Velocity matters — Composer 2 is much faster at completing projects. More capacity doesn't automatically mean more output. An engineer running Composer 2 on tasks may complete more work in 138 hours than another running Opus.

4. The pricing model shapes your workflow. Claude Code's speed-limit model rewards consistent daily usage with parallel agents. Cursor's monthly budget is more forgiving for bursty schedules. The "best" plan depends on how you work, not just the capacity math. (I covered this difference in Part 1.)

5. Codex is a real contender at ~1.6x Cursor's capacity, and a number of engineers I know and follow online have been enjoying Codex for its knack at solving harder problems that Opus 4.6 may have challenges with. And you get the SOTA model for all the agent-hours.

6. With Anthropic's "capacity reductions" for 7% of users, I ran out more often in the 5-hour session, but not necessarily the weekly session. I'm not 100% sure yet, because the measurements keep fluctuating. But the weekly session seems to be similar to what it was before. And since it is the constraining factor, running out of 5-hour sessions may not necessarily mean that I have overall fewer tokens per month.

Caveats and open questions

This section is long on purpose. The caveats are as important as the findings.

Experimental design limitations

No two experiments were identical. Models changed, plans changed, promotions came and went. Each experiment is a snapshot of a specific configuration on a specific day.

I was the human bottleneck. Confirming "done" claims, assigning projects, occasional breaks — all of this introduces noise. Semi-autonomous mode created asymmetry across tools: each tool pauses at different moments for permission, which affects velocity differently and is unavoidable.

Also, velocity was not the primary objective, since I was interested in token capacity (or agent-hours). In particular, code quality was probably decent but not audited. From my experience, the AI agents usually get most of the way to the finish line.

And, Codex and Claude Code both have lighter, faster models (e.g. GPT 5.4 mini, Sonnet) for varying speed and token usage.

There are many interesting variables and questions, and I didn't test them out for the sake of time.

Limitations in measurement and extrapolation

The whole purpose of this experiment is to normalize across tools that report usage in fundamentally different units, and that's also the main source of imprecision. Claude Code reports percentages of session and week. Cursor reports dollars for API plus a separate pool for Composer, with a combined total. Converting between these systems requires assumptions.

Resolution of the measurements is often low. If your measurement jumps from 0% to 3% in an hour, the true value could be anywhere from 3.0% to 3.99% — a roughly 33% range of uncertainty. For that reason, I ran multiple experiments to get a sense of averages and ranges. Using 4 agents helped me accelerate burn to see more numerical change in less time.

I simplified my extrapolation for agent-hours by multiplying the weekly estimated agent-hours by 4, totaling 28 days. Technically, the average number of days is slightly over 30.

The chaotic experimental window

I get the sense that something around March 13 or March 14 may have changed Claude Code's token burn to accelerate.

Moreover, the 2x off-peak discount launched March 14 and ended March 28. I normalized by halving, but the normalization is an approximation. Composer 2 shipped March 19, Experiment 7 may not represent steady state, though Experiment 8 (March 20, no discount) confirms the pattern. Codex's 2x promo was active through April 2, normal-rate Codex may go to 0.8x rather than ~1.6x Cursor. Or, focusing on frontier models, 6.2x instead of 12.4x.

I could have waited for a quiet week. But there hasn't been a quiet week in AI coding tools in months. This chaos is normal usage — the launches, the promotions, the regressions. A perfectly controlled experiment would be more precise but less representative of what you'd actually experience.

Open questions

Does the capacity gap change for different work types: greenfield vs refactoring vs debugging?
What about for tech stack? I was doing full-stack engineering in Elixir/React/Terraform. How does that change for Python/Svelte/Pulumi? Firmware? Mobile? SRE? Database internals?
What's the quality gap? If any of the models' speed comes at a quality cost, the velocity advantage shrinks.
How does this look on team and enterprise plans, particularly Claude Code Premium Seats in Teams?
Will these numbers hold as all the companies adjust pricing and models adjust velocity?

Tips for reducing token usage

I wanted to share a few resources I found online or heard while discussing this with friends and colleagues.

If you're using Opus, consider switching to Sonnet as your default model. A few of my friends report that Sonnet is similarly effective, but faster and more token efficient. I've been mostly focused on Opus, so I can't speak to this directly.

Reading Claude Code best practices. Regardless which tool you're using, some of the concepts in the guide may help.

Clearing context more frequently is an easy change. My experiments ran on models with 1M context, and I just let them run and auto-compact over the course of the hour. I believe the whole conversation gets sent up (minus caching effects), so clearing might be impactful.

Cron job 2–3 hours before you start your work day to send Claude/Codex a trivial message. Given that the 5-hour session limit is a constraining factor, consider that you typically have 2 sessions in an 8-hour work day. You can get a 3rd window if you trigger it before you start the bulk of your work. Note that in the end, you'll still hit the weekly constraints.

Use "token-reducing" libraries like RTK. The premise is that a lot of CLI binaries that the AI coding agents call generate noisy output that is bad for LLMs. It creates a proxy to optimize the tokens. Consider looking for more, since this is a class of tooling. In the CLI, there is tokf. There are also prompt compressors like Microsoft's LLMLingua.

Current events: recent news about token costs

At the cost of extending this article further, I wanted to highlight a few recent items of news as it pertains to this analysis.

On March 5, 2026, Forbes reported that Cursor's internal analysis showed that the $200/mo Claude Code subscription could get $2,000 of tokens at the end of last year, and in early March 2026 was getting $5,000 of tokens. On the other hand, compare that to Cursor's $200/mo plan offering $400/mo of API usage plus generous Auto+Composer. But the reason I was interested in this experiment is to begin to translate it to "how many hours of engineering work can I do with this?" and begin to quantify this.

Also on March 5, 2026, investor-entrepreneur Chamath Palihapitiya tweeted that his company 8090 chose to migrate off of Cursor because AI costs have tripled since November 2025, and are "now spending many millions per year", trending to $10m per year. He mentions that it may be how the engineers are using the tooling as well, e.g. running runaway loops ("Ralph loops") without regard to cost. But the main point is that it's a topic of interest and an area worth thinking about.

Around the weeks of March 14–26, users were reporting increased token burn rates. (See my LinkedIn posts noting my initial observation on March 14, then my LinkedIn post when it trended on X on March 25). It looks like Anthropic announced a capacity change on March 26, estimated to affect a minor ~7% of users. But as of publishing this article (Mar 30), it seems like they're still working on it.

I speculate that Anthropic tweaked the 5-hour token limits which helps them with scale, but the weekly token limits didn't change that much. If that's true, then overall monthly token capacity doesn't change much. It just means you run into the limits a lot more per day. (You might try that cron job I mention above.)

Anyways, this article represents a moment in time as our use of the tool and the pricing models around it change. Last June/July (2025), Cursor changed its pricing models in a way that upset users. I wouldn't be surprised if this continues to change.

Wrapping up

This started with a pricing question and ended up capturing a lot more. While the technology, pricing and business will continue to evolve, I wanted to do this deep dive to understand a snapshot of the ecosystem today. As things evolve further, I can have an anchoring mental model to reason about future changes.

The choice isn't just "cheaper vs more expensive." It's what kind of capacity you need. Frontier model capacity for complex reasoning? Reach for Claude Code or Codex. Fast implementation throughput on well-scoped tasks, or you prefer an IDE? Cursor Composer has a real speed advantage when you combine frontier models for planning and troubleshooting with fast, lightweight models. Most engineers probably need some of both — the question is which default fits your workflow.

I plan to keep running experiments as both tools evolve. If you're interested in discussing the findings, seeing the raw data, or talking about token math — I'd like to hear about it: connect on LinkedIn or find me on X.

This is Part 3 of my series on transitioning from Cursor to Claude Code. Catch up: Part 1: Stuck at 16%, Part 2: Parallel Agents.

From 1 to 3 Parallel Claude Code Agents: How I Broke Past 16% Utilization

Andrew Shu — Tue, 17 Mar 2026 16:41:24 +0000

At the end of my last post, I was: stuck at 16% Claude Code utilization on the Max 20x plan, and had just figured out that parallel agents could help me break past that limit, explore git worktrees, and make better use of the $200 plan I had paid for.

But I knew that git commits and conflicts would be a problem as 2 agents make the same commits in the same repository. So how do I coordinate and isolate them?

Spinning up a second (or third) agent in another terminal is easy. But keeping them productive and increasing velocity was the new challenge. I had been reading many posts about people orchestrating 10's or 100's of background agents, but I haven't read many tutorials covering the evolution from 1 to 3 agents.

Anthropic recently shipped Claude Code Agent Teams, which automates this: a lead agent coordinates teammates, assigns tasks, and synthesizes results across multiple sessions. But this is more about automated delegation of a single existing project rather than adding the ability to parallelize arbitrary new projects.

This post covers the observations, changes in my local development environment and the reasoning that took me from 16% to 50%+ utilization.

Token FOMO: how were people using 3+ Claude Code accounts?

I'll confess: I was feeling token FOMO, watching engineers post articles about their agent squads and getting 100 agents run in parallel. Meanwhile, I was stuck at 16% on a $200 / month. I didn't feel like I needed 100 agents, but I wanted to understand how to break past that to get on the path to higher output.

After the initial experiment with a second agent to consume more tokens, I realized that extra agents would be chaos: merge conflicts, inconsistent databases, agents pulling the rug out from under each other.

So, I decided to investigate how to coordinate them. This eventually took me on a journey that improved my workflow. But before I took the first step, I realized I needed to ask: would it actually increase my output?

Before adding a second agent, make sure the first is busy

It's a bit counterintuitive. It's so easy to spin up a second agent that it's also easy to miss that you're only getting value if you're able to keep both agents mostly busy. Here are a few hints to figure out where you are.

If your agent is often waiting on you: waiting for an answer to its question or needing clarification on ambiguous tasks, it's not doing real work. You probably need a way to keep them busy better, and this is where structured markdown plan files pay off. Or the agent is waiting on you to execute some queries or deployment, maybe the problem is a tooling problem - you need to automate something and put it in the hands of the agents (if it's safe to).
If you keep interrupting the agent to micromanage, it's effectively the same problem. It may be helpful to review the markdown plan files and get them into more agreeable state before you let the agents run.
If your agents keep churning out work that you end up disagreeing with (despite having reasonable plans), then the workload may not be good for parallelizing. I see this often when troubleshooting complex systems or complex projects.
If you prefer to hand-code, or you don't like context-switching between multiple agents and tasks, you have a totally valid reason not parallelize. Not every workflow benefits from more agents.

But if you're sitting idle while the agent runs—working on the next task, responding to Slack, browsing Reddit—this is the signal you can juggle another agent. Basically: if you're waiting on the agent regularly, committing and shipping regularly, then add more agents.

Two agents crammed in a repo: isolation is a problem

When you're ready for Agent 2, you become an engineering manager and face the problem of assigning useful work to your team. You need to source the work: come up with ideas, talk to people. You need to scope it so it's parallelizable and pragmatic.

But there's a technical problem you'll face first. If two agents are touching the same files, you'll get merge conflicts, overwritten work and outdated understanding of the code. In the first few hours working with parallel coding agents, I tried to keep them productive and focusing on separate concerns in the same repository.

Here are a few methods I used to separate the agents:

Frontend / backend split: these are often separate concerns in separate files
Application and infrastructure code: e.g. one agent writes Typescript, and another, Terraform
Feature pipelining: first ship feature 1 behind a feature flag, and validate it / work on corner cases while another agent starts feature 2
Async refactoring, hardening, polishing, documentation: sometimes if I have a bit of extra bandwidth, I'll spin up an extra agent to do maintenance that avoids my main work. It's useful to accumulate maintenance tasks in a backlog for the agent to pull from.

After a few coding sessions, I realized "separation" wasn't enough: I was trying to keep agents separate through convention rather than configuration. Collision reduction wasn't enough; I needed to properly isolate them to eliminate collisions so they can be more autonomous and be faster.

And note: these agent "separation" methods may not have fully solved the "isolation" problem. But they're useful scoping / delegation approaches to fully isolated agents, too.

I also began to be aware of the kinds of workloads that required more active attention, and some kinds of workloads that ran longer. I knew that I could only handle 1 workload that required active attention, which meant the other agents needed longer projects. And having longer projects meant a bit of planning ahead.

Git worktrees: multiple coding agents in the same repo

Once you have isolated, right-sized projects, I was still running into the risk of git conflicts. Parallel coding agents are still editing files in the same directory.

Git worktrees solve this problem. Each worktree is a separate checkout of the same repository: a different folder, a different branch, but linked to the same git history and object store. They're lightweight to create, and you can have 3 agents working in the same sandbox, all contributing back to the same repo. Learn more about git worktrees.

Why Git worktrees instead of separate clones? They're a bit more lightweight because they're linked to a history of the various branches and commits. So, for example, a single git fetch makes it into all worktree directories. And commits in each worktree are known by the others.

There are a few ways to set this up:

Worktrees with git

Git worktrees are both branches and folders, so here are a few commands you can use:

# List worktrees
git worktree list

# Create a new worktree AND a new branch in 1 command
git worktree add -b <new-branch-name> <path/to/new/directory>

# Create a new worktree with an existing branch
git worktree add <path> [<branch>]

# Remove the worktree (i.e. the folder) but the branch remains
git worktree remove <path>

Built-in support in vibe coding tools

I won't elaborate here, but I'll link to the documentation:

Cursor: Parallel Agents
Claude Code: Run parallel Claude Code sessions with Git worktrees
Codex: Worktrees

Note: If you want automated multi-agent coordination, check out Claude Code Agent Teams. This is useful to parallelize tasks within a single project. The worktree-based approach I describe is slightly different. You can create and control your own system of parallel agents to launch multiple, arbitrary projects. Claude Code Agent Teams lets you burn down a project's list faster, and the worktree-based approach lets you branch out to work on multiple projects.

3 parallel API servers, 3 frontend servers, 3 databases

Git worktrees went a long way, but I noticed that verifying my code was tedious. Stateless unit tests were easy, I could run them to my heart's content in each worktree. But integration tests that touched the database encountered different postgres schemas. And my local servers obviously collided on ports.

Here's what a sample web server might look like, with an API server and UI server that each have environment variables.

I found myself spinning up and tearing down servers, running migrations back and forth. I started thinking about how to isolate them a bit better by extracting configs into environment files, then parameterizing different ports and databases. Classic DevOps practices.

So, I wrote a few Claude Skills that wrap the underlying git worktree command: create a worktree, allocate ports, provision databases, generate .env files, install packages and register the ports/database in a central JSON file.

I've open-sourced a generic version that you can customize for your application: you can find it on Github. You'll need to customize the setup/teardown to the specifics of your environment. While I can't make it turnkey for every solution, I wanted to share the structural elements: where it runs and how it runs.

Results: from 16% to 50% with 3 parallel agents

Parallelizing coding agents was fast and straightforward: for a small application, I reconfigured my tooling in about a day. With the agent Skills I shared in Github, I was able to create a worktree and provision the environment in such a way that I could scale up the number of parallel agents to 3 and beyond.

Whereas I was stuck at 16% before, I was quickly hitting 45+% consistently. I got to the point where I started bumping up against the weekly rate limits. And I could finally see the pathway to 10 or more agents, and the need for 2 or more Max 20x subscriptions.

But in the end, it wasn't about getting to the top of a token leaderboard. Boosting my Claude Code utilization from 16% up was an exercise to ground my use of AI coding agents towards getting useful work.

It was helpful to work on a small project to exercise my software development lifecycle: planning, implementing, testing, and running on simple cloud infrastructure. What's the use of fast coding if you can't operate and troubleshoot it?

Don't feel FOMO about orchestrating 10+ parallel agents

It's important to keep our eyes on the goal: to build things people use, enjoy and get value from.

There are folks who are pushing limits and are aiming to build fleets of 10's or 100's of parallel agents. That's awesome, and I can't wait to see what abstractions and tools they create to make it useful for the rest of us.

But, I wanted to figure out the pathway to parallel agents in a grounded, lightweight way. I wanted to figure out when to parallelize, how to split up the work, and what infrastructure to set up. AI coding agents are clearly accelerating our work, but I wanted to feel the rough edges so I know how specific tools solve specific problems.

If you take one thing from this: don't start with the tooling. Start by getting your first agent fully occupied, then find an isolated task for a second. The changes you make should follow the problems you encounter.

If you want to skip the manual setup, grab the worktree bootstrap script on GitHub and customize it for your project. It handles port allocation, database creation, and env config for Rails, Phoenix, Django, and similar stacks. Check out the readme.md for instructions.

Now that I'm running 3 agents and burning tokens 3x faster, the cost comparison between Claude Code and Cursor becomes interesting again. Next up: I'm going to dig into the pricing math to answer my original question about why switching from Cursor to Claude Code seemed to drop my token usage by 64%.

This is Part 2 of my 3-part series on my experience transitioning from Cursor to Claude Code. Catch up: Part 1: Stuck at 16%. Part 3 next week.

I Switched from Cursor to Claude Code and Got Stuck at 16% Utilization

Andrew Shu — Fri, 13 Mar 2026 19:16:47 +0000

While tinkering over the holidays, I remember thinking: "This is so strange! I was easily reaching $350 of Claude tokens in Cursor usage for the month. After switching to Claude Code, I was barely making it past 16% in Claude Code's 5-hour sessions. Comparing Claude vs Cursor's $200 plans, they both cost $200 / month. It's the same work, same velocity, yet I'm experiencing totally different limits."

Given my ops and scaling experience, I'm mindful of how much it costs to operate software. So I obviously couldn't leave this alone! This journey started out with me worried I had overpaid for a $200 plan but ended up leading to a significant acceleration in my workflows as I tried to make full use of my Claude Code allotment.

From Cursor to Claude Code: monthly token counter vs 5-hour speed limits

Within 15 minutes of using Claude Code, I realized that I was going to need much more than the Pro plan ($20/month). I started with the smallest paid plan to feel where the limits are. This threshold was initially a shock to me, since my mental model of "token limits" was still based on Cursor's monthly window.

At the time, it would take me a few days to use up Cursor's tokens. But with Claude's 5-hour cycle, you get fast feedback that the Claude plan you've chosen is too small. So to reframe my observation: it was not that I had "used up all the tokens for the month", but that I was using tokens at a much faster speed in this 5-hour session than was supported by the plan.

Given how fast I had hit my "$20 Pro Plan" limit, I assumed that I wouldn't need to try out the middle Max 5x plan and just jump up to the $200/mo Max 20x plan. (Anthropic only charged me the prorated difference, so it was easy to upgrade).

I assumed I was going to hit 80-100+% utilization like I was in Cursor, but I was wrong. Imagine my surprise when, after a day or two of coding, I realized that I never hit anywhere close to the Max 20x plans' 80-100% utilization!

Did I overpay for Claude's Max 20x Plan? No, but I needed to learn how to use it.

After using Claude Code for the first 2-3 sessions, I noticed I was only using 6-12% of the 5-hour usage window in each session. Thus I was only using a fraction of the $200 I spent. This was a surprise! I was doing the same coding workloads on Cursor and Claude Code.

Having such low Claude Code utilization was great, because it meant I could code more and spend less money! But it bothered me on two levels: firstly, could I have gotten away with paying less. Secondly, how were people hitting 100%? Not just 100% -- I was reading online about people using 2-3 Claude Code accounts.

My goal wasn't to maximize token spend or get to the top of the leaderboard. I was puzzled and bothered by this underutilization. So, the first thing I did was to set up structured markdown plans to launch longer-running agents that made full use of the 5-hour session. This let me confirm that tasks were reasonable, and I was unlikely to need to pause Claude's work to answer questions and troubleshoot.

After a focused few high-usage sessions, I realized I could push my utilization to 14 - 16%. And that seemed to be the ceiling.

Claude Code's 5-hour sessions are a "use it or lose it" rate limit that spreads out usage

Let's dive into Claude Code's rate limiting system. The "5-hour usage" windows functions as a "speed limit". What this practically means is – you'll figure out your "speed of token usage", and calibrate your plan accordingly.

So when we compare Cursor vs Claude Code's pricing models, Cursor is billed by monthly total tokens consumed. That means I could leave Cursor untouched for 29 days, then use up all my tokens in the 30th day. (I assume there is a rate limit for extremely bursty token usage in Cursor, but I've never hit it.) It also means that comparing Cursor to Claude Code licensing is an apples to oranges comparison.

Claude Code also has "weekly limits" that are a second layer on top of the 5-hour limit. Imagine if you maximized the 5-hour limit, 24 / 7; that could get extremely expensive for Anthropic. So, Anthropic sets an upper limit for sustained usage over the week. If we reflect on pricing design, they could have set it at a 1 month limit. But by setting it to be weekly, you must utilize your weekly limits, because it doesn't roll over to the next week.

So the 5-hour limit is a "burst speed limit", and the weekly limit is a "sustained usage limit". The 5-hour window smooths out utilization across the day, and the weekly window smooths it out over the month. Since tokens don't roll over from week to the next, you use it or you lose it. Technically you can miss a few 5-hour windows, and you can make it up later in the week. And if you don't use it for the week, you don't get to make it back up.

This isn't a bad thing. Most engineers are probably doing work spread out over days and weeks. Claude Code's system is a fair agreement for typical engineering work, that makes better use of compute resources and the effort you put in writing code. If you want higher levels of usage on either plan, you're an advanced user and you need to pay API usage (i.e. higher) token costs.

Another way to look at it could be: the 5 hour session roughly maps out to a 10 hour workday, and then utilization assumes a ~40-hour work week of utilization. So some of the windowing and upper limits may make sense from this lens, and may not make sense if you're trying to utilize your license with a 24 / 7 weekly schedule. I haven't explored the math and logic behind this framing, but wanted to share it as a thought experiment. So take it with a grain of salt.

So did I get past 16% utilization in Claude Code?

After all this, I was still stuck at 16% utilization. I understood why: the speed-limit system means that a single coding session with a single agent has a natural upper limit. No matter how focused I was, one human directing one AI agent can only consume tokens so fast.

And that raised the obvious question: if one agent tops out at ~16%, and people online are hitting 100%+ across multiple accounts... they must be running agents in parallel. This meant I needed to figure out how to coordinate multiple AI agents working on the same codebase without them stepping on each other's toes.

To coordinate parallel agents, I had to rethink how I broke down projects. It also led me to git worktrees and additional changes in my local development environments. I'll cover that in my next post, and describe how it took my utilization from 16% to 50%+.

If you're tracking your own Claude Code or Cursor utilization, or you've figured out the parallel agent workflow, I'd love to hear about it — DM me on LinkedIn or on X.

I write weekly about vibe coding workflows, costs, and tools. Follow me here on Dev.to, or subscribe at ashu.co for email updates.

Why I use markdown plan files instead of Cursor and Claude's built-in planning

Andrew Shu — Mon, 02 Mar 2026 17:49:53 +0000

The technique that helped me jump over the threshold from "coding with AI" to actually "vibe coding" was the use of a plain markdown file. Interestingly, it wasn't Cursor's built-in planning mode, nor Claude's in-memory task lists. It was a plain markdown file, with numbered subtasks, and a breakdown of the work that needed to be done.

Early on, I stayed "hands-on" with the coding agent, because I was concerned about multi-part problems, about the AI coding agent "jumping in too soon", and that I wouldn't know how the code worked if the AI hallucinated or ran out of context (tokens). But I found that vibe coding with markdown plans meant that I had an artifact that (quickly) helped me be more intentional with design.

In this article, I'll share the prompt I put into AGENTS.md (which I've shared on Github), what worked, and what didn't. But first, I'll share some context about how I got there, because I think the path is relevant to a pattern I hear a lot of engineers get stuck in.

How I went from 'coding with AI' to actually vibe coding

I used Cursor in my work through most of 2025. I found it useful for inline prompt editing, one-off chat questions, and greenfield scripts for reporting and maintenance. But I was mostly doing single-file work. Multi-file, multi-step projects with real complexity? I'd start in Cursor, hit a hallucination or a design decision I disagreed with, and fall back to writing it by hand.

For me, it came down to a conversation with Michael Stahnke (leading engineering at Flox) at Github Universe last year. We were comparing notes on how we were using AI for coding, and he mentioned something that resonated instantly: he was structuring his work in markdown plan files. This gave me a lever of control -- it let me audit a multi-part project breakdown before implementation.

From there, I was able to go from prompting the agent for each change I wanted, to taking a step back and creating a plan that I could let the AI coding agents run for hours on. This was when I truly went "hands off" and let the AI steer for itself.

Here's what I've found makes this approach work.

" width="800" height="533">

Benefits of a markdown plan file, for me and the AI coding agents

It gives me an artifact I can control. With a file artifact, I can find it easily in my codebase, edit it if I want to, commit it for future reference, append notes and learnings, or feed it into another system or automation. This is a subtle point, but it's the key reason why I prefer a "markdown plan" that I own, over the "planning modes" in Cursor / Claude.

It makes me and the coding agents more intentional. AI is great, but still not perfect. A markdown plan forces a quick up front research phase where I can identify gaps in understanding, and areas I may disagree with. It also forces me to roughly understand what's being built. Even as things become more autonomous, it's still important to understand what you own -- even if it's at a higher level and across many more systems.

I can control the pace better. I can pause the AI to ask questions, I can rewind to a previous state (in git and the plan file), skip around tactically, I can adapt the plan as we discover new information, or go fully hands-free. The goal is to increase autonomy and parallelism, but having a file with clearly numbered and grouped tasks lets me communicate about and manage chunks of work. This can be extended by pointing multiple agents at different phases of the same plan.

It gives me more options in managing my conversation's context windows. Since agents lose efficacy as their context windows fill up, different people have different preferences in how often they "reset" the agent. Some prefer 50% usage, 90% usage, or some people OK with "infinite-ish conversation" compaction. In any case, having a markdown plan with task status, a work log, and git commits gives you the option to clear the conversation any time.

A plan file gives me a place to deposit learnings, error messages and TODO's. This is more of a documentation step. But as I interact with the agent, there are times where it encounters an error message, or a design decision that I want to remember or revisit later, I like to log it for later. Since I always archive my plan files instead of deleting, I plan to use this as a work journal where I can come back and ask questions like "what tech debt have I accumulated"? This covers a gap in knowledge, because it's not contained in code or commit messages.

I can automate post-implementation steps as a Cursor / Claude Skill. Cursor Skills and Claude Skills both support reusable "prompt actions" - you can think of them as natural language scripts. When everything in the plan is done, I need to delete or archive the plan to a different folder. I've noticed that this is a natural point to run a Skill to review the code and look for opportunities to improve: security, testing, documentation, and code factoring.

Example Markdown plan to create a local Next.js frontend / backend for the coding agent

Markdown plan file best practices for Claude Code, Cursor, etc.

I've put my plan prompt on Github, and you can read an example of a markdown plan for creating a simple "hello world" Next.js app.

0xandrewshu / ai-utils

A collection of scripts, prompts and docs for use with AI and vibe coding

AI Utilities: collection of scripts, prompts and docs

As I use AI for vibe coding or other types of work, I find it helpful to collect reusable prompts, skills, subagent files, configs, etc. I'm creating this repository to deposit artifacts that I've found useful.

Compatibility

The intent is for these snippets to be reusable across AI coding tools (e.g. Claude Code, Cursor, Codex, Gemini / Antigravity, Copilot, etc.). There are occasionally differences in capabilities, but they have typically "caught up" with one another pretty quickly after that.

Repository organization

Initially, I plan to organize these as a flat directory until more organization is necessary.

rule-$NAME/ - e.g. CLAUDE.md, AGENT.md, AGENTS.md
skill-$NAME/ - e.g. Claude Skills, Cursor Skills
subagent-$NAME/ - e.g. Claude Subagents, Cursor Subagents
prompt-$NAME/ - e.g. reusable prompts to copy/paste into vibe coding tools (Claude Code, Cursor) or chat AI tools (Claude.ai, ChatGPT)

In each directory, I'll aim to…

View on GitHub

Header: a quick summary of the plan

I like to have a few lines at the top that summarize what this plan file contains: title, date, objective, and references to any "child" or "related" plans. I "spin off" and "split" big plans into child plans, and I find it useful for the "child plans" to reference the "parent plan", and vice versa.

Task List: the focal point for the implementation

Near the top of the plan, I like to have a consistently structured, markdown table of tasks. This is the focal point of the plan that serves as a backlog that sequences and organizes work. Since AI is often inconsistent about formatting, and the structure of the task list is important to my workflow, I've made it a point to specify structure concretely:

I like this markdown table to have these 5 columns: #, task, status, priority, comments
All tasks are numbered, so I can tell the AI things like "Do 1.1 - 1.3 but skip 1.4"
Tasks are grouped into "phases", so I can tell the AI things like "do phase 3 first"
I find that using emojis like "✅ Completed" helps me visualize status better for larger lists

Task Breakdown: a design doc to review

This functions like a design doc - I like to audit this BEFORE implementation. It's usually accurate, so it's mostly to catch the occasional issue and to improve my understanding.

Work log: a journal of errors, problems and learnings

This is a dropoff location where I ask the AI to deposit error messages, design decisions and tradeoffs, so I can reference them later. When I run a Claude skill to "close up my plan", I have hooks that will reflect on problems described in the work log. I want to be able to query for exact error messages so I can document it later.

TODO section: accumulating ideas that don't impact scope

Regularly in my work, I have to make tradeoffs and tell the AI that "this is outside scope", but "log it for later". So I tell the AI "save a todo to do XYZ", and the TODO's are stored here. Since I like to archive my plans (in git, or Obsidian), you can query old plans belatedly to extract TODO's relating to "Github actions", for example.

What didn't work: hand-written plans, long AGENTS.md files

I often find it helpful to read what people tried, but didn't work. So here's a few things that I tried:

Write the plan files by hand. When I started out, I would hand-write a plan. Very quickly, I realized that this time-consuming process could and should be done by the AI. I see and talk to engineers doing this, and I think this is a common misconception: use a short plan, and let the AI coding agent flesh out the plan.

Estimate and document "effort" and "risk", to influence the agent to behave differently. I thought this would change its behavior so it would scale up rigor / safety up or down. But in the end, I've seen no evidence of this. And its estimates of effort and risk were incredibly inaccurate.

A long planning guideline for AGENTS.md. The naive version of this planning guidelines accumulated to be long -- because I had the AI add instructions every time I was annoyed at something, it grew to 130 lines with a nearly full example. I've since condensed it to one that's around 35 lines, and it's productively good.

Having no planning guideline, and just telling the agent to "create a md plan". It was usually correct, but it was inconsistent. I depend on the task list being near the top, and having it structured in a certain way.

Telling the agent to "create a plan", and assuming it would follow the AGENTS.md. I have to explicitly say "create a markdown plan", and sometimes "create a md plan according to guidelines".

Counterpoint: why markdown plans may not be for everyone

Before I close, I should note that this technique may not be for everyone.

For starters, Claude and Cursor's Planning Mode is actually pretty solid. I think it works for many / most cases, and is simpler to use. It's time consuming to wait for a plan, review it, and simply implement it again. Also, you may get a lot of plans that you simply agree with and are OK to get going. If this is the case, you may as well have the AI jump straight into implementation and review the output at the end.

And as engineers seek more parallel agents and longer running autonomous agents, they may have to re-evaluate markdown plans. Maybe markdown plans aren't scalable enough, or are too freeform. For example, you can look at Steve Yegge's beads project, as products in issue tracking. (I haven't tried it, but I'd like to.) I believe the idea is that a more structured workflow will boost clarity, performance and understanding, especially for "totally autonomous agent teams" like his Gastown project.

There are other philosophies worth exploring. The Ralph Wiggum loop, created by Geoffrey Huntley, takes a different approach entirely: instead of planning across a long session, it runs the agent in a bash loop with fresh context each iteration. Progress lives in files and git, not in the agent's memory, so it avoids "context rot".

Another philosophy: Spec-driven development (with tools like GitHub Spec Kit and AWS's Kiro) goes further on the planning axis — writing detailed specifications with acceptance criteria before any code generation, so the spec itself becomes the source of truth.

My markdown plans sit somewhere in between: more structured than a Ralph loop prompt, lighter than a full SDD spec.

Closing thoughts: markdown plans are medium weight, and that's the point

This may seem like a heavyweight process, but in reality it's pretty quick. For starters, it's not meant for truly lightweight, one-shot prompts. I primarily use it for larger, multi-hour runs where I want to reduce the likelihood of poor quality work.

To reference a recently common saying that "with vibe coding, all engineers become managers", I think a markdown plan is basically like a manager or a tech lead asking a teammember to do a bit of researching, project planning and designing things. The complexity and time spent should be scaled up or down depending on complexity and urgency.

I should also add that I'm open to moving beyond markdown plans, and I don't think this is necessarily the end state of project planning. Specifically what I care about it is:

Taking a small amount of time to do some planning that I can iterate on as things change
Having good task management: task identification, descriptions, explanations, groupings
Having an artifact that I can control, integrate with and automate around

If you're using markdown plans or have a different approach to keeping agents on track for longer projects, I'd like to hear about it — DM me on LinkedIn or on X.

I write weekly about vibe coding workflows, costs, and tools. Follow me here on Dev.to, or subscribe at ashu.co for email updates.