Aleksandr Sakov

Posted on Apr 30 • Originally published at sundr.dev

From Vibe Coding to Shipping: My Spec-Driven Workflow with Claude Code

#ai #claude #specdrivendevelopment #programming

On February 2, 2025, Andrej Karpathy coined the term "vibe coding": "fully give in to the vibes, embrace exponentials, and forget that the code even exists." A year later, on February 4, 2026, he retracted it. The new word was "agentic engineering" — because, as he put it, the new default is that you are not writing the code directly 99% of the time. You are orchestrating agents.

Twelve months. From a meme to a discipline. Most teams I talk to have not noticed the difference.

In between those two tweets, a 2025 METR study ran a controlled experiment on senior open-source developers using Cursor and Claude on mature codebases — the kind of code base where you know all the corners. The developers forecasted AI would make them 24% faster. After completing the tasks, they estimated they had been 20% faster. Measurement showed they were actually 19% slower. Senior people, on code they had been writing for years.

In an earlier post I said AI made me 30-40% faster on my own client work. Both numbers can be true at the same time. The difference is not the tool. It is the workflow around the tool.

This post is that workflow.

It is not theoretical. It is what I run every day on production code at sundr. Claude Code is the engine; a small set of open-source tools are the guardrails. Together they turn AI from an enthusiastic intern who ships 70% of a feature into a senior pair who ships the other 30%.

If you have ever watched Claude generate a clean-looking pull request, hit merge with a smile, and then watched the same code break three days later under real traffic — this post is for you.

AI is an amplifier

The single most useful sentence I read about AI in 2025 came from Google's 2025 DORA report, which surveyed nearly five thousand developers in September: "AI doesn't fix a team; it amplifies what's already there."

Same report: 90% of developers are now using AI at work, and over 80% report productivity gains. About 30% report little or no trust in the code AI produces. That last number is what spec-driven workflow exists to address.

Three months later, Stack Overflow's 2025 developer survey dropped a sharper data point. Developer trust in AI accuracy fell from 40% in 2024 to 29% in 2025. Forty-six percent now actively distrust it. Sixty-six percent report being frustrated by output that is "almost right but not quite". That is exactly the failure mode that destroys you in code review six months later, when the bug is yours to debug and the original prompt is long gone.

Adoption is up. Trust is down. The gap between those two lines is the discipline gap. AI is going to amplify whatever workflow you already have. If your workflow is "type until it compiles," that is what you are amplifying.

The rest of this post is the workflow I amplify instead.

The 70% problem and the math of compounding

Addy Osmani named the trap clearly in December 2024: "AI can rapidly produce 70% of a solution, but that final 30% — edge cases, security, production integration — remains as challenging as ever."

I see this in every client review. The demo works. The pull request looks clean. The tests pass — because Claude wrote the tests too, which is its own problem. And then the bug surfaces three days into production traffic. A null check that was never wired up. A timezone offset that nobody noticed. An auth header that worked in development because the dev environment was permissive. Plausible-looking code that does not actually do the thing.

Why does this happen? It is not Claude being dumb. It is compounding probability.

Imagine each AI decision in a non-trivial task as a step that succeeds 99% of the time. Picking the right library. Understanding the data shape. Handling the empty case. Naming the variable. Calling the API correctly. Catching the right error. A real feature has dozens of these decisions in sequence.

One hundred decisions at 99% success: 0.99 to the power of 100. That is about 37%. A 37% chance the whole chain holds together end to end.

This is not pessimism. It is multiplication. You can see it in your own pull requests. Most of the line-by-line code is fine. The aggregate is broken because it has to be — even at 99% per decision, a hundred-step chain fails the majority of the time.

You cannot out-prompt compounding probability. You can only insert review. A workflow that catches errors at three or four checkpoints during the work — instead of one big review at the end — bends the math back in your favor. That is not bureaucracy. That is arithmetic.

CLAUDE.md — where Claude's memory lives

Every project I run with Claude Code has a file at the root called CLAUDE.md. It is the first thing Claude reads on every session. It is the closest thing the AI has to memory. It is also the smallest, cheapest investment that gives you the biggest return.

Per Anthropic's documented memory hierarchy, CLAUDE.md files cascade — there is one at ~/.claude/CLAUDE.md for your global preferences, one at the project root that everyone working on the repo sees, and optional per-directory files for module-specific rules. I commit the project root one. I do not commit my personal one.

I aim for twenty to fifty lines, structured into five short sections:

# {Project Name}

## What this is
{2-3 sentences: what the product does, what stack it runs on}

## Key directories
- src/domain/  — pure business logic, no framework imports
- src/infrastructure/  — Express, DB, external services
- tests/  — unit + integration + characterization

## Code standards
- TypeScript strict mode; type hints required
- Test framework: vitest
- Imports: external, then internal, then relative

## Common commands
- pnpm test  — run all tests
- pnpm dev   — start dev server
- pnpm build — production build

## Anti-patterns
- Do NOT use console.log — use logger from app/utils
- Do NOT mutate objects — spread {...obj, key: value}
- Never use a double-hyphen in user-facing text; use em dash with spaces
- Do NOT import infrastructure/ from domain/

The single most important section is the last one. Anti-patterns with concrete // DO NOT examples are read as constraints. Abstract rules are read as suggestions. Claude responds to the difference. So do humans, but they have the option to forget — Claude does not. That is the leverage.

Keep the file short. Long files dilute attention. If a rule does not earn its line, cut it. I have started deleting items from CLAUDE.md when I notice myself violating them — not because the rule was wrong, but because if it could not survive my own habits, Claude was never going to internalize it either.

Boris Cherny — who created Claude Code — publicly says his own setup is "surprisingly vanilla" and that he does not customize Claude Code much (see his Threads on how he uses Claude Code). That tracks with what I see in practice. Most projects need twenty to fifty lines, not two hundred. The file does its job by being noticed, not by being elaborate.

The compound effect is bigger than it looks. Claude starts every session with this file in its head. That is what lets it follow conventions on the first try, write commands the way you actually run them, and refuse anti-patterns without being asked. Without CLAUDE.md, the AI is a brilliant amnesiac who reintroduces himself every morning. With it, you have continuity. And continuity is what turns "works in this prompt" into "works in this codebase."

The four phases: Specify, Plan, Tasks, Implement

Anthropic's recommended workflow for Claude Code is four phases: Explore, Plan, Implement, Commit. That is the foundation. On top of it I run a four-phase artifact discipline that turns ambiguous intent into shippable code: Specify, Plan, Tasks, Implement — with the Commit step folded into the TDD-driven commits inside each task. Same shape as EPCC, but with explicit gates between phases that mean a human reads the artifact before AI starts on the next one.

The workflow pairs two open-source tools, both of which I would point a new client at on day one.

GitHub's Spec Kit is the artifact half. It defines the file structure: a constitution.md that captures operating principles, a spec.md that captures intent and acceptance criteria, a plan.md that captures the technical approach, a tasks.md that breaks the work into bounded units. Each artifact has its own slash command — /speckit.specify, /speckit.plan, /speckit.tasks, /speckit.implement — that asks Claude the right questions in the right order. Open source, framework-agnostic, works with Claude Code, Copilot, Cursor, and Gemini.

Anthropic's official superpowers plugin is the discipline half. It bundles a set of skills that enforce the rituals around the artifacts: brainstorming insists on a design conversation before any code, writing-plans decomposes work into 2-to-5-minute tasks before implementation, test-driven-development enforces RED → GREEN → REFACTOR, verification-before-completion demands evidence — actual command output — before I am allowed to claim something is fixed.

Spec Kit is the what. Superpowers is the how. Either one alone leaks. Plenty of teams use Spec Kit and still vibe-code their way through implementation because nothing is enforcing the test-first habit. Plenty of others run superpowers and skip the spec because nobody made writing one a precondition. Together, they close the gaps where AI usually wanders off.

Here is what each phase actually does in practice.

Specify answers what we are building and why. Five components, in order: functional requirements, acceptance criteria, input/output examples, constraints, and — the one most teams skip — out of scope. AI expands scope by default. The line you do not draw is the line AI will cross. I write the explicit "do not implement X, Y, Z in this feature" list before I write anything else, even if Z feels obvious.

Acceptance criteria use EARS notation — short, machine-parsable lines like WHEN the user submits a checkout form without a payment method, the system SHALL reject the order AND display a recoverable error. EARS was published in a 2009 IEEE paper and has nothing to do with AI; it just happens to be unambiguous, and unambiguous is what AI needs.

Plan answers what technical shape it should take. The output is a plan.md with the architectural sketch, the data model, the integration points, the risks, and an explicit "if this fails, here is the rollback." This is the highest-leverage gate in the workflow — the last cheap moment to change direction before tasks get cut.

Tasks breaks the plan into 30-to-60-minute units of AI work, each independently testable. Markers like [P] indicate parallelizable tasks. The list is ordered. Nothing is "TBD" — if I do not know how to bound a task, the spec is wrong, not the task list.

Implement is the only phase where AI is in the driver's seat — and only inside one task at a time, with the test running before and after every change. Specify, Plan, and Tasks all stay in human hands. The artifact is mine; the implementation is Claude's. That flip is what makes the workflow work. Every gate between phases is a place where I can stop, edit, restart, or reverse course without losing more than the last task's worth of work.

Two more pieces of this discipline are worth their own section: the Claude Code modes that compound the most, and the specific class of bug that only a fresh-context reviewer can see.

Two free tactics that compound

Most of what I have described costs effort. The two tactics in this section are nearly free, and they are the ones I use the most.

The first is Plan Mode. Claude Code has a read-only mode where the AI can see every file and run every search, but cannot edit anything, run shell commands, or write to disk. Toggle it on, paste your task, and Claude reads, plans, and writes a markdown plan back at you — without touching the codebase. Then you toggle it off and let Claude execute the plan.

I run Plan Mode before every non-trivial change. It costs me sixty seconds and saves me — conservatively — twenty minutes of cleanup per week. The plan that comes back is also a great input to Spec Kit's /speckit.plan, since it surfaces files and risks I had not thought through.

The second tactic is Writer-Reviewer with /clear. The setup: ask Claude to do a thing. When it claims to be done, type /clear to wipe the context, then in the same session ask Claude to review the diff. The reviewer is technically the same model, but with no memory of the writer's reasoning or assumptions.

This catches a specific class of bugs that pure in-session review misses — cases where the writer's mental model was wrong from step one and every subsequent decision compounded the error. The writer cannot see those bugs because the writer's mental model is the bug. A fresh-context reviewer comes at the diff cold and asks "wait, why is this casting userId to a number when the schema says it is a UUID?"

In my experience, the post-/clear review catches a meaningful share of issues that in-session review misses — roughly the lift you would expect from a second pair of eyes, except the second pair never gets tired and never has Friday-afternoon brain. It is not magic. It is just a different head reading the code. If you only adopt one tactic from this post, adopt this one.

Both Plan Mode and the /clear trick are also free of any plugin or extra tool. They ship with Claude Code as it stands.

Tasks sized for thirty minutes of AI work, not three hours

The single biggest lever inside the Implement phase is task size. Too small, and the overhead of starting and committing each one drowns the value. Too large, and Claude drifts: the context fills up, the test runs become unreliable, and you find yourself reviewing a sprawling diff at the end.

The size that works for me is thirty to sixty minutes of AI work per task. Each task ships its own commit and is independently verifiable. A real tasks.md looks like this:

- [ ] T1: Add /api/health endpoint  [P]
- [ ] T2: Wire health check to monitoring  [P]
- [ ] T3: Write characterization test for legacy /api/order  [blocks T4]
- [ ] T4: Refactor /api/order — extract validation helper
- [ ] T5: Add e2e test for /api/order happy path

The [P] markers say which tasks can run in parallel. [blocks Tx] markers express dependencies. T3 has to ship before T4 because you do not refactor what you cannot characterize.

Inside each task I run a tight loop: write the failing test, watch it fail, write the minimal code to pass, watch it pass, commit. This is just TDD; the superpowers test-driven-development skill enforces the cycle when I am moving fast and tempted to skip steps. The loop matters because every committed step is a known-good rollback point. If a task goes off the rails, I lose minutes, not hours.

One operational rule: I watch context utilization. Once Claude's context is over about 70%, output quality falls off — the reasoning gets sloppy, the code starts repeating earlier patterns wrong. When I see that bar climb, I either run /compact, start a new session, or stop. Pushing past it is wasted tokens and wasted time.

What this looks like on a real project

Make this concrete: last month I shipped a tournament-staking system for MTT Tracker, the poker analytics product I run on the side. Players sell shares of their tournament action to backers; the tracker now models the full economy — markups, settlements, swaps, staking-adjusted P&L.

One proposal.md defined motivation and scope. The Out of Scope section explicitly excluded a backer-facing portal, payment integration, a staking marketplace, and tax-reporting features — none of those were "obvious" exclusions, and every one of them is something AI would happily have expanded into. Drawing those lines up front saved hours of "let me add this while we are here" detours.

The tasks.md grew to 195 ordered, bounded items, staged across three phases: feature flag and gating, schema and calculations, then UI and analytics. Each task carried a TDD micro-loop — write the failing test, watch it fail, write the minimal code, watch it pass, commit. Six locale files updated in lockstep with every UI change because the rule was in CLAUDE.md.

The interesting part is not how fast it shipped. It is that nothing leaked outside the spec. (For more on the product itself, I wrote a longer MTT Tracker case study here.)

What I would skip — and what changed in 2026

Three things this post does not recommend, even though they show up in plenty of agentic-engineering write-ups.

Heavyweight multi-agent teams for sub-day features. Spinning up a "team" of specialized agents — backend, frontend, devil's advocate, reviewer — is genuinely useful for cross-cutting work that takes a couple of days. For a one-hour change, single-agent EPCC plus the Writer-Reviewer trick from Section 6 is faster and produces less noise.

Magic-word thinking budgets. Throughout 2025, prompts laced with ultrathink or think harder were a real lever — they bumped Claude's reasoning depth. As of January 16, 2026, Anthropic deprecated those keywords. Current models manage thinking budgets adaptively. If you see them in someone's CLAUDE.md, that file has not been updated this year.

The "AI coding makes you faster" headline as a single number. Even METR — the team that ran the original 2025 −19% study — updated their estimate in early 2026 to roughly −18% with wide confidence intervals and an explicit caveat that the data is "very weak evidence." The honest framing is not that AI is universally slower. It is that AI without discipline is a coin flip on mature codebases. With discipline, it is a multiplier. The headline number depends entirely on the workflow around the tool.

That coin-flip-versus-multiplier framing is what ThoughtWorks calls cognitive debt in their April 2026 Tech Radar. Spec-driven development is on their short list of habits that stop cognitive debt from piling up. They are right.

That is also the answer to the puzzle I opened with. The same Claude can make a senior engineer 30-40% faster on one team and 19% slower on another. The model is identical. The workflow around it picks the outcome.

If you want this discipline applied to your codebase

Reading about a workflow is one thing. Running it on your actual repository — with your actual stack, your actual deadlines, and your actual technical debt — is another. If you are building a real product and would like the workflow described here applied directly to your codebase, not in theory, that is what I do at sundr.

The two easiest first steps: try the project calculator for a quick sense of timeline and budget, or book a free thirty-minute call and tell me what you are working on. I will give you a straight answer about whether this approach fits — and if it does not, I will tell you that too. (If you are still deciding between hiring solo and an agency, my honest take on that question is in another post.)

No hard sell. Just an experienced engineer giving you a real opinion.

DEV Community