Spec-Driven Development

#ai #claudecode #agenticengineering #programming

Andrej Karpathy recently gave a talk called "From Vibe Coding to Agentic Engineering." One line stuck with me: "People have to be in charge of this spec, this plan. Work with your agent to design a spec that is very detailed."

He is describing what happens when you stop treating AI as a magic text box and start treating it as something that builds from structured input. The casual approach works for small things. A quick script, a throwaway page, a one-off function. The moment the problem gets real, you need a spec, not just prompts.

I have been building this way for a long time now. I want to talk about what I have learned from practicing spec-driven development. What it actually looks like in practice, and why it produces better work with less correction.

Four layers

It is common for projects using Claude Code or similar tools to have only one layer of configuration: a CLAUDE.md file with some instructions and context. Maybe some rules. That is the behavioral layer. It tells the agent how to act, what conventions to follow, what to avoid.

The behavioral layer is not the spec. And on its own, it is not enough.

The spec is a separate artifact. It describes what you are building: the architecture, the API contracts, the data models, the user flows. When Karpathy suggests it is "basically the docs," he means a document detailed enough that an agent can build from it without asking you a bunch of clarifying questions (or even worse, making generalized guesses).

I built a tool called Forge that generates these specs either from scratch, from existing product documentation, or as a retrofit from existing codebases. It gathers information, or reads from the provided sources, analyzes the structure, and produces detailed planning documents: product specs, feature scoping matrices, API mappings, test plans. Other tools do similar things. Spec-kit, Superpowers, GSD. The ecosystem is growing because the need is real.

Even though I built one of these tools, I still recommend that teams build their own. I will explain why below.

The four layers in practice

Layer 1: The spec. Generated or hand-written, this is the detailed plan. Architecture, contracts, data models. The agent builds from this. If the spec is vague, the output is vague.

Layer 2: The workflows. A spec sitting in your repo is just a document. What makes it useful is the skills built around it. Skills that generate the spec, reference it when building new features, and check for drift when the architecture changes. The spec is the artifact; the workflows are the muscle.

Layer 3: The behavioral config. CLAUDE.md and rules. This tells the agent how to behave while building. Code conventions, testing requirements, commit message format, what to avoid. I have rules that enforce things like "always use Tailwind design tokens" and "each file has a single responsibility." These are not the spec. They are the guardrails.

Layer 4: The mechanical enforcement. Hooks are the nervous system. They fire automatically on events: before a commit, after a file edit, when a session starts. A rule says "run tests before committing." A hook actually prevents the commit if the tests fail. The difference between a suggestion and a gate.

Layers 3 and 4 are the ones that exist in most setups. Layers 1 and 2 are where the leverage is. The agents that produce consistently good work have all four.

Why "just prompting" breaks down

Before I realized the importance of a thorough spec planning phase, I started building one of my projects by jumping right into prompting. I would describe what I wanted, the agent would produce code, I would correct what it got wrong, and we would iterate. For the first few features, it worked. The codebase was small enough that the agent could hold the full picture in context.

Then the project grew. More services, more state, more API surface. The agent started making assumptions that conflicted with decisions from three sessions ago. It would invent data models that did not match the ones we already had. I was spending more time correcting than building. My view layer was a complete mess. If you do not establish good component and styling patterns, you will experience chaos...

That is when I built Forge. Not because I wanted to build a tool, but because I needed a better spec. I needed a structured document that captured the architecture, the contracts, the data models, and the decisions I had already made. Once I had that and retrofitted the project with it, using it to create guardrails and standards, the difference was immediate. The agent stopped guessing. The output aligned with the real architecture. The correction loop that was eating my time nearly disappeared.

The difference is not just quality. It is speed. Once you have a spec and establish rules to give guidance, the agent moves fast and stays on track. Without one, on a complex project you will likely spend more time correcting the agent than you would have spent writing the thing yourself. Even worse, you will not have a very consistent or orderly codebase.

"Build your own" is the real lesson

When I retrofitted my project with Forge, I did not just generate a spec and call it done. I built skills around it. Skills that regenerate planning documents when the codebase changes. Skills that cross-reference the spec when building new features. The spec became the center of a workflow, not a one-time artifact.

That is the part I think matters most. An adopted spec tool gives you structure. A spec tool you built yourself gives you structure that reflects your judgment. Your opinions are the value.

When I first started configuring agent behavior, I front-loaded everything. I had dozens of rules, all active at all times, and many of them were written for situations that had not come up yet. It felt productive and thorough.

It is not. Every token in a rule pays rent on every API call. A rule that fires once per session but loads on every turn is wasting context that could hold the actual work. The discipline is: write rules when the need is demonstrated, not when you imagine it might be. Start with a problem you observed, then write the rule that prevents it from happening again.

The spec is a conversation

Karpathy says "work with your agent to design a spec." That word "with" is important. The spec is not something you hand down from above, it is something you build collaboratively. You start with a rough shape and then the agent fills in the detail. You iterate and then the spec gets sharper with each pass.

This is how I work every day. I do not write specs from scratch. I use tools to generate a first pass from the existing code, then I iterate with the agent. My corrections become part of the spec's history. The spec is a living document that keeps getting better as more information becomes available.

That collaborative loop is what separates spec-driven development from just writing a really detailed prompt. A prompt is static. A spec evolves.

What this means for your workflow

If you are building with AI agents and you do not have a spec layer, start with one thing: generate a structured analysis of the codebase you are working on. Not a summary. A full inventory: modules, dependencies, API surface, data models. Then give your agent that document as context before you ask it to build anything else.

You will notice the difference immediately. The agent stops asking clarifying questions. The output aligns with the real architecture instead of inventing its own. And when the agent does get something wrong, you can point to the spec and say "this is what we agreed on," which makes the correction precise instead of vague.

The spec is not extra work. It is the work that makes all the other work faster.