Tisha

Posted on Jun 1 • Edited on Jun 6

Spec-Driven Development: When Structure Helps and When It Becomes Tax

#ai #llm #softwaredevelopment #dev

Disclosure: I work at Microsoft. The views here are my own, and I've kept the tool comparisons evidence-based.

1. The Ambiguity Tax

Every vague requirement you hand an AI coding agent gets paid for later: in rework, in drift, in three files that each solved a slightly different version of the problem you never fully stated. I call this the ambiguity tax, the compounding cost of letting an automated loop run on under-specified intent. A human engineer fills gaps with judgment and a quick Slack message; an agent fills them with confident guesses and then builds on those guesses at machine speed. By the time you read the diff, the misunderstanding is load-bearing.

Spec-driven development (SDD) is, at its core, a strategy for paying this tax up front when it's cheap, instead of at review time when it's expensive. But there's a second tax most SDD advocates never mention, and it's the more interesting one.

2. First, Define the Artifact

Before the philosophy, the noun. A spec, in this context, is not a Word document handed down from a product manager. It's a versioned, reviewable artifact that carries engineering intent into the agent's context: a file (or set of files) that lives in the repo, moves through code review, and constrains what the agent generates. That's the whole shift. Intent moves out of ephemeral chat history and into something you can diff, comment on, and roll back.

3. What SDD Actually Means

Spec-driven development is the practice of making the spec, not the conversation, the primary unit of engineering work when collaborating with an AI agent. Instead of "prompt, code, fix, prompt again," you get "spec, plan, tasks, code, verify against spec." The artifact is the source of truth and the chat is just how you edit it. This sounds like a pure win. It isn't, which brings us to the tradeoff.

4. The Core Tradeoff

SDD lives between two failure modes. Too little structure produces the ambiguity tax: the agent guesses, drifts, and fragments. Too much structure produces what I'll call the Law of Surplus Structure: every extra rule consumes the agent's finite reasoning budget, whether or not it reduces uncertainty. The entire craft of SDD is finding the floor of that curve, enough structure to kill ambiguity, not so much that you're burning tokens to enforce ceremony. Hold that U-shape in your head; everything below is about locating its bottom.

The picture is the whole argument. Ambiguity cost falls fast as you add the first bits of structure, then flattens. Surplus-structure cost starts near zero and climbs as ceremony piles up. Total cost is their sum, and it bottoms out well before "maximum structure." Everything past that minimum is you paying to make the agent dumber.

5. The Taxonomy: Three Levels of SDD

Birgitta Böckeler's framing is the cleanest I've found: SDD isn't one thing, it's three levels of commitment.

Level	What persists	Who edits what	The spec is…
Spec-first	Code. Spec is scaffolding.	You edit code after generation.	A starting prompt you discard.
Spec-anchored	Spec and code, kept in sync.	You edit both; spec is reviewed.	A durable contract.
Spec-as-source	Spec only. Code is a build output.	You edit only the spec.	The source of truth; code is compiled from it.

Most teams think they're doing spec-anchored. Most are actually doing spec-first with extra steps: they write a spec, generate from it, then never touch it again. That's fine, as long as you're honest that the spec was a prompt, not a contract.

6. The Canonical Lifecycle Loop

Strip away the tool branding and nearly every SDD workflow is the same six-stage loop.

Stage	Question it answers	Output
Explore	What exists? What's the terrain?	Shared understanding
Specify	What should be true when we're done?	The spec
Plan	How will we get there?	Technical approach
Tasks	What are the discrete steps?	Ordered work items
Implement	Build it.	Code
Verify	Does it match the spec?	Pass/fail + evidence

Tools differ mostly in which stages they automate, which they force you to do explicitly, and how much each artifact weighs.

7. The Ecosystem, Reframed by Architecture

Most SDD tool round-ups list features. More useful is to sort tools by which architectural layer they operate on, because that's what determines whether two tools compete or compose.

7.1 Intent Layer: "What should be true?"

These tools turn fuzzy requirements into reviewable artifacts.

Tool	Maintainer	Shape	Best for
Spec Kit	GitHub	Comprehensive, multi-file (spec/plan/tasks/contracts/constitution)	Greenfield, large teams, strict specs
OpenSpec	Fission AI	Lightweight, change-centric (~4 artifacts)	Brownfield, fast iteration
Kiro	AWS	Agentic IDE, multimodal input	AWS/Claude users
BMAD-METHOD	Community	Multi-agent, role-simulating	Enterprise-scale complexity

The headline contrast: Spec Kit optimizes for completeness, OpenSpec optimizes for review cost. Spec Kit generates roughly 800 lines where OpenSpec generates roughly 250 for the same change. Whether that completeness is an asset or a tax depends entirely on your codebase, which is the whole point of this post.

7.2 Execution Layer: "Build it, and check yourself."

These don't replace the spec; they govern how the agent acts on it. Superpowers uses guided Q&A to clarify intent, then runs sub-agents behind a verification-before-completion gate. GSD manages context in waves for solo developers. HVE Core runs an RPI loop: Research, Plan, Implement, Review.

7.3 Orchestration Layer: "Coordinate many agents."

Squad coordinates parallel agents. BMAD-METHOD simulates a full agile team of specialized agents.

The takeaway: Intent, Execution, and Orchestration tools compose. You can pair OpenSpec (intent) with Superpowers (execution). Picking "the best SDD tool" is the wrong question; picking one tool per layer is the right one.

8. The Decision Filter

Here's the part the methodology evangelists skip: you should not always write a spec. The signal isn't team size or "best practice," it's the cost of ambiguity for this specific change.

Signal	Spec earns its keep	Spec is just ceremony
Blast radius	Touches many modules / public APIs	One file, contained
Reversibility	Hard to undo (migrations, schemas)	Trivial to revert
Ambiguity	Requirements genuinely unclear	You already know the exact diff
Audience	Others must review/maintain	Throwaway or solo-spike
Repetition	Pattern you'll repeat 10×	One-off

If most of your signals sit in the right column, the spec is the tax. Write the code.

A composite from the kind of work this filter is built for (details anonymized; treat it as illustrative, not a case study): a payments service had a settlement module nobody wanted to touch, the original authors long gone, behavior documented only by the tests that happened to pass. The task was to add a new payout currency. Every signal sat in the left column: blast radius across a dozen call sites, an irreversible ledger migration, requirements that turned out to mean three different things depending on who you asked, and a change the on-call team would own for years. The first instinct was to let the agent loose on it. The right move was the opposite. An hour spent writing down what "settled" actually meant, in EARS form, surfaced two contradictions between the rounding rules and the reconciliation job before a single line changed. The spec didn't slow the work down; it caught the bug that would have shipped. That is the left column earning its keep. The same agent, pointed at a one-line config flag the week before, would have produced nothing but a longer paper trail.

9. The Law of Surplus Structure

The claim, stated plainly: every artifact you add to an agent's context consumes reasoning budget, and if it doesn't reduce uncertainty, it's not governance, it's tax. This isn't a vibe; it's measurable from two independent directions.

Direction one, token cost. Jamie Telin ran OpenSpec against Spec Kit on the same task (streaming + session support for a chat app), twice, using GPT-5.2. The leaner framework won both times, and the gap was not small.

Measurement	OpenSpec	Spec Kit	Delta
Test 1, total tokens	~57,740	~120,947	+109%
Test 2, planning	38,117	96,298	+152%
Test 2, implementation	53,612	84,742	+58%
Test 2, total	91,729	181,040	+97%

More upfront structure nearly doubled total token usage without improving outcomes. OpenSpec also hit a higher success rate with roughly 20% fewer assistant turns and 25% fewer tool calls. (Source: Jamie Telin, "Spec Driven Development Is Wasting Tokens," Mar 2026.)

Direction two, a controlled study. A 2026 paper from ETH Zurich, Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? (Gloaguen, Mündler, Müller, Raychev, Vechev; arXiv, Feb 2026), tested the intuitive belief that handing an agent a structured repository overview helps it. They evaluated two settings: established SWE-bench tasks paired with LLM-generated context files written to the agent vendors' own recommendations, and a fresh collection of real-world issues drawn from repositories that already ship developer-written context files. The result cut against the intuition. Across multiple agents and models, context files reduced task success rates compared with giving the agent no repository context at all, while raising inference cost by over 20%.

Read that twice. Both the machine-written and the human-written files made outcomes worse on balance, not better, and they did it while costing more. The agents didn't ignore the files; they obeyed them, explored more broadly, ran more tests, traversed more files, and "thought" harder without producing better final patches. I call this failure mode the compliance loop trap: the agent spends its cognitive budget satisfying the structural guardrails instead of solving the problem, and the diligence is real but misdirected. The authors' own conclusion is the thesis of this entire post: unnecessary requirements from context files make tasks harder, and human-written context should describe only minimal requirements. Everything beyond that is surplus. This is the second tax I promised in Section 1: ambiguity is expensive, and so is its overcorrection.

10. Token Economics Is Architecture

If structure has a token price, then context budget is an architectural resource to be allocated, not spent reflexively. Treat it like memory in an embedded system.

Cost driver	Mitigation
Verbose, always-loaded specs	Load specs lazily, scoped to the task
Redundant restatement across artifacts	Single source of truth per fact; reference, don't repeat
Sub-agents rebuilding context	Pass distilled state, not full history
Multi-file divergence	State checkpoints: snapshot agreed truth before fan-out

The discipline: spend tokens where they reduce uncertainty, starve everything else.

11. EARS: Making Natural Language Less Ambiguous

If you're going to write requirements, write them in a form that resists misreading. EARS (Easy Approach to Requirements Syntax), developed by Mavin et al. at Rolls-Royce and presented at the IEEE Requirements Engineering conference (RE'09), constrains prose into a small set of patterns, and it's been adopted at Airbus, Bosch, Dyson, Honeywell, Intel, NASA, Rolls-Royce, and Siemens. The template:

While <optional pre-condition>, when <optional trigger>, the <system name> shall <system response>.

Before, the kind of requirement an agent will happily misinterpret:

The system should handle expired tokens gracefully and clean up sessions,
making sure not to leak any sensitive data.

What's "gracefully"? Clean up when? Leak to where? Each gap is a guess waiting to happen.

After, EARS-structured and unambiguous:

WHEN an identity token expires,
THE SYSTEM SHALL invalidate the active session cache within 500ms.

IF cache eviction fails,
THEN THE SYSTEM SHALL retry up to 3 times,
log a structured JSON error with a correlation ID,
and SHALL NOT persist plain-text PII in telemetry.

Same intent, zero room for creative interpretation. Note that EARS adds words but removes uncertainty, which is exactly the trade the Law of Surplus Structure says is worth making. Structure that reduces ambiguity isn't tax; structure that merely decorates is.

12. The Reality Check

Six failure modes I've watched SDD run into. None is a reason to abandon it; each is a reason to apply the decision filter.

Review overload. A spec that generates 800 lines of artifacts moves the bottleneck from writing code to reviewing specs. You haven't removed work, you've relocated it. If spec review is slower than the code review it replaced, the spec is tax.

False control. A detailed spec feels like control, but the agent can satisfy every line and still produce something wrong, because the spec encoded your misunderstanding faithfully. Precision is not correctness.

Spec/code drift. In spec-anchored workflows, the spec and code diverge the moment someone edits code directly and skips the spec. Now you have two sources of truth and no way to know which is right. Drift turns a contract back into a stale comment.

The multi-file divergence trap. When an agent fans out across many files, each can drift toward a different interpretation. State checkpoints, snapshotting agreed truth before parallel work, are the only reliable defense.

Natural language bottoms out. Even EARS can't make "intuitive UX" machine-precise. Some intent is irreducibly fuzzy, and pretending otherwise just produces confident wrong answers.

Spec-as-source repeats old risks. "Edit only the spec, regenerate the code" is the dream, but it reinvents the problems of code generation: opaque output, debugging a thing you didn't write, and trusting a compiler you can't fully inspect.

13. Adoption Strategy

Don't roll out SDD as a mandate. Roll it out where the ambiguity tax is highest, prove it, then expand.

Phase	Focus	Goal
Weeks 1 to 2	Pick one high-blast-radius, high-ambiguity workstream	Feel where specs earn their keep
Weeks 3 to 4	Add EARS for the requirements that bite	Reduce misinterpretation, measure review time
Month 2	Introduce one execution-layer tool (e.g., a verification gate)	Catch spec/code drift automatically
Month 3	Codify your own decision filter	Make "spec or skip?" a team reflex, not a ritual

The goal isn't "we do SDD now." It's "we know exactly when SDD pays, and we skip it when it doesn't."

Closing

Spec-driven development is not a methodology you adopt wholesale. It's a cost-management strategy for the two taxes that bracket every AI-assisted change: the ambiguity tax on the left, the surplus-structure tax on the right. Good engineering is finding the bottom of that curve, per change, not per team. So the rule is simple, and it's the whole post in one line:

Spec it when ambiguity is expensive. Skip it when the code is cheaper than the ceremony.

DEV Community