DEV Community

Tisha Chawla
Tisha Chawla

Posted on

Spec-Driven Development: When Structure Helps and When It Becomes Tax

Disclosure: I work at Microsoft. The views here are my own, and I've kept the tool comparisons evidence-based.


1. The Ambiguity Tax

Every vague requirement you hand an AI coding agent gets paid for later: in rework, in drift, in three files that each solved a slightly different version of the problem you never fully stated. I call this the ambiguity tax, the compounding cost of letting an automated loop run on under-specified intent. A human engineer fills gaps with judgment and a quick Slack message; an agent fills them with confident guesses and then builds on those guesses at machine speed. By the time you read the diff, the misunderstanding is load-bearing.

Spec-driven development (SDD) is, at its core, a strategy for paying this tax up front when it's cheap, instead of at review time when it's expensive. But there's a second tax most SDD advocates never mention, and it's the more interesting one.


2. First, Define the Artifact

Before the philosophy, the noun. A spec, in this context, is not a Word document handed down from a product manager. It's a versioned, reviewable artifact that carries engineering intent into the agent's context: a file (or set of files) that lives in the repo, moves through code review, and constrains what the agent generates. That's the whole shift. Intent moves out of ephemeral chat history and into something you can diff, comment on, and roll back.


3. What SDD Actually Means

Spec-driven development is the practice of making the spec, not the conversation, the primary unit of engineering work when collaborating with an AI agent. Instead of "prompt, code, fix, prompt again," you get "spec, plan, tasks, code, verify against spec." The artifact is the source of truth and the chat is just how you edit it. This sounds like a pure win. It isn't, which brings us to the tradeoff.


4. The Core Tradeoff

SDD lives between two failure modes. Too little structure produces the ambiguity tax: the agent guesses, drifts, and fragments. Too much structure produces what I'll call the Law of Surplus Structure: every extra rule consumes the agent's finite reasoning budget, whether or not it reduces uncertainty. The entire craft of SDD is finding the floor of that curve, enough structure to kill ambiguity, not so much that you're burning tokens to enforce ceremony. Hold that U-shape in your head; everything below is about locating its bottom.

The cost of structure is U-shaped: ambiguity cost falls as you add structure, surplus-structure cost rises, and total cost bottoms out at a sweet spot in between.

The picture is the whole argument. Ambiguity cost falls fast as you add the first bits of structure, then flattens. Surplus-structure cost starts near zero and climbs as ceremony piles up. Total cost is their sum, and it bottoms out well before "maximum structure." Everything past that minimum is you paying to make the agent dumber.


5. The Taxonomy: Three Levels of SDD

Birgitta Böckeler's framing is the cleanest I've found: SDD isn't one thing, it's three levels of commitment.

Level What persists Who edits what The spec is…
Spec-first Code. Spec is scaffolding. You edit code after generation. A starting prompt you discard.
Spec-anchored Spec and code, kept in sync. You edit both; spec is reviewed. A durable contract.
Spec-as-source Spec only. Code is a build output. You edit only the spec. The source of truth; code is compiled from it.

Most teams think they're doing spec-anchored. Most are actually doing spec-first with extra steps: they write a spec, generate from it, then never touch it again. That's fine, as long as you're honest that the spec was a prompt, not a contract.


6. The Canonical Lifecycle Loop

Strip away the tool branding and nearly every SDD workflow is the same six-stage loop.

Stage Question it answers Output
Explore What exists? What's the terrain? Shared understanding
Specify What should be true when we're done? The spec
Plan How will we get there? Technical approach
Tasks What are the discrete steps? Ordered work items
Implement Build it. Code
Verify Does it match the spec? Pass/fail + evidence

Tools differ mostly in which stages they automate, which they force you to do explicitly, and how much each artifact weighs.


7. The Ecosystem, Reframed by Architecture

Most SDD tool round-ups list features. More useful is to sort tools by which architectural layer they operate on, because that's what determines whether two tools compete or compose.

7.1 Intent Layer: "What should be true?"

These tools turn fuzzy requirements into reviewable artifacts.

Tool Maintainer Shape Best for
Spec Kit GitHub Comprehensive, multi-file (spec/plan/tasks/contracts/constitution) Greenfield, large teams, strict specs
OpenSpec Fission AI Lightweight, change-centric (~4 artifacts) Brownfield, fast iteration
Kiro AWS Agentic IDE, multimodal input AWS/Claude users
BMAD-METHOD Community Multi-agent, role-simulating Enterprise-scale complexity

The headline contrast: Spec Kit optimizes for completeness, OpenSpec optimizes for review cost. Spec Kit generates roughly 800 lines where OpenSpec generates roughly 250 for the same change. Whether that completeness is an asset or a tax depends entirely on your codebase, which is the whole point of this post.

7.2 Execution Layer: "Build it, and check yourself."

These don't replace the spec; they govern how the agent acts on it. Superpowers uses guided Q&A to clarify intent, then runs sub-agents behind a verification-before-completion gate. GSD manages context in waves for solo developers. HVE Core runs an RPI loop: Research, Plan, Implement, Review.

7.3 Orchestration Layer: "Coordinate many agents."

Squad coordinates parallel agents. BMAD-METHOD simulates a full agile team of specialized agents.

The takeaway: Intent, Execution, and Orchestration tools compose. You can pair OpenSpec (intent) with Superpowers (execution). Picking "the best SDD tool" is the wrong question; picking one tool per layer is the right one.


8. The Decision Filter

Here's the part the methodology evangelists skip: you should not always write a spec. The signal isn't team size or "best practice," it's the cost of ambiguity for this specific change.

Signal Spec earns its keep Spec is just ceremony
Blast radius Touches many modules / public APIs One file, contained
Reversibility Hard to undo (migrations, schemas) Trivial to revert
Ambiguity Requirements genuinely unclear You already know the exact diff
Audience Others must review/maintain Throwaway or solo-spike
Repetition Pattern you'll repeat 10× One-off

If most of your signals sit in the right column, the spec is the tax. Write the code.

A composite from the kind of work this filter is built for (details anonymized; treat it as illustrative, not a case study): a payments service had a settlement module nobody wanted to touch, the original authors long gone, behavior documented only by the tests that happened to pass. The task was to add a new payout currency. Every signal sat in the left column: blast radius across a dozen call sites, an irreversible ledger migration, requirements that turned out to mean three different things depending on who you asked, and a change the on-call team would own for years. The first instinct was to let the agent loose on it. The right move was the opposite. An hour spent writing down what "settled" actually meant, in EARS form, surfaced two contradictions between the rounding rules and the reconciliation job before a single line changed. The spec didn't slow the work down; it caught the bug that would have shipped. That is the left column earning its keep. The same agent, pointed at a one-line config flag the week before, would have produced nothing but a longer paper trail.


9. The Law of Surplus Structure

The claim, stated plainly: every artifact you add to an agent's context consumes reasoning budget, and if it doesn't reduce uncertainty, it's not governance, it's tax. This isn't a vibe; it's measurable from two independent directions.

Direction one, token cost. Jamie Telin ran OpenSpec against Spec Kit on the same task (streaming + session support for a chat app), twice, using GPT-5.2. The leaner framework won both times, and the gap was not small.

Measurement OpenSpec Spec Kit Delta
Test 1, total tokens ~57,740 ~120,947 +109%
Test 2, planning 38,117 96,298 +152%
Test 2, implementation 53,612 84,742 +58%
Test 2, total 91,729 181,040 +97%

More upfront structure nearly doubled total token usage without improving outcomes. OpenSpec also hit a higher success rate with roughly 20% fewer assistant turns and 25% fewer tool calls. (Source: Jamie Telin, "Spec Driven Development Is Wasting Tokens," Mar 2026.)

Direction two, a controlled study. A 2026 paper from ETH Zurich, Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? (Gloaguen, Mündler, Müller, Raychev, Vechev; arXiv, Feb 2026), tested the intuitive belief that handing an agent a structured repository overview helps it. They evaluated two settings: established SWE-bench tasks paired with LLM-generated context files written to the agent vendors' own recommendations, and a fresh collection of real-world issues drawn from repositories that already ship developer-written context files. The result cut against the intuition. Across multiple agents and models, context files reduced task success rates compared with giving the agent no repository context at all, while raising inference cost by over 20%.

Read that twice. Both the machine-written and the human-written files made outcomes worse on balance, not better, and they did it while costing more. The agents didn't ignore the files; they obeyed them, explored more broadly, ran more tests, traversed more files, and "thought" harder without producing better final patches. I call this failure mode the compliance loop trap: the agent spends its cognitive budget satisfying the structural guardrails instead of solving the problem, and the diligence is real but misdirected. The authors' own conclusion is the thesis of this entire post: unnecessary requirements from context files make tasks harder, and human-written context should describe only minimal requirements. Everything beyond that is surplus. This is the second tax I promised in Section 1: ambiguity is expensive, and so is its overcorrection.


10. Token Economics Is Architecture

If structure has a token price, then context budget is an architectural resource to be allocated, not spent reflexively. Treat it like memory in an embedded system.

Cost driver Mitigation
Verbose, always-loaded specs Load specs lazily, scoped to the task
Redundant restatement across artifacts Single source of truth per fact; reference, don't repeat
Sub-agents rebuilding context Pass distilled state, not full history
Multi-file divergence State checkpoints: snapshot agreed truth before fan-out

The discipline: spend tokens where they reduce uncertainty, starve everything else.


11. EARS: Making Natural Language Less Ambiguous

If you're going to write requirements, write them in a form that resists misreading. EARS (Easy Approach to Requirements Syntax), developed by Mavin et al. at Rolls-Royce and presented at the IEEE Requirements Engineering conference (RE'09), constrains prose into a small set of patterns, and it's been adopted at Airbus, Bosch, Dyson, Honeywell, Intel, NASA, Rolls-Royce, and Siemens. The template:

While <optional pre-condition>, when <optional trigger>, the <system name> shall <system response>.

Before, the kind of requirement an agent will happily misinterpret:

The system should handle expired tokens gracefully and clean up sessions,
making sure not to leak any sensitive data.
Enter fullscreen mode Exit fullscreen mode

What's "gracefully"? Clean up when? Leak to where? Each gap is a guess waiting to happen.

After, EARS-structured and unambiguous:

WHEN an identity token expires,
THE SYSTEM SHALL invalidate the active session cache within 500ms.

IF cache eviction fails,
THEN THE SYSTEM SHALL retry up to 3 times,
log a structured JSON error with a correlation ID,
and SHALL NOT persist plain-text PII in telemetry.
Enter fullscreen mode Exit fullscreen mode

Same intent, zero room for creative interpretation. Note that EARS adds words but removes uncertainty, which is exactly the trade the Law of Surplus Structure says is worth making. Structure that reduces ambiguity isn't tax; structure that merely decorates is.


12. The Reality Check

Six failure modes I've watched SDD run into. None is a reason to abandon it; each is a reason to apply the decision filter.

Review overload. A spec that generates 800 lines of artifacts moves the bottleneck from writing code to reviewing specs. You haven't removed work, you've relocated it. If spec review is slower than the code review it replaced, the spec is tax.

False control. A detailed spec feels like control, but the agent can satisfy every line and still produce something wrong, because the spec encoded your misunderstanding faithfully. Precision is not correctness.

Spec/code drift. In spec-anchored workflows, the spec and code diverge the moment someone edits code directly and skips the spec. Now you have two sources of truth and no way to know which is right. Drift turns a contract back into a stale comment.

The multi-file divergence trap. When an agent fans out across many files, each can drift toward a different interpretation. State checkpoints, snapshotting agreed truth before parallel work, are the only reliable defense.

Natural language bottoms out. Even EARS can't make "intuitive UX" machine-precise. Some intent is irreducibly fuzzy, and pretending otherwise just produces confident wrong answers.

Spec-as-source repeats old risks. "Edit only the spec, regenerate the code" is the dream, but it reinvents the problems of code generation: opaque output, debugging a thing you didn't write, and trusting a compiler you can't fully inspect.


13. Adoption Strategy

Don't roll out SDD as a mandate. Roll it out where the ambiguity tax is highest, prove it, then expand.

Phase Focus Goal
Weeks 1 to 2 Pick one high-blast-radius, high-ambiguity workstream Feel where specs earn their keep
Weeks 3 to 4 Add EARS for the requirements that bite Reduce misinterpretation, measure review time
Month 2 Introduce one execution-layer tool (e.g., a verification gate) Catch spec/code drift automatically
Month 3 Codify your own decision filter Make "spec or skip?" a team reflex, not a ritual

The goal isn't "we do SDD now." It's "we know exactly when SDD pays, and we skip it when it doesn't."


Closing

Spec-driven development is not a methodology you adopt wholesale. It's a cost-management strategy for the two taxes that bracket every AI-assisted change: the ambiguity tax on the left, the surplus-structure tax on the right. Good engineering is finding the bottom of that curve, per change, not per team. So the rule is simple, and it's the whole post in one line:

Spec it when ambiguity is expensive. Skip it when the code is cheaper than the ceremony.


Further Reading

Top comments (0)