Bringing the ideas I've been thinking about for months into life has never been easier, thanks to AI agents. The basic intuition is—give it a prompt, it builds the whole feature, the result looks good. Done. It takes only minutes to build the same thing that would've taken hours otherwise.
Yes, I know, everyone's doing that. Right?
The reason I'm opening like this is to point out what happened afterwards.
I tried to use the search bar, and it fired a request on every keystroke.
Wait, what? I didn't do that. Of course I'd add a debounce here. But the agent didn't. Why? I didn't ask it to. I said—build me a search bar, and it built me one that works; but I didn't say exactly what I wanted.
Also, I noticed that the search button changes color on hover, but I'd already told it not to do that. The agent forgot, it hallucinated. What's missing then?
What was missing was I did not provide the agent with the exact decisions to work with the feature; or did not provide a proper reference point to fallback to, to remediate the hallucination. In other words, I did not provide it with a proper spec. Hence, it took the hidden decisions itself; even though it pulled the feature off. This is the core problem that Spec-Driven Development (SDD) solves.
The Hidden Product Decisions Your AI Agent Is Making For You
Here's what happens when you describe something to an AI agent and it generates code: lots of decisions get made. Let's take the search bar implementation as an example. Does the filtering happen on the client or the server? Does the URL update so results are shareable? What does an empty query show? Everything, or nothing?
I tend to miss nitty-gritty details while reviewing tons of AI generated code in a short amount of time. The code works, the UI looks right, I move on… Every one of those is a decision that belongs to my product. If I don't make the decisions consciously, the agent takes them based on whatever pattern shows up most often in its training data.
Take that search filter. Left to its devices the AI gives you something like this:
onChange={(e) => fetchResults(e.target.value)}
const filtered = results.filter(item =>
item.name.includes(query) && activeCategory === item.category
)
if (!query) return <div />
Looks fine. But AND logic might be wrong for my product. Maybe users expect OR. Return nothing on a query might be backwards from what my users need. That onChange is silently hitting my backend on every keystroke, firing off requests and abandoning them mid-flight.
The agent did nothing except following the conventions it had in its training data. Convention just wasn't the answer for my specific situation and I never got the chance to weigh in.
I used to think the fix was reviewing AI output carefully. But it doesn't always work. The problem was that by the time I was reviewing anything the decisions were already buried in code.
What a spec is, and what it isn't
"Write a spec first" can be inferred as a post-it-note to a 30-page PRD. Let me be more concrete.
It's a markdown file. One file, written before I open my editor, that answers every behavioral question I can think of. What does the component render in each state? What interactions trigger what behavior? What happens at the edges? Not aspirational, not "we'll figure this out later." Decided.
If you've done Behavior-Driven Development (BDD), a practice that defines feature behavior in plain language before writing any implementation, this is the same instinct with the tooling stripped away. BDD got the philosophy right, but the problem was it required maintaining a whole parallel infrastructure that most teams abandoned quietly after six months. A markdown file that an AI can read gets you the same thing without any of that overhead.
It's not a new idea either. GitHub shipped a tool called spec-kit built specifically for this workflow in 2025. There's a heavier version too—machine-readable specs in OpenAPI or JSON Schema, where the spec itself generates tests and mocks automatically, but that's a real tooling investment. A markdown table gets you most of the benefit in twenty minutes, which is where I'd start.
But the tooling isn't the interesting part. The interesting part is what happens when you sit down and try to write the spec.
You'll get maybe two rows into your behavior table and hit a question you've never consciously asked yourself. What does "no results" look like? A message, an illustration, or nothing? If the user types while a request is in flight, do I cancel the old one or wait? Can users share a filtered URL with a teammate, or do filters live only in component state? Who owns this file when requirements change?
That said, I must warn that a spec is only valuable if it stays current. If it drifts from what the code actually does, it becomes something worse than no spec—it's documentation that confidently lies.
How SDD Differs From Test-Driven Development (TDD)
If you write tests first, this fits naturally. Test-Driven Development (TDD)—the practice of writing failing tests before writing the code that makes them pass, makes you consistent. It doesn't make you right. You can write a test that encodes the wrong behavior, and that test will defend that wrong behavior forever. Tests verify implementation, but what verifies the tests?
The spec. That's the decision source above the tests. If a test fails, fix the code. If the spec changes, you update the spec first, then the test, then the code. In that order, always. For me it doesn't feel like adding a layer on top of TDD—it is a source of truth that validates my tests.
The SDD Workflow in Practice: Spec, Tests, Implementation
Let's proceed with the same component—a search and filter UI for a product listing. The kind of thing I could prompt for without thinking.
Step 1: Defining the spec.
I opened a blank SPEC.md and started filling in scenarios. Just this forced me to make decisions I'd have otherwise skipped entirely.
The debounce question came up immediately—once I wrote "user types query" as a row, I had to fill in the behavior column. And that pulled in the next question: what if they type faster than the debounce window? What if a request is already in flight when the debounce fires?
The URL update requirement showed up because I asked myself: can a user share a filtered search with a friend? A product question, not a technical one. It wouldn't have occurred to me in a prompt.
Here's what the spec looked like when I was done:
SearchFilter component
| State / Interaction | Behavior |
|---|---|
| Page loads, no query or filters | Show all results |
| User types query | Debounce 300ms, then fetch |
| User types while in-flight req | Cancel previous, start new debounce |
| Results returned | Replace list, update URL params |
| Query returns no results | Show "No results for [query]" |
| User clears query | Reset to all results immediately |
| User applies category filter | AND logic with active query, fetch now |
| Multiple filters active | AND logic—must match all |
| Request takes > 200ms | Show loading skeleton |
| Request fails | Inline error, preserve previous results |
- Case-insensitive matching
- URL updates on every search (q= and category= params)—results are shareable
- Sort order persists across filter changes
It's more rows than you'd expect for a "simple" component. That's kind of the point.
Step 2: Writing the tests.
Then I handed the spec to the agent: Read SPEC.md and write contract tests derived from it. Do not write any implementation yet.
The first pass came back testing things I hadn't specified. I pushed back. The second pass was better but still had invented behavior. The third pass was finally just the spec, nothing else.
This is worth mentioning: "read the spec and write tests" is not a magic prompt. The agent will add its own ideas. Treat the generated tests like a pull request—read every line.
The thing that breaks this workflow most often: Don't ask the agent to write both the spec and the tests from scratch. If you do, they'll agree with each other—and they might both be wrong. The AI bakes its assumptions into the spec, writes tests that enforce those assumptions, and you end up with a green test suite that proves nothing. This is why the spec has to come from you. The AI can help you think through edge cases, but the decisions have to be yours before the tests are written.
Step 3: Implementation.
Once the tests were genuinely spec-derived and failing, I prompted for implementation: read the spec, make the tests pass, add nothing that isn't in the spec. Green. Done. No guessing about debounce timing, no creative interpretation of empty state. The AI moved fast because it had something specific to work against.
One rule I kept the whole time: if a test fails after implementation, I fix the code. Never the test. The moment you change a test to make code pass, you've quietly changed the spec—and the whole point was that the spec is yours, not the AI's.
What SDD doesn't solve
SDD doesn't help if the spec is wrong. If I specify the wrong behavior, wrong status codes, wrong business logic, wrong security model—the implementation will faithfully build exactly that. A good spec process catches a lot of mistakes, but it's not magic. I still have to think carefully, and thinking carefully is a skill that takes practice.
Writing a good spec is also harder than it looks the first few times. The tendency is to write the happy path and stop. The edge cases, the error scenarios, the security implications—those take effort to surface.
And finally, SDD doesn't replace domain knowledge. If you don't understand the problem well enough, the spec will reflect that. Talking to users, understanding the system's context, knowing what actually matters—none of that gets skipped just because you wrote a spec first.
What SDD does is make your thinking visible before it becomes code. If the thinking is wrong, you find out faster and cheaper. If it's right, the rest of the process gets dramatically smoother.
Why AI Agents Make Spec-Driven Development More Critical Than Ever
SDD has always been good practice. Skipping the thinking used to be recoverable. You'd catch the missing case in code review, or QA, or production where you'd patch fast and learn something.
AI collapsed that timeline. The gap between "vague idea" and "running code" used to be hours of manual work, and somewhere in those hours you'd catch the questions you hadn't answered. Now it's seconds. The questions don't surface. They get buried under code that already passes tests.
SDD puts the thinking back at the front—before the prompts, before the code, before anything runs. Not because AI is bad at generating code. Because deciding what to build is still your job, and always will be.
Spec first. Prompt later.
Top comments (0)