How to execute Spec-Driven Development in a contemporary product team

#aiops #claude #agile #softwareengineering

I've been doing Agile in one form or another for most of my career. Standups, retrospectives, sprint planning — the whole apparatus.

And like most people who've been doing it that long, I've developed fairly strong opinions about which bits actually matter and which bits are ceremony for ceremony's sake.

A lot of us have used Jira throughout our careers. Love it or not, it's driven a lot of how we've done our work for 20+ years. We create tickets, we write content into them, we do work, we close tickets. We go again.

The bit that always mattered though, in my experience, was the acceptance criteria. Not the ticket title. Not the epic it was attached to. The specific, testable statements of what done looks like.

We've always had specifications of one type or another, and whether we realised it or not. A spec is entirely a blueprint to achieve something, that's it. It can be as simple or complex as you want, but it tells you, and whoever comes after you, what the intent was.

And in this new world where our LLMs work best with clear, well-defined, scope-limited instructions, specs matter more than ever too. This led to the creation of Penling.

We were using AI agents to do implementation work — Claude Code, mostly — and the output was good. Genuinely good, not "good for AI" good. But it was inconsistent in a way I couldn't immediately put my finger on. We were working harder than we needed to keep the LLM in the bounds of the task we were focusing on at the time.

Some tasks came back exactly right. Others came back technically correct but somehow not quite what I'd had in mind or included bits I wasn't ready to build. The code worked. The tests passed. But there'd be decisions baked into the implementation that I hadn't thought about, and sometimes they were fine and sometimes they created problems down the track.

The variable, I eventually worked out, was how precisely I'd defined the work before handing it over.

When I'd been specific — here's what it needs to do, here's what it explicitly doesn't need to do, here are the constraints — the output was reliable. When I'd been vague, I was essentially asking the model to make decisions I hadn't made myself. And it would. Confidently.

This is obvious in retrospect. It's basically the same thing I'd been saying for years: the quality of our output is directly related to how well we understood the problem before we started. The AI hasn't changed that. It's just made the consequence of not doing it arrive faster and less visibly.

A human engineer who gets a vague brief will ask a question, or make an assumption that's at least informed by being in the room for the conversation. An AI will make an assumption that's informed by its training data, which is not the same thing. It won't flag uncertainty unless you've given it a reason to. It'll just decide, and keep going.

The fix isn't better prompting. I've been down that road. You end up with increasingly elaborate instructions that are really just a specification written badly. And so we end up negotiating with the LLM through increasingly elaborate prompts to keep it on task. Take this prompt for example:

Add a CSV export to the projects page. When I click it downloads a file named with today's date. The button shows a loading state while the file generates.

What actually works is writing a proper spec before the build starts. You may already have a PRD. The spec is what sits between that and the AI — translating intent into something executable. Four things, clearly stated:

A definition of what the work is. Specific enough that there's only one reasonable interpretation of it.

Expected results — the observable outcomes that mean it's done. Written as things you can check, not things you have to feel.

Conditions — the constraints. Performance, design system, compatibility, whatever applies. The stuff that lives in everyone's heads and doesn't usually survive the handover to a ticket.

Boundaries — what's explicitly out of scope. This one is underrated. An AI without clear boundaries will do adjacent work because adjacent work looks like being helpful. It's not misbehaving. You just didn't tell it to stop.

So, what does this look like for our CSV export example from before:

Build a CSV export feature for the projects list.

Definition: Add an "Export" button to the projects list page that downloads the current user's projects as a CSV file, containing project name, status, created date, and owner

Expected results:
- Clicking "Export" downloads a file named projects-[YYYY-MM-DD].csv
- The CSV includes column headers: Name, Status, Created, Owner
- Only projects the current user can see are included
- The button shows a loading state while the file generates

Conditions:
- Use the existing GET /projects endpoint — do not create a new one
- Follow existing button and loading state patterns from the design system
- File generation is client-side; no new backend endpoints needed

Boundaries:
- No filtering or column selection — export all visible projects as-is
- Do not modify the projects list layout or any existing components
- If the user has more than 1000 projects, export the first 1000 only

This is what spec driven development is, at its core. You drive the build from the spec. The spec is the contract. The AI executes against it.

There's no arguing that there's more text in the spec when compared with the prompt in these examples. But that's the point really isn't it. We've been more specific about what we want to achieve, told the LLM how to do it, where to stop and what we expect to see at the end.

The outcome is pretty much what you'd hope for if you were to have asked a team mate to execute your clearly defined instructions: focused output, no surprises, and clean PRs that are easier to review because you know what the PR was supposed to accomplish.

The less obvious outcome is that writing the spec forces the conversation that usually happens in code review to happen before anyone's written anything. Which is where it should have been all along.

Penling is the tool we built around this workflow — a shared workspace where the product thinking happens. Goals, requirements, the spec itself. In practice that means your PRD and your AI's build instructions aren't two separate things maintained in two separate tools. Penling derives the spec from the broader product intent, hands it directly to an AI agent via MCP, and keeps the full chain of reasoning attached to the PR at the end. One place, from first thought to merged code.

But you don't need a fully featured toolkit like Penling to try this; A shared doc with the four sections we described earlier is enough to see whether it changes anything in the way you interact your LLM.

In my experience it does. Quite significantly.

More on the methodology: What is spec driven development?