DEV Community: Paul Penling

The loop doesn't know what done looks like.

Paul Penling — Wed, 24 Jun 2026 09:39:32 +0000

Loop engineering has been making the rounds. Engineers at Anthropic, Google, OpenAI — people worth listening to — saying they've stopped writing prompts. They write loops now. The idea is straightforward: instead of manually prompting an agent for each task, you design a system that prompts, runs, evaluates, and iterates until a goal is reached or a stopping condition fires. Act, observe, refine, repeat — without you in between every step.

It's a real shift. Loops produce genuinely different results from one-shot prompts, in the same way a program with error handling and retry logic produces different results from a program that crashes on first failure. The pattern is worth taking seriously.

But I've noticed something in how it's being described that I think is worth examining. A lot of the framing is: tell the agent what you want, set it loose, let it keep going until it's done. And that particular instruction — "keep going until it's done" — is where I think loop engineering can go wrong in exactly the same way vibe coding does.

Vibe coding's failure mode is well-understood by now. You describe roughly what you want, the model builds something, you look at it and say "more like that but better," and you iterate until it feels right. It works for prototypes. It breaks down when you need something specific, when multiple people need to agree on what was built, or when the thing you're building has to keep being right six months from now when nobody remembers the conversation.

The failure isn't the model. The failure is the absence of a fixed definition of what success looks like. The model has nothing to converge toward except your next prompt.

A loop without a clear target has the same problem, just running autonomously. The agent makes progress, hits an evaluation signal — tests pass, lint is clean, CI goes green — and declares itself done. But "tests pass" is not the same as "we built the right thing." If the original goal was ambiguous, the loop will find a way to satisfy its stopping condition that may or may not match what was actually wanted. It can do this quickly, confidently, and at considerable cost in tokens.

This is the part that doesn't get enough attention in the loop engineering conversation: the evaluation layer is only as good as what you're evaluating against.

Tests tell you the code does what the tests expect. CI tells you the build is clean. Neither tells you whether the behaviour that shipped is the behaviour that was specified. If you haven't written down what the feature is supposed to do — concretely, with expected results and explicit constraints — the loop is optimising toward a signal that doesn't fully represent the goal.

The result tends to be expensive drift. The agent iterates, token costs accumulate, and what emerges is something that passes the available checks without actually satisfying the original intent. Not because the model failed, but because the model was never given a precise enough target to hit.

A spec fixes this. Not a prompt — a spec. A written, agreed statement of what done actually looks like: the expected behaviour, the constraints that apply, the things explicitly out of scope. An artifact the loop can validate against, not a description it interprets differently on each pass.

When a loop has a spec as its anchor, several things change. The evaluation layer has something real to measure against, not just the proxy signals of passing tests and green builds. The stopping condition becomes meaningful — the loop terminates when the output matches the spec, not when the agent runs out of ideas or token budget. And when something drifts between iterations, the spec is there to catch it: not "the previous output was better" but "this no longer satisfies condition four."

It also contains costs in a way that "keep going until done" cannot. A loop without a target will explore. A loop with a spec will converge.

The more interesting framing for loop engineering, I think, isn't "how do we keep the agent running longer" — it's "how do we give the agent a precise enough mission that the loop terminates correctly." That's a spec problem, not a loop problem. The mechanism for running autonomously is the easy part. Knowing what done looks like, with enough precision that a model can verify it without ambiguity, is the hard part. It was always the hard part.

Penling is built around writing that spec before the loop starts — a shared workspace where product, engineering, and design align on what done looks like, producing a spec that lives in the repository and gets handed directly to the agent or loop as its input. The loop still runs. It just knows where it's going.

This is part of a series on spec-driven development. Earlier pieces cover why briefs and specs are different things and where specs tend to get written — and why that placement matters.

A brief and a spec are not the same thing.

Paul Penling — Tue, 16 Jun 2026 12:42:00 +0000

We had a moment, early on, that I think a lot of teams are having right now.

An engineer picked up a task on a Monday morning. By Monday afternoon the code was written. By Tuesday it was reviewed and merged. By Wednesday the PM was in standup asking what had actually shipped, because the PR description didn't quite match what she'd understood the task to be.

The feature worked. The tests passed. It was just... not entirely what the designer had specified, and not entirely what the PM had meant when she'd written the brief, and the engineer — reasonably — had bridged the gap the only way available to them, which was to make a call.

Nobody had done anything wrong. The brief was ambiguous in the usual ways briefs are ambiguous. In the old timeline, that ambiguity would have surfaced during implementation — an engineer would have got stuck, or asked a question, or pushed the PR back because something didn't add up. The human in the loop was slow enough that there was time to catch it.

The AI wasn't slow. It made the call and kept going.

I've come to think that speed is doing something specific to how teams break down.

The coordination mechanisms we've built up — design reviews, PM sign-offs, spec walkthroughs, even just the informal "do you have five minutes" conversation — exist because implementation is slow. Slow enough that you can run a checkpoint before the code is committed. Slow enough that changing course is annoying but not catastrophic.

When implementation compresses to hours, those checkpoints become either blockers or skipped steps. If you run them, you slow the AI down to human speed and lose most of the gain. If you skip them, you're shipping fast but without anyone having actually agreed on what was built.

Most teams I've talked to have landed somewhere in the middle, which usually means: the checkpoints happen, but they happen after the fact, when everyone's looking at something already built and merged.

The thing I keep noticing is that "we discussed this" and "we had a spec" feel like the same thing but they're not.

A brief is intent. It's the what and the why, usually written for a product audience, often under-specified in exactly the ways that matter for implementation — the conditions, the edge cases, the explicit exclusions that tell an AI agent where to stop.

A spec is executable intent. It's what you get when someone has translated the brief into something concrete enough that there's only one reasonable interpretation of it. What does done look like? What can it not do? What constraints apply? Who decided what?

The gap between them is where the coordination failure lives. The PM thinks the spec conversation happened because there was a brief. The engineer thinks it happened because the brief seemed clear. The AI doesn't know there was a gap at all — it builds against whatever it was given, confidently.

The pattern we landed on was simple enough in principle and harder in practice: the spec gets written before the build starts. Not by the engineer alone, not by the PM alone — together, in the same document, before anyone opens a terminal.

The designer writes expected results. The PM sets conditions. The lead draws the boundaries. And then — only then — the AI agent gets the spec as its input.

What that does, structurally, is move the alignment conversation to before the build rather than after it. The review of the spec becomes the design review, the PM sign-off, and the scoping conversation all at once. When the AI finishes, the PR reflects what everyone agreed to — because everyone agreed to the spec, and the spec is what the AI built against.

The speed doesn't go away. If anything, it improves. The AI isn't slower because the spec is better-defined — it's faster, because it's not making guesses you'll spend two days unwinding.

There's an organisational shift in here too, and it's the part that takes the most adjustment.

The spec has to be a team artifact. Not a prompt someone writes by themselves and hands to an agent. Not a document that lives in one person's head or one person's editor. A shared document that everyone who has a stake in the outcome has actually read and signed off on — before the code exists.

That changes the role of the PM. Instead of writing a brief and waiting for a PR, they're in the spec, setting conditions. It changes the role of the designer. Instead of reviewing output, they're shaping expected results upfront. It changes the role of the lead. The architectural constraints aren't discovered in review — they're written into the spec as boundaries before the build starts.

The AI doesn't change what good coordination looks like. It just surfaces very quickly what happens when you skip it.

Penling is the tool we built around this — a shared workspace where the spec gets written before anyone touches a terminal, handed directly to your coding agent via MCP, and committed to the repository with the PR when the work ships. Everyone who touches the work is in the same artifact, at the same time, before the AI starts building.

The standup conversation changed. It's a lot less "what did we actually ship" and a lot more "that went in exactly as planned."

This is part of a series on adopting spec-driven development. Part one is on why specs make AI output reliable. Part two covers what happens to the reasoning after the code merges.

The spec is in the wrong place

Paul Penling — Sun, 14 Jun 2026 20:33:51 +0000

My day job is at a large tech company. Hundreds of engineering teams, and every one of them is somewhere different on AI adoption. Some are still treating coding agents as a curiosity. Some have quietly rebuilt their whole workflow around them. Most are in the messy middle, and watching that middle is where the idea for Penling came from.

Spec-driven development has arrived, and I think it's the right idea. Tools like SpecKit and Kiro put the specification front and center to the coding workflow, and the results are noticeably better — an agent working from a written spec produces dramatically better output than an agent working from a vibe and a prompt.

But look at where the spec gets written. In the engineer's IDE. In the terminal. At the moment the task gets picked up.

That's a long way from where the product decisions were actually made.

Think about the journey a piece of intent takes before it reaches that terminal. A team decides something — in a planning session, in a strategy doc, in a passing conversation that should have been a meeting. That decision gets compressed into a ticket. Maybe that ticket sits in a backlog for a few weeks. Eventually an engineer picks it up, reads whatever context has been given, and writes a spec for the agent.

That spec is probably not the team's full intent, at least not in my experience of how tickets get written. It's the engineer's reconstruction of what they think the team meant, written under deadline, weeks after the actual decision. A translation of a translation. And then we hand it to an agent and let it build, fast and confidently, exactly what the spec says.

The agents are doing their job. The spec is just in the wrong place.

I've believed this for a long time, well before AI coding tools existed: the quality of work is mostly determined before the work begins. Teams consistently start executing before they've aligned on purpose and scope, and almost every painful thing that happens later — rework, drift, the feature nobody asked for — traces back to that.

Coding agents didn't create this problem. They made it urgent. Execution used to be expensive enough that misalignment surfaced slowly; you'd catch the misunderstanding somewhere in week two of building. Now execution is nearly free and nearly instant. The agent will build the wrong thing completely, beautifully, before anyone notices the brief was ambiguous.

So the gap between "we can build it" and "we agreed what it should be" is now most of the risk in a software project. Possibly all of it.

Which means the spec — the written, agreed statement of what we're building and what success looks like — shouldn't be a just-in-time artefact produced at the keyboard. It should be where the project starts. The whole team involved in creating it: product, engineering, whoever owns the outcome. Arguing about it while arguing is still cheap. Agreeing on what done means before anything gets built.

Do that, and two things happen at once. You decide the project's success where it's actually decided anyway — at the start. And you constrain the agent to the bounds the team set, not the bounds one engineer inferred from what they were given at some point down the track.

The obvious objection: don't we already have tools for this? We have project trackers. We have wikis. We have spec-driven dev tools. Between Jira, Confluence and SpecKit, surely this is covered.

I don't think it is, and the reason is that each one holds the right thing in the wrong place.

Confluence holds the team's intent, far from the code. Nothing connects the document to what gets built. It doesn't version with the code, agents don't read it, and it starts rotting the day after it's written. Everyone has opened a design doc and wondered whether it describes the system or a system that almost got built.

Jira holds the breakdown, but a ticket is intent with the reasoning stripped out. It tells you what to do this sprint and nothing about why, or what success was supposed to look like.

SpecKit holds the right artefact — a real spec, next to the code, readable by the agent. But it lives with the engineer at implementation time, which is exactly the problem. It's spec-driven development scoped to a terminal session, when the decisions it should encode were made weeks earlier, by more people than one.

The intent exists in all three places, partially, and lives durably in none of them. And there's a quieter problem underneath all of it: where the spec currently lives — in the repo — most of the team can't even reach it. If your company is anything like mine, the product managers don't have GitHub licences. So even when a real spec exists, the people who should have a say in it never see it.

If you take the placement argument seriously, the product design mostly falls out of it.

The spec has to be durable, so it lives as a markdown file committed to your repo, versioned alongside the code it describes. But it has to be authored where the whole team can reach it — in a shared environment, in a format everyone can read, not behind a licence half the team doesn't have.

And it has to come first, so the workflow starts at the brief, not the backlog. A rough idea gets shaped into a spec by the team, the spec gets broken down into buildable work, and the agent builds against it. Planning, breakdown and build all hang off the one artefact — because in this workflow they were never really separate things to begin with.

That's the whole of it. The rest is detail.

Vibe-coding a prototype is genuinely great. No spec needed, and anyone telling you otherwise is selling process. But the moment that prototype has users, a team, a roadmap — the moment it has to keep doing the right thing after the person who built it has moved on — somebody has to write down what it's meant to do. Somewhere durable. Somewhere the next person, and the next agent, will actually look.

If your team is building with agents today, it's worth knowing where your spec gets written — and who writes it. Or whether it gets written at all.

Paul is a software engineer in Sydney and the founder of Penling, a spec-driven delivery platform for teams building with LLMs.

I can't tell you why this code is the way it is.

Paul Penling — Fri, 12 Jun 2026 14:00:00 +0000

Six months ago one of the engineers on my team shipped a Microsoft SSO integration. Worked first time, tests passed, PR was clean. It went in on a Thursday afternoon.

Last week someone raised a bug. Different flow, adjacent bit of auth. I pulled up the file to understand the shape of what was there before touching it.

The PR was spotless. Fourteen files changed, good commit message, all checks green. So I did what you do — git log auth.ts. Three commits. Dates, hashes, one-liners. Nothing I couldn't have inferred from reading the code.

What I actually wanted to know was: why Entra ID and not Okta? Were other providers considered? Was the session timeout deliberate or just the default? These aren't things you can read from a diff. They're the decisions that happened before the code was written — the ones that would have been in the ticket, or the design doc, or the Slack thread, or the engineer's head.

None of that survived the merge.

This is a new failure mode, and I don't think we've fully named it yet.

With human-written code, the reasoning is imperfect but it's at least somewhere. The engineer who wrote it had to understand the problem well enough to solve it. Even if they didn't document anything, they can usually reconstruct the thinking — why this approach, why not that one, what they were optimising for. The reasoning lives in their head. It's retrievable, if they're still around.

With AI-written code, that's not how it works. The model doesn't remember. The context that drove the implementation — the spec, the clarifications, the constraints — existed for the duration of the build and then evaporated. What you're left with is the output. The code and the tests, both technically correct, neither of which tells you anything about the decisions that shaped them.

The diff shows you what changed. It doesn't show you what was considered and rejected. It doesn't show you what's explicitly out of scope. It doesn't show you who decided, or why.

The review problem is downstream of this.

When I review a PR for AI-written code, I'm approving the implementation without ever seeing the decision tree behind it. I can check whether the code is correct. I can check whether the tests cover what they claim to. What I can't check is whether the right thing was built — because the spec that defined "right" isn't in the PR.

In the old model, a developer wrote the code, so they implicitly validated the approach as they went. They'd push back if the requirements were contradictory. They'd surface tradeoffs. The review was the last check in a chain that started with someone thinking the problem through.

With AI agents, that chain can be shorter than it looks. The code can be correct and the approach can be wrong at the same time, and the PR won't necessarily tell you which it is.

The fix is obvious once you've seen the problem clearly enough. You need the spec — the thing the AI built against — committed to the repository alongside the code it produced.

Not as a comment. Not in a Notion page that'll drift out of sync. In the repo, versioned with everything else, updated when the code is updated, reviewed in the same PR.

When you do that, the question "why is it built this way?" has a one-file answer. The decision to use Entra ID is there. The explicit exclusion of Okta is there. The session timeout is documented as deliberate, or as a default that was left pending a product decision. The context survives the merge because it was committed with the merge.

This is what we mean by spec-driven development. The spec isn't just a prompt. It's the permanent record of what the work was supposed to be — why this, not that, and what's out of scope.

The thing that surprised me, when we started doing this consistently, is how much it changes code review. Not because reviews get easier, exactly, but because they get more honest. You're not just checking whether the code works. You're checking whether it matches the intent. And because the intent is written down and in the PR, you can actually do that check.

The other thing it changes is the debugging conversation six months later. Instead of git log and archaeology, you open one file. The answer is there.

It's a small shift in how the work gets organised. The impact is disproportionate.

Penling is the tool we built to make this the default rather than the exception — a shared workspace where the spec is written before the build starts, handed to the AI agent via MCP, and committed to your repository when the PR opens. The reasoning travels with the code, permanently.

But the principle works without the tooling. If you're shipping AI code today, the question worth asking is: if someone opens this file in six months, what will they find? The diff, or the reasoning?

This is part of a series on adopting spec-driven development. Part one covers why specs make AI output reliable in the first place. Part three is on what goes wrong when the team skips the spec entirely.

How to execute Spec-Driven Development in a contemporary product team

Paul Penling — Thu, 11 Jun 2026 09:23:34 +0000

I've been doing Agile in one form or another for most of my career. Standups, retrospectives, sprint planning — the whole apparatus.

And like most people who've been doing it that long, I've developed fairly strong opinions about which bits actually matter and which bits are ceremony for ceremony's sake.

A lot of us have used Jira throughout our careers. Love it or not, it's driven a lot of how we've done our work for 20+ years. We create tickets, we write content into them, we do work, we close tickets. We go again.

The bit that always mattered though, in my experience, was the acceptance criteria. Not the ticket title. Not the epic it was attached to. The specific, testable statements of what done looks like.

We've always had specifications of one type or another, and whether we realised it or not. A spec is entirely a blueprint to achieve something, that's it. It can be as simple or complex as you want, but it tells you, and whoever comes after you, what the intent was.

And in this new world where our LLMs work best with clear, well-defined, scope-limited instructions, specs matter more than ever too. This led to the creation of Penling.

We were using AI agents to do implementation work — Claude Code, mostly — and the output was good. Genuinely good, not "good for AI" good. But it was inconsistent in a way I couldn't immediately put my finger on. We were working harder than we needed to keep the LLM in the bounds of the task we were focusing on at the time.

Some tasks came back exactly right. Others came back technically correct but somehow not quite what I'd had in mind or included bits I wasn't ready to build. The code worked. The tests passed. But there'd be decisions baked into the implementation that I hadn't thought about, and sometimes they were fine and sometimes they created problems down the track.

The variable, I eventually worked out, was how precisely I'd defined the work before handing it over.

When I'd been specific — here's what it needs to do, here's what it explicitly doesn't need to do, here are the constraints — the output was reliable. When I'd been vague, I was essentially asking the model to make decisions I hadn't made myself. And it would. Confidently.

This is obvious in retrospect. It's basically the same thing I'd been saying for years: the quality of our output is directly related to how well we understood the problem before we started. The AI hasn't changed that. It's just made the consequence of not doing it arrive faster and less visibly.

A human engineer who gets a vague brief will ask a question, or make an assumption that's at least informed by being in the room for the conversation. An AI will make an assumption that's informed by its training data, which is not the same thing. It won't flag uncertainty unless you've given it a reason to. It'll just decide, and keep going.

The fix isn't better prompting. I've been down that road. You end up with increasingly elaborate instructions that are really just a specification written badly. And so we end up negotiating with the LLM through increasingly elaborate prompts to keep it on task. Take this prompt for example:

Add a CSV export to the projects page. When I click it downloads a file named with today's date. The button shows a loading state while the file generates.

What actually works is writing a proper spec before the build starts. You may already have a PRD. The spec is what sits between that and the AI — translating intent into something executable. Four things, clearly stated:

A definition of what the work is. Specific enough that there's only one reasonable interpretation of it.

Expected results — the observable outcomes that mean it's done. Written as things you can check, not things you have to feel.

Conditions — the constraints. Performance, design system, compatibility, whatever applies. The stuff that lives in everyone's heads and doesn't usually survive the handover to a ticket.

Boundaries — what's explicitly out of scope. This one is underrated. An AI without clear boundaries will do adjacent work because adjacent work looks like being helpful. It's not misbehaving. You just didn't tell it to stop.

So, what does this look like for our CSV export example from before:

Build a CSV export feature for the projects list.

Definition: Add an "Export" button to the projects list page that downloads the current user's projects as a CSV file, containing project name, status, created date, and owner

Expected results:
- Clicking "Export" downloads a file named projects-[YYYY-MM-DD].csv
- The CSV includes column headers: Name, Status, Created, Owner
- Only projects the current user can see are included
- The button shows a loading state while the file generates

Conditions:
- Use the existing GET /projects endpoint — do not create a new one
- Follow existing button and loading state patterns from the design system
- File generation is client-side; no new backend endpoints needed

Boundaries:
- No filtering or column selection — export all visible projects as-is
- Do not modify the projects list layout or any existing components
- If the user has more than 1000 projects, export the first 1000 only

This is what spec driven development is, at its core. You drive the build from the spec. The spec is the contract. The AI executes against it.

There's no arguing that there's more text in the spec when compared with the prompt in these examples. But that's the point really isn't it. We've been more specific about what we want to achieve, told the LLM how to do it, where to stop and what we expect to see at the end.

The outcome is pretty much what you'd hope for if you were to have asked a team mate to execute your clearly defined instructions: focused output, no surprises, and clean PRs that are easier to review because you know what the PR was supposed to accomplish.

The less obvious outcome is that writing the spec forces the conversation that usually happens in code review to happen before anyone's written anything. Which is where it should have been all along.

Penling is the tool we built around this workflow — a shared workspace where the product thinking happens. Goals, requirements, the spec itself. In practice that means your PRD and your AI's build instructions aren't two separate things maintained in two separate tools. Penling derives the spec from the broader product intent, hands it directly to an AI agent via MCP, and keeps the full chain of reasoning attached to the PR at the end. One place, from first thought to merged code.

But you don't need a fully featured toolkit like Penling to try this; A shared doc with the four sections we described earlier is enough to see whether it changes anything in the way you interact your LLM.

In my experience it does. Quite significantly.

More on the methodology: What is spec driven development?