jucelinux

Posted on May 17

The five-step loop: spec, plan, implement, verify, consolidate

#ai #claudecode #agents #architecture

Fifth article in the **Grounded Code* series. The previous article introduced the spec. This one shows the workflow that turns the spec into a working feature without drift.*

I. The setup

The spec is the artifact. The loop is the workflow that runs on top of it.

A reasonable objection at this point: "every engineer already does some version of this. Think before you code, plan the approach, write the code, run the tests, clean up. Why does this need a five-step ceremony with a name?"

The honest answer is that nothing in the loop is new in concept. What is new is that every step becomes a persistent artifact, because the primary reader of the codebase no longer sustains implicit habits across turns. The agent has no memory of what you decided yesterday. It has no internal narrative of "I planned it this way because." The "plan" the human carries in their head has to become a written plan that the agent can re-read. The "decision to test it manually" has to become a verification block in the spec. The whole loop is engineering discipline rendered as artifacts, because the new reader cannot inherit them otherwise.

This article has two non-obvious claims I want to defend explicitly:

Step 2 (plan mode) is the step that saves the most time. A bad plan is cheap to throw away. A bad implementation is expensive.
Skipping step 1 (the spec) is the most expensive failure mode I keep seeing. The drift it causes compounds across turns and is often invisible until verification passes for the wrong feature.

Both claims came out of watching the loop run, and watching it fail when steps were skipped. Neither was obvious to me when I started.

II. The five steps

1. Spec        Write or update <feature>.spec.md before any code.
2. Plan        Enter plan mode. Agent reads, proposes, does not write.
3. Implement   Exit plan mode. Execute the approved plan.
4. Verify      Run the spec's verification block exactly as written.
5. Consolidate Commit, reconcile spec with reality, clean the scratchpad.

Every non-trivial change goes through this loop. The point of the loop is that the spec stays the anchor across all five steps. At any moment, the agent can re-read it and re-ground itself in seconds. Without an anchor, the agent's context drifts. With one, drift is cheap to recover from.

I want to walk through each step, but spend the most time on the two where I think the case isn't obvious.

III. Step 1, where most people want to skip

Before any code changes, write or update <feature>.spec.md. For a new feature, start from a blank spec. For a change to an existing feature, update the frontmatter (invariants, in_scope, out_of_scope) and the body sections that move.

This is the step humans most want to skip. Every time. The instinct is "I know what I'm doing, let me just code." I had it. Every engineer I've watched try this loop had it.

Don't skip it. This is the claim I want to defend.

The spec does not have to be perfect. It has to be honest about what you know now, and specific about the contract. A useful prompt to bootstrap from a vague idea:

"Update src/payment/payment.spec.md to reflect that discounts now stack multiplicatively, not additively. Update the invariants field and the relevant edge case in the body. Do not change the implementation yet."

The "do not change the implementation yet" matters. It keeps step 1 small and reviewable.

Here is why skipping step 1 is the most expensive failure mode I keep seeing.

Without a spec, the agent has no contract to hold onto. The first prompt establishes some implicit intent in the agent's working memory, but that intent is not durable. Three turns later, when the context is partially compacted and the agent is reasoning about an edge case, the original intent is no longer cleanly available. The agent guesses. The agent will fill in plausible defaults that almost-but-not-quite match what you wanted.

The compounding effect is what makes this expensive. A wrong decision in turn three influences turn four, which influences turn five. By turn ten you have a feature that "works" (tests pass, code compiles) but isn't what you asked for. Sometimes you catch it. Sometimes the PR ships.

I've seen this fail in two specific ways that are worth naming. The first: the agent extends the feature with something out of scope (payment authorization snuck into a pricing function) because nothing told it not to. The second: the agent solves the right problem with the wrong contract (a function returns the right value but with a signature that breaks two downstream callers) because the public API was never declared.

Both failures cost more to fix than the spec would have cost to write. The spec is ten to fifteen minutes of work that prevents the kind of drift that takes hours to diagnose and clean up.

If you genuinely cannot write a spec because the feature is research, write a research spec. The public_api field is empty. The purpose is "explore X." The body describes what you're trying to learn. This is still an anchor; it just anchors a learning loop instead of an implementation loop.

IV. Step 2, the step that saves the most time

Once the spec is updated, enter plan mode. In Claude Code this is /permission plan. Other harnesses expose it differently. The mechanic is the same everywhere: the agent can read and propose, but cannot write to disk.

Ask the agent to propose a plan against the spec. Something like:

"Here is the updated spec. Propose a plan for the implementation. Do not write any code yet. The plan should be a numbered list of changes, each small enough to verify independently."

The plan that comes back is your contract for step 3. Read it. Push back on anything that doesn't match the spec. If the plan reveals that the spec was unclear or wrong, go back to step 1 and refine.

This is the step I want to make the strongest case for, because the value isn't obvious until you've seen it pay off.

The economic argument is simple. A plan is cheap to read, cheap to revise, cheap to throw away. A bad plan costs maybe two thousand tokens to produce and another five hundred to discuss. If you discover the plan is wrong, you've lost less than a coffee.

A bad implementation is expensive. It's hundreds or thousands of lines of code. It dirties your git state. It introduces test failures that have to be cleaned up. It rebuilds your mental model into something that doesn't match what you actually wanted. Throwing it away is hard, both technically and psychologically (the sunk cost is real).

Plan mode catches the wrong direction before it costs anything substantial. That alone would justify the step.

There's a second effect that I think matters even more. The agent reads more carefully in plan mode. The mental frame is "I am proposing," not "I am executing." It surveys the codebase, identifies the relevant files, considers alternatives. In implementation mode, the agent is reaching for tools and writing code. The reading-to-writing ratio inverts.

The plan that emerges from plan mode is usually better than the plan that would have emerged from "just write the code." Not because the agent is smarter in plan mode, but because the constraints of the mode make it think differently.

The third effect is alignment with you. Plan mode produces an artifact that's small enough to actually read. You can object to step three of seven without having to read a thousand lines of code. That objection is the cheapest possible alignment, and it happens before any code exists.

When I audited the three coding-agent codebases for the series, all three had explicit plan-mode-like affordances. Heavily used. Not optional. That convergence is the strongest evidence I have that this step is load-bearing.

The cost of plan mode is real, mostly in turns and in waiting for the agent to think. Use it for the survey and the first-cut plan, then exit plan mode and execute. Don't try to perfect every detail in plan mode; iteration there has diminishing returns.

V. Step 3, implement

Exit plan mode. Have the agent execute the approved plan. Watch the diffs as they arrive. Two things to watch for.

The diff stays inside the feature directory. If the implementation is touching files in five other features, either the feature is too big (split it, go back to step 1) or the change is genuinely cross-cutting and the feature boundary is wrong (rethink the spec). Co-location is one of the rules the diff makes legible. When the diff sprawls, something is off architecturally.

Public API matches the public_api field of the spec. If the agent renamed an exported symbol or changed a signature without updating the spec, that's drift. Catch it now. Either the spec needs to update or the implementation needs to revert. Don't let the spec and the implementation diverge silently.

For non-trivial changes, do the implementation in micro-commits: one small, self-contained change per commit, each verifiable on its own. The spec stays the same across them. The diff is small. The rollback target is clear. If you're new to this pattern, the discipline feels slow at first. It pays back the first time you have to bisect a regression.

If you find yourself wanting to fork an exploration mid-implementation ("let me try a different approach and see if it works better"), don't do it in the main thread. Either go back to plan mode for a quick alternative survey, or fork a separate session for the experiment. Mixing exploration and implementation in the same context is how diffs sprawl.

VI. Step 4, verify

Run the verification block from the spec, exactly as written. Don't substitute "run all tests" for "run the specific test file the spec named." The specificity is the point. The verification block is a contract that the feature is locally testable, which means changes to it don't require running the entire test suite.

If verification fails, three things might be true:

The failure surfaces a real bug. Fix it. The fix is its own small loop (maybe update the spec to add the edge case, plan the fix, implement, re-verify). Treat it as a micro-loop, not as an emergency.

The failure surfaces a wrong test. Update the test. If the test was enforcing a rule that's no longer true, the spec also has to change. This is a hint that step 1 was incomplete. Usually the invariants field was missing a rule, or the rule was wrong.

The failure surfaces an environment issue (missing env var, broken dependency, port conflict). Fix the environment. But capture the fix somewhere durable, like CLAUDE.md or the project README. If the agent had to figure it out once, it'll have to figure it out again next session unless the lesson lands somewhere persistent.

Don't suppress failures by removing the verification check or commenting out the failing assertion. That's the kind of move that looks fine in the moment and rots over months.

VII. Step 5, the step that prevents cruft

Once verification passes, three things happen in order.

Commit. A single commit per feature change is ideal. The commit message references the spec: "update payment.spec.md: stacked discounts; implement; tests pass."

Reconcile the spec with reality. If the implementation revealed something the spec didn't anticipate, update the spec to match. The spec should describe what the code actually does, not what it was originally intended to do. This is the move that prevents stale-spec rot. The discipline is small (usually one or two field edits) but it's the only thing that keeps the spec trustworthy session after session.

Clean the scratchpad. Anything in the working memory that was relevant to this loop and only this loop, drop it. Anything that was a load-bearing learning that should survive the loop, hoist it into the spec, into CLAUDE.md, or into a relevant .claude/rules/<topic>.md file. The agent's working memory is bounded. Treating it as long-term storage is one of the most reliable ways to compact a session into uselessness.

Step 5 is the step that prevents loops from accumulating cruft. Without consolidation, every subsequent loop starts in a worse state. With it, the codebase stays grounded in its own truth.

I see step 5 skipped almost as often as step 1. The cost of skipping it is slower, more diffuse, and harder to attribute to any one decision. But it's real. After ten unconsolidated loops, the spec is no longer trustworthy, the scratchpad is full of dead context, and the next session has to spend tokens rebuilding what the previous one knew.

VIII. When the loop breaks down

A few common failure modes and what to do about each.

The spec is too vague to write yet. You're in research mode, not implementation mode. Use a research spec, as described in section III.

The plan in step 2 is huge. The feature is too big. Split it. Each split feature gets its own spec and its own loop. The original spec might still exist as an umbrella spec that links to the smaller specs in its "Related specs" section.

The agent keeps drifting from the plan during step 3. Re-read the spec out loud (in chat). Ask the agent to confirm the next single change against the spec. If it can't, the plan was wrong; go back to step 2.

Verification passes but the feature is wrong. The spec is wrong. The verification block didn't capture what was actually important. Update the spec, usually the invariants field is missing a critical rule, and re-run.

The loop is overkill for tiny changes. It is. For a typo fix or a one-line patch, skip the loop and just do it. The loop is for non-trivial changes. The line is fuzzy. When you're unsure, err toward the loop. The cost (a few minutes on the spec) is small, and the benefit (catching misalignment before it compounds) is large.

IX. Honest scope

This loop is overhead. Five to fifteen minutes of upfront work before any code happens. For features that take an hour, the math is good. For features that take five minutes, it isn't.

The discipline of the loop assumes you'll do it more than once. If you do this loop once, you're paying the full overhead cost without the amortization benefit. If you do it ten times on the same codebase, the spec accumulates knowledge that compounds across loops, and the cost-per-loop drops as your skill with the format grows.

If you can't commit to running the loop consistently, run it on the changes where misalignment is most expensive (new features, public-API changes, anything touching billing or auth). Skip it on cosmetic fixes, dependency bumps, and config tweaks. A partial loop is fine. A loop applied where it matters most is most of the value.

I don't think the loop is the only way to do this. It's the version I've seen converge across three coding-agent codebases that built it independently. Other versions exist (one team I know uses a four-step variant that folds consolidate into commit). The five-step version is what I'd start with, but the principles travel.

X. What's next

We have the artifact (spec.md) and the workflow (the loop). The final article in the series is the principles that hold underneath both: ten rules grouped into three clusters (locality, contracts, quarantine), each with code as proof.

If you want to try the loop before the next article ships, the cheapest experiment is to apply it to one feature you'd extend tomorrow. Write the spec. Plan in plan mode. Implement. Verify. Consolidate. Count the time you spent, and compare it to the next ten minutes of context recovery you would have spent in a free-form session. The receipt is the argument, same as it has been all series.

Next in the series: *"The ten principles: locality, contracts, quarantine."***

DEV Community