Diya Burman

Posted on Jun 22

Spec Debt Doesn't Disappear When You Fix It. It Migrates.

#ai #softwareengineering #agents #testing

Preface

I want to be upfront about something before we get into it. None of the frameworks in this article is mine. The ideas here come from two people who have been thinking about this stuff way harder and longer than I have — and they deserve full credit before I say another word.

Dan Shapiro — CEO of Glowforge, Wharton Research Fellow, and the person who gave this whole conversation a vocabulary. His blog post “The Five Levels: from Spicy Autocomplete to the Dark Factory” is the conceptual spine of everything I’m about to say. Read the original. It’s short, sharp, and will make you uncomfortable in the best way. danshapiro.com

Nate B. Jones — AI strategist, zero-hype practitioner, and the person whose YouTube channel made me realize I had been fooling myself about where I actually sat on this ladder. His video “The 5 Levels of AI Coding (Why Most of You Won’t Make It Past Level 2)” is what triggered this entire newsletter. natebjones.com — Watch the video

This newsletter — The Level 5 Engineer — is my public learning log. I’m a Senior Software Engineer and a Tech Lead, currently somewhere between Level 2 and Level 3 (in context of the title of this newsletter) on a good day. The goal is Level 5. I’m documenting the climb in real time — the frameworks, the tools, the mindset shifts, and the moments where I realize I’ve been doing it wrong. If you’re on a similar journey, pull up a chair.

Issue #7 ended with seven spec debt items documented in a project that had been built carefully for seven issues. Every item was passing its tests. None of them announced themselves. They were found by asking a different question: not "does this pass?" but "what would a second agent build from this step?"

Issue #8 fixes all seven — and builds the tool that found them into something reusable.

The seven fixes

Working through each item one at a time, running the test suite after every individual fix. Not batching them. The discipline matters — if a fix breaks something, you want to know which fix broke it.

Fix 1 — Timeout measurement ambiguity

# Before
And the response is returned within 12 seconds

# After
And the response is returned within 12 seconds of the order being submitted

"Of the order being submitted" anchors the clock to client-side HTTP request dispatch — the same moment time.time() is captured in the step definition. Without this anchor, a second implementation could measure from server receipt, from the last retry attempt, or from when the response body is fully read. All three produce different numbers under load.

Fix 2 — "Retried" vs "total attempts"

# Before
And the payment gateway is not retried more than 2 times

# After
And the payment gateway receives no more than 2 charge requests total

"Retried 2 times" has two valid English readings: 2 retries meaning 3 total requests, or retried up to 2 times meaning 2 total. "No more than 2 charge requests total" counts requests, not retries, and the word "total" makes clear the initial attempt is included. This also changed the assertion in the step definition — from trusting the response body's retry_count field to checking the actual call count at the mock server. Stronger assertion, same outcome.

Fix 3 — "Released" without mechanism

# Before
And the inventory reservation is released

# After
And the inventory service receives a reservation release request for SHOE-RED-42 and BELT-BRN-M

"Released" says what happened but not how, and not for which items. The rewrite names the items and specifies that a request is sent to the inventory service. This fix also revealed a gap: the current implementation signals release via a response body field (inventory_released: true) rather than a separate API call to the inventory service. The spec now describes the intended behaviour. The implementation doesn't fully match it yet. That's a future issue — but the gap is now visible rather than hidden.

Fix 4 — "Explicit user action" — removed entirely

# Before
And no order is confirmed without explicit user action

# After
(step removed)

This step implies a follow-up confirmation flow (POST /orders/{id}/confirm or equivalent) that does not exist anywhere in the codebase. It passes trivially because no order is confirmed in the partial availability scenario — not because the confirmation flow was implemented. A spec step that passes for the wrong reason is not a safety net. It is a false guarantee. If the confirmation flow is built in a future issue, a new scenario should specify it precisely. Leaving this step in place would invite an agent to invent an unspecced endpoint.

Fix 5 — Presence without value assertions

The order_status_bad.feature timestamp step was asserting only that a field exists and is a non-empty string. Tightened to assert the field name, the value, and the type explicitly. Kept conservative — order_status_bad.feature is a pedagogical artifact and shouldn't be converted into a good spec, which would defeat its purpose in the newsletter.

Fix 6 — "An order exists" without specifying how

# Before
Given an order was successfully placed and confirmed with order ID "aaa00000-..."

# After
Given an order was created via POST /orders and confirmed with order ID "aaa00000-..."

"Successfully placed and confirmed" describes the outcome but not the mechanism. "Created via POST /orders" makes explicit that a real creation flow is expected. The step definition currently seeds the order directly into the in-memory store — a shortcut. The rewrite creates a documented gap between spec intent and step implementation. Visible gap, not hidden one.

Fix 7 — "Correct" without definition

# Before
And the notification contains the correct order id and total

# After
And the notification request body contains order_id "order-abc-123" and total 134.97

"Correct" is relative to context that may not be available to the reader. The rewrite hardcodes the expected values established in the When clause. Two agents reading the original step would both implement something that checks the notification body — but one might compare against the When-clause values, another might check against a computed total, a third might only verify field presence. The rewrite removes all three interpretations.

This fix also caught something the stub had been hiding: the notification mock was returning "mock-notif-001" as a notification id. Not a UUID. The format assertion caught it immediately. This is exactly the value of adding concrete assertions — it surfaces stub data that was never valid but was never checked.

The audit framework

After fixing all seven items, I built the diagnostic tool into a standalone document: docs/spec-audit-framework.md. The full document is in the repo. Here's the core of it.

Five questions — ask them for every scenario in every feature file:

Q1: Who owns this scenario?
Can you name the team, service, or domain this scenario belongs to? If the answer includes "and also", the scenario is in the wrong file.

Q2: What decisions does this scenario leave open?
For every Given, When, and Then clause: could two agents build different implementations that both pass? If yes, the step is underspecified.

Q3: Are all terms defined within the file?
Every noun that is not a standard HTTP concept or a primitive type should be defined in the scenario or a Background clause. If understanding a term requires reading another file or asking a colleague, it is spec debt.

Q4: Does this scenario describe behaviour or implementation?
Steps should describe what the system does from the caller's perspective. Any step that references internal concepts — database field names, function names, internal status codes — is leaking implementation into the spec.

Q5: What does this scenario NOT say that it should?
List the edge cases, error states, and boundary conditions the scenario implies but does not specify. Each one is a silent assumption waiting to become a production incident.

Six debt classes:

Class	What it looks like
UNDERSPECIFIED	Step present but leaves a decision open
MIXED CONCERN	Scenario covers more than one service domain
UNDEFINED TERM	A noun used without being defined
AMBIGUOUS COUNT	A quantity with two valid interpretations
IMPLICIT FLOW	Implies a follow-up flow that isn't specced
LEAKY ABSTRACTION	References implementation details

What the framework found that the manual audit missed

Applying the five questions to all four fixed feature files surfaced one item the Issue #7 manual audit didn't catch.

In order_status_good.feature, the Given clause now reads "created via POST /orders" — the fixed version from this session. Q4 flagged it for a different reason than the original audit: the step definition still seeds the order directly into the in-memory store. The spec text is precise. The implementation of the spec takes a shortcut.

The manual audit looked at feature file text. The framework applies Q4 to step definitions as well — and a step definition that silently does something different from what the spec says is spec debt, even if the test passes.

This distinction matters: spec debt can migrate from the feature file into the step definition. You fix the scenario, tighten the language, run the tests — green. But the step definition now implements a shortcut that contradicts the precise step text. The debt moved, it didn't disappear.

The scorecard — after all fixes

Applied the framework to all four non-pedagogical feature files:

order_creation.feature — 5 scenarios, 1 debt item remaining (LEAKY ABSTRACTION at step definition level — inventory release mechanism gap from Fix 3)

order_status_good.feature — 2 scenarios, 1 debt item remaining (LEAKY ABSTRACTION — step definition seeds order directly rather than via POST /orders)

notification_service.feature — 2 scenarios, 0 debt items

order_status_bad.feature — kept as pedagogical artifact, not audited for debt

Debt density after fixes: 0.22 items per scenario. Both remaining items are LEAKY ABSTRACTION at the step definition level. Zero AMBIGUOUS COUNT or IMPLICIT FLOW items remain — the two highest-risk classes.

The uncomfortable answer

After fixing seven spec debt items and applying a structured audit framework to a project that has been built carefully for eight issues, two debt items remain. Both were introduced by the same sessions that fixed other debt — a precise spec step was written, and the implementation of that step took a shortcut.

Spec debt is not eliminated by fixing debt. It migrates.

The practical conclusion: treat step definitions as part of the spec surface, not just as test harness code. A step definition that silently does something different from what the spec says is spec debt, even if the test passes. The audit framework catches both — but only if you apply Q4 to the step definitions as well as the feature text.

The other finding worth naming: notification_service.feature scored zero debt items. It was written after eight issues of accumulating lessons about what the previous files got wrong. The absence of debt is not accidental — it's the result of knowing what bad specs look like before writing the next one.

The best time to write a spec is after you've written a few bad ones. Auditing retroactively and fixing forward is the realistic path. Not "write it right the first time."

Next issue: Prompts Are Disposable. Skills Are Infrastructure — the conceptual shift from session-level prompts to versioned, reusable skill definitions. Layer 2 begins.

Sources & Further Reading