jucelinux

Posted on May 17

The central artifact: spec.md and the field tests-first never had

#ai #claudecode #agents #architecture

Fourth article in the **Grounded Code* series. The previous articles established the cost, the mechanism, and the patterns to drop. This one introduces the artifact that holds the whole system together.*

I. The setup

The first three articles were diagnostic. The cost is 7.5x in tokens on the same feature. The mechanism is how agents read with grep instead of an IDE. The patterns that score badly are the ones that paid for themselves on the old axis and stopped paying on the new one.

This article goes positive. What does the agent actually need to stay anchored across turns, across sessions, across compactions, across the moment three weeks from now when a different developer asks the agent to extend the same feature?

The answer is a small markdown file that lives next to the code. It is the document the whole architecture turns around. It is where Grounded Code earns its name.

I want to make the article's three promises explicit, because they map directly to what I want you to walk away with:

The full format of <feature>.spec.md.
A complete example.
The field tests-first never had, and why it matters more than I expected.

II. The full format

A spec is a short markdown file with frontmatter (the parseable contract) and a body (the prose rationale). It lives at src/<feature>/<feature>.spec.md, next to the implementation and the test.

The skeleton:

---
purpose: One sentence, present tense, what this feature does.
public_api:
  - functionName(arg: T): U
  - OtherType: { field: T }
invariants:
  - Statement that must always hold.
  - Another statement.
in_scope:
  - What this feature is responsible for.
out_of_scope:
  - What this feature explicitly does NOT do.
verification: |
  pnpm test src/payment/payment.test.ts
  pnpm typecheck
---

# <Feature> spec

A short prose narrative explaining the why of this feature.

## Why this exists

What problem this feature solves, in one paragraph.

## Edge cases worth knowing

A list of edge cases the implementation handles.

## Related specs

Links to specs of features this one interacts with.

Six frontmatter fields, three body sections. The whole thing is usually 60 to 120 lines. Anything shorter is under-specifying. Anything longer means the feature is too big and should split.

Let me walk through the frontmatter fields.

`purpose`

One sentence, present tense, what the feature does. Fifteen to twenty-five words. This is what the agent sees first when it opens the spec, so it has to convey the feature without marketing language.

"Computes the final price after discounts, taxes, and currency conversion."

Not:

"A robust, extensible system for handling our complex pricing logic with support for various edge cases."

`public_api`

The exported names and their signatures. This is the surface the rest of the codebase depends on. If you add or remove a name here, downstream code is going to break. So this field doubles as a "things that must not be renamed without a migration" list.

You don't list every internal helper. Just the surface.

`invariants`

Statements that must always hold, written as plain English. The implementation enforces them, the tests verify them, the spec declares them. When all three agree, the system is consistent.

"The total returned by computeTotal is non-negative."

"If applyDiscount is called twice with the same code, the second call is a no-op."

`in_scope` and `out_of_scope`

In_scope is what the feature does. Out_of_scope is what it does not do, and crucially, what it should not be extended to do without an explicit decision.

I'm going to spend an entire section on out_of_scope below, because this is the field I keep finding to be more load-bearing than anything else in the spec. It's the one piece of the format that has no analog in tests-first development.

`verification`

The exact commands to run to verify the feature works. Copy-pasteable. This is what the agent runs as the gate before considering a change done.

verification: |
  pnpm test src/payment/payment.test.ts
  pnpm typecheck
  pnpm lint src/payment

If verification involves running the app and checking something manually, say so explicitly. The agent will then know to surface the request rather than silently skip it.

The body

Three sections, in order.

Why this exists. One paragraph. The motivation. What changes if this feature did not exist. If the feature exists because of an external constraint (a regulation, a third-party API quirk, an incident), name it.

Edge cases worth knowing. A list. Each bullet states an edge case and its handling, with the justification. This is where lessons from production live. The things that would otherwise be encoded as comments scattered through the implementation get hoisted into one readable place.

Related specs. Links to the specs of features this one talks to. Relative paths, so the agent can follow them with one read.

III. A complete example

Here's the spec for a pricing feature, fully filled out:

---
purpose: Computes the final order total including discounts, taxes, and tip.
public_api:
  - computeTotal(order: Order, rates: Rates): Money
  - Money: { amount: PositiveInt; currency: Currency }
invariants:
  - The total is non-negative.
  - The total is in the currency of the order.
  - Discounts apply before taxes; tip applies last.
in_scope:
  - Discount application (percent and fixed-amount).
  - Sales tax based on shipping address.
  - Tip as percentage of pre-tax subtotal.
out_of_scope:
  - Payment authorization.
  - Currency conversion (rates are received as input).
  - Refund computation (see ../refund/refund.spec.md).
verification: |
  pnpm test src/total/total.test.ts
  pnpm typecheck
---

# Total spec

## Why this exists

Computing the total is non-trivial because of ordering: discounts before
taxes (legally required in our jurisdictions), tip after tax (customer
expectation), and rounding at each step (otherwise we accumulate cent-level
discrepancies that fail reconciliation).

## Edge cases worth knowing

- **100% discount:** total is zero, not negative. Tax on a zero subtotal
  is zero.
- **Tip on a fully-discounted order:** zero (tip is percentage of pre-tax
  subtotal which is zero).
- **Mixed-currency line items:** rejected upstream; this function assumes
  single-currency orders. The type system enforces this via the `Order`
  type's `currency` field.

## Related specs

- `../order/order.spec.md` for the upstream caller.
- `../refund/refund.spec.md` for the inverse computation.

That's a sixty-line file. The implementation it describes is probably one hundred fifty lines. The tests are probably two hundred. The agent navigates all three with one glob (src/total/*) and rebuilds full understanding in two reads.

Notice what isn't in the spec. There's no marketing language. There's no implementation. There's no example of usage. The examples live in the test file, where they execute. The spec is a contract, not a tutorial.

IV. The field tests-first never had

I want to make a careful argument here, because TDD has earned its place over twenty years and "tests aren't enough" is not a casual claim.

Tests-first does real work. It forces the developer to state what passing looks like before coding. It catches regressions across changes. It documents behavior in an executable form. None of that goes away with spec.md. The verification field of the spec literally points at the test suite. The tests are the gate. The spec is the anchor. They do different jobs.

But there are two things tests structurally cannot do, no matter how well you write them:

1. Tests don't carry rationale. A test says expect(applyDiscount(order, '50OFF')).toEqual(...). It tells you what the system does in that case. It doesn't tell you why the discount applies before tax instead of after. It doesn't tell you whether the rounding was a deliberate decision or an accident of the implementation. When the agent has to decide whether a new requirement is a feature change or a bug fix, the tests cannot help. The rationale is somewhere else (a code comment, the head of the engineer who wrote it, a Slack thread from 2023). The spec's "Why this exists" section is where rationale belongs.

2. Tests don't carry the boundary. This is the one that surprised me. Tests state what is true about the feature. They do not state what is not part of the feature. Nothing in your test suite says "this function should not be extended to handle payment authorization." A reasonable agent, asked to extend the pricing function with a payment step, would do so. The test would not fail. The PR would even compile. And now a function that was supposed to be pure pricing computation is doing payment authorization, with all the failure modes that implies (latency, retries, error handling, observability).

The out_of_scope field is the explicit boundary. Listed, parseable, sitting at the top of the spec. When the agent reads the spec, it reads what the feature does NOT do, and the boundary is anchored. Asked to add payment authorization, the agent now responds with "that's listed as out_of_scope; should I create a new feature or expand the scope of this one?" That single question prevents a class of drift that no amount of test coverage can catch.

In the audit of the three coding-agent codebases, this is the field I kept seeing carry weight. opencode's specs explicitly list things the feature won't do. Claude Code's AGENTS.md documents at the codebase level use the same pattern. The boundary, once stated, holds.

The format calls this field load-bearing. After watching it work in practice, I think load-bearing is an understatement. The out_of_scope field is what stops a feature from quietly becoming five features.

There's a secondary effect that's almost as valuable. When you sit down to write out_of_scope and find the field hard to fill, that's a signal. It usually means the feature itself has no real boundary. The discipline of filling in out_of_scope is the discipline of deciding what the feature actually is.

V. What does not go in the spec

A short list, because the spec is small on purpose.

The implementation. The spec describes; it doesn't implement.
Examples of usage. Those go in the test file, where they're executable. The test name is the spec sentence; the test body is the example.
TODOs. Use the issue tracker. TODOs in a spec rot.
Diagrams. If a diagram would help, prefer the type system, then a single ASCII or Mermaid block. Don't reach for a separate image.
Marketing copy. The spec is for the agent and the engineer, not the website.

The spec is a contract document. Anything that doesn't help the agent or the engineer extend the feature correctly is noise.

VI. Honest limits

Two things I want to flag before someone tries this and is disappointed.

Specs cost time to write. The first spec for a feature takes maybe ten to fifteen minutes to fill in properly. That's real. The benefit shows up across all subsequent agent sessions, but the initial cost is real. For tiny features (a one-line utility, a single config constant), the cost outweighs the benefit. Skip the spec for those.

Specs need to be maintained. A stale spec is worse than no spec. If the implementation drifts away from the spec, the agent will trust the spec and produce wrong code. The discipline that makes this work is the discipline of updating the spec when the feature changes. The next article in the series (the loop) puts a step five at the end of every change for exactly this reason: consolidate the spec back to truth.

If you can't commit to updating the spec, don't write one. Stale documentation is the older, more familiar version of this problem, and the agent makes it worse, not better, by trusting confidently wrong content.

VII. What's next

The spec is the anchor. The next article is about the loop that runs on top of it: spec, plan, implement, verify, consolidate. Five steps, with the read-only plan mode being the one that saves the most time, and skipping step one being the most expensive failure mode I have seen in practice.

If you want to try writing a spec before the next article ships, pick a feature you'd extend tomorrow and fill in the frontmatter. Don't worry about the body yet. The frontmatter alone will tell you whether the feature has a coherent boundary. Half the value of the spec lands in the first ten minutes.

Next in the series: *"The five-step loop: spec, plan, implement, verify, consolidate."***

DEV Community

The central artifact: spec.md and the field tests-first never had

I. The setup

II. The full format

`purpose`

`public_api`

`invariants`

`in_scope` and `out_of_scope`

`verification`

The body

III. A complete example

IV. The field tests-first never had

V. What does not go in the spec

VI. Honest limits

VII. What's next

Top comments (0)

I. The setup

II. The full format

purpose

public_api

invariants

in_scope and out_of_scope

verification

The body

III. A complete example

IV. The field tests-first never had

V. What does not go in the spec

VI. Honest limits

VII. What's next

`purpose`

`public_api`

`invariants`

`in_scope` and `out_of_scope`

`verification`