DEV Community: guo king

Your MCP Server's Client Is a Language Model. Write the Contract Like You Mean It

guo king — Mon, 20 Jul 2026 02:33:08 +0000

The Model Context Protocol quietly turned every internal script, database query, and admin action into an API.

The client for that API is not a developer reading your docs. It's a language model reading your tool descriptions at runtime and deciding — on its own — what to call and with what arguments.

When a human developer integrates your API, ambiguity gets absorbed by their judgment. They read the docs twice, test in staging, and email you when the 409 makes no sense. An agent does none of that. It reads the tool's name, description, and input schema, forms a belief about what the tool does, and acts on that belief immediately, in production, possibly hundreds of times an hour.

Every ambiguity in a human-facing API costs you a support thread. Every ambiguity in an agent-facing tool costs you whatever the agent decided it meant, multiplied by every session that reaches the same fork.

The fix isn't better prompting on the client side — client prompts are the one thing you don't control. The fix is a contract on the server side.

What changes when the consumer is an agent

Contract element	Human developer	AI agent
Documentation	Read once during integration	Re-read from the schema every session; the description is the docs
Validation errors	Fixed during development	Must name the failing field so the agent can self-correct
Retries	Written deliberately, once	Attempted whenever an error looks transient; idempotency is survival
Destructive actions	Guarded by developer caution	Guarded only by what the contract flags
Breaking changes	Announced in a changelog	Silently misfire until a human notices agent behavior degraded

None of this is exotic. It's the same rigor a good OpenAPI spec already demands — applied with the assumption that the reader has perfect patience, zero context outside the schema, and no ability to ask a clarifying question.

Input schemas: constrain, don't describe

JSON Schema is the enforcement layer of the contract, and agents respect it more reliably than they respect prose. Every constraint you can express in the schema, express in the schema: enums instead of free-text strings, patterns for identifiers, explicit defaults, required fields kept to the true minimum.

"amount_cents": {
  "type": "integer",
  "minimum": 1,
  "maximum": 500000,
  "description": "Refund amount in cents. Must not exceed the invoice's remaining refundable balance; the server rejects overdrafts with REFUND_EXCEEDS_BALANCE."
}

The anti-pattern is the string-typed field with a paragraph of prose explaining the accepted values. Agents will eventually send the one variant your parser didn't expect.

Output contracts: agents parse, they don't skim

An MCP tool that returns a friendly paragraph of prose forces every consuming agent to parse English. That parse is a probabilistic operation that will go wrong somewhere.

Commit to structured output: stable field names, a machine-readable status, identifiers the agent can feed into follow-up calls. Prose belongs in one designated summary field, not in the load-bearing structure.

{
  "status": "refunded",
  "refund_id": "rf_8104",
  "amount_cents": 4200,
  "invoice_state": "closed",
  "summary": "Refunded $42.00 against invoice inv_2231."
}

And treat output fields with public-API backward-compatibility discipline. Renaming refund_id to refundId is a breaking change even though no compiler will catch it — the agents consuming it just start failing at whatever step needed the identifier.

An error taxonomy agents can act on

A human reads "Something went wrong, please try again later" and applies judgment. An agent needs the judgment encoded: every error should map to exactly one recovery action.

Class	Example code	Agent action
Transient	`UPSTREAM_TIMEOUT`	Retry with backoff, same idempotency key
Invalid input	`INVALID_FIELD: amount_cents`	Fix the named field, retry once
State conflict	`REFUND_EXCEEDS_BALANCE`	Re-read state, re-plan, don't blind-retry
Permission	`ROLE_FORBIDDEN`	Stop and surface to the user
Not found	`INVOICE_NOT_FOUND`	Verify the identifier with a read tool first

Two rules keep the taxonomy honest:

Codes are append-only. Agents and their prompts calcify around them; renames are breaking changes.
Every validation error names the failing field in structured detail. A bare "invalid request" sends an agent into a guess-and-retry loop that looks, in your logs, exactly like an attack.

Side effects: flag them or agents will find them

Classify every tool three ways: read-only or mutating, idempotent or not, reversible or destructive. MCP has native annotations for exactly this — readOnlyHint, destructiveHint, idempotentHint — and your spec should decide their values deliberately rather than defaulting them. Client applications use these hints to decide when to ask a human for confirmation; a mislabeled tool inherits whatever trust the label promised.

Any mutating tool an agent may retry needs an idempotency_key input that makes duplicate calls return the original result.

For genuinely destructive operations — deletion, irreversible sends, money movement — the strongest contract clause is a two-step shape: a dry_run mode that returns what would happen plus a confirmation token, and an execute mode that requires that token. That turns "the agent deleted the wrong thing" from an incident into a rejected call.

Putting it together: a refund tool contract

tool: refund_invoice
purpose: refund a failed or disputed invoice, once
inputs: invoice_id (pattern inv_*), amount_cents (1..500000),
        idempotency_key (uuid, required), dry_run (bool, default true)
output: status, refund_id, amount_cents, invoice_state, summary
errors: INVALID_FIELD, INVOICE_NOT_FOUND, REFUND_EXCEEDS_BALANCE,
        ROLE_FORBIDDEN, UPSTREAM_TIMEOUT
side effects: mutating, idempotent via key, destructive
              (readOnlyHint=false, destructiveHint=true,
               idempotentHint=true)
policy: dry_run=true returns preview + confirm_token;
        execution requires a confirm_token issued within 10 minutes
version: v2 (v1 accepted amount as dollars; removed 2026-06)

Every line answers a question an agent would otherwise answer by guessing. The version line is not decoration: the dollars-to-cents change is exactly the kind of silent semantic shift agent clients cannot detect and cannot survive.

The uncomfortable truth about agent-facing APIs: your real interface is not the code. It's the beliefs your contract induces in a language model. Write the contract first, and those beliefs are at least ones you chose.

The full version — with the versioning/review-gate process and a deeper error-taxonomy method — is on spec-coding.dev. There's also a free browser-based API spec generator that produces this contract structure from a plain-language description.

Your AI Agents Ship Code Faster Than You Can Review It. Here's the Workflow That Fixes That

guo king — Mon, 20 Jul 2026 02:33:06 +0000

Most teams didn't decide to become AI agent teams. One developer adopted a coding agent, another wired one into CI, and within a quarter half the pull requests were machine-written.

Then the pain starts, and it's a specific kind of pain:

PRs pile up faster than anyone can read them
Two agents solve overlapping problems in incompatible ways
A prompt that lived in one developer's chat session produced a product decision nobody approved

The failure isn't the agents. It's that the team's intent exists nowhere an agent — or a reviewer — can check. Chat transcripts aren't reviewable, aren't reusable, and can't be compared against the final diff.

Before agents, the slowest step in delivery was implementation, so every process optimized for developer throughput. With agents, implementation is nearly free, and two other steps become the constraint: deciding exactly what to build, and confirming the result matches that decision.

Here's the five-stage workflow that moves human effort to those two places. It's tool-agnostic — Claude Code, Cursor, Copilot agents, or a mix — because the contract lives in files, not in any tool's memory.

The workflow at a glance

Stage	Owner	Artifact	Gate to pass
1. Intake	Feature owner	`spec.md` packet	Goal, non-goals, acceptance criteria approved by a human
2. Task split	Tech lead	`tasks.md`	Every task has a write scope and a verifiable outcome
3. Implementation	One agent per task	Branch diff	Diff stays inside the task's write scope
4. Evidence	Same agent	`evidence.md`	Named tests exist and pass
5. Review	Human reviewer	PR with linked artifacts	Checklist review against criteria, not vibes

Stage 1: Turn the ticket into a spec packet

Everything downstream inherits the quality of this step — and it's the step teams most want to skip.

Input: a ticket like "let support admins retry failed exports."
Output: a one-page packet that fixes four things:

The goal, in observable terms
The non-goals that bound the change
Acceptance criteria a test can check
The evidence a reviewer will demand

The discipline that matters most is non-goals. A line like "no schema changes, no email template edits, no new dependencies" removes the three most common ways agents helpfully expand scope.

Write the packet before choosing which agent implements it. This is where product ambiguity gets resolved by a person instead of leaking into an agent's assumptions.

Stage 2: Split the spec into bounded tasks

A spec packet is still too big to hand to one agent as one prompt. The reliable unit of agent work is a task that a single agent can complete — and a single human can review — in one sitting.

Each task needs three fields beyond its description:

Task 2 of 3: add retry endpoint
Write scope:
- src/exports/retry.ts
- src/exports/retry.test.ts

Depends on: task 1 (retry state column) merged.

Acceptance:
- POST /api/exports/:id/retry on a failed export returns 202
  and the same retry_id on duplicate calls
- retrying a non-failed export returns 409

Evidence:
- test: export_retry_idempotent
- test: export_retry_wrong_state_conflict

One rule saves you from most multi-agent chaos: tasks that share files must not run in parallel. If two tasks both touch billing/index.ts, merge them or serialize them explicitly. Merge conflicts between agents are not a tooling problem; they're a task-splitting problem.

Stage 3: Run agents in parallel inside write scopes

With bounded tasks, parallelism becomes safe and boring — which is the goal. Give each agent one task, its packet, and an isolated branch or git worktree.

The write scope is the enforcement line: an agent that edits a file outside its scope has its diff rejected before review, no discussion needed. Most modern agents respect an instruction like:

Modify only the files listed under write scope. If the task requires touching anything else, stop and report instead of editing.

That stop-and-report clause matters more than it looks. The most expensive agent failure is not a wrong implementation; it's a quiet decision — an extra column, a renamed function, a new dependency — buried in an otherwise plausible diff. Scope enforcement converts quiet decisions into loud questions a human answers in seconds.

Stage 4: Collect evidence before requesting review

An agent's claim that work is complete is worth nothing. Before a task's PR opens, the same agent fills in evidence.md: which named tests were added, their pass output, and which acceptance criterion each test covers.

Evidence is where hallucinated completeness dies. An agent that says "all tests pass" but can't paste the run output for export_retry_idempotent has not finished. The gate is mechanical: the named test either exists and passes, or it doesn't.

Stage 5: Review the PR against the checklist

Human review is the scarcest resource in the whole system, so spend it only on what earlier gates can't check: does the diff match the intent, and did anything unstated slip in?

The reviewer confirms the diff stays in scope, maps each change to a task, spot-checks the evidence, and asks the one question no automation can: what did nobody ask for?

A real run: one feature, three agents

Feature: "support admins can retry a failed data export."

spec: exports/retry-failed-export/spec.md
tasks:
  1. retry state column + migration    agent A  (db/migrations, models)
  2. retry endpoint + idempotency      agent B  (src/exports/retry.ts + test)
  3. admin console retry button        agent C  (console/exports/*.tsx + test)
serialization: task 1 merges first; tasks 2 and 3 run in parallel
review: one human, three PRs, ~40 minutes total

The instructive failure: agent B added a helpful retry_count column — outside its write scope. The scope check flagged it before review. The team decided the count was genuinely useful, wrote it into the packet as a follow-up task, and rejected the out-of-scope edit.

That's the workflow doing its job: the good idea survived, the unreviewed decision did not.

What to measure in the first 30 days

Adopt this on one feature stream, not the whole team, and track three numbers:

Review time per agent PR — should fall week over week as artifacts standardize
Out-of-scope edits caught before review — should be non-zero early (the gate works), declining later (the prompts learn)
Rework rate (PRs needing a second implementation pass) — this is the number that convinces skeptics

The steady state is unglamorous: specs are short, tasks are boring, agents are interchangeable, and review is fast. That's what it looks like when humans spend their time on intent and evidence, and machines spend theirs on everything in between.

The full version of this article — with the failure-mode table and links to copy-ready templates for spec packets, task plans, and evidence logs — is on spec-coding.dev. The spec packet generator is free and runs entirely in your browser.

Vibe Coding vs Spec Coding: Same Refund Feature, Built Twice

guo king — Tue, 16 Jun 2026 01:57:41 +0000

Vibe coding is intoxicating. You describe what you want in plain language, the AI writes the code, and ten minutes later you have a working endpoint.

I was sold — until I shipped a refund feature that way and spent the next two weeks patching bugs that a 90-minute spec would have prevented entirely.

This is the side-by-side, using the exact same requirement, so you can see where the gap opens up.

The requirement

An e-commerce platform needs an order refund feature. The PM's brief:

Support full and partial refunds
Call the payment gateway (Stripe-style) to reverse the charge
Track refund status: pending, processing, succeeded, failed
Support agents trigger refunds through an internal tool

Simple enough. Both paths start here.

Path A: vibe coding

The prompt:

"Build me an order refund API in Node.js. Support full and partial refunds. Call a payment gateway to reverse the charge. Track refund status. Use Express and Postgres."

Sixty seconds later: a clean RefundController with createRefund and getRefundStatus. It validates the order exists, checks the amount against the order total, calls paymentGateway.refund(), saves the result. The code looks professional. The happy path works.

Ship it.

Bug #1: the double refund

A support agent clicks refund, the page hangs for a second, they click again. Two refunds go through. No idempotency check.

Fix prompt: "Add a check to prevent duplicate refunds for the same order." The AI adds a query: if a refund exists for this order, reject. Works — until it doesn't.

Bug #2: partial refund overflow

A $200 order. Support issues $50, then $80, then $100. Total refunded: $230. The duplicate check only catches exact duplicates, not cumulative amounts.

Fix prompt: "Track cumulative refund amounts and reject refunds that would exceed the order total." The AI adds a SUM(amount) query — but it's not in a transaction with the insert, so two concurrent partials can both pass the check.

Bug #3: gateway timeout

The gateway times out. The refund row sits at processing forever. Support can't retry — the duplicate check blocks them. Did the money actually leave? Nobody knows.

Fix prompt: "Add retry logic for gateway timeouts." The AI adds a retry loop: no exponential backoff, no idempotency key on the gateway call, no cap. The retry can now create a duplicate charge on the gateway side.

Bug #4: the race condition

Two agents process refunds for the same order simultaneously. Both pass the cumulative check (neither refund is committed yet), both hit the gateway, both succeed. The customer is refunded twice.

Fix prompt: "Add locking…"

Four patches in, each reasonable in isolation, and the architecture is a patchwork: no state machine, no documented invariants, no tests for how the patches interact.

The real cost

The first version took 10 minutes. The four patches took two weeks — investigation, testing, support escalations, and one manual reconciliation against gateway records.

The "fast" approach wasn't fast. It front-loaded the dopamine and back-loaded the pain.

Path B: spec coding

Same requirement. Same AI. Different starting point — 90 minutes writing this before any code:

# Feature: Order Refund Processing

## Goal
Process refunds safely: no over-refund, no duplicate
processing, correct gateway reconciliation.

## Non-Goals
- Customer self-service refund portal (future phase)
- Refund reason analytics
- Automated approval rules

## State Machine
pending → processing → succeeded
pending → processing → failed → pending (retry)

Only ONE refund may be "processing" per order at any time.

## Acceptance Criteria

Given an order with total $200 and $0 previously refunded
When a support agent requests a $50 refund
Then a refund record is created with status "pending"
  And the gateway is called with an idempotency key
  And on gateway success, status moves to "succeeded"
  And the refundable balance is now $150.

Given an order with total $200 and $150 already refunded
When a support agent requests a $75 refund
Then the request is rejected with "exceeds refundable balance"
  And no gateway call is made.

Given a refund in "processing" state
When another refund request arrives for the same order
Then the request is rejected with "refund already in progress"
  And no gateway call is made.

Given a refund in "processing" state
When the gateway times out
Then the status remains "processing"
  And a background job retries with exponential backoff
  And the retry uses the SAME idempotency key
  And after 3 failures, status moves to "failed"
  And an alert goes to the payments team.

## Edge Cases
- Concurrency: SELECT FOR UPDATE on the order row before
  checking refundable balance
- Idempotency: each refund attempt gets a UUID, passed to
  the gateway as the idempotency key
- Precision: all amounts in cents (integer), no floats
- Reconciliation: nightly job compares local records
  against the gateway settlement report

## Rollback Plan
- Feature flag: refund_processing_v2
- Rollback disables new refunds; in-flight ones continue
  via the background job
- Additive schema only — no migration rollback needed

Then the prompt:

"Implement the refund feature described in this spec. Follow the state machine exactly. Use SELECT FOR UPDATE for concurrency control. Include the idempotency key in all gateway calls. All amounts in cents." [paste spec]

The output is structurally different

The AI generates, in the first version:

processRefund wrapped in a transaction with SELECT FOR UPDATE
Cumulative balance check inside the transaction — no race window
Idempotency UUID minted at creation, passed to every gateway call
Background retry with exponential backoff, capped at 3 attempts
State transitions that match the spec's machine exactly

Every bug from Path A is pre-handled. Double refund? The lock plus the idempotency key. Overflow? Balance check in the same transaction as the insert. Timeout? Same-key retry that the gateway treats as safe.

Same AI. Same capability. Dramatically different output — because the input was dramatically different. The AI didn't get smarter; it got better constraints.

The honest math

	Vibe coding	Spec coding
Time to first version	10 min	~2.5 hours
Production bugs	4 (one involving real money)	0 in this scenario
Total time to stable	~2 weeks	~half a day

Vibe coding is great for prototypes, internal tools, and anything where a bug costs you a shrug. The moment money, state machines, or concurrency enter the picture, the 90 minutes you "save" by skipping the spec gets repaid at loan-shark interest.

Adapted from the full case study on Spec Coding. The site maintains free spec templates and a browser-based spec packet generator for exactly this workflow.

Acceptance Criteria Your QA Can Run Without Asking You Anything (6 Copyable Examples)

guo king — Tue, 16 Jun 2026 01:57:13 +0000

The acceptance criterion I reach for most is not the happy path. It's the duplicate action: double-clicked submit buttons, repeated webhook deliveries, retried payment calls, imported rows seen twice.

That one scenario exposes more vague thinking than a dozen success cases:

Given a user submits the same request twice within 2 seconds
When both requests reach the server
Then exactly one state change is recorded
  And the second response returns the first operation id
  And the audit log marks it as a replay

If your team can't answer what happens in that scenario, the spec isn't done. Here's the format that forces the answer, plus six examples you can copy.

The anatomy of a criterion that works

Every Given/When/Then has three parts with specific jobs:

Given — the precondition. What must be true before the action occurs?
When — the trigger. What specific event causes the behavior?
Then — the observable outcome. How does someone testing this know it worked?

The quality bar: QA can turn it into a test without asking the author what they meant.

"The system responds quickly" fails the bar — quickly could be 200ms or 5 seconds depending on who's reading. "The API responds within 500ms at p95" passes.

Bad:

"The login should work correctly and handle errors gracefully."

Good:

Given a registered user is on the login page
When the user enters a valid email and correct password and clicks "Sign in"
Then the user is redirected to the dashboard within 2 seconds;
  the session cookie is set with HttpOnly and Secure flags;
  the last_login_at timestamp is updated in the users table.

Three observable outcomes. Zero interpretation needed.

Example 1: Account lockout

Given a registered user with email "jane@example.com"
  and the account is not currently locked
When the user submits an incorrect password 3 times consecutively
  within a 10-minute window
Then the account is locked for 15 minutes;
  subsequent login attempts return HTTP 429 with body
  {"error": "account_locked", "retry_after_seconds": 900};
  the form displays "Account locked. Try again in 15 minutes.";
  a login_lockout event is written to the audit_log table;
  after 15 minutes the account unlocks automatically.

Notice what's pinned down: the window (10 min), the count (3), the duration (15 min), the status code, the error body, the UI copy, and the audit trail.

Example 2: Session expiry

Given a user is logged in
  and the session timeout is configured to 30 minutes of inactivity
When the user performs no actions for 30 consecutive minutes
Then the next request returns HTTP 401;
  the session cookie is cleared server-side;
  the user is redirected to /login?reason=session_expired;
  the page displays "Your session expired due to inactivity.";
  unsaved client form data is NOT recoverable
  (known limitation, documented in the UI).

That last line is the underrated move: writing down what the system deliberately does NOT do. It converts a future bug report into a documented decision.

Example 3: Checkout with insufficient stock

Given a customer has 3 units of SKU-2087 in their cart
  and available_quantity for SKU-2087 is now 1
When the customer clicks "Place Order"
Then the order is NOT placed;
  the page shows "Some items are no longer available
  in the requested quantity.";
  the line item shows "Only 1 available — please update quantity.";
  the customer can update to 1 and retry;
  if available_quantity drops to 0 between page load and submit,
  the message reads "SKU-2087 is out of stock"
  and the line item shows a "Remove" button only;
  no payment is captured in ANY insufficient-stock scenario.

The two-tier degradation (low stock vs zero stock) and the final invariant are what make this executable instead of decorative.

Example 4: Rate limiting

Given the rate limit for the "standard" plan is 100 requests/minute
  and key "key_abc123" has made 100 requests in the current window
When the consumer sends request #101 in the same window
Then the response is 429 with body
  {"error": "rate_limit_exceeded", "retry_after_seconds": <remaining>};
  headers include X-RateLimit-Limit, X-RateLimit-Remaining,
  X-RateLimit-Reset, and Retry-After;
  successful responses ALSO carry the X-RateLimit-* headers;
  limits are scoped per API key, not per IP;
  429 responses are NOT counted against the next window.

The last two lines settle the arguments your team would otherwise have in the PR thread.

Example 5: CSV import with duplicates

Given an admin uploads a 10,000-row CSV to /admin/contacts/import
  and 200 rows have emails that already exist
When the import job processes the file
Then 9,800 contacts are created and 200 are skipped (not updated);
  the result page shows Total/Created/Skipped/Errors counts;
  a CSV of skipped rows is downloadable with row_number, email, reason;
  email comparison is case-insensitive;
  a malformed row counts under "Errors" and the rest continue;
  the import is atomic per-row, not per-file — an interruption
  at row 5,000 leaves the first 5,000 committed.

Example 6: Notification retry

Given the service sends an email via the SMTP provider
When the provider returns a transient error (timeout, 5xx, DNS)
Then the notification enters a retry queue with exponential backoff:
  1 min → 5 min → 15 min → 60 min;
  each attempt is logged with attempt_number and next_retry_at;
  after the 4th failure, status = "permanently_failed"
  and an alert fires with the full error history;
  a PERMANENT error (550 mailbox not found) gets NO retries —
  status fails immediately and the user's email_verified flag
  is set to false.

The transient/permanent split is the difference between a retry policy and a retry loop.

The 5 mistakes that make criteria useless

Too vague — "the system works correctly." Untestable.
Implementation instead of behavior — "uses Redis cache." The criterion should survive a tech-stack change.
Multiple behaviors in one criterion — if it has three Whens, it's three criteria.
Happy path only — the error paths are where the incidents live.
Untestable performance claims — "fast" is not a number. "p95 < 500ms" is.

Steal this blank template

## Acceptance Criteria

### Happy path
Given <precondition>
When <action>
Then <observable outcome>; <outcome>; <outcome>.

### Error handling
Given <precondition>
When <failure trigger>
Then <error response with exact status/message>;
  <state that must NOT change>;
  <what gets logged/alerted>.

### Edge cases
Given <boundary condition: duplicate, concurrent, empty, max>
When <action>
Then <deterministic outcome>.

### Performance
Given <load condition>
When <action>
Then <metric with number and percentile>.

This is a condensed cut of the full guide — all 20 examples across auth, e-commerce, APIs, data processing, and notifications — on Spec Coding. There's also a free Gherkin generator if you want the format scaffolded for you.

'Retry on Error' Is Not a Payment Spec — Write the Failure Matrix Instead

guo king — Tue, 16 Jun 2026 01:56:54 +0000

Most payment specs I've reviewed describe the happy path in three pages and the failure behavior in one sentence: "retry on error."

That sentence is where 80% of the production incidents come from.

A payment workflow spec earns its keep by naming — category by category — exactly what the system does when the card network, the issuer, or the customer refuses to cooperate. Here's the structure I use.

Start with a failure taxonomy, not a flowchart

Before drawing a single box, force the spec to answer one question: what are the categories of failure this workflow can produce? I insist on five, because collapsing them into "error" is what creates the mess:

Network timeout. The processor never answered. The charge may or may not exist. This is the only category where retrying with the same idempotency key is mandatory.
Soft decline. The issuer said no for a recoverable reason — insufficient funds, do-not-honor, expired card. Retry is allowed, but only with customer action or a later attempt window.
Hard decline. Stolen card, pickup card, fraud. Retrying is never correct, and on some networks it increases your risk score. The spec must forbid it.
Fraud review. The processor accepted the auth but punted on a decision. The response is async. The spec must describe the waiting state and the webhook that ends it.
3DS challenge. The issuer demanded Strong Customer Authentication. This is not a failure — it's a branch. The customer sees a redirect or iframe, and the workflow pauses until they finish.

Every downstream decision — retry policy, user messaging, observability — hangs off this five-row matrix. If the taxonomy is wrong, nothing below it will save you.

Retry rules that match the category, not the HTTP status

The rule I write into every payment spec, word for word: retry policy is a function of failure category, not of HTTP status.

A 402 from Stripe can be either a soft decline you should let the customer fix or a hard decline you should never touch again. The spec has to branch on the processor's decline code, not the transport code.

Concretely:

card_declined + decline_code: insufficient_funds → up to three retries spaced by the dunning schedule, each gated on a customer action or a scheduled job.
card_declined + decline_code: stolen_card → permanent flag on the payment method; any subsequent attempt fails closed before hitting the network.
Connection error or 5xx with no response body → immediate retry with the same Idempotency-Key, because the processor may have already charged the card and a fresh key would double-charge.

Idempotency keys belong to the attempt, not the request

The single most common mistake I see: idempotency keys scoped to the HTTP request instead of the logical attempt.

If a timeout triggers a retry and the retry generates a new key, the processor treats it as a new charge. The spec must say, in one sentence: the key is minted when the attempt begins and survives every retransmission inside that attempt. A new attempt — a new customer action, a new dunning cycle, a new order — gets a new key. Nothing in between does.

Also spec the key's lifetime. Processors typically honor keys for 24 hours. If a retry crosses that boundary, reconcile against the processor's ledger (list charges by metadata) instead of assuming the retry is safe.

3DS is a first-class state, not an error

If the spec treats SCA as an error, the frontend will do something stupid like show a red banner while the issuer is mid-challenge.

The spec needs a state called requires_action (or whatever your processor calls it) with explicit transitions: entered when the auth returns a challenge URL, exited when the webhook confirms success or failure.

Spec the two flavors separately:

In-flow challenge: the client SDK mounts the iframe, blocks interaction, resolves.
Redirect challenge: the browser navigates to the issuer's ACS URL and comes back to a return URL you control.

Nail down the return URL, the query params you expect, and what happens if the customer closes the tab mid-challenge. That last case always gets forgotten, and it's the one that produces stuck subscriptions in production.

Auth, capture, and the seven-day cliff

If your workflow does auth-now / capture-later, the spec must call out auth expiry. Most processors auto-void an uncaptured auth at roughly seven days (Stripe is 7, Adyen varies by scheme, some schemes are shorter for debit).

The spec needs to answer: what happens if the fulfillment job runs on day 8? My answer is always the same — require a fresh auth before capture attempts beyond day 5, and treat any capture against an expired auth as a hard failure that opens a new authorization, not a retry.

Multi-capture makes this worse. Spec the partial-capture order, whether over-capture is permitted (usually not), and how refund-before-final-capture interacts with the remaining authorized amount. I've watched a team discover at 2am that their "simple" refund reduced the captureable balance to zero and killed the next shipment.

Dunning is a state machine — write it down

For subscription failures, the spec should contain the dunning schedule verbatim, not a vague "we will retry":

Attempt	When	Side effect
1	Immediate, at renewal	—
2	+3 days	Silent retry
3	+7 days	Email 24h earlier
4	+14 days	Final-notice email
Cancel	+21 days	Access revoked at next cycle boundary

Each transition is a row in a state table: previous state, trigger, new state, side effects. Without this table the team re-argues the schedule every quarter.

The webhook is the source of truth

Non-negotiable clause: the synchronous response from the processor is advisory. The webhook is the ledger.

Forbid any state transition derived only from the API response — capture confirmed, refund settled, dispute opened, 3DS completed all wait for the corresponding event.

Concrete consequence: the spec needs an outbox or a reconciliation job. If the webhook is delayed, the UI shows "processing" longer than the customer expects. The spec owns that tradeoff and picks a timeout after which a job polls the processor directly. I pick 30 seconds for interactive flows, 15 minutes for background ones.

Acceptance criteria with a real retry scenario

Given a customer with a Visa ending 4242 and a recurring $29 subscription
When the renewal charge returns card_declined / insufficient_funds
Then the payment is marked past_due
  And attempt 2 is scheduled for +3 days with the same payment method
  And no email is sent on this attempt
  And the customer retains access until the grace period expires

Given the client receives a connection timeout on charge creation
When the client retries within 24 hours
Then it reuses the original Idempotency-Key
  And the processor returns the original charge, not a duplicate

Three metrics the spec has to name

Authorization rate broken down by BIN range and card scheme.
Decline-reason distribution with the processor's raw decline_code preserved — not bucketed into "declined".
3DS drop-off: challenges initiated vs challenges completed.

Miss any of these and you're flying blind the first time a single issuer changes its risk model and tanks your approval rate overnight. I also require a dashboard for webhook lag — the gap between the processor's event timestamp and your ingestion timestamp. A growing lag is usually the earliest signal that the payment pipeline is about to page someone.

The takeaway

A payment spec is not "describe the charge endpoint." It's a failure-handling document with a small happy path attached. Get the taxonomy right, attach a retry rule to each row, treat 3DS as a branch instead of an error, and let the webhook be the source of truth.

This is adapted from the full guide on Spec Coding, where we maintain spec templates and browser-based generators for API contracts, database migrations, and payment workflows.

Why Your API Contract Breaks in Production (And How to Fix It in the Spec)

guo king — Wed, 22 Apr 2026 08:11:48 +0000

The most expensive API bugs I've seen weren't implementation bugs. They were contract bugs — cases where the producer and consumer had different beliefs about what the API promised, and nobody wrote it down before the code was shipped.

Here's the pattern: a team builds an endpoint, the frontend and backend agree on the behavior in a Slack thread or a meeting, and the implementation proceeds. Six months later, a field gets renamed as part of a "minor cleanup," a consumer breaks silently, and the incident postmortem includes the sentence "we thought this was backward compatible."

The root cause is almost never malice or carelessness. It's that the contract was implicit — shared understanding that lived in people's heads rather than in a document that both sides could point to.

What an API contract actually is

An API contract is the explicit agreement between a producer and its consumers about:

What fields exist, what they're named, and what types they carry
What the valid inputs are (required vs. optional, validation rules)
What error codes mean and when each one is returned
What "success" looks like at the HTTP level and the payload level
What the retry behavior is — is this operation idempotent?
What the version policy is — when does a change become breaking, and how much notice will consumers get?

Most teams document some of these things. The ones that cause incidents are usually the last three.

The three contract decisions teams skip

1. Error taxonomy

When an operation fails, what does the consumer do next?

If your API returns 400 for validation errors and 400 for business-rule rejections, the consumer can't distinguish between "fix your input" and "this action is blocked for business reasons." They'll implement retry logic that retries unretryable errors, or they'll show the wrong message to the user.

Before implementation, the spec should define:

400 Bad Request    — invalid input; client should fix and retry
409 Conflict       — valid input, business rule blocks; client should not retry
422 Unprocessable  — valid input, data state prevents action; may resolve over time
503 Unavailable    — transient; client should retry with backoff

This is a design decision. Make it in the spec, not in the error handler.

2. Idempotency

Can the consumer safely retry this request if they don't get a response?

Payment endpoints, state-mutation endpoints, and any operation with side effects need an explicit answer. "The backend should handle it" is not an answer — it's a deferred argument.

Spec-first means writing this down before implementation:

POST /orders
Idempotency: Yes
Key: X-Idempotency-Key header (UUID, client-generated)
Window: 24 hours — duplicate requests within 24h return the original response
After window: treated as new request

Now the consumer knows what to send, the backend knows what to store, and QA knows what to test.

3. Breaking change definition

Before you ship your first external-facing version, agree on what counts as breaking:

Breaking (requires version bump, advance notice):

Removing a field
Renaming a field
Changing a field's type
Changing the meaning of an existing status code
Making an optional field required

Non-breaking (can ship without version bump):

Adding a new optional field
Adding a new endpoint
Adding a new enum value (if consumers are built to handle unknown values)
Bug fixes that bring behavior in line with documented behavior

Document this. When "minor cleanup" happens six months from now, the engineer making the change has a checklist, not a judgment call.

The OpenAPI to CI pipeline

Once your contract is written, the best way to protect it is to make breaking changes fail in CI before they reach a reviewer.

The minimal setup:

Generate OpenAPI spec from your implementation (or write it first and validate against it)
Run openapi-diff or oasdiff in CI — fails the build if a breaking change is detected
Run contract tests against a real consumer — Pact or equivalent; the consumer's expectations are versioned alongside the producer's spec

This doesn't prevent all contract breaks. It prevents the ones that nobody caught because they were in a field rename that "obviously" wouldn't affect anything.

The spec-first version of this workflow

The difference between spec-first API development and regular API development isn't the tools. It's the order:

Regular: Build → Document → Ship → Break consumer → Fix

Spec-first: Define contract → Review contract → Build to contract → Validate in CI → Ship

The contract review is where most of the value lives. It's the moment when the consumer team can say "we rely on that field being optional" before the producer team has already built it as required.

That conversation is almost free before implementation. It's expensive after deployment.

A minimal API contract template

If you don't have a spec process yet, start with this for every new endpoint:

Endpoint: POST /v1/[resource]
Purpose: [one sentence]

Request:
  Required fields: [list]
  Optional fields: [list]
  Validation rules: [list]

Response (success):
  Status: 201
  Fields returned: [list]

Response (errors):
  400: [when, what to tell the consumer]
  409: [when, what to tell the consumer]
  503: [when, consumer should retry with backoff]

Idempotency: [yes/no] — if yes, key mechanism and window
Breaking change policy: [link to shared definition]
Version: v1 — changes that break the above require a new version

That's the minimum. It takes 15 minutes to write. The alternative is a 2am incident.

I write about spec-first delivery and API contracts at spec-coding.dev. The API Contract Checklist covers the full pre-release review process.

I Shipped Fewer Bugs After I Started Writing Specs. Heres the Framework

guo king — Wed, 22 Apr 2026 08:11:29 +0000

I Shipped Fewer Bugs After I Started Writing Specs. Here's the Framework

Last March, we had a billing incident that cost us three days. A rounding error in our invoice calculation was silently overcharging customers by fractions of a cent. Not enough for anyone to notice on a single invoice. But multiplied across thousands of transactions over two months, it added up. We had to issue refunds, write a post-mortem, and rebuild trust with a handful of enterprise customers who did the math.

The fix took 20 minutes. The actual bug was a Math.floor() where we needed Math.round() with banker's rounding.

Here's what stung: if anyone had written down "use banker's rounding for currency calculations" before writing the code, this never would have shipped. The developer who wrote it didn't know about our rounding convention. The reviewer didn't think to check. Nobody was wrong. The requirement just lived in one person's head and never made it to the keyboard.

That incident changed how our team builds software.

What spec-first actually means

Spec-first development is simple: before you write code, you write a short document describing what you're building, what you're not building, and how you'll know it works.

It's not waterfall. It's not Big Design Up Front. It's 10 minutes of structured thinking in a text file before you open your editor. Think of it as a conversation with your future self and your reviewers, except you have it before you've sunk 8 hours into an implementation.

The spec travels with the PR. It's a living artifact, not a bureaucratic gate.

A real example: CRM contact deduplication

One of our team members picked up a ticket to build contact deduplication for our internal CRM. On the surface, it seemed straightforward: find duplicate contacts, merge them. Here's the abbreviated spec she wrote before starting:

## Goal
Detect and merge duplicate contacts in the CRM based on email
and phone number matching.

## Non-goals
- Fuzzy name matching (phase 2)
- Automated merging without user confirmation
- Deduplication of company records

## Acceptance Criteria
- Two contacts with the same email (case-insensitive) are flagged as duplicates
- Two contacts with the same phone number (normalized to E.164) are flagged
- User sees a side-by-side comparison and chooses which fields to keep
- Merge preserves all activity history from both records
- Merge is reversible for 30 days

## Edge Cases
- Contact A has email X, Contact B has email X and email Y.
  What happens to email Y after merge?
- Three-way duplicates: A matches B, B matches C, but A doesn't match C
- Contact with 500+ activities: does the merge UI choke?
- Merged contact is referenced in active automation workflows

That spec took about 12 minutes to write. But look at what it caught before a single line of code was written:

The three-way duplicate problem. Without the spec, the developer would have built a pairwise merge and discovered the transitive matching issue during QA. Or worse, in production when a user tried to merge a chain of duplicates and got inconsistent results.

The automation workflow reference. Contacts linked to active automations needed special handling. That requirement wasn't in the ticket. It came out of the developer thinking through edge cases while writing the spec. Without it, merging a contact mid-automation would have broken running workflows.

The reversibility requirement. The original ticket said nothing about undo. But when you write "acceptance criteria," you start thinking about what happens when things go wrong. Thirty days of reversibility turned out to be a key requirement from the customer success team, and the developer discovered this during a quick spec review before starting.

The template

After six months of iteration, our team settled on a lightweight template. Here's the version we use for any feature that takes more than a couple of hours:

# Feature: [Name]
**Author:** [Your name]
**Date:** [Today]
**Status:** Draft / In Review / Approved

## Goal
One paragraph. What are we building and why?

## Non-goals
What are we explicitly NOT doing? (This prevents scope creep.)

## Acceptance Criteria
Numbered list. How do we know this is done?

## Edge Cases
Bullet list. What weird inputs, states, or timing issues could break this?

## Rollout
How does this get to production? Feature flag? Percentage rollout?
Who monitors it?

That's it. Five sections. One page. The non-goals section alone has saved us more rework than any other practice we've adopted. When someone suggests "hey, we should also handle X" during implementation, we check the non-goals list. If X is there, the answer is "not this PR." If it's not there and it should be, we update the spec and have that conversation before writing more code.

I've published the full template with examples if you want to grab it.

Six months of results

We started using specs consistently in October 2025 across a team of eight engineers. Here's what changed by April 2026:

P1 incidents dropped from ~3/month to ~1/month. Not all of that is attributable to specs. We also improved our CI pipeline. But the incidents we did have were operational (infrastructure, scaling) rather than "we built the wrong thing" or "we missed an edge case."

Code review cycles got shorter. Average time from PR open to merge went from 2.1 days to 1.4 days. Reviewers stopped asking "what is this supposed to do?" because the spec was right there. Reviews focused on implementation quality instead of requirements discovery.

Junior engineers ramped faster. Two new hires in January were writing production specs within their second week. The specs gave them a structured way to demonstrate understanding before writing code, and gave senior engineers a concrete artifact to review instead of trying to evaluate whether someone "gets it" from a Slack conversation.

Scope creep became visible. When a feature grew beyond its original spec, we could point to the document and say "this is new scope. Do we want to update the spec or save it for a follow-up?" Before specs, scope crept silently until the PR was twice the expected size.

The biggest surprise: specs made estimation better. When you force yourself to list acceptance criteria and edge cases upfront, you have a much clearer picture of the actual work. Our sprint velocity predictions got noticeably more accurate, not because we got better at estimating, but because we got better at understanding what we were estimating.

Getting started

If you've never written a spec before, don't try to change your whole team overnight. Start with yourself, on one feature. Pick something you're about to build that has at least a couple of moving parts. Spend 10 minutes filling in the template above. Share it with whoever would review your PR and ask: "Does this match your understanding?"

That one conversation will tell you whether spec-first development is worth adopting more broadly. In my experience, it always is.

I've written a complete guide to spec-first development that covers the philosophy, the process, and the common objections. And the template is free to download. Take it, modify it, make it yours.

Daniel Marsh is a software engineer and the author of spec-coding.dev, a resource for teams adopting spec-first development.

10 Software Spec Mistakes That Cause Production Incidents (With Fixes)

guo king — Wed, 22 Apr 2026 08:10:58 +0000

After 12 years in B2B SaaS — and too many postmortems — I've noticed that most production incidents trace back to a decision that wasn't made in the spec. Not a coding error. A specification gap.

Here are the 10 mistakes I see most often, what the symptom looks like, and how to fix each one before implementation starts.

1. Acceptance criteria that can't be tested

Symptom: QA closes the ticket as "passed" but the feature behaves differently than product expected.

The mistake:

The user should see a confirmation message after submitting.

The fix:

Given a valid form submission, when the user clicks Submit, then a green banner appears at the top of the page with the text "Your changes have been saved" and remains visible until the user navigates away or dismisses it.

Acceptance criteria are a contract between product and QA. If QA has to ask the author what "confirmation" means, the spec didn't do its job.

2. Non-goals that don't name anything

Symptom: Engineering builds something adjacent to the spec because the boundary wasn't explicit.

The mistake:

Out of scope: internationalization and advanced settings.

The fix:

Out of scope for this release: (1) translated UI strings — English only; (2) per-user notification preferences — all users receive the same defaults; (3) bulk operations — only single-record edits are supported. A reviewer can reject this change if any of these appear in the implementation.

The non-goal should be specific enough that a reviewer can use it to push back.

3. No named decision owner

Symptom: Scope creep during implementation because nobody was authorized to say "that's out."

The fix: Every spec should have one named person who can approve scope changes. Not a committee. One person. If that person is unavailable, a named backup.

4. "Handled elsewhere" for failure paths

Symptom: An edge case hits production and ops discovers there's no fallback, no log, and no rollback path.

The mistake:

Error handling will be managed by the existing error middleware.

The fix: Name the specific failure modes and what happens in each:

If the payment processor times out after 10 seconds: return HTTP 503, log payment.timeout with order ID, do NOT charge the card, surface "Payment unavailable — please try again" to the user. Do not retry automatically.

5. Missing rollback definition

Symptom: A deployment goes wrong and nobody agrees on what "rollback" means, so the incident runs 3x longer than it should.

The fix: Before implementation, answer: if this change is reverted, what exactly happens?

Code revert only?
Config flag flip?
Database migration that needs to be reversed?
Data that needs to be repaired?

If rollback requires data repair, that repair script should be written before the feature ships.

6. Acceptance criteria that only cover the happy path

Symptom: QA testing passes 100%, but 3 edge cases surface in production week one.

The fix: For every main-flow scenario, write at least one failure scenario and one boundary scenario.

Happy path: Given a valid user, when they submit, then success.
Failure path: Given an expired session, when they submit, then redirect to login with the form state preserved in session.
Boundary: Given a user with exactly 0 remaining credits, when they try to submit, then show the upgrade prompt before the form is processed.

7. Ambiguous authorization rules

Symptom: A user can access or modify something they shouldn't be able to. Or they can't do something they should be able to.

The mistake:

Only authorized users can edit this record.

The fix:

Edit access requires: (1) the user is the record owner, OR (2) the user has the admin role in the same organization. Users with viewer role can read but not edit. Requests from users outside the record's organization are rejected with 403, regardless of their role.

8. No stop-loss threshold for the release

Symptom: A feature causes elevated error rates after deployment. Nobody knows whether to roll back or wait, so the on-call engineer makes a judgment call at 1am.

The fix: Name the threshold before the release:

Roll back automatically if: error rate on /checkout exceeds 2% over a 5-minute window, OR payment processor timeout rate exceeds 5%, OR any P0 alert fires within 2 hours of deployment. The on-call engineer does not need approval to roll back if these thresholds are hit.

9. "Friendly" or "reasonable" as a spec term

Symptom: Engineering and design interpret "friendly error message" differently. Or "reasonable performance" means 200ms to one person and 2 seconds to another.

The fix: Replace every vague term with a testable one.

"friendly" -> specific message text or message category
"reasonable" -> explicit metric with threshold
"fast" -> p95 latency under X ms at Y concurrent users
"handled" -> specific behavior enumerated

10. Spec written after the code

Symptom: The spec reads like documentation of what was built, not a description of what should be built. Nobody reviewed it before implementation.

The fix: This one is structural, not a wording fix. The spec needs to exist — and be reviewed — before implementation starts. The review is where the expensive conversations happen cheaply.

The spec's job is not to describe decisions after they're made. It's to force decisions before they're expensive.

The common thread

Every mistake on this list has the same root cause: a decision that should have been made explicitly in the spec was left implicit, and the team discovered the gap somewhere more expensive — in review, in testing, or in production.

Spec-first development doesn't mean longer documents. It means making the right decisions earlier, in writing, where they can be challenged before they're coded.

I publish practical spec-first guides and free templates at spec-coding.dev. The Spec Review Checklist covers the pre-implementation review in detail.

I Shipped Fewer Bugs After I Started Writing Specs — Here's the Framework

guo king — Wed, 01 Apr 2026 09:48:53 +0000

I Shipped Fewer Bugs After I Started Writing Specs — Here's the Framework

Last March, we had a billing incident that cost us three days. A rounding error in our invoice calculation was silently overcharging customers by fractions of a cent. Not enough for anyone to notice on a single invoice — but multiplied across thousands of transactions over two months, it added up. We had to issue refunds, write a post-mortem, and rebuild trust with a handful of enterprise customers who did the math.

The fix took 20 minutes. The actual bug was a Math.floor() where we needed Math.round() with banker's rounding.

Here's what stung: if anyone had written down "use banker's rounding for currency calculations" before writing the code, this never would have shipped. The developer who wrote it didn't know about our rounding convention. The reviewer didn't think to check. Nobody was wrong — the requirement just lived in one person's head and never made it to the keyboard.

That incident changed how our team builds software.

What spec-first actually means

Spec-first development is simple: before you write code, you write a short document describing what you're building, what you're not building, and how you'll know it works.

It's not waterfall. It's not Big Design Up Front. It's 10 minutes of structured thinking in a text file before you open your editor. Think of it as a conversation with your future self and your reviewers — except you have it before you've sunk 8 hours into an implementation.

The spec travels with the PR. It's a living artifact, not a bureaucratic gate.

A real example: CRM contact deduplication

## Goal
Detect and merge duplicate contacts in the CRM based on email
and phone number matching.

## Non-goals
- Fuzzy name matching (phase 2)
- Automated merging without user confirmation
- Deduplication of company records

## Acceptance Criteria
- Two contacts with the same email (case-insensitive) are flagged as duplicates
- Two contacts with the same phone number (normalized to E.164) are flagged
- User sees a side-by-side comparison and chooses which fields to keep
- Merge preserves all activity history from both records
- Merge is reversible for 30 days

## Edge Cases
- Contact A has email X, Contact B has email X and email Y —
  what happens to email Y after merge?
- Three-way duplicates: A matches B, B matches C, but A doesn't match C
- Contact with 500+ activities — does the merge UI choke?
- Merged contact is referenced in active automation workflows

That spec took about 12 minutes to write. But look at what it caught before a single line of code was written:

The three-way duplicate problem. Without the spec, the developer would have built a pairwise merge and discovered the transitive matching issue during QA — or worse, in production when a user tried to merge a chain of duplicates and got inconsistent results.

The reversibility requirement. The original ticket said nothing about undo. But when you write "acceptance criteria," you start thinking about what happens when things go wrong. Thirty days of reversibility turned out to be a key requirement from the customer success team — the developer discovered this during a quick spec review before starting.

The template

After six months of iteration, our team settled on a lightweight template. Here's the version we use for any feature that takes more than a couple of hours:

# Feature: [Name]
**Author:** [Your name]
**Date:** [Today]
**Status:** Draft / In Review / Approved

## Goal
One paragraph. What are we building and why?

## Non-goals
What are we explicitly NOT doing? (This prevents scope creep.)

## Acceptance Criteria
Numbered list. How do we know this is done?

## Edge Cases
Bullet list. What weird inputs, states, or timing issues could break this?

## Rollout
How does this get to production? Feature flag? Percentage rollout?
Who monitors it?

I've published the full template with examples if you want to grab it.

Six months of results

We started using specs consistently in October 2025 across a team of eight engineers. Here's what changed by April 2026:

P1 incidents dropped from ~3/month to ~1/month. Not all of that is attributable to specs — we also improved our CI pipeline. But the incidents we did have were operational (infrastructure, scaling) rather than "we built the wrong thing" or "we missed an edge case."

Junior engineers ramped faster. Two new hires in January were writing production specs within their second week. The specs gave them a structured way to demonstrate understanding before writing code — and gave senior engineers a concrete artifact to review instead of trying to evaluate whether someone "gets it" from a Slack conversation.

Scope creep became visible. When a feature grew beyond its original spec, we could point to the document and say "this is new scope — do we want to update the spec or save it for a follow-up?" Before specs, scope crept silently until the PR was twice the expected size.

The biggest surprise: specs made estimation better. When you force yourself to list acceptance criteria and edge cases upfront, you have a much clearer picture of the actual work. Our sprint velocity predictions got noticeably more accurate — not because we got better at estimating, but because we got better at understanding what we were estimating.

Getting started

That one conversation will tell you whether spec-first development is worth adopting more broadly. In my experience, it always is.

I've written a complete guide to spec-first development that covers the philosophy, the process, and the common objections. And the template is free to download — take it, modify it, make it yours.

Daniel Marsh is a software engineer and the author of spec-coding.dev, a resource for teams adopting spec-first development.

What Is Spec-First Development? (Complete Guide)

guo king — Sat, 14 Mar 2026 02:43:23 +0000

Originally published at spec-coding.dev

Spec-first development becomes clearer when the team makes the hidden
decisions visible before coding starts. This article focuses on those
decisions and why they matter in delivery.

## 1. Why Teams Usually Get This Wrong

The problem appears the moment more than one role depends on the answer.
Product wants speed, engineering wants boundaries, QA wants testability, and
operations wants something that can be rolled back without improvisation. If
the spec never resolves those tensions, the work moves downstream as rework.

The real decision behind spec-first is ownership. Who approves the
boundary? Who validates acceptance criteria? Who can stop a rollout? Without
those answers in writing, teams ship with blurred accountability.

## 2. A Concrete Delivery Situation

Test your spec against one realistic delivery path. Ask:

What can the reviewer reject?
What can the tester verify?
What can the operator roll back?

If the document cannot answer those three angles, it is not ready. Stop asking
whether the team "understands the idea" and start asking whether the written
spec would survive handoff.

## 3. What the Spec Needs to Say Out Loud

A stronger spec is not longer by accident — it is more explicit exactly where
projects tend to drift. At minimum:

The intended outcome, explicit non-goals, and the decision owner for scope changes
Acceptance criteria that can be judged pass or fail without tribal knowledge
Failure behavior, fallback paths, and which logs or metrics will surface them after release

## 4. Acceptance Criteria That Remove Guesswork

Given the approved scope and dependencies When the team executes the primary flow Then success can be verified with a concrete result and observable evidence
Given an exception, retry, or permission boundary is hit When the system takes the fallback path Then the user-facing behavior and operational response remain explicit

The goal is not to force every team into the same wording. The goal is to
force decisions about input, trigger, and expected behavior while there is
still time to change course cheaply.

## 5. Review Questions Before Implementation Starts

Which decision would still be ambiguous to a reviewer seeing the change for the first time?
Can QA derive the main-flow, failure-path, and regression cases without interviewing the author?
If the release fails, does the document tell operations what to watch and when to stop?

If you cannot answer these without a side conversation, the draft is still
carrying unpriced uncertainty.

## 6. Rollout, Monitoring, and Rollback

Good specs do not stop at merge readiness. They tell the team how to release
safely:

Stage to the smallest useful audience first — don't treat review approval as a substitute for runtime evidence
Name the stop-loss threshold before release: error rate, latency, data mismatch, or override volume
Record what rollback actually means: code revert, config switch, job pause, or data repair

## 7. Common Mistakes

Treating scope as obvious and leaving the real policy decision for implementation review
Hiding risky edge behavior behind vague phrases like reasonable, friendly, or handled elsewhere
Publishing a document that sounds complete but never states how success or rollback will be judged

## 8. When to Go Deep vs. Stay Lightweight

The right amount of detail is the amount that prevents expensive invention
later. If a missing sentence would force engineering, QA, or operations to
guess — that sentence belongs in the spec.

What matters is not document length. What matters is whether the document
prevents the team from rediscovering the same decision in implementation,
testing, and release review.

## 9. Final Takeaway

Spec-first development becomes easier once the spec names the risky decisions
early enough for someone to challenge them. That is the practical value:
less improvisation, fewer surprises in review, and cleaner evidence when it
is time to ship.

Daniel Marsh is a senior software engineer with 12 years of experience,
specializing in spec-first development. More articles at
spec-coding.dev.