guo king

Posted on Apr 22 • Originally published at spec-coding.dev

10 Software Spec Mistakes That Cause Production Incidents (With Fixes)

#programming #productivity #devops #career

After 12 years in B2B SaaS — and too many postmortems — I've noticed that most production incidents trace back to a decision that wasn't made in the spec. Not a coding error. A specification gap.

Here are the 10 mistakes I see most often, what the symptom looks like, and how to fix each one before implementation starts.

1. Acceptance criteria that can't be tested

Symptom: QA closes the ticket as "passed" but the feature behaves differently than product expected.

The mistake:

The user should see a confirmation message after submitting.

The fix:

Given a valid form submission, when the user clicks Submit, then a green banner appears at the top of the page with the text "Your changes have been saved" and remains visible until the user navigates away or dismisses it.

Acceptance criteria are a contract between product and QA. If QA has to ask the author what "confirmation" means, the spec didn't do its job.

2. Non-goals that don't name anything

Symptom: Engineering builds something adjacent to the spec because the boundary wasn't explicit.

The mistake:

Out of scope: internationalization and advanced settings.

The fix:

Out of scope for this release: (1) translated UI strings — English only; (2) per-user notification preferences — all users receive the same defaults; (3) bulk operations — only single-record edits are supported. A reviewer can reject this change if any of these appear in the implementation.

The non-goal should be specific enough that a reviewer can use it to push back.

3. No named decision owner

Symptom: Scope creep during implementation because nobody was authorized to say "that's out."

The fix: Every spec should have one named person who can approve scope changes. Not a committee. One person. If that person is unavailable, a named backup.

4. "Handled elsewhere" for failure paths

Symptom: An edge case hits production and ops discovers there's no fallback, no log, and no rollback path.

The mistake:

Error handling will be managed by the existing error middleware.

The fix: Name the specific failure modes and what happens in each:

If the payment processor times out after 10 seconds: return HTTP 503, log payment.timeout with order ID, do NOT charge the card, surface "Payment unavailable — please try again" to the user. Do not retry automatically.

5. Missing rollback definition

Symptom: A deployment goes wrong and nobody agrees on what "rollback" means, so the incident runs 3x longer than it should.

The fix: Before implementation, answer: if this change is reverted, what exactly happens?

Code revert only?
Config flag flip?
Database migration that needs to be reversed?
Data that needs to be repaired?

If rollback requires data repair, that repair script should be written before the feature ships.

6. Acceptance criteria that only cover the happy path

Symptom: QA testing passes 100%, but 3 edge cases surface in production week one.

The fix: For every main-flow scenario, write at least one failure scenario and one boundary scenario.

Happy path: Given a valid user, when they submit, then success.
Failure path: Given an expired session, when they submit, then redirect to login with the form state preserved in session.
Boundary: Given a user with exactly 0 remaining credits, when they try to submit, then show the upgrade prompt before the form is processed.

7. Ambiguous authorization rules

Symptom: A user can access or modify something they shouldn't be able to. Or they can't do something they should be able to.

The mistake:

Only authorized users can edit this record.

The fix:

Edit access requires: (1) the user is the record owner, OR (2) the user has the admin role in the same organization. Users with viewer role can read but not edit. Requests from users outside the record's organization are rejected with 403, regardless of their role.

8. No stop-loss threshold for the release

Symptom: A feature causes elevated error rates after deployment. Nobody knows whether to roll back or wait, so the on-call engineer makes a judgment call at 1am.

The fix: Name the threshold before the release:

Roll back automatically if: error rate on /checkout exceeds 2% over a 5-minute window, OR payment processor timeout rate exceeds 5%, OR any P0 alert fires within 2 hours of deployment. The on-call engineer does not need approval to roll back if these thresholds are hit.

9. "Friendly" or "reasonable" as a spec term

Symptom: Engineering and design interpret "friendly error message" differently. Or "reasonable performance" means 200ms to one person and 2 seconds to another.

The fix: Replace every vague term with a testable one.

"friendly" -> specific message text or message category
"reasonable" -> explicit metric with threshold
"fast" -> p95 latency under X ms at Y concurrent users
"handled" -> specific behavior enumerated

10. Spec written after the code

Symptom: The spec reads like documentation of what was built, not a description of what should be built. Nobody reviewed it before implementation.

The fix: This one is structural, not a wording fix. The spec needs to exist — and be reviewed — before implementation starts. The review is where the expensive conversations happen cheaply.

The spec's job is not to describe decisions after they're made. It's to force decisions before they're expensive.

The common thread

Every mistake on this list has the same root cause: a decision that should have been made explicitly in the spec was left implicit, and the team discovered the gap somewhere more expensive — in review, in testing, or in production.

Spec-first development doesn't mean longer documents. It means making the right decisions earlier, in writing, where they can be challenged before they're coded.

I publish practical spec-first guides and free templates at spec-coding.dev. The Spec Review Checklist covers the pre-implementation review in detail.

DEV Community

10 Software Spec Mistakes That Cause Production Incidents (With Fixes)

1. Acceptance criteria that can't be tested

2. Non-goals that don't name anything

3. No named decision owner

4. "Handled elsewhere" for failure paths

5. Missing rollback definition

6. Acceptance criteria that only cover the happy path

7. Ambiguous authorization rules

8. No stop-loss threshold for the release

9. "Friendly" or "reasonable" as a spec term

10. Spec written after the code

The common thread

Top comments (0)