guo king

Posted on Apr 22 • Originally published at spec-coding.dev

Why Your API Contract Breaks in Production (And How to Fix It in the Spec)

#architecture #webdev #programming #api

The most expensive API bugs I've seen weren't implementation bugs. They were contract bugs — cases where the producer and consumer had different beliefs about what the API promised, and nobody wrote it down before the code was shipped.

Here's the pattern: a team builds an endpoint, the frontend and backend agree on the behavior in a Slack thread or a meeting, and the implementation proceeds. Six months later, a field gets renamed as part of a "minor cleanup," a consumer breaks silently, and the incident postmortem includes the sentence "we thought this was backward compatible."

The root cause is almost never malice or carelessness. It's that the contract was implicit — shared understanding that lived in people's heads rather than in a document that both sides could point to.

What an API contract actually is

An API contract is the explicit agreement between a producer and its consumers about:

What fields exist, what they're named, and what types they carry
What the valid inputs are (required vs. optional, validation rules)
What error codes mean and when each one is returned
What "success" looks like at the HTTP level and the payload level
What the retry behavior is — is this operation idempotent?
What the version policy is — when does a change become breaking, and how much notice will consumers get?

Most teams document some of these things. The ones that cause incidents are usually the last three.

The three contract decisions teams skip

1. Error taxonomy

When an operation fails, what does the consumer do next?

If your API returns 400 for validation errors and 400 for business-rule rejections, the consumer can't distinguish between "fix your input" and "this action is blocked for business reasons." They'll implement retry logic that retries unretryable errors, or they'll show the wrong message to the user.

Before implementation, the spec should define:

400 Bad Request    — invalid input; client should fix and retry
409 Conflict       — valid input, business rule blocks; client should not retry
422 Unprocessable  — valid input, data state prevents action; may resolve over time
503 Unavailable    — transient; client should retry with backoff

This is a design decision. Make it in the spec, not in the error handler.

2. Idempotency

Can the consumer safely retry this request if they don't get a response?

Payment endpoints, state-mutation endpoints, and any operation with side effects need an explicit answer. "The backend should handle it" is not an answer — it's a deferred argument.

Spec-first means writing this down before implementation:

POST /orders
Idempotency: Yes
Key: X-Idempotency-Key header (UUID, client-generated)
Window: 24 hours — duplicate requests within 24h return the original response
After window: treated as new request

Now the consumer knows what to send, the backend knows what to store, and QA knows what to test.

3. Breaking change definition

Before you ship your first external-facing version, agree on what counts as breaking:

Breaking (requires version bump, advance notice):

Removing a field
Renaming a field
Changing a field's type
Changing the meaning of an existing status code
Making an optional field required

Non-breaking (can ship without version bump):

Adding a new optional field
Adding a new endpoint
Adding a new enum value (if consumers are built to handle unknown values)
Bug fixes that bring behavior in line with documented behavior

Document this. When "minor cleanup" happens six months from now, the engineer making the change has a checklist, not a judgment call.

The OpenAPI to CI pipeline

Once your contract is written, the best way to protect it is to make breaking changes fail in CI before they reach a reviewer.

The minimal setup:

Generate OpenAPI spec from your implementation (or write it first and validate against it)
Run openapi-diff or oasdiff in CI — fails the build if a breaking change is detected
Run contract tests against a real consumer — Pact or equivalent; the consumer's expectations are versioned alongside the producer's spec

This doesn't prevent all contract breaks. It prevents the ones that nobody caught because they were in a field rename that "obviously" wouldn't affect anything.

The spec-first version of this workflow

The difference between spec-first API development and regular API development isn't the tools. It's the order:

Regular: Build → Document → Ship → Break consumer → Fix

Spec-first: Define contract → Review contract → Build to contract → Validate in CI → Ship

The contract review is where most of the value lives. It's the moment when the consumer team can say "we rely on that field being optional" before the producer team has already built it as required.

That conversation is almost free before implementation. It's expensive after deployment.

A minimal API contract template

If you don't have a spec process yet, start with this for every new endpoint:

Endpoint: POST /v1/[resource]
Purpose: [one sentence]

Request:
  Required fields: [list]
  Optional fields: [list]
  Validation rules: [list]

Response (success):
  Status: 201
  Fields returned: [list]

Response (errors):
  400: [when, what to tell the consumer]
  409: [when, what to tell the consumer]
  503: [when, consumer should retry with backoff]

Idempotency: [yes/no] — if yes, key mechanism and window
Breaking change policy: [link to shared definition]
Version: v1 — changes that break the above require a new version

That's the minimum. It takes 15 minutes to write. The alternative is a 2am incident.

I write about spec-first delivery and API contracts at spec-coding.dev. The API Contract Checklist covers the full pre-release review process.

DEV Community