Why Expand-and-Contract Works for Database Migrations (and What Teams Skip)

#database #productivity #webdev #programming

The expand-and-contract pattern for database migrations is one of those rare engineering ideas that sounds like an abstract framework and turns out to be the answer to a very specific operational problem. The problem: how do you change the shape of a production database while the application is still serving traffic, without taking the system down, without leaving the database and the code in inconsistent states, and without losing the ability to roll back if something goes wrong.

The answer expand-and-contract gives is not subtle: stage the migration so that every intermediate state is one the application can survive. Add new shape first. Make the application speak both shapes. Backfill the old data into the new shape. Cut over reads to the new shape. Drop the old shape last.

The reason it works is more interesting than the description. And the reason teams skip parts of it is where the production incidents come from.

Photo by Homa Appliances on Unsplash

The Core Property: Every Intermediate State Is Valid

The thing expand-and-contract gives up is the ability to deploy a schema change and a code change atomically. In return, it guarantees that every state the system passes through is a state the application can run in.

State 1: Old code, old schema. The current production state. Everything works.

State 2: Old code, new schema added. Application has not changed. New columns exist but are unused. Application still works because it ignores the new columns. Rollback: drop the new columns. Fast and safe.

State 3: New code (dual-write), new schema. Application writes to both shapes on every mutation. Reads still come from the old shape. Application still works. Rollback: deploy the previous code version, which stops the dual-write but does not require any schema change.

State 4: New code (read from new shape), new schema. Application reads from the new column. Dual-write continues for safety. Application still works because the new column has been backfilled. Rollback: feature-flag the read back to the old column.

State 5: New code (single-write), old schema removed. Final state. Application works. Rollback at this point requires restoring from backup, which is why this stage waits until you are confident.

The key property: there is no point in the sequence where the application is in a state that the schema cannot serve. Every deploy is independently safe. Every rollback is a deploy, not a schema operation.

This is what zero-downtime actually means. Not that nothing changes, but that nothing breaks during the changes.

Why the Pattern Works for Distributed Systems Too

The expand-and-contract pattern generalizes beyond databases. The same shape (add the new thing, get the system using both, migrate, cut over, remove the old thing) is how you change API contracts, message formats, event schemas, configuration shapes, and anything else where the producer and the consumer cannot be updated atomically.

The Confluent documentation on schema evolution covers this for Kafka message schemas. The Protocol Buffers backwards compatibility rules are essentially expand-and-contract baked into the format design. The OpenAPI versioning guidance is the same pattern applied to REST APIs.

The shared idea: a change is only safe if every reader and writer can handle every state the change passes through. Make that property non-negotiable, and the safe migration patterns fall out of it.

What Teams Skip (and Why It Bites)

In practice, teams under deadline pressure compress two stages into one. The compressions are predictable, and the failures are predictable too.

Skipping the dual-write stage. A team adds a new column, backfills it directly with a single SQL statement, then deploys the application code that reads from the new column. The window between the backfill ending and the deploy starting is a window where new writes hit the old column only. After the deploy, the most recent writes are missing from the new column. The team finds out when a user reports that their last action did not "stick."

The fix is to dual-write first, then backfill, then cut over reads. The dual-write covers the window the single-SQL backfill cannot cover.

Skipping the gradual cutover. A team deploys the new read path at 100 percent traffic on a single deploy. The new read path has a bug that only fires for a specific data pattern that the test data did not cover. Production sees a thundering error rate within minutes.

The fix is the feature flag or percentage rollout. Start at 1 percent. Errors surface in the slow trickle before they become a flood. The Datadog feature flag observability writeup is one of the better explanations of how to instrument the rollout.

Skipping the wait period before contracting. A team finishes the cutover and immediately drops the old columns. Three days later a bug surfaces that requires reverting to the old read path. The old columns are gone. The only rollback is restoring from a backup that is now three days old.

The fix is the wait period. A week is reasonable for most teams. Two weeks for systems where the cost of a rollback is high. The dual-write code stays in place for the entire wait period, so the old columns continue to be populated and a rollback is just a feature-flag flip.

The Engineering Cost of Doing It Right

The honest accounting: expand-and-contract takes five deploys instead of one. The team writes dual-write code, a backfill script, and rollout instrumentation. The total wall-clock time from "we want to add this column" to "the column is in production and used everywhere" is days, not hours.

For some systems, this cost is not justified. A small internal tool with a quiet maintenance window every Sunday is fine running a one-shot migration. Pretending otherwise is engineering theater that does not produce value.

For customer-facing systems and high-volume transactional systems, the cost is justified by the alternative. A single multi-hour outage from a botched migration costs more than the entire expand-and-contract overhead, often by an order of magnitude, even setting aside the reputational cost.

The judgment call about which path to take is one of the things 137Foundry helps engineering teams think through, especially for systems that have grown past the point where the old maintenance-window approach is still fitting their availability requirements.

How to Get the Pattern Routine

The biggest predictor of whether a team will successfully execute expand-and-contract is whether they have done it before. The first migration in this pattern is slow and feels bureaucratic. The second one is faster. By the fifth, the team has reusable scaffolding (a backfill template, a migration plan document, a feature-flag pattern) and the migration takes a fraction of the original time.

The way to make this happen: pick a low-stakes migration as the first one. Add a nullable column that the application does not actually need yet. Walk through all five stages. Have a teammate review the rollback at each stage. The whole exercise might take a couple of days, but at the end of it the team has done the pattern once, the scaffolding exists, and the next one is half the time.

The full mechanics of how to put this in practice, with the specific lock-behavior caveats for Postgres and the operational disciplines that hold it together, is covered at https://137foundry.com/articles/how-to-plan-zero-downtime-database-schema-migration. For more on how 137Foundry approaches the broader engineering work that surrounds production database changes, https://137foundry.com/services covers the practice areas directly.

The Real Lesson

The reason expand-and-contract works is not that the pattern is clever. It is that the pattern refuses to compress states that the system cannot survive being in. Every other migration approach is, at the limit, a wager that the compression will be fine. Most of the time, the wager pays off. The cases where it does not are the cases that end up in postmortems.

The pattern is the discipline that turns the wager into a guarantee. It is more work. It is also the reason the migrations that ship without incident are the ones executed by teams that take it seriously.

For broader context on building this kind of discipline into engineering practice generally, Charity Majors's writing on observability and operability is worth reading, and the GitLab handbook on database migrations is one of the more detailed open documents of how a large team actually executes the pattern in practice.