Digital Unicon

Posted on Jun 21

Zero-Downtime Migrations: A Step-by-Step Playbook

#database #devops #postgressql #saas

If you've ever typed ALTER TABLE on a production database and held your breath, this article is for you.

Database migrations are one of the most dangerous routine operations in software engineering. Done wrong, a simple column rename can take down your entire platform — locking tables, timing out connections, and turning your Monday morning deploy into an all-hands incident.

The good news: zero-downtime migrations are entirely achievable, even on large, high-traffic databases. They just require a different mental model – one where your schema and your code evolve together but independently.

This is the playbook my team uses. It's boring, methodical, and it works.

Why Migrations Go Wrong

Before the playbook, let's understand the failure modes.

Problem 1: The Lock

Most databases acquire an ACCESS EXCLUSIVE lock when you alter a table. This blocks every read and write until the migration finishes. On a large table, "finishes" can mean minutes or hours.

-- This looks innocent. It is not.
ALTER TABLE users ADD COLUMN last_seen_at TIMESTAMP;

-- On a 50M row table with active traffic?
-- You just blocked every query touching `users`.

Problem 2: Deploying Code and Schema Together

The classic mistake: you rename a column and deploy your app at the same time. For one brief window, your new code is looking for email_address while the database still has email. Errors spike. Users see failures. You roll back both and start over.

Problem 3: The Irreversible Change

You drop a column. The deploy succeeds. Then you realize the old version of your app (still running on 2 of your 10 servers during a rolling deploy) needed that column. Now you have 500 errors per second and no way to un-drop.

The Mental Model: Expand and Contract

The solution to all three problems is a pattern called Expand and Contract (sometimes called parallel change).

The idea is simple:

Never make a breaking schema change in a single step. Instead, expand the schema to support both old and new states, update your application code, then contract the schema by removing what's no longer needed.

Every migration becomes a 3-phase process spread across multiple deploys:

Phase 1: EXPAND   → Add new structure, keep old structure intact
Phase 2: MIGRATE  → Move data, update app code to use new structure
Phase 3: CONTRACT → Remove old structure once it's no longer referenced

Each phase is independently deployable. Each phase is independently rollback-safe.

Let's walk through it concretely.

Phase 1: Expand

Goal: Make the database compatible with both the old app and the new app simultaneously.

Say you want to rename users.name to users.full_name.

In the Expand phase, you add the new column without removing the old one:

-- Migration: 001_add_full_name_column.sql
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

Why this is safe: Adding a nullable column (or one with a default) doesn't require a full table rewrite in modern databases. PostgreSQL 11+ handles this instantly via a catalog change. MySQL 8.0+ with ALGORITHM=INSTANT does the same.

At this point, your old app still reads and writes name. The new column exists but is empty. Nothing breaks.

Pro tip for large tables — avoid long locks entirely:

-- PostgreSQL: Use concurrent index builds, not blocking ones
CREATE INDEX CONCURRENTLY idx_users_full_name ON users(full_name);

-- MySQL: Use pt-online-schema-change or gh-ost for large table alters

Phase 2: Migrate (Data + Code)

This is the meaty phase. It has two parts that happen together.

Part A: Backfill the Data

Copy data from the old column to the new one. Don't do this in a single UPDATE — it locks the table and may run for hours.

Instead, backfill in batches:

import psycopg2
import time

conn = psycopg2.connect(DATABASE_URL)
cursor = conn.cursor()

BATCH_SIZE = 1000
last_id = 0

while True:
    cursor.execute("""
        UPDATE users
        SET full_name = name
        WHERE id > %s
          AND full_name IS NULL
        ORDER BY id
        LIMIT %s
        RETURNING id
    """, (last_id, BATCH_SIZE))

    rows = cursor.fetchall()
    conn.commit()

    if not rows:
        break

    last_id = rows[-1][0]
    print(f"Backfilled up to id={last_id}")
    time.sleep(0.05)  # Breathing room for other queries

print("Backfill complete.")

This approach:

Never locks the table for more than a few milliseconds per batch
Can be paused and resumed
Can run in production without impacting traffic

Part B: Update Your Application Code

Now deploy the application code that writes to both columns and reads from the new column:

# Before (Phase 1 app code)
def update_user_name(user_id, name):
    db.execute("UPDATE users SET name = %s WHERE id = %s", (name, user_id))

def get_user(user_id):
    return db.fetchone("SELECT name FROM users WHERE id = %s", (user_id,))


# After (Phase 2 app code) — dual-write, read from new column
def update_user_name(user_id, name):
    db.execute("""
        UPDATE users
        SET name = %s, full_name = %s
        WHERE id = %s
    """, (name, name, user_id))

def get_user(user_id):
    return db.fetchone("SELECT full_name FROM users WHERE id = %s", (user_id,))

Why dual-write matters: During a rolling deploy, some servers run the old code, some run the new. Dual-writing ensures both versions stay in sync with no data loss regardless of which server handles a request.

Verify Before Moving On

Before Phase 3, confirm that:

The backfill is 100% complete (no NULL values in full_name for rows that should have data)
All application servers are on the Phase 2 code
No old-code servers are still running (check your deployment dashboard)

-- Verify backfill completeness
SELECT COUNT(*)
FROM users
WHERE name IS NOT NULL
  AND full_name IS NULL;

-- Should return 0

Phase 3: Contract

Only when Phase 2 has been fully deployed and verified do you remove the old column.

-- Migration: 003_drop_old_name_column.sql
ALTER TABLE users DROP COLUMN name;

And clean up your application code:

# Phase 3 app code — old column gone, no more dual-write
def update_user_name(user_id, name):
    db.execute("UPDATE users SET full_name = %s WHERE id = %s", (name, user_id))

def get_user(user_id):
    return db.fetchone("SELECT full_name FROM users WHERE id = %s", (user_id,))

The key insight: At this point, nothing in your codebase references name anymore. Dropping it is completely safe. If something breaks, it was already broken before you ran this migration.

The Rollback Plan

Every phase should have an explicit rollback strategy.

Phase	If something goes wrong...	Rollback action
Phase 1 (Expand)	App misbehaves after new column added	Drop the new column. Old code never touched it.
Phase 2 (Migrate)	New code has bugs, reads fail	Re-deploy old code. Old column is still intact. Data is still there.
Phase 3 (Contract)	Realise old column was still needed	This is hard. Don't rush Phase 3.

The most important rule: never skip Phase 2. The temptation to combine "add new column, migrate data, drop old column" into a single migration script is real — and it's how incidents happen.

Real-World Patterns Cheat Sheet

Here are common migration scenarios mapped to the Expand/Contract approach:

Rename a Column

Expand: Add new column
Migrate: Backfill data, dual-write in code
Contract: Drop old column

Add a NOT NULL Column

-- Phase 1: Add as nullable
ALTER TABLE orders ADD COLUMN shipped_at TIMESTAMP NULL;

-- Phase 2: Backfill, then add constraint
UPDATE orders SET shipped_at = created_at WHERE status = 'shipped';

-- Phase 3: Enforce the constraint
ALTER TABLE orders ALTER COLUMN shipped_at SET NOT NULL;

Split a Column into Two

Expand: Add both new columns
Migrate: Parse and populate from old column in batches
Contract: Drop old column

Change a Column Type (e.g. INT → BIGINT)

Expand: Add new column with new type
Migrate: Copy and cast data, dual-write in code
Contract: Drop old column, rename new one (or keep new name in code)

Tools Worth Knowing

Tool	Use case
gh-ost (GitHub)	Online schema changes for MySQL — copies table in background, applies binlog changes
pglogical	Logical replication for PostgreSQL migrations between major versions
Flyway / Liquibase	Schema version control — tracks which migrations have run per environment
pt-online-schema-change	Percona's alternative to gh-ost for MySQL
strong_migrations (Ruby)	Catches unsafe migrations in Rails at dev time before they hit production

The 3 AM Test

Before you run any migration in production, ask yourself:

If this migration goes wrong at 3 AM, can I roll it back in under 5 minutes without data loss?

If the answer is no — split it into smaller phases until it is.

Zero-downtime migrations aren't magic. They're discipline. Each phase is small, safe, and independently reversible. The whole process takes longer than a single scary ALTER TABLE, but it's the difference between a boring Tuesday deploy and an incident retrospective.

Your on-call rotation will thank you.

Summary

Never change schema and code atomically. Decouple them across phases.
Expand first — add new structure, don't remove old structure.
Migrate safely — batch backfills, dual-write during transition.
Contract last — only drop old structure after all code has moved on.
Always have a rollback plan for each phase independently.
Use the right tools — gh-ost, CREATE INDEX CONCURRENTLY, and batch scripts are your friends.

DEV Community