Young Gao

Posted on Mar 21

Database Migration Strategies That Won't Take Down Production

#postgres #database #backend #devops

Every backend engineer has a migration horror story. Maybe it was the ALTER TABLE that locked a 200-million-row table for 47 minutes. Maybe it was the column rename that brought down every API server simultaneously. Or maybe it was the "quick fix" migration that corrupted data in a way that took three days to untangle.

Database migrations are the most dangerous routine operation in backend engineering. Your code deploys are (hopefully) stateless and reversible. Your database changes are neither. They mutate persistent state that your entire system depends on, and getting them wrong can mean downtime, data loss, or both.

This article covers the patterns, tooling, and discipline required to run migrations in production without breaking things.

Why Migrations Are Scary

The fundamental problem is simple: schema changes and code changes don't deploy atomically. There is always a window where your running application code and your database schema are out of sync. During a typical rolling deployment:

Migration runs, changing the schema
Old application code is still running against the new schema
New application code starts rolling out
Both old and new code run simultaneously against the new schema
Old code finishes draining

Steps 2-4 are where things break. If your migration renames a column, old code looking for the old name throws errors. If it drops a column, same problem. If it adds a NOT NULL column without a default, old code inserting rows fails.

The second problem is locking. Most ALTER TABLE operations in traditional databases acquire locks that block reads, writes, or both. On a table with millions of rows, a lock held for even a few seconds can cascade into connection pool exhaustion, request timeouts, and a full outage.

The third problem is irreversibility. You can roll back a code deploy in seconds. Rolling back a migration that dropped a column means restoring from backup — if you even have a recent one that's consistent.

Zero-Downtime Migration: The Core Principle

The rule is straightforward: at every point during the migration and deployment process, all running code must be compatible with the current database schema.

This means:

Never rename a column in a single step
Never drop a column that running code still references
Never add a NOT NULL constraint without a default
Never assume the migration and code deploy happen simultaneously

Every migration must be backward compatible with the currently deployed code and forward compatible with the code about to be deployed.

The Expand-Contract Pattern

This is the most important pattern in zero-downtime migrations. Instead of making a breaking change in one step, you split it into three phases:

Phase 1: Expand

Add the new structure alongside the old one. Both old and new code work.

-- Migration: Add new column (nullable, so old code can still INSERT)
ALTER TABLE orders ADD COLUMN status_v2 VARCHAR(50);

Phase 2: Migrate

Deploy code that writes to both old and new structures. Backfill existing data.

-- Backfill in batches (more on this later)
UPDATE orders SET status_v2 = status WHERE status_v2 IS NULL
  AND id BETWEEN $start AND $end;

Phase 3: Contract

Once all code uses the new structure and all data is migrated, remove the old one.

-- Only after all application code stops reading `status`
ALTER TABLE orders DROP COLUMN status;
ALTER TABLE orders RENAME COLUMN status_v2 TO status;

Each phase is a separate migration tied to a separate code deploy. Phase 1 goes out with or before the code that starts using the new column. Phase 3 goes out only after you've verified all services have been updated and the old column is truly unused.

Real-world example: renaming a column

Renaming user_name to display_name on a users table with 10M rows:

Deploy 1: Migration adds `display_name` column
Deploy 2: Code writes to both columns, reads from `display_name` with fallback to `user_name`
Deploy 3: Backfill script copies remaining data
Deploy 4: Code reads only from `display_name`
Deploy 5: Migration drops `user_name`

Yes, that's five deploys for a column rename. That's the cost of zero downtime. In practice, deploys 1-2 often ship together, and 4-5 ship together, so it's usually three deploy cycles.

Handling Large Table Alterations

On PostgreSQL, many ALTER TABLE operations are fast because they only update catalog metadata:

-- These are ~instant in PostgreSQL, regardless of table size
ALTER TABLE orders ADD COLUMN notes TEXT;
ALTER TABLE orders ALTER COLUMN notes SET DEFAULT '';
ALTER TABLE orders DROP COLUMN old_field;

But some operations are not:

-- These rewrite the entire table or scan all rows
ALTER TABLE orders ALTER COLUMN amount TYPE BIGINT;  -- full rewrite
ALTER TABLE orders ADD COLUMN verified BOOLEAN NOT NULL DEFAULT true;
-- (instant in PG 11+ but rewrites in older versions)
CREATE INDEX ON orders (customer_id);  -- full table scan

For type changes on large tables, use the expand-contract pattern: add a new column with the desired type, backfill, switch reads, drop the old column.

For index creation, always use CONCURRENTLY:

-- Blocks writes:
CREATE INDEX idx_orders_customer ON orders (customer_id);

-- Does NOT block writes (takes longer, but safe):
CREATE INDEX CONCURRENTLY idx_orders_customer ON orders (customer_id);

Important caveat: CREATE INDEX CONCURRENTLY cannot run inside a transaction. Most migration tools wrap each migration in a transaction by default. You need to either disable that behavior for this specific migration or use a tool that handles it.

In golang-migrate:

-- +migrate: no-transaction
CREATE INDEX CONCURRENTLY idx_orders_customer ON orders (customer_id);

In Flyway, you would set executeInTransaction=false on the migration.

Advisory Locks and Migration Safety

When running migrations in a horizontally scaled environment, you need to ensure only one instance runs migrations at a time. Most tools handle this with advisory locks:

-- golang-migrate uses PostgreSQL advisory locks
SELECT pg_advisory_lock(12345);
-- ... run migrations ...
SELECT pg_advisory_unlock(12345);

This prevents two pods starting simultaneously from both trying to run the same migration and corrupting the schema history.

Migration Tooling

golang-migrate

Straightforward, language-agnostic, uses plain SQL files:

migrations/
  000001_create_users.up.sql
  000001_create_users.down.sql
  000002_add_orders.up.sql
  000002_add_orders.down.sql

Run from CLI or embedded in your Go application:

import "github.com/golang-migrate/migrate/v4"

m, err := migrate.New(
    "file://migrations",
    "postgres://localhost:5432/mydb?sslmode=disable",
)
if err != nil {
    log.Fatal(err)
}

if err := m.Up(); err != nil && err != migrate.ErrNoChange {
    log.Fatal(err)
}

Strengths: simple, no DSL to learn, works with any language's deployment pipeline. Weaknesses: no built-in support for non-transactional migrations, limited state tracking.

Flyway

JVM-based, more opinionated, supports versioned and repeatable migrations:

-- V1__Create_users.sql
CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL UNIQUE,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- V2__Add_orders.sql
CREATE TABLE orders (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT NOT NULL REFERENCES users(id),
    total_cents BIGINT NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Flyway tracks state in a flyway_schema_history table and supports callbacks, placeholders, and Java-based migrations for complex logic. It's the standard in Java/Kotlin ecosystems.

Other Notable Tools

Alembic (Python/SQLAlchemy): generates migrations from model diffs, good for Python shops
Sqitch: dependency-based rather than version-ordered, powerful but steeper learning curve
Atlas (by Ariga): declarative schema management, computes diffs automatically, gaining traction in Go ecosystems
pg_partman / pgloader: for partition management and bulk data loading, respectively

Choosing a Tool

Pick based on your ecosystem and complexity:

Need	Tool
Simple SQL migrations, any language	golang-migrate
Java/Kotlin ecosystem	Flyway
Python/SQLAlchemy	Alembic
Declarative schema-as-code	Atlas
Complex dependency graphs	Sqitch

Data Backfills Done Right

Backfilling data across millions of rows is where most migration disasters happen. The naive approach:

-- DO NOT DO THIS
UPDATE orders SET status_v2 = compute_new_status(status);

This acquires a lock on every row, generates enormous WAL (write-ahead log) volume, and can take hours while blocking other writes.

Instead, backfill in batches with throttling:

import time
import psycopg2

BATCH_SIZE = 5000
SLEEP_SECONDS = 0.1  # throttle to reduce replication lag

conn = psycopg2.connect(dsn)
conn.autocommit = True

cursor = conn.cursor()
cursor.execute("SELECT MIN(id), MAX(id) FROM orders")
min_id, max_id = cursor.fetchone()

current = min_id
while current <= max_id:
    cursor.execute("""
        UPDATE orders
        SET status_v2 = compute_new_status(status)
        WHERE id >= %s AND id < %s
          AND status_v2 IS NULL
    """, (current, current + BATCH_SIZE))

    updated = cursor.rowcount
    print(f"Batch {current}-{current + BATCH_SIZE}: updated {updated} rows")

    current += BATCH_SIZE
    time.sleep(SLEEP_SECONDS)

Key practices:

Batch by primary key range, not LIMIT/OFFSET (which gets slower as offset grows)
Sleep between batches to let replicas catch up and avoid saturating I/O
Make it idempotent (WHERE status_v2 IS NULL) so you can restart safely if it fails midway
Monitor replication lag during the backfill and pause if it exceeds your threshold
Run during low-traffic windows when possible, even if it's technically safe at peak

For truly massive tables (billions of rows), consider doing the backfill at the application level: update the new column whenever a row is naturally read or written, and run the batch backfill for the long tail of untouched rows.

Blue-Green Database Deployments

Blue-green deployments for databases are harder than for stateless application servers, but the pattern exists and works well for certain scenarios.

The idea: maintain two database schemas (or databases) — blue (current) and green (next). Migrate green, point traffic to it, keep blue as a rollback target.

Schema-Level Blue-Green

-- Blue schema (current)
CREATE SCHEMA blue;
-- ... tables in blue schema

-- Green schema (next version)
CREATE SCHEMA green;
-- ... migrated tables in green schema

-- Application config points to schema
SET search_path = 'blue';  -- current
SET search_path = 'green'; -- after cutover

This works when your migration is a large structural change that's hard to do incrementally. You build the new schema, backfill it from the old one, then switch the application's search_path.

Limitations

True blue-green for databases requires either:

A brief write-freeze during cutover (seconds, not minutes)
Dual-write during the transition period
Logical replication between old and new schemas

For most teams, the expand-contract pattern is more practical than blue-green for databases. Reserve blue-green for major version upgrades (e.g., PostgreSQL 14 to 16) where you set up logical replication between the old and new clusters.

Rollback Strategies

Down Migrations

Every migration tool supports down migrations. Write them:

-- 000005_add_verified_column.up.sql
ALTER TABLE users ADD COLUMN verified BOOLEAN DEFAULT false;

-- 000005_add_verified_column.down.sql
ALTER TABLE users DROP COLUMN verified;

But understand their limits. A down migration that drops a column destroys data. If you added the column, backfilled it over three days, and then need to roll back, that data is gone.

Forward-Only Rollbacks

A safer pattern: instead of rolling back the migration, roll forward with a new migration that undoes the change:

000005_add_verified_column.up.sql    -- adds column
000006_remove_verified_column.up.sql -- removes it (if needed)

This keeps the migration history linear and auditable. Many teams mandate forward-only migrations in production while keeping down migrations for local development.

Point-In-Time Recovery

For catastrophic failures, you need PITR:

# PostgreSQL continuous archiving
archive_mode = on
archive_command = 'cp %p /archive/%f'

# Restore to a specific timestamp
recovery_target_time = '2025-03-15 14:30:00 UTC'

Test your PITR process regularly. The worst time to discover your backups don't work is during an incident.

The Migration Rollback Matrix

Scenario	Strategy
Added a column, code not yet deployed	Down migration (safe, no data loss)
Changed column type, data transformed	Forward migration to revert; may lose precision
Dropped a column	PITR or restore from backup
Added an index	Drop it (fast, safe)
Data backfill went wrong	Forward migration to fix; depends on whether original data is preserved

Testing Migrations

Against Production-Like Data

Your test database with 50 rows will not reveal the problems that show up with 50 million rows. Test against a copy of production data:

# Snapshot production (use your cloud provider's snapshot feature)
# Restore to a test instance
# Run the migration
# Measure: time, locks held, WAL generated, replication lag

Schema Diffing

After running migrations on a staging environment, diff the resulting schema against what you expect:

# Using pg_dump to compare schemas
pg_dump --schema-only production_db > prod_schema.sql
pg_dump --schema-only staging_db > staging_schema.sql
diff prod_schema.sql staging_schema.sql

Atlas has this built in:

atlas schema diff \
  --from "postgres://localhost/production" \
  --to "postgres://localhost/staging"

CI Pipeline Integration

Run migrations in CI against a fresh database and a database with the previous version's schema:

# .github/workflows/migrations.yml
migration-test:
  services:
    postgres:
      image: postgres:16
      env:
        POSTGRES_DB: test
        POSTGRES_PASSWORD: test
  steps:
    - uses: actions/checkout@v4
    - name: Run all migrations from scratch
      run: migrate -path ./migrations -database "$DB_URL" up
    - name: Verify schema matches expectations
      run: ./scripts/verify-schema.sh
    - name: Test rollback of latest migration
      run: migrate -path ./migrations -database "$DB_URL" down 1
    - name: Re-apply latest migration
      run: migrate -path ./migrations -database "$DB_URL" up

Statement-Level Analysis

Before running a migration in production, analyze what it will actually do:

-- Check if an ALTER TABLE will rewrite the table
-- (PostgreSQL-specific: check pg_catalog after the change)

-- Estimate lock duration using pg_stat_activity during staging test
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';

A Migration Checklist

Before every production migration:

Is it backward compatible? Can the currently deployed code work with the new schema?
Is it forward compatible? Can the about-to-be-deployed code work with the old schema (in case you need to roll back the code deploy)?
Does it hold locks? If so, for how long? On how many rows?
Has it been tested against production-sized data?
Is there a rollback plan? Down migration, forward fix, or PITR?
Is the backfill batched and idempotent?
Is the migration wrapped in a transaction? Should it be? (CREATE INDEX CONCURRENTLY cannot be.)
Has someone else reviewed it? Migration PRs deserve as much scrutiny as application code.

Summary

Database migrations will always carry risk. The goal is not to eliminate risk but to reduce it to a level where deploying a migration is routine rather than terrifying. The core practices:

Use expand-contract for every breaking schema change
Batch and throttle data backfills
Test against production-scale data
Write rollback plans before you need them
Use CONCURRENTLY for index operations
Keep migrations small, frequent, and reviewable

The teams that migrate databases confidently are not the ones with the best tools. They are the ones with the most disciplined processes.

Building resilient backend systems? This is article #17 in the Production Backend Patterns series. Follow for the next one.

If you found this useful, consider supporting my work:

DEV Community