Young Gao

Posted on Mar 23

Zero-Downtime Database Migrations: Patterns for Production PostgreSQL in 2026

#postgres #database #devops #tutorial

Zero-Downtime Database Migrations: Patterns for Production PostgreSQL in 2026

Database migrations are the scariest part of deployments. One bad ALTER TABLE can lock your production database for minutes, turning a routine release into an incident. Here's how to make schema changes without any downtime.

The Core Problem

PostgreSQL acquires locks during DDL operations. A simple ALTER TABLE users ADD COLUMN email_verified boolean NOT NULL DEFAULT false on a table with millions of rows will:

Acquire an ACCESS EXCLUSIVE lock on the table
Rewrite the entire table (for NOT NULL with DEFAULT on PG < 11)
Block ALL reads and writes until complete

On a table with 50M rows, this can take 30+ seconds. Every query hitting that table queues up, your connection pool saturates, and your app goes down.

Pattern 1: Expand-Contract Migrations

Split every breaking change into three phases:

Phase 1 (Expand):   Add new column/table, keep old one working
Phase 2 (Migrate):  Backfill data, update app to write both
Phase 3 (Contract): Remove old column/table after verification

Example: Renaming a Column

Wrong (causes downtime):

ALTER TABLE users RENAME COLUMN name TO full_name;

Right (zero-downtime):

-- Migration 1: Expand
ALTER TABLE users ADD COLUMN full_name text;

-- Migration 2: Backfill (in batches)
UPDATE users SET full_name = name WHERE full_name IS NULL AND id BETWEEN 0 AND 10000;
UPDATE users SET full_name = name WHERE full_name IS NULL AND id BETWEEN 10001 AND 20000;
-- ... continue in batches

-- Migration 3: Add trigger to keep columns in sync
CREATE OR REPLACE FUNCTION sync_user_name() RETURNS trigger AS $$
BEGIN
    IF NEW.name IS DISTINCT FROM OLD.name THEN
        NEW.full_name := NEW.name;
    END IF;
    IF NEW.full_name IS DISTINCT FROM OLD.full_name THEN
        NEW.name := NEW.full_name;
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER sync_user_name_trigger
BEFORE UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_user_name();

-- Deploy app code that reads/writes full_name instead of name
-- Verify for a few days

-- Migration 4: Contract (after app fully migrated)
DROP TRIGGER sync_user_name_trigger ON users;
DROP FUNCTION sync_user_name();
ALTER TABLE users DROP COLUMN name;

Pattern 2: Safe Column Operations

Adding a Column

-- Safe on PostgreSQL 11+: volatile default doesn't rewrite table
ALTER TABLE orders ADD COLUMN status text DEFAULT 'pending';

-- For NOT NULL, add in two steps:
ALTER TABLE orders ADD COLUMN priority integer;
-- Backfill in batches...
ALTER TABLE orders ALTER COLUMN priority SET NOT NULL;
-- The SET NOT NULL requires a full table scan but NOT a rewrite

Removing a Column

Never drop columns in the same deploy as removing code references:

-- Step 1: Deploy app code that stops reading/writing the column
-- Step 2: Wait for all old app instances to drain
-- Step 3: Drop the column
ALTER TABLE users DROP COLUMN legacy_field;

Changing a Column Type

-- Wrong: rewrites entire table
ALTER TABLE events ALTER COLUMN payload TYPE jsonb USING payload::jsonb;

-- Right: create new column, backfill, swap
ALTER TABLE events ADD COLUMN payload_jsonb jsonb;

-- Backfill in batches
DO $$
DECLARE
    batch_size integer := 5000;
    max_id bigint;
    current_id bigint := 0;
BEGIN
    SELECT max(id) INTO max_id FROM events;
    WHILE current_id <= max_id LOOP
        UPDATE events
        SET payload_jsonb = payload::jsonb
        WHERE id > current_id AND id <= current_id + batch_size
        AND payload_jsonb IS NULL;

        current_id := current_id + batch_size;
        COMMIT;
        PERFORM pg_sleep(0.1); -- Brief pause to reduce load
    END LOOP;
END $$;

-- Then swap in app code, then drop old column

Pattern 3: Online Index Creation

Standard CREATE INDEX locks the table for writes. Always use CONCURRENTLY:

-- Wrong: blocks writes
CREATE INDEX idx_orders_customer ON orders (customer_id);

-- Right: doesn't block writes
CREATE INDEX CONCURRENTLY idx_orders_customer ON orders (customer_id);

Key caveats with CONCURRENTLY:

Cannot run inside a transaction
Takes longer (two table scans instead of one)
Can fail and leave an invalid index — always check:

-- Check for invalid indexes after CONCURRENTLY
SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE NOT indisvalid;

-- If invalid, drop and retry
DROP INDEX CONCURRENTLY idx_orders_customer;
CREATE INDEX CONCURRENTLY idx_orders_customer ON orders (customer_id);

Replacing an Existing Index

-- Create new index concurrently
CREATE INDEX CONCURRENTLY idx_orders_customer_v2 ON orders (customer_id, created_at);

-- Swap (instant, just changes catalog)
-- Option A: Drop old, rename new
DROP INDEX CONCURRENTLY idx_orders_customer;
ALTER INDEX idx_orders_customer_v2 RENAME TO idx_orders_customer;

-- Option B: For unique constraints, use ALTER TABLE
ALTER TABLE orders
    DROP CONSTRAINT orders_email_key,
    ADD CONSTRAINT orders_email_key UNIQUE USING INDEX idx_orders_email_v2;

Pattern 4: Safe Constraint Changes

Adding NOT NULL

-- Wrong: full table scan while holding ACCESS EXCLUSIVE lock
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

-- Better: add CHECK constraint first (validated in background)
ALTER TABLE users ADD CONSTRAINT users_email_not_null
    CHECK (email IS NOT NULL) NOT VALID;

-- Then validate (only takes ShareUpdateExclusiveLock, doesn't block writes)
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null;

-- PostgreSQL 12+ can then cheaply promote to NOT NULL
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_email_not_null;

Adding Foreign Keys

-- Wrong: validates all existing rows while holding lock
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
    FOREIGN KEY (customer_id) REFERENCES customers(id);

-- Right: add unvalidated, then validate separately
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
    FOREIGN KEY (customer_id) REFERENCES customers(id) NOT VALID;

ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_customer;

Pattern 5: Batched Backfills

Large UPDATE operations can cause issues: WAL bloat, replication lag, and lock contention. Always backfill in batches:

# backfill.py — Safe batched backfill
import psycopg
import time

BATCH_SIZE = 5000
SLEEP_BETWEEN_BATCHES = 0.1  # seconds

def backfill_column(conn_string: str, table: str, set_clause: str, where_clause: str):
    with psycopg.connect(conn_string) as conn:
        conn.autocommit = True  # Each batch is its own transaction

        # Get total count
        with conn.execute(f"SELECT count(*) FROM {table} WHERE {where_clause}") as cur:
            total = cur.fetchone()[0]

        processed = 0
        while True:
            with conn.execute(f"""
                WITH batch AS (
                    SELECT ctid FROM {table}
                    WHERE {where_clause}
                    LIMIT {BATCH_SIZE}
                    FOR UPDATE SKIP LOCKED
                )
                UPDATE {table} SET {set_clause}
                WHERE ctid IN (SELECT ctid FROM batch)
            """) as cur:
                affected = cur.rowcount

            if affected == 0:
                break

            processed += affected
            print(f"Progress: {processed}/{total} ({processed*100//total}%)")
            time.sleep(SLEEP_BETWEEN_BATCHES)

    print(f"Backfill complete: {processed} rows updated")

# Usage
backfill_column(
    "postgresql://localhost/mydb",
    "users",
    "email_normalized = lower(email)",
    "email_normalized IS NULL"
)

Pattern 6: Migration Linting

Catch dangerous migrations before they reach production. Use squawk for PostgreSQL:

# .github/workflows/migration-lint.yml
name: Migration Safety Check
on:
  pull_request:
    paths:
      - 'migrations/**'

jobs:
  lint-migrations:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Install squawk
        run: npm install -g squawk-cli

      - name: Lint new migrations
        run: |
          # Find new migration files
          MIGRATIONS=$(git diff --name-only origin/main...HEAD -- 'migrations/*.sql')
          if [ -n "$MIGRATIONS" ]; then
            echo "$MIGRATIONS" | xargs squawk
          fi

squawk catches:

ALTER TABLE ... ADD COLUMN ... DEFAULT (unsafe before PG 11)
CREATE INDEX without CONCURRENTLY
ALTER TABLE ... ADD CONSTRAINT ... FOREIGN KEY without NOT VALID
Column type changes that rewrite tables
NOT NULL additions without CHECK constraint pattern

The Migration Checklist

Before every migration:

Will it acquire an ACCESS EXCLUSIVE lock? If yes, restructure.
Does it rewrite the table? Check with ALTER TABLE ... SET (fillfactor = 100) trick or check the PG docs.
Are indexes created CONCURRENTLY?
Are constraints added NOT VALID first?
Is the backfill batched with sleep intervals?
Can you roll back? Every expand migration should be reversible without data loss.
Is the app compatible with both old and new schema? The expand phase MUST work with the current app code.

Real-World Timing

On a 100M row table with PostgreSQL 15:

Operation	Naive	Zero-Downtime
Add column with default	45s locked	<1ms (PG 11+)
Add NOT NULL	30s locked	<1ms + background validate
Create index	120s locked	180s, no locks
Change column type	90s locked	Background, no locks
Rename column	Instant but breaks app	3 deploys, zero downtime

The zero-downtime approach takes more deploys but zero seconds of user-facing downtime.

Tools That Help

squawk: Migration linter for PostgreSQL
pgroll: Automated zero-downtime migrations from Xata
reshape: Schema migration tool with automatic expand-contract
pg_repack: Repacks tables without ACCESS EXCLUSIVE locks
Flyway/Liquibase: Migration runners (use with the patterns above)

Have you experienced a migration-related outage? What patterns does your team use for zero-downtime schema changes? Share in the comments — I'm collecting war stories for a follow-up post on migration disaster recovery.

DEV Community

Zero-Downtime Database Migrations: Patterns for Production PostgreSQL in 2026

Zero-Downtime Database Migrations: Patterns for Production PostgreSQL in 2026

The Core Problem

Pattern 1: Expand-Contract Migrations

Example: Renaming a Column

Pattern 2: Safe Column Operations

Adding a Column

Removing a Column

Changing a Column Type

Pattern 3: Online Index Creation

Replacing an Existing Index

Pattern 4: Safe Constraint Changes

Adding NOT NULL

Adding Foreign Keys

Pattern 5: Batched Backfills

Pattern 6: Migration Linting

The Migration Checklist

Real-World Timing

Tools That Help

Top comments (0)