Zero-Downtime Database Migrations: Patterns for Production PostgreSQL in 2026
Database migrations are the scariest part of deployments. One bad ALTER TABLE can lock your production database for minutes, turning a routine release into an incident. Here's how to make schema changes without any downtime.
The Core Problem
PostgreSQL acquires locks during DDL operations. A simple ALTER TABLE users ADD COLUMN email_verified boolean NOT NULL DEFAULT false on a table with millions of rows will:
- Acquire an
ACCESS EXCLUSIVElock on the table - Rewrite the entire table (for NOT NULL with DEFAULT on PG < 11)
- Block ALL reads and writes until complete
On a table with 50M rows, this can take 30+ seconds. Every query hitting that table queues up, your connection pool saturates, and your app goes down.
Pattern 1: Expand-Contract Migrations
Split every breaking change into three phases:
Phase 1 (Expand): Add new column/table, keep old one working
Phase 2 (Migrate): Backfill data, update app to write both
Phase 3 (Contract): Remove old column/table after verification
Example: Renaming a Column
Wrong (causes downtime):
ALTER TABLE users RENAME COLUMN name TO full_name;
Right (zero-downtime):
-- Migration 1: Expand
ALTER TABLE users ADD COLUMN full_name text;
-- Migration 2: Backfill (in batches)
UPDATE users SET full_name = name WHERE full_name IS NULL AND id BETWEEN 0 AND 10000;
UPDATE users SET full_name = name WHERE full_name IS NULL AND id BETWEEN 10001 AND 20000;
-- ... continue in batches
-- Migration 3: Add trigger to keep columns in sync
CREATE OR REPLACE FUNCTION sync_user_name() RETURNS trigger AS $$
BEGIN
IF NEW.name IS DISTINCT FROM OLD.name THEN
NEW.full_name := NEW.name;
END IF;
IF NEW.full_name IS DISTINCT FROM OLD.full_name THEN
NEW.name := NEW.full_name;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER sync_user_name_trigger
BEFORE UPDATE ON users
FOR EACH ROW EXECUTE FUNCTION sync_user_name();
-- Deploy app code that reads/writes full_name instead of name
-- Verify for a few days
-- Migration 4: Contract (after app fully migrated)
DROP TRIGGER sync_user_name_trigger ON users;
DROP FUNCTION sync_user_name();
ALTER TABLE users DROP COLUMN name;
Pattern 2: Safe Column Operations
Adding a Column
-- Safe on PostgreSQL 11+: volatile default doesn't rewrite table
ALTER TABLE orders ADD COLUMN status text DEFAULT 'pending';
-- For NOT NULL, add in two steps:
ALTER TABLE orders ADD COLUMN priority integer;
-- Backfill in batches...
ALTER TABLE orders ALTER COLUMN priority SET NOT NULL;
-- The SET NOT NULL requires a full table scan but NOT a rewrite
Removing a Column
Never drop columns in the same deploy as removing code references:
-- Step 1: Deploy app code that stops reading/writing the column
-- Step 2: Wait for all old app instances to drain
-- Step 3: Drop the column
ALTER TABLE users DROP COLUMN legacy_field;
Changing a Column Type
-- Wrong: rewrites entire table
ALTER TABLE events ALTER COLUMN payload TYPE jsonb USING payload::jsonb;
-- Right: create new column, backfill, swap
ALTER TABLE events ADD COLUMN payload_jsonb jsonb;
-- Backfill in batches
DO $$
DECLARE
batch_size integer := 5000;
max_id bigint;
current_id bigint := 0;
BEGIN
SELECT max(id) INTO max_id FROM events;
WHILE current_id <= max_id LOOP
UPDATE events
SET payload_jsonb = payload::jsonb
WHERE id > current_id AND id <= current_id + batch_size
AND payload_jsonb IS NULL;
current_id := current_id + batch_size;
COMMIT;
PERFORM pg_sleep(0.1); -- Brief pause to reduce load
END LOOP;
END $$;
-- Then swap in app code, then drop old column
Pattern 3: Online Index Creation
Standard CREATE INDEX locks the table for writes. Always use CONCURRENTLY:
-- Wrong: blocks writes
CREATE INDEX idx_orders_customer ON orders (customer_id);
-- Right: doesn't block writes
CREATE INDEX CONCURRENTLY idx_orders_customer ON orders (customer_id);
Key caveats with CONCURRENTLY:
- Cannot run inside a transaction
- Takes longer (two table scans instead of one)
- Can fail and leave an invalid index — always check:
-- Check for invalid indexes after CONCURRENTLY
SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE NOT indisvalid;
-- If invalid, drop and retry
DROP INDEX CONCURRENTLY idx_orders_customer;
CREATE INDEX CONCURRENTLY idx_orders_customer ON orders (customer_id);
Replacing an Existing Index
-- Create new index concurrently
CREATE INDEX CONCURRENTLY idx_orders_customer_v2 ON orders (customer_id, created_at);
-- Swap (instant, just changes catalog)
-- Option A: Drop old, rename new
DROP INDEX CONCURRENTLY idx_orders_customer;
ALTER INDEX idx_orders_customer_v2 RENAME TO idx_orders_customer;
-- Option B: For unique constraints, use ALTER TABLE
ALTER TABLE orders
DROP CONSTRAINT orders_email_key,
ADD CONSTRAINT orders_email_key UNIQUE USING INDEX idx_orders_email_v2;
Pattern 4: Safe Constraint Changes
Adding NOT NULL
-- Wrong: full table scan while holding ACCESS EXCLUSIVE lock
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
-- Better: add CHECK constraint first (validated in background)
ALTER TABLE users ADD CONSTRAINT users_email_not_null
CHECK (email IS NOT NULL) NOT VALID;
-- Then validate (only takes ShareUpdateExclusiveLock, doesn't block writes)
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null;
-- PostgreSQL 12+ can then cheaply promote to NOT NULL
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_email_not_null;
Adding Foreign Keys
-- Wrong: validates all existing rows while holding lock
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
FOREIGN KEY (customer_id) REFERENCES customers(id);
-- Right: add unvalidated, then validate separately
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
FOREIGN KEY (customer_id) REFERENCES customers(id) NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_customer;
Pattern 5: Batched Backfills
Large UPDATE operations can cause issues: WAL bloat, replication lag, and lock contention. Always backfill in batches:
# backfill.py — Safe batched backfill
import psycopg
import time
BATCH_SIZE = 5000
SLEEP_BETWEEN_BATCHES = 0.1 # seconds
def backfill_column(conn_string: str, table: str, set_clause: str, where_clause: str):
with psycopg.connect(conn_string) as conn:
conn.autocommit = True # Each batch is its own transaction
# Get total count
with conn.execute(f"SELECT count(*) FROM {table} WHERE {where_clause}") as cur:
total = cur.fetchone()[0]
processed = 0
while True:
with conn.execute(f"""
WITH batch AS (
SELECT ctid FROM {table}
WHERE {where_clause}
LIMIT {BATCH_SIZE}
FOR UPDATE SKIP LOCKED
)
UPDATE {table} SET {set_clause}
WHERE ctid IN (SELECT ctid FROM batch)
""") as cur:
affected = cur.rowcount
if affected == 0:
break
processed += affected
print(f"Progress: {processed}/{total} ({processed*100//total}%)")
time.sleep(SLEEP_BETWEEN_BATCHES)
print(f"Backfill complete: {processed} rows updated")
# Usage
backfill_column(
"postgresql://localhost/mydb",
"users",
"email_normalized = lower(email)",
"email_normalized IS NULL"
)
Pattern 6: Migration Linting
Catch dangerous migrations before they reach production. Use squawk for PostgreSQL:
# .github/workflows/migration-lint.yml
name: Migration Safety Check
on:
pull_request:
paths:
- 'migrations/**'
jobs:
lint-migrations:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Install squawk
run: npm install -g squawk-cli
- name: Lint new migrations
run: |
# Find new migration files
MIGRATIONS=$(git diff --name-only origin/main...HEAD -- 'migrations/*.sql')
if [ -n "$MIGRATIONS" ]; then
echo "$MIGRATIONS" | xargs squawk
fi
squawk catches:
-
ALTER TABLE ... ADD COLUMN ... DEFAULT(unsafe before PG 11) -
CREATE INDEXwithoutCONCURRENTLY -
ALTER TABLE ... ADD CONSTRAINT ... FOREIGN KEYwithoutNOT VALID - Column type changes that rewrite tables
-
NOT NULLadditions without CHECK constraint pattern
The Migration Checklist
Before every migration:
- Will it acquire an ACCESS EXCLUSIVE lock? If yes, restructure.
-
Does it rewrite the table? Check with
ALTER TABLE ... SET (fillfactor = 100)trick or check the PG docs. - Are indexes created CONCURRENTLY?
- Are constraints added NOT VALID first?
- Is the backfill batched with sleep intervals?
- Can you roll back? Every expand migration should be reversible without data loss.
- Is the app compatible with both old and new schema? The expand phase MUST work with the current app code.
Real-World Timing
On a 100M row table with PostgreSQL 15:
| Operation | Naive | Zero-Downtime |
|---|---|---|
| Add column with default | 45s locked | <1ms (PG 11+) |
| Add NOT NULL | 30s locked | <1ms + background validate |
| Create index | 120s locked | 180s, no locks |
| Change column type | 90s locked | Background, no locks |
| Rename column | Instant but breaks app | 3 deploys, zero downtime |
The zero-downtime approach takes more deploys but zero seconds of user-facing downtime.
Tools That Help
- squawk: Migration linter for PostgreSQL
- pgroll: Automated zero-downtime migrations from Xata
- reshape: Schema migration tool with automatic expand-contract
- pg_repack: Repacks tables without ACCESS EXCLUSIVE locks
- Flyway/Liquibase: Migration runners (use with the patterns above)
Have you experienced a migration-related outage? What patterns does your team use for zero-downtime schema changes? Share in the comments — I'm collecting war stories for a follow-up post on migration disaster recovery.
Top comments (0)