DEV Community

ZNY
ZNY

Posted on

Database Migration Strategies That Actually Work in Production

Database Migration Strategies That Actually Work in Production

Database migrations are the thing that looks simple until you're凌晨3点 trying to recover from a migration that locked your production table for 45 minutes. After running migrations on systems with billions of rows, here's what I've learned about doing them safely.

The Fundamental Problem

Most migration tutorials show you this:

class AddIndexesToOrders < ActiveRecord::Migration[7.0]
  def change
    add_index :orders, :user_id
  end
end
Enter fullscreen mode Exit fullscreen mode

This is fine in development. In production with 50 million orders, this will lock your table and bring down your application.

The Core Principle: Small Changes, Applied Incrementally

Every production-safe migration follows the same pattern: make the change in small, non-breaking steps that can be rolled back independently.

Expanding-Contract Pattern

Instead of one big migration, use three:

Migration 1: Add new column (nullable)

-- Migration 1: Safe - never locks
ALTER TABLE orders ADD COLUMN user_email VARCHAR(255);

-- Update application to write BOTH old and new columns
-- Deploy this first
Enter fullscreen mode Exit fullscreen mode

Migration 2: Backfill data (in batches)

-- Safe batched backfill
DO $$
DECLARE
  batch_size INT := 10000;
  offset_val INT := 0;
  max_id INT;
  updated INT;
BEGIN
  SELECT MAX(id) INTO max_id FROM orders;

  LOOP
    UPDATE orders 
    SET user_email = (SELECT email FROM users WHERE users.id = orders.user_id)
    WHERE id IN (
      SELECT id FROM orders 
      WHERE user_email IS NULL 
      AND id <= max_id
      ORDER BY id 
      LIMIT batch_size
    );

    GET DIAGNOSTICS updated = ROW_COUNT;
    EXIT WHEN updated = 0;

    -- Prevent lock contention
    PERFORM pg_sleep(0.1);
  END LOOP;
END $$;
Enter fullscreen mode Exit fullscreen mode

Migration 3: Add NOT NULL constraint

-- Now safe because all rows have values
ALTER TABLE orders ALTER COLUMN user_email SET NOT NULL;
Enter fullscreen mode Exit fullscreen mode

Handling Long-Running Migrations

The Lock Timeout Strategy

-- Set a short lock timeout so migration fails fast instead of hanging
SET lock_timeout = '2s';

-- Migration that might need a lock
ALTER TABLE orders ADD COLUMN status VARCHAR(50);

-- If it can't get lock in 2s, it fails immediately
-- Instead of blocking for minutes
Enter fullscreen mode Exit fullscreen mode

Concurrent Index Building

Never use CREATE INDEX in production. Always use CREATE INDEX CONCURRENTLY.

-- BAD: Locks table, blocks reads/writes
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- GOOD: Runs without locking, takes longer but zero downtime
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders(user_id);
Enter fullscreen mode Exit fullscreen mode

Critical note: CREATE INDEX CONCURRENTLY cannot run inside a transaction. Your migration framework needs to handle this.

# Rails: Tell it to run outside a transaction
class AddIndexesToOrders < ActiveRecord::Migration[7.0]
  disable_ddl_transaction!

  def change
    add_index :orders, :user_id, algorithm: :concurrently
  end
end
Enter fullscreen mode Exit fullscreen mode

Schema Versioning: The Branching Model

For complex systems, treat database schema like code with proper branching:

main (production schema)
  └── staging-test (validate migrations)
        └── feature/user-email-migration (your change)
Enter fullscreen mode Exit fullscreen mode
# Before starting a migration
git checkout main
git pull
git checkout -b migration/user-email-fix

# Run migrations locally against fresh production copy
# Once validated:
git checkout main
git merge migration/user-email-fix
# Deploy migration to production
Enter fullscreen mode Exit fullscreen mode

The Pre-Migration Checklist

Before ANY production migration:

□ Tested on production-size dataset (at minimum on staging with production data snapshot)
□ Lock duration estimated (use EXPLAIN ANALYZE)
□ Rollback plan documented
□ Canary/deploy step prepared (migrate 1% of traffic, observe, then full deploy)
□ Alert thresholds set (if migration causes >X% error rate, auto-rollback)
□ Migrations scheduled during low-traffic window
□ On-call engineer aware and standing by
□ Database backup verified (point-in-time recovery tested)
□ Lock timeout set appropriately
□ Query plan examined for full table scans
Enter fullscreen mode Exit fullscreen mode

Real-World Example: Renaming a Column Safely

Renaming a column is a four-migration process:

Migration 1: Add new column (double-write starts)

ALTER TABLE users ADD COLUMN display_name VARCHAR(100);
Enter fullscreen mode Exit fullscreen mode
# Update application to WRITE to both columns
# User.where(name: 'John').update(display_name: 'John') 
# runs in background
Enter fullscreen mode Exit fullscreen mode

Migration 2: Backfill

UPDATE users SET display_name = name WHERE display_name IS NULL;
-- In batches of 10,000 with 0.1s sleep
Enter fullscreen mode Exit fullscreen mode

Migration 3: Stop reading from old column

# Deploy code that only reads from display_name
# Verify everything works
Enter fullscreen mode Exit fullscreen mode

Migration 4: Drop old column

ALTER TABLE users DROP COLUMN name;
-- Must run outside transaction for PostgreSQL
Enter fullscreen mode Exit fullscreen mode

PostgreSQL-Specific Tools

pg_repack: Remove Bloat Without Table Locks

# Install
CREATE EXTENSION pg_repack;

# Repack a bloated table without locking
pg_repack -d mydb -t orders --no-indexes

# Repack with specific index
pg_repack -d mydb -t orders -i idx_orders_user_id
Enter fullscreen mode Exit fullscreen mode

pg_activity: Monitor Migration Progress

# Watch active queries during migration
pg_activity -h localhost -U postgres

# Or query directly
SELECT pid, state, query, query_start, now() - query_start AS duration
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
Enter fullscreen mode Exit fullscreen mode

The Migration Locking Hierarchy

Understanding lock modes prevents surprises:

Lock Mode Blocks
Access Share DROP TABLE, TRUNCATE
Row Share DELETE, UPDATE, SELECT FOR UPDATE
Row Exclusive INSERT, UPDATE, DELETE
Share Update Exclusive ANALYZE, CREATE INDEX CONCURRENTLY
Share CREATE INDEX (blocking)
Share Row Exclusive ALTER TABLE
Exclusive REFRESH MATERIALIZED VIEW CONCURRENTLY
Access Exclusive DROP TABLE, TRUNCATE, most ALTER TABLE

The Golden Rule

If your migration takes more than 100ms on a production table, it's wrong. Go back and break it into smaller pieces.

The goal is always: zero downtime, zero data loss, instant rollback capability.


What migration horror stories do you have? What's your go-to strategy for risky migrations?

Top comments (0)