Introduction
Database schema changes are among the most risky operations in software deployment. A botched migration can bring down your entire application, corrupt data, or cause hours of downtime while you scramble to rollback changes.
Yet databases must evolve. Features require new tables and columns. Performance optimizations demand index changes. Refactoring necessitates schema restructuring. The challenge is executing these changes without disrupting service availability.
In this comprehensive guide, we'll explore battle-tested strategies for zero-downtime database migrations, covering everything from simple column additions to complex table restructuring.
Why Database Migrations Are Risky
Unlike stateless application deployments, database changes involve persistent data:
Data Permanence
-- Application code: Easy to rollback
git revert abc123
kubectl rollout undo deployment/myapp
-- Database: Permanent changes
ALTER TABLE users DROP COLUMN email;
-- Data is GONE forever
-- Can't easily rollback
Schema Lock Times
Some operations lock tables, blocking reads/writes:
-- PostgreSQL: Adding NOT NULL blocks writes
ALTER TABLE users
ADD COLUMN email VARCHAR(255) NOT NULL;
-- Table locked until completion
-- On 10M row table: 30+ seconds of downtime
-- MySQL: Adding index locks table
ALTER TABLE users ADD INDEX idx_email (email);
-- Table locked during index build
-- On 100M row table: 10+ minutes of downtime
Application-Database Version Mismatch
Old code + new schema = errors:
Sequence of events:
1. Deploy schema change: DROP COLUMN first_name
2. Old application code still running: SELECT first_name FROM users
3. ERROR: column "first_name" does not exist
4. Application crashes
Rollback Complexity
Rolling back destructive changes requires data restoration:
-- Forward migration
ALTER TABLE users DROP COLUMN legacy_id;
-- Rollback requires:
1. Restore from backup
2. Replay transaction log
3. Potential data loss from gap
4. Extended downtime
Fundamental Principles
1. Backward Compatibility
Every schema change must work with both old and new application code:
Phase 1: Deploy schema (works with old code)
Phase 2: Deploy new code (works with new schema)
Phase 3: Cleanup (remove old code support)
2. Expand-Migrate-Contract Pattern
Never remove or modify directly. Instead:
1. EXPAND: Add new schema elements
2. MIGRATE: Dual-write to old and new
3. CONTRACT: Remove old schema elements
3. Small, Incremental Changes
Break large migrations into small, deployable chunks:
-- Bad: One large migration
BEGIN;
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
UPDATE users SET full_name = CONCAT(first_name, ' ', last_name);
ALTER TABLE users DROP COLUMN first_name;
ALTER TABLE users DROP COLUMN last_name;
COMMIT;
-- If this fails halfway: disaster
-- Good: Multiple small migrations
-- Migration 1:
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Migration 2 (background job):
UPDATE users SET full_name = CONCAT(first_name, ' ', last_name)
WHERE full_name IS NULL;
-- Migration 3 (after new code deployed):
ALTER TABLE users DROP COLUMN first_name;
ALTER TABLE users DROP COLUMN last_name;
4. Test Everything
Test migrations on production-like data:
# Restore production snapshot to staging
pg_dump production_db | psql staging_db
# Run migration
psql staging_db < migration.sql
# Time the operation
time psql staging_db < migration.sql
# If >5 seconds on staging, too slow for production
# Test rollback
psql staging_db < rollback.sql
Safe Migration Patterns
Adding a Nullable Column
Simplest case—usually safe:
-- PostgreSQL: Fast, no locks
ALTER TABLE users ADD COLUMN phone VARCHAR(20);
-- MySQL: May lock table, use ALGORITHM=INPLACE
ALTER TABLE users
ADD COLUMN phone VARCHAR(20),
ALGORITHM=INPLACE, LOCK=NONE;
Adding a NOT NULL Column
Requires default value to avoid locking:
-- Bad: Locks table while adding NOT NULL
ALTER TABLE users ADD COLUMN status VARCHAR(20) NOT NULL;
-- Good: Add as nullable, set default, make NOT NULL
-- Step 1: Add nullable column with default
ALTER TABLE users
ADD COLUMN status VARCHAR(20) DEFAULT 'active';
-- Step 2: Backfill existing rows (in batches)
UPDATE users SET status = 'active' WHERE status IS NULL;
-- Step 3: Add NOT NULL constraint
ALTER TABLE users
ALTER COLUMN status SET NOT NULL;
Renaming a Column
Use expand-migrate-contract:
-- Migration 1: Add new column
ALTER TABLE users ADD COLUMN email_address VARCHAR(255);
-- Deploy code that writes to BOTH columns:
INSERT INTO users (email, email_address)
VALUES ('user@example.com', 'user@example.com');
-- Migration 2: Backfill data
UPDATE users
SET email_address = email
WHERE email_address IS NULL;
-- Deploy code that reads from email_address:
SELECT email_address FROM users;
-- Deploy code that only writes to email_address:
INSERT INTO users (email_address) VALUES ('user@example.com');
-- Migration 3: Drop old column
ALTER TABLE users DROP COLUMN email;
Changing Column Type
Create new column, migrate data, swap:
-- Goal: Change user_id from INT to BIGINT
-- Migration 1: Add new column
ALTER TABLE orders ADD COLUMN user_id_bigint BIGINT;
-- Migration 2: Backfill (in batches to avoid locks)
UPDATE orders
SET user_id_bigint = user_id::BIGINT
WHERE user_id_bigint IS NULL
LIMIT 10000;
-- Repeat until complete
-- Deploy code that uses user_id_bigint
-- Migration 3: Swap columns
BEGIN;
ALTER TABLE orders DROP COLUMN user_id;
ALTER TABLE orders RENAME COLUMN user_id_bigint TO user_id;
COMMIT;
Adding an Index
Use concurrent index creation:
-- PostgreSQL: CREATE INDEX CONCURRENTLY
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
-- Doesn't block reads or writes
-- Takes longer but zero downtime
-- MySQL 5.6+: ALGORITHM=INPLACE
ALTER TABLE users
ADD INDEX idx_email (email),
ALGORITHM=INPLACE, LOCK=NONE;
Removing an Index
-- PostgreSQL: DROP INDEX CONCURRENTLY
DROP INDEX CONCURRENTLY idx_users_old_email;
-- MySQL: Standard DROP INDEX is fast
ALTER TABLE users DROP INDEX idx_old_email;
Adding a Foreign Key
Add without validation, then validate:
-- PostgreSQL: Add constraint NOT VALID
ALTER TABLE orders
ADD CONSTRAINT fk_orders_users
FOREIGN KEY (user_id) REFERENCES users(id)
NOT VALID;
-- Fast, doesn't check existing rows
-- Validate in separate transaction
ALTER TABLE orders
VALIDATE CONSTRAINT fk_orders_users;
-- Slower, but doesn't block writes
Splitting a Table
Gradual data migration:
-- Goal: Split users table into users and profiles
-- Migration 1: Create new table
CREATE TABLE profiles (
user_id BIGINT PRIMARY KEY REFERENCES users(id),
bio TEXT,
avatar_url VARCHAR(255),
created_at TIMESTAMP DEFAULT NOW()
);
-- Migration 2: Backfill data
INSERT INTO profiles (user_id, bio, avatar_url)
SELECT id, bio, avatar_url FROM users
WHERE id NOT IN (SELECT user_id FROM profiles);
-- Deploy code that reads from profiles table
-- Deploy code that writes to profiles table
-- Migration 3: Remove columns from users
ALTER TABLE users
DROP COLUMN bio,
DROP COLUMN avatar_url;
Large-Scale Data Migrations
Batched Updates
Avoid locking tables with batch processing:
-- Bad: Updates all rows at once
UPDATE users SET status = 'active' WHERE status IS NULL;
-- Locks table for minutes/hours
-- Good: Batch updates
DO $$
DECLARE
batch_size INT := 1000;
processed INT := 0;
BEGIN
LOOP
UPDATE users
SET status = 'active'
WHERE id IN (
SELECT id FROM users
WHERE status IS NULL
LIMIT batch_size
);
IF NOT FOUND THEN
EXIT;
END IF;
processed := processed + batch_size;
RAISE NOTICE 'Processed % rows', processed;
-- Sleep to avoid overwhelming database
PERFORM pg_sleep(0.1);
END LOOP;
END $$;
Background Jobs
Move data migration to background workers:
# Migration creates new column
class AddUserStatusColumn(Migration):
def up(self):
self.execute("""
ALTER TABLE users
ADD COLUMN status VARCHAR(20) DEFAULT 'active'
""")
# Separate background job backfills data
class BackfillUserStatus(BackgroundJob):
def run(self):
batch_size = 1000
while True:
updated = db.execute("""
UPDATE users
SET status = CASE
WHEN last_login > NOW() - INTERVAL '30 days'
THEN 'active'
ELSE 'inactive'
END
WHERE id IN (
SELECT id FROM users
WHERE status IS NULL
LIMIT %s
)
RETURNING id
""", [batch_size])
if not updated:
break
time.sleep(0.1) # Throttle
self.update_progress(len(updated))
Online Schema Change Tools
pt-online-schema-change (Percona Toolkit for MySQL):
pt-online-schema-change \
--alter "ADD COLUMN status VARCHAR(20) DEFAULT 'active'" \
--execute \
D=mydb,t=users
# How it works:
# 1. Creates new table with desired schema
# 2. Copies data in chunks
# 3. Captures ongoing changes via triggers
# 4. Swaps tables atomically
gh-ost (GitHub Online Schema Migrator):
gh-ost \
--user="root" \
--password="secret" \
--host="localhost" \
--database="mydb" \
--table="users" \
--alter="ADD COLUMN status VARCHAR(20) DEFAULT 'active'" \
--execute
# Triggerless approach:
# 1. Creates ghost table
# 2. Copies data
# 3. Tails binary log for changes
# 4. Swaps tables
Testing Migrations
Local Testing
# 1. Restore production snapshot
pg_dump production | psql migration_test
# 2. Run migration
psql migration_test < migration_001.sql
# 3. Time the operation
time psql migration_test < migration_001.sql
# Goal: <5 seconds for zero downtime
# 4. Verify data integrity
psql migration_test -c "
SELECT COUNT(*) FROM users WHERE email IS NULL;
"
# 5. Test rollback
psql migration_test < rollback_001.sql
Staging Environment
# Test with production-like load
# Terminal 1: Run migration
psql staging < migration.sql
# Terminal 2: Simulate traffic
while true; do
psql staging -c "SELECT * FROM users LIMIT 100;"
psql staging -c "INSERT INTO users (email) VALUES ('test@example.com');"
done
# Monitor for locks and errors
psql staging -c "
SELECT pid, state, wait_event_type, query
FROM pg_stat_activity
WHERE state != 'idle';
"
Load Testing
# Simulate production load during migration
import threading
import psycopg2
def simulate_traffic():
conn = psycopg2.connect("dbname=staging")
while True:
cur = conn.cursor()
cur.execute("SELECT * FROM users ORDER BY RANDOM() LIMIT 10")
cur.execute("INSERT INTO users (email) VALUES ('test@example.com')")
conn.commit()
# Start 10 threads
threads = [threading.Thread(target=simulate_traffic) for _ in range(10)]
for t in threads:
t.start()
# Run migration while traffic running
psql staging < migration.sql
# Measure:
# - Query latency increase
# - Lock wait times
# - Error rates
Rollback Strategies
Forward-Only Migrations
Preferred approach—always roll forward:
-- Migration adds column
ALTER TABLE users ADD COLUMN email VARCHAR(255);
-- If there's a problem, don't rollback
-- Instead, deploy fix:
ALTER TABLE users DROP COLUMN email;
-- This is still "rolling forward"
Reversible Migrations
When rollback is necessary:
class AddEmailColumn(Migration):
def up(self):
self.execute("""
ALTER TABLE users ADD COLUMN email VARCHAR(255)
""")
def down(self):
self.execute("""
ALTER TABLE users DROP COLUMN email
""")
Point-in-Time Recovery
Last resort for catastrophic failures:
# PostgreSQL: Restore to point before migration
pg_restore \
--dbname=mydb \
--clean \
--if-exists \
backup_before_migration.dump
# MySQL: Restore from backup
mysql mydb < backup_before_migration.sql
# Warning: Loses all changes after backup
Database-Specific Considerations
PostgreSQL
-- Strengths:
-- - CREATE INDEX CONCURRENTLY
-- - ALTER TABLE ... SET NOT VALID
-- - Transactional DDL
-- Watch out for:
-- - Adding NOT NULL locks table (use default)
-- - Long-running transactions block DDL
-- - Concurrent index creation can fail, leaves invalid index
-- Best practices:
SET lock_timeout = '2s'; -- Fail fast if can't acquire lock
CREATE INDEX CONCURRENTLY idx_name ON table(column);
ANALYZE table; -- Update statistics after migration
MySQL
-- Strengths:
-- - ALGORITHM=INPLACE for many operations
-- - Online DDL (5.6+)
-- Watch out for:
-- - No transactional DDL
-- - Some operations still lock table
-- - pt-online-schema-change often needed
-- Best practices:
ALTER TABLE users
ADD COLUMN status VARCHAR(20),
ALGORITHM=INPLACE, -- Avoid table copy
LOCK=NONE; -- Allow reads/writes
-- Check if operation is online:
SELECT * FROM information_schema.innodb_ddl_log;
MongoDB
// Schema-less, but still need migrations
// Add field to all documents
db.users.updateMany(
{ status: { $exists: false } },
{ $set: { status: 'active' } }
);
// Rename field
db.users.updateMany(
{},
{ $rename: { 'email': 'email_address' } }
);
// Best practice: Batch updates
const batchSize = 1000;
let batch = 0;
while (true) {
const result = db.users.updateMany(
{
status: { $exists: false },
_id: { $gt: batch * batchSize }
},
{ $set: { status: 'active' } },
{ limit: batchSize }
);
if (result.modifiedCount === 0) break;
batch++;
sleep(100); // Throttle
}
Migration Tools
Flyway
-- V1__create_users_table.sql
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL
);
-- V2__add_status_column.sql
ALTER TABLE users ADD COLUMN status VARCHAR(20) DEFAULT 'active';
# Run migrations
flyway migrate
# Verify
flyway info
Liquibase
<!-- changeset-001.xml -->
<databaseChangeLog>
<changeSet id="1" author="developer">
<createTable tableName="users">
<column name="id" type="BIGINT" autoIncrement="true">
<constraints primaryKey="true"/>
</column>
<column name="email" type="VARCHAR(255)">
<constraints nullable="false"/>
</column>
</createTable>
<rollback>
<dropTable tableName="users"/>
</rollback>
</changeSet>
</databaseChangeLog>
Alembic (Python)
# versions/001_add_status.py
def upgrade():
op.add_column(
'users',
sa.Column('status', sa.String(20), server_default='active')
)
def downgrade():
op.drop_column('users', 'status')
# Apply migrations
alembic upgrade head
# Rollback
alembic downgrade -1
Best Practices Checklist
□ Test migration on production snapshot
□ Time the operation (<5s for zero downtime)
□ Ensure backward compatibility
□ Use expand-migrate-contract for destructive changes
□ Add indexes concurrently
□ Use NOT VALID for constraints, validate separately
□ Batch large data updates
□ Set lock_timeout to fail fast
□ Have rollback plan
□ Monitor database locks during migration
□ Schedule during low-traffic period (if possible)
□ Alert team before migration
□ Monitor error rates after deployment
□ Keep old schema elements until new code fully deployed
□ Document migration in changelog
Monitoring During Migrations
-- PostgreSQL: Monitor locks
SELECT
pid,
state,
wait_event_type,
query_start,
NOW() - query_start AS duration,
query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
-- Check table bloat
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE tablename = 'users';
# Monitor query performance
mysql> SHOW PROCESSLIST;
mysql> SHOW ENGINE INNODB STATUS\G
# Check replication lag
mysql> SHOW SLAVE STATUS\G
Conclusion
Zero-downtime database migrations are achievable with careful planning and the right techniques:
Key Principles:
- Maintain backward compatibility
- Use expand-migrate-contract pattern
- Make small, incremental changes
- Test on production-like data
- Monitor during migrations
Common Patterns:
- Add nullable columns, backfill, add constraints
- Create concurrent indexes
- Use NOT VALID constraints
- Batch large updates
- Deploy code changes separately from schema changes
Remember: The safest migration is the one that's been tested thoroughly on staging with production load. Never skip testing, even for "simple" changes.
Need help with database migrations? InstaDevOps provides expert consulting for database architecture, migration strategies, and performance optimization. Contact us for a free consultation.
Need Help with Your DevOps Infrastructure?
At InstaDevOps, we specialize in helping startups and scale-ups build production-ready infrastructure without the overhead of a full-time DevOps team.
Our Services:
- 🏗️ AWS Consulting - Cloud architecture, cost optimization, and migration
- ☸️ Kubernetes Management - Production-ready clusters and orchestration
- 🚀 CI/CD Pipelines - Automated deployment pipelines that just work
- 📊 Monitoring & Observability - See what's happening in your infrastructure
Special Offer: Get a free DevOps audit - 50+ point checklist covering security, performance, and cost optimization.
📅 Book a Free 15-Min Consultation
Originally published at instadevops.com
Top comments (0)