Every backend engineer has a migration horror story. Maybe it was the ALTER TABLE that locked a 200-million-row table for 47 minutes. Maybe it was the column rename that brought down every API server simultaneously. Or maybe it was the "quick fix" migration that corrupted data in a way that took three days to untangle.
Database migrations are the most dangerous routine operation in backend engineering. Your code deploys are (hopefully) stateless and reversible. Your database changes are neither. They mutate persistent state that your entire system depends on, and getting them wrong can mean downtime, data loss, or both.
This article covers the patterns, tooling, and discipline required to run migrations in production without breaking things.
Why Migrations Are Scary
The fundamental problem is simple: schema changes and code changes don't deploy atomically. There is always a window where your running application code and your database schema are out of sync. During a typical rolling deployment:
- Migration runs, changing the schema
- Old application code is still running against the new schema
- New application code starts rolling out
- Both old and new code run simultaneously against the new schema
- Old code finishes draining
Steps 2-4 are where things break. If your migration renames a column, old code looking for the old name throws errors. If it drops a column, same problem. If it adds a NOT NULL column without a default, old code inserting rows fails.
The second problem is locking. Most ALTER TABLE operations in traditional databases acquire locks that block reads, writes, or both. On a table with millions of rows, a lock held for even a few seconds can cascade into connection pool exhaustion, request timeouts, and a full outage.
The third problem is irreversibility. You can roll back a code deploy in seconds. Rolling back a migration that dropped a column means restoring from backup — if you even have a recent one that's consistent.
Zero-Downtime Migration: The Core Principle
The rule is straightforward: at every point during the migration and deployment process, all running code must be compatible with the current database schema.
This means:
- Never rename a column in a single step
- Never drop a column that running code still references
- Never add a
NOT NULLconstraint without a default - Never assume the migration and code deploy happen simultaneously
Every migration must be backward compatible with the currently deployed code and forward compatible with the code about to be deployed.
The Expand-Contract Pattern
This is the most important pattern in zero-downtime migrations. Instead of making a breaking change in one step, you split it into three phases:
Phase 1: Expand
Add the new structure alongside the old one. Both old and new code work.
-- Migration: Add new column (nullable, so old code can still INSERT)
ALTER TABLE orders ADD COLUMN status_v2 VARCHAR(50);
Phase 2: Migrate
Deploy code that writes to both old and new structures. Backfill existing data.
-- Backfill in batches (more on this later)
UPDATE orders SET status_v2 = status WHERE status_v2 IS NULL
AND id BETWEEN $start AND $end;
Phase 3: Contract
Once all code uses the new structure and all data is migrated, remove the old one.
-- Only after all application code stops reading `status`
ALTER TABLE orders DROP COLUMN status;
ALTER TABLE orders RENAME COLUMN status_v2 TO status;
Each phase is a separate migration tied to a separate code deploy. Phase 1 goes out with or before the code that starts using the new column. Phase 3 goes out only after you've verified all services have been updated and the old column is truly unused.
Real-world example: renaming a column
Renaming user_name to display_name on a users table with 10M rows:
Deploy 1: Migration adds `display_name` column
Deploy 2: Code writes to both columns, reads from `display_name` with fallback to `user_name`
Deploy 3: Backfill script copies remaining data
Deploy 4: Code reads only from `display_name`
Deploy 5: Migration drops `user_name`
Yes, that's five deploys for a column rename. That's the cost of zero downtime. In practice, deploys 1-2 often ship together, and 4-5 ship together, so it's usually three deploy cycles.
Handling Large Table Alterations
On PostgreSQL, many ALTER TABLE operations are fast because they only update catalog metadata:
-- These are ~instant in PostgreSQL, regardless of table size
ALTER TABLE orders ADD COLUMN notes TEXT;
ALTER TABLE orders ALTER COLUMN notes SET DEFAULT '';
ALTER TABLE orders DROP COLUMN old_field;
But some operations are not:
-- These rewrite the entire table or scan all rows
ALTER TABLE orders ALTER COLUMN amount TYPE BIGINT; -- full rewrite
ALTER TABLE orders ADD COLUMN verified BOOLEAN NOT NULL DEFAULT true;
-- (instant in PG 11+ but rewrites in older versions)
CREATE INDEX ON orders (customer_id); -- full table scan
For type changes on large tables, use the expand-contract pattern: add a new column with the desired type, backfill, switch reads, drop the old column.
For index creation, always use CONCURRENTLY:
-- Blocks writes:
CREATE INDEX idx_orders_customer ON orders (customer_id);
-- Does NOT block writes (takes longer, but safe):
CREATE INDEX CONCURRENTLY idx_orders_customer ON orders (customer_id);
Important caveat: CREATE INDEX CONCURRENTLY cannot run inside a transaction. Most migration tools wrap each migration in a transaction by default. You need to either disable that behavior for this specific migration or use a tool that handles it.
In golang-migrate:
-- +migrate: no-transaction
CREATE INDEX CONCURRENTLY idx_orders_customer ON orders (customer_id);
In Flyway, you would set executeInTransaction=false on the migration.
Advisory Locks and Migration Safety
When running migrations in a horizontally scaled environment, you need to ensure only one instance runs migrations at a time. Most tools handle this with advisory locks:
-- golang-migrate uses PostgreSQL advisory locks
SELECT pg_advisory_lock(12345);
-- ... run migrations ...
SELECT pg_advisory_unlock(12345);
This prevents two pods starting simultaneously from both trying to run the same migration and corrupting the schema history.
Migration Tooling
golang-migrate
Straightforward, language-agnostic, uses plain SQL files:
migrations/
000001_create_users.up.sql
000001_create_users.down.sql
000002_add_orders.up.sql
000002_add_orders.down.sql
Run from CLI or embedded in your Go application:
import "github.com/golang-migrate/migrate/v4"
m, err := migrate.New(
"file://migrations",
"postgres://localhost:5432/mydb?sslmode=disable",
)
if err != nil {
log.Fatal(err)
}
if err := m.Up(); err != nil && err != migrate.ErrNoChange {
log.Fatal(err)
}
Strengths: simple, no DSL to learn, works with any language's deployment pipeline. Weaknesses: no built-in support for non-transactional migrations, limited state tracking.
Flyway
JVM-based, more opinionated, supports versioned and repeatable migrations:
-- V1__Create_users.sql
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL UNIQUE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- V2__Add_orders.sql
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES users(id),
total_cents BIGINT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
Flyway tracks state in a flyway_schema_history table and supports callbacks, placeholders, and Java-based migrations for complex logic. It's the standard in Java/Kotlin ecosystems.
Other Notable Tools
- Alembic (Python/SQLAlchemy): generates migrations from model diffs, good for Python shops
- Sqitch: dependency-based rather than version-ordered, powerful but steeper learning curve
- Atlas (by Ariga): declarative schema management, computes diffs automatically, gaining traction in Go ecosystems
- pg_partman / pgloader: for partition management and bulk data loading, respectively
Choosing a Tool
Pick based on your ecosystem and complexity:
| Need | Tool |
|---|---|
| Simple SQL migrations, any language | golang-migrate |
| Java/Kotlin ecosystem | Flyway |
| Python/SQLAlchemy | Alembic |
| Declarative schema-as-code | Atlas |
| Complex dependency graphs | Sqitch |
Data Backfills Done Right
Backfilling data across millions of rows is where most migration disasters happen. The naive approach:
-- DO NOT DO THIS
UPDATE orders SET status_v2 = compute_new_status(status);
This acquires a lock on every row, generates enormous WAL (write-ahead log) volume, and can take hours while blocking other writes.
Instead, backfill in batches with throttling:
import time
import psycopg2
BATCH_SIZE = 5000
SLEEP_SECONDS = 0.1 # throttle to reduce replication lag
conn = psycopg2.connect(dsn)
conn.autocommit = True
cursor = conn.cursor()
cursor.execute("SELECT MIN(id), MAX(id) FROM orders")
min_id, max_id = cursor.fetchone()
current = min_id
while current <= max_id:
cursor.execute("""
UPDATE orders
SET status_v2 = compute_new_status(status)
WHERE id >= %s AND id < %s
AND status_v2 IS NULL
""", (current, current + BATCH_SIZE))
updated = cursor.rowcount
print(f"Batch {current}-{current + BATCH_SIZE}: updated {updated} rows")
current += BATCH_SIZE
time.sleep(SLEEP_SECONDS)
Key practices:
-
Batch by primary key range, not
LIMIT/OFFSET(which gets slower as offset grows) - Sleep between batches to let replicas catch up and avoid saturating I/O
-
Make it idempotent (
WHERE status_v2 IS NULL) so you can restart safely if it fails midway - Monitor replication lag during the backfill and pause if it exceeds your threshold
- Run during low-traffic windows when possible, even if it's technically safe at peak
For truly massive tables (billions of rows), consider doing the backfill at the application level: update the new column whenever a row is naturally read or written, and run the batch backfill for the long tail of untouched rows.
Blue-Green Database Deployments
Blue-green deployments for databases are harder than for stateless application servers, but the pattern exists and works well for certain scenarios.
The idea: maintain two database schemas (or databases) — blue (current) and green (next). Migrate green, point traffic to it, keep blue as a rollback target.
Schema-Level Blue-Green
-- Blue schema (current)
CREATE SCHEMA blue;
-- ... tables in blue schema
-- Green schema (next version)
CREATE SCHEMA green;
-- ... migrated tables in green schema
-- Application config points to schema
SET search_path = 'blue'; -- current
SET search_path = 'green'; -- after cutover
This works when your migration is a large structural change that's hard to do incrementally. You build the new schema, backfill it from the old one, then switch the application's search_path.
Limitations
True blue-green for databases requires either:
- A brief write-freeze during cutover (seconds, not minutes)
- Dual-write during the transition period
- Logical replication between old and new schemas
For most teams, the expand-contract pattern is more practical than blue-green for databases. Reserve blue-green for major version upgrades (e.g., PostgreSQL 14 to 16) where you set up logical replication between the old and new clusters.
Rollback Strategies
Down Migrations
Every migration tool supports down migrations. Write them:
-- 000005_add_verified_column.up.sql
ALTER TABLE users ADD COLUMN verified BOOLEAN DEFAULT false;
-- 000005_add_verified_column.down.sql
ALTER TABLE users DROP COLUMN verified;
But understand their limits. A down migration that drops a column destroys data. If you added the column, backfilled it over three days, and then need to roll back, that data is gone.
Forward-Only Rollbacks
A safer pattern: instead of rolling back the migration, roll forward with a new migration that undoes the change:
000005_add_verified_column.up.sql -- adds column
000006_remove_verified_column.up.sql -- removes it (if needed)
This keeps the migration history linear and auditable. Many teams mandate forward-only migrations in production while keeping down migrations for local development.
Point-In-Time Recovery
For catastrophic failures, you need PITR:
# PostgreSQL continuous archiving
archive_mode = on
archive_command = 'cp %p /archive/%f'
# Restore to a specific timestamp
recovery_target_time = '2025-03-15 14:30:00 UTC'
Test your PITR process regularly. The worst time to discover your backups don't work is during an incident.
The Migration Rollback Matrix
| Scenario | Strategy |
|---|---|
| Added a column, code not yet deployed | Down migration (safe, no data loss) |
| Changed column type, data transformed | Forward migration to revert; may lose precision |
| Dropped a column | PITR or restore from backup |
| Added an index | Drop it (fast, safe) |
| Data backfill went wrong | Forward migration to fix; depends on whether original data is preserved |
Testing Migrations
Against Production-Like Data
Your test database with 50 rows will not reveal the problems that show up with 50 million rows. Test against a copy of production data:
# Snapshot production (use your cloud provider's snapshot feature)
# Restore to a test instance
# Run the migration
# Measure: time, locks held, WAL generated, replication lag
Schema Diffing
After running migrations on a staging environment, diff the resulting schema against what you expect:
# Using pg_dump to compare schemas
pg_dump --schema-only production_db > prod_schema.sql
pg_dump --schema-only staging_db > staging_schema.sql
diff prod_schema.sql staging_schema.sql
Atlas has this built in:
atlas schema diff \
--from "postgres://localhost/production" \
--to "postgres://localhost/staging"
CI Pipeline Integration
Run migrations in CI against a fresh database and a database with the previous version's schema:
# .github/workflows/migrations.yml
migration-test:
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: test
POSTGRES_PASSWORD: test
steps:
- uses: actions/checkout@v4
- name: Run all migrations from scratch
run: migrate -path ./migrations -database "$DB_URL" up
- name: Verify schema matches expectations
run: ./scripts/verify-schema.sh
- name: Test rollback of latest migration
run: migrate -path ./migrations -database "$DB_URL" down 1
- name: Re-apply latest migration
run: migrate -path ./migrations -database "$DB_URL" up
Statement-Level Analysis
Before running a migration in production, analyze what it will actually do:
-- Check if an ALTER TABLE will rewrite the table
-- (PostgreSQL-specific: check pg_catalog after the change)
-- Estimate lock duration using pg_stat_activity during staging test
SELECT pid, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE wait_event_type = 'Lock';
A Migration Checklist
Before every production migration:
- Is it backward compatible? Can the currently deployed code work with the new schema?
- Is it forward compatible? Can the about-to-be-deployed code work with the old schema (in case you need to roll back the code deploy)?
- Does it hold locks? If so, for how long? On how many rows?
- Has it been tested against production-sized data?
- Is there a rollback plan? Down migration, forward fix, or PITR?
- Is the backfill batched and idempotent?
-
Is the migration wrapped in a transaction? Should it be? (
CREATE INDEX CONCURRENTLYcannot be.) - Has someone else reviewed it? Migration PRs deserve as much scrutiny as application code.
Summary
Database migrations will always carry risk. The goal is not to eliminate risk but to reduce it to a level where deploying a migration is routine rather than terrifying. The core practices:
- Use expand-contract for every breaking schema change
- Batch and throttle data backfills
- Test against production-scale data
- Write rollback plans before you need them
- Use
CONCURRENTLYfor index operations - Keep migrations small, frequent, and reviewable
The teams that migrate databases confidently are not the ones with the best tools. They are the ones with the most disciplined processes.
Building resilient backend systems? This is article #17 in the Production Backend Patterns series. Follow for the next one.
If you found this useful, consider supporting my work:
Top comments (0)