DEV Community

Cover image for Zero-Downtime Database Migrations
Samson Tanimawo
Samson Tanimawo

Posted on

Zero-Downtime Database Migrations

Database migrations without downtime are a superpower. Here's the playbook that's survived dozens of schema changes in production.

The core idea

Never change schema and code at the same time. Split every migration into phases where the old code and new code can both run against the same schema.

Adding a column

Phase 1: Add the column as nullable. Deploy.
Phase 2: Deploy code that writes the new column. Reads still fall back if the column is null.
Phase 3: Backfill the column for existing rows.
Phase 4: Deploy code that assumes the column is populated. Remove the fallback.
Phase 5: Make the column NOT NULL if needed.

Yes, it's 5 deploys. Yes, it's worth it. Zero downtime.

Renaming a column

You can't just rename. You have to:

Phase 1: Add the new column.
Phase 2: Write to both columns (old and new).
Phase 3: Backfill the new column from the old.
Phase 4: Switch reads to the new column.
Phase 5: Stop writing to the old column.
Phase 6: Drop the old column.

6 deploys. Yes. Unless you can take downtime, there's no shortcut.

The dangerous ones

Dropping a column. Always more dangerous than it looks. Services you didn't know about might still reference it. Grep the entire codebase. Check query logs for references.

Changing a column type. Usually means creating a new column, migrating data, switching reads, dropping the old. Do not issue ALTER COLUMN TYPE on a large table during traffic hours.

Adding indexes. CREATE INDEX CONCURRENTLY in Postgres. Without it, the table is locked. I've taken down production this way. Don't.

The backfill problem

Backfilling millions of rows with one big UPDATE will lock your table. Batch it. Update 1,000 rows at a time, sleep 100ms between batches, monitor replication lag.

A backfill that runs for 8 hours is fine. A backfill that holds a lock for 30 seconds is an outage.

The rollback plan

Write down, before starting, what the rollback looks like for each phase. For most phases it's 'revert the code deploy.' For the drop phase, rollback is harder — if you've dropped the column, you'll need to restore from backup.

The safest rule: never drop anything in the same week you stop using it. Leave the column for 2 weeks after the last read. You'll thank me when you need to rollback.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)