Anil Kurmi

Posted on Apr 3

How I Took Down Prod With a 400ms Migration (And The Playbook I Use Now)

#database #devjournal #devops #postgres

How I Took Down Prod With a 400ms Migration (And The Playbook I Use Now)

The 3 AM ALTER TABLE That Ruined My Weekend

It was a regular Tuesday deploy. The Jira ticket was straightforward: just add a foreign key constraint linking orders to customers. The table had 50 million rows, but I had explicitly tested it against a staging dump. It took exactly 400 milliseconds.

"It works locally and on staging. Ship it," I confidently told my team lead.

What I totally missed was the underlying database locking mechanism. The ALTER TABLE needed an ACCESS EXCLUSIVE lock on both tables — orders and customers. But at that exact moment in production, a heavy business analytics dashboard was running a long SELECT query, holding an AccessShare lock on customers.

So, my simple 400ms migration just sat there waiting in the queue for its turn.

And then the disaster happened. The moment my ACCESS EXCLUSIVE lock request entered the queue, every single subsequent query—every SELECT, INSERT, and UPDATE against customers—started queuing up behind my migration.

Within 90 seconds, our Postgres connection pool was completely exhausted. The authentication service, which frequently checked user records on customers, started throwing 503s. Fifteen seconds of invisible lock contention turned into 20 minutes of cascading API downtime. PagerDuty lost its mind.

My SQL was perfectly valid. The schema change was correct. And yet, I brought down the entire system.

I learned it the hard way, and this is what I drill into my team today: schema migrations are the most dangerous operations you will perform as a backend engineer. Not because of "bad SQL", but because lock queues are invisible until they blow up in your face.

Why Schema Migrations Are Dangerous (The Real Mechanics)

Most engineers think the risk of a migration is writing invalid syntax. It almost never is. The real dangers are purely mechanical:

DDL locks are nuclear. In PostgreSQL, most ALTER TABLE commands demand an ACCESS EXCLUSIVE lock. This lock conflicts with every other lock type, even the basic locks held by simple SELECT statements. While your migration waits, the table is untouchable.

The lock queue is the real killer. Even if your DDL would run instantly once it gets the lock, the waiting is what bankrupts your uptime. Your database goes from "responding normally" to "completely blocked" in a heartbeat because all read/write traffic forms a traffic jam behind your migration's lock request.

Replication lag adds insult to injury. In MySQL, an online DDL operation finishes on the primary first. The replicas then apply the DDL sequentially. Any normal data writes (DML) that happened concurrently on the primary only get applied to the replica after the DDL finishes. On a billion-row table, your reads will suddenly be serving stale data for minutes.

Old code vs. New code. During a rolling deployment, some of your app instances are running the old code, and some the new. If your migration drops a column that old code still tries to read, you'll get 500 server errors from the instances that haven't been swapped out yet.

My 4 Difficulty Tiers for Migrations

Not all migrations are created equal. Over the years, I've divided them into four tiers, and the tier dictates my strategy.

Tier 1: Safe (Execute Anytime)

These operations acquire brief metadata locks and don't rewrite the table data. You can run them during peak system traffic and sleep soundly.

Add a nullable column:

ALTER TABLE orders ADD COLUMN discount_code text;

This is metadata-only in Postgres (and MySQL 8.0+). No table rewrite, no data scan. The lock is held for microseconds.

Create a brand new table:
No existing data, no old queries to block. Go for it.

Add an index concurrently (Postgres):

CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);

Without CONCURRENTLY, Postgres locks the table for all writes for the entire duration of the index build. On 100 million rows, that's minutes of downtime. With CONCURRENTLY, Postgres builds it in the background allowing normal reads and writes.

Pro Tip: CREATE INDEX CONCURRENTLY cannot run inside a transaction block. Your ORM/migration tool needs to run it standalone.

Add an index in MySQL (online):

ALTER TABLE orders ADD INDEX idx_customer_id (customer_id), ALGORITHM=INPLACE, LOCK=NONE;

Tier 2: Careful (Patience is Key)

These operations are safe only if you follow a strict sequence. Try a shortcut, and you'll be firefighting.

Add a NOT NULL column with a default:

Never do this in one statement on a large table (especially Postgres < 11):

-- DANGEROUS: Rewrites the entire table under an exclusive lock
ALTER TABLE users ADD COLUMN status text DEFAULT 'active' NOT NULL;

The proper, safe multi-step way that works everywhere:

-- Step 1: Add nullable column (almost instant)
ALTER TABLE users ADD COLUMN status text;

-- Step 2: Set default for all NEW rows moving forward
ALTER TABLE users ALTER COLUMN status SET DEFAULT 'active';

-- Step 3: Backfill old rows in small chunks from a background script
-- UPDATE users SET status = 'active' WHERE id BETWEEN 1 AND 10000 AND status IS NULL;
-- ... sleep for a bit, repeat chunk by chunk

-- Step 4: Add NOT NULL constraint safely
ALTER TABLE users ADD CONSTRAINT users_status_not_null CHECK (status IS NOT NULL) NOT VALID;
ALTER TABLE users VALIDATE CONSTRAINT users_status_not_null;
ALTER TABLE users ALTER COLUMN status SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_status_not_null;

Drop an unused column:

Don't just DROP COLUMN. First, verify it's absolutely unused via your observability tools (DataDog/NewRelic). Mark it as "deferred" in your ORM (like SQLAlchemy's deferred). Keep it like that for a full deploy cycle.
Only then, drop it. Even then, use a timeout because dropping a column still needs an exclusive lock to update the metadata:

SET lock_timeout = '5s';
ALTER TABLE users DROP COLUMN legacy_avatar_url;
RESET lock_timeout;

Add a foreign key constraint:

This is the exact command that took down my production environment.
Safe approach:

-- Step 1: Add without validation (fast metadata lock)
ALTER TABLE orders ADD CONSTRAINT orders_customer_fk FOREIGN KEY (customer_id) REFERENCES customers (id) NOT VALID;

-- Step 2: Validate separately (uses a weaker ShareUpdateExclusive lock)
ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_fk;

Tier 3: Complex (The Expand-Contract Pattern)

These require multiple, separate deployments. You simply cannot do them in a single migration file.

Rename a column:
If you run ALTER TABLE RENAME COLUMN, the old app pods checking the old name will instantly crash. You need four deploys:

Deploy 1 (Expand): Add the new column (display_name).
Deploy 2 (Dual Write): App writes to BOTH old and new columns. Run a script to backfill existing rows.
Deploy 3 (Cut Over): App stops writing to the old column and completely switches to reading/writing the new column.
Deploy 4 (Contract): Drop the old column.

Change a column type:
If you change a column from text to jsonb, Postgres rewrites the entire table under an exclusive lock. A 10GB table equates to 30 minutes of blocking. You must use the Expand-Contract pattern here too: add new column -> dual-write -> backfill -> cut over -> drop old column.

Here is what the Expand-Contract pattern architecture looks like:

sequenceDiagram
    participant DB as Database
    participant App as Application
    participant BG as Background Job

    Note over DB,App: Phase 1: Expand
    App->>DB: ALTER TABLE ADD new_column
    Note over DB: Old + new columns exist

    Note over DB,App: Phase 2: Dual Write
    App->>DB: Write to old_column AND new_column
    BG->>DB: Backfill new_column in batches

    Note over DB,App: Phase 3: Cut Over Reads
    App->>DB: Read from new_column
    App->>DB: Still write to both columns

    Note over DB,App: Phase 4: Stop Old Writes
    App->>DB: Write only to new_column

    Note over DB,App: Phase 5: Contract
    App->>DB: ALTER TABLE DROP old_column
    Note over DB: Migration complete

Tier 4: Dangerous (Clear your schedule)

Change a primary key, change a partitioning scheme, or merge massive tables. These require full table rewrites affecting every foreign key and index. These aren't just migration tasks. They are fully-fledged infrastructure projects needing runbooks, specific rollback points, and dedicated maintenance windows.

The Tooling Landscape (No need to reinvent the wheel)

If you have a massive, write-heavy MySQL or Postgres database, you don't always manually manage the Expand-Contract dance. We rely on established tools:

gh-ost (GitHub) for MySQL: Instead of heavy triggers, it creates a "ghost" table, alters it, and uses the MySQL binlog to replicate changes asynchronously. Great for write-heavy systems.
pt-online-schema-change (Percona): The old veteran. Uses DB triggers to sync data to a shadow table. Simpler than gh-ost but adds trigger overhead, meaning your writes get a bit slower during the migration.
pgroll (Xata): Automates the Expand-Contract pattern natively for Postgres, handling dual-schema versions simultaneously.
strong_migrations (Ruby on Rails): My favorite defense. A gem that physically blocks you from running dangerous migrations in your terminal by suggesting the safe, multi-step alternatives.

The One Setting That Saves Careers

If you take away just one thing from this article, it's this. Add this to every single migration:

SET lock_timeout = '5s';

If your query waits more than 5 seconds for a lock, Postgres just kills your migration. Your deployment pipeline fails, but production stays up. You can just retry at 2 AM.

For advanced database management, here's a 50ms exponential backoff retry pattern (adopted from PostgresAI):

DO $do$
DECLARE
  lock_timeout CONSTANT text := '50ms';
  max_attempts CONSTANT int := 1000;
  ddl_completed boolean := false;
BEGIN
  PERFORM set_config('lock_timeout', lock_timeout, false);
  FOR i IN 1..max_attempts LOOP
    BEGIN
      EXECUTE 'ALTER TABLE my_table ADD COLUMN new_col int';
      ddl_completed := true;
      EXIT;
    EXCEPTION WHEN lock_not_available THEN
      PERFORM pg_sleep(LEAST(60, 0.01 * power(2, i)) * (0.5 + random()));
    END;
  END LOOP;
  IF NOT ddl_completed THEN
    RAISE EXCEPTION 'Migration failed after % attempts', max_attempts;
  END IF;
END $do$;

This approach attempts to get the lock, gives up in 50ms, waits, and tries again. You execute your migration without ever holding up production traffic for more than 50 milliseconds.

The Anti-Patterns That Will Ruin Your Uptime

Mixing Schema and Data Migrations. Don't run ADD COLUMN and UPDATE table SET column='x' in the same transaction block. You hold the schema lock hostage while your data backfills for 10 minutes.
Assuming ROLLBACK is a valid strategy. If your script successfully backfilled a million rows, and then failed on the last step causing a schema ROLLBACK, your backfilled data is gone. Make phases independently reversible.
Running schema changes during peak traffic. "But I used CONCURRENTLY!" - it doesn't matter. You still need metadata locks. Peak traffic equates to a high chance of blocking. Schedule it off-peak.
Not testing on production-sized data. The query planner operates differently for 10K rows versus 100 Million rows. Always validate against a staging environment that structurally mimics production.

Before hitting deploy on production:

[ ] Did I set lock_timeout = '5s'?
[ ] For Postgres: Did I use CONCURRENTLY for index building?
[ ] For Constraints: Am I using NOT VALID with a separate VALIDATE phase?
[ ] Have I split the schema modification and the data update into separate scripts/transactions?
[ ] Is my read-replica lag monitor open and visible?

The engineers who maintain zero-downtime aren't necessarily inherently better developers. They are just incredibly, systematically cautious. Always assume the database lock queue is a risk to your uptime.

DEV Community

How I Took Down Prod With a 400ms Migration (And The Playbook I Use Now)

How I Took Down Prod With a 400ms Migration (And The Playbook I Use Now)

The 3 AM ALTER TABLE That Ruined My Weekend

Why Schema Migrations Are Dangerous (The Real Mechanics)

My 4 Difficulty Tiers for Migrations

Tier 1: Safe (Execute Anytime)

Tier 2: Careful (Patience is Key)

Tier 3: Complex (The Expand-Contract Pattern)

Tier 4: Dangerous (Clear your schedule)

The Tooling Landscape (No need to reinvent the wheel)

The One Setting That Saves Careers

The Anti-Patterns That Will Ruin Your Uptime

Before hitting deploy on production:

Sources & Further Reading

Top comments (0)