Mickel Samuel

Posted on Mar 2 • Originally published at migrationpilot.dev

The 5 PostgreSQL Migration Mistakes That Cause Production Outages

#postgres #database #devops #backend

Most PostgreSQL migration outages are caused by the same five patterns. Each one involves a lock that blocks queries longer than expected, a table rewrite that no one anticipated, or a cascade effect that turns a simple schema change into minutes of downtime. Here are the five mistakes, the production incidents they cause, and the safe patterns to avoid them.

Mistake #1: CREATE INDEX Without CONCURRENTLY

The dangerous migration

-- This looks harmless
CREATE INDEX idx_orders_customer_id ON orders (customer_id);

What happens in production

A regular CREATE INDEX acquires a SHARE lock on the table. This lock blocks all INSERT, UPDATE, and DELETE operations for the entire duration of the index build.

On a table with 1 million rows, an index build might take 5-30 seconds. On a table with 100 million rows, it can take minutes. During that entire time, every write to the table is blocked. Every API request that tries to insert or update a row in that table will queue up, waiting.

The cascading effect is what makes this dangerous: as write queries queue up, they consume connection pool slots. When the connection pool is exhausted, new requests start failing. What started as an index build on one table becomes a total service outage.

The safe pattern

-- Add CONCURRENTLY — builds the index without blocking writes
CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);

CREATE INDEX CONCURRENTLY acquires only a SHARE UPDATE EXCLUSIVE lock , which allows reads and writes to continue while the index is being built. It takes longer (two table scans instead of one), but your application stays up.

Important caveat: CONCURRENTLY cannot run inside a transaction. If your migration framework wraps migrations in transactions by default (Flyway, Liquibase, Knex), you need to disable the transaction for this specific migration.

-- Flyway: disable transaction for this migration
-- flyway:executeInTransaction=false
CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);

-- If the concurrent build fails, clean up the invalid index:
DROP INDEX CONCURRENTLY IF EXISTS idx_orders_customer_id;
-- Then retry the CREATE INDEX CONCURRENTLY

Mistake #2: ALTER TABLE SET NOT NULL Without the CHECK Pattern

The dangerous migration

-- Adding NOT NULL to an existing column
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

What happens in production

SET NOT NULL acquires an ACCESS EXCLUSIVE lock — the most restrictive lock in PostgreSQL. This lock blocks everything : reads, writes, even otherSELECT queries.

While holding this lock, PostgreSQL scans the entire table to verify that no existing rows have NULL values in the column. On a 50-million-row table, this scan can take minutes. During those minutes, nothing else can touch the table. Not even reads.

"Nothing quite like the surprise you get the first time you rewrite every tuple on a table with 20M rows because you added a column with a default value." — Hacker News

The safe pattern (PostgreSQL 12+)

-- Step 1: Add a CHECK constraint with NOT VALID
-- This only checks new/modified rows, no full table scan
ALTER TABLE users ADD CONSTRAINT users_email_not_null
  CHECK (email IS NOT NULL) NOT VALID;

-- Step 2: Validate the constraint in a separate transaction
-- This scans the table but only holds a SHARE UPDATE EXCLUSIVE lock
-- (allows reads AND writes to continue)
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null;

-- Step 3: Now set NOT NULL — PostgreSQL sees the valid CHECK constraint
-- and skips the full table scan entirely
ALTER TABLE users ALTER COLUMN email SET NOT NULL;

-- Step 4: Drop the CHECK constraint (redundant now)
ALTER TABLE users DROP CONSTRAINT users_email_not_null;

This four-step pattern is the standard safe approach. The key insight: once PostgreSQL sees a validated CHECK constraint that guarantees no NULLs exist, the SET NOT NULL becomes a metadata-only operation that completes instantly.

PostgreSQL 18 update: As of PostgreSQL 18 (September 2025), you can use ALTER TABLE ... ADD COLUMN ... NOT NULL NOT VALID directly. This simplifies the pattern for new columns, though the CHECK pattern is still needed for existing columns.

Mistake #3: ALTER COLUMN TYPE (Silent Table Rewrite)

The dangerous migration

-- Changing a column type
ALTER TABLE events ALTER COLUMN payload TYPE jsonb USING payload::jsonb;

-- Or seemingly innocent changes like:
ALTER TABLE users ALTER COLUMN name TYPE varchar(500);  -- was varchar(255)

What happens in production

Most column type changes trigger a full table rewrite. PostgreSQL creates a new copy of the table with the new column type, copies every row, then swaps the tables. The entire operation runs under an ACCESS EXCLUSIVE lock.

On a 10GB table, a rewrite can take 10-30 minutes. During that time, no reads or writes are possible. Your application is completely blocked on that table. If the table is central to your application (users, orders, events), this is a full outage.

What makes this especially dangerous is that many type changes look safe. Changingvarchar(255) tovarchar(500) seems like it should be a metadata change, but PostgreSQL rewrites the entire table.

Safe alternatives

There is no universal safe pattern for column type changes. The approach depends on the specific change:

-- varchar(255) → text: SAFE, no rewrite (just removes the length limit)
ALTER TABLE users ALTER COLUMN name TYPE text;

-- varchar(255) → varchar(500): REWRITES the table
-- Safe alternative: just use text instead
ALTER TABLE users ALTER COLUMN name TYPE text;

-- int → bigint: REWRITES the table
-- Safe alternative: expand-contract migration
-- 1. Add new column
ALTER TABLE events ADD COLUMN id_new bigint;
-- 2. Backfill in batches (application-level)
-- 3. Swap columns
-- 4. Drop old column

The general principle: if the new type requires more storage per row or a different binary representation, PostgreSQL rewrites the table. The safe alternative is usually an expand-contract migration: add a new column, backfill, swap, drop.

Mistake #4: Running Migrations Without lock_timeout

The dangerous migration

-- No lock_timeout set — will wait indefinitely for the lock
ALTER TABLE orders ADD COLUMN discount_pct numeric(5,2);

What happens in production

Even a fast DDL operation like adding a nullable column needs to acquire a lock on the table. If there is a long-running query or transaction holding a conflicting lock on that table, your DDL statement will wait indefinitely for the lock.

While your DDL statement is waiting, it goes into the lock queue. Here's the critical part:every subsequent query on that table also gets queued behind your waiting DDL. Even simple SELECT queries. The lock queue is FIFO, and your DDL operation is blocking everything behind it.

This is the "lock cascade" pattern. One long-running analytics query + one migration without lock_timeout = total table lockout. Your migration is waiting for the analytics query, and every application query is waiting for your migration.

The safe pattern

-- Always set lock_timeout before DDL operations
SET lock_timeout = '5s';

-- Now if the lock can't be acquired within 5 seconds,
-- the statement fails instead of waiting indefinitely
ALTER TABLE orders ADD COLUMN discount_pct numeric(5,2);

-- Reset for safety
RESET lock_timeout;

With lock_timeout set, a failed lock acquisition gives you an error you can handle: retry after a short delay, kill the blocking query, or run the migration during a lower-traffic window. Without it, your only option is to manually kill the migration session while your application is down.

A good deployment wrapper retries the migration 3-5 times with a short delay between attempts. If a 5-second lock_timeout fails 5 times in a row (25 seconds of retrying), something is genuinely blocking the table and you need to investigate — not wait forever.

-- Retry wrapper pattern (pseudocode)
for attempt in 1..5:
  BEGIN;
    SET LOCAL lock_timeout = '5s';
    ALTER TABLE orders ADD COLUMN discount_pct numeric(5,2);
  COMMIT;
  -- Success, exit loop

  -- On lock_timeout error:
  ROLLBACK;
  sleep(2 * attempt);  -- Exponential backoff

-- If all 5 attempts fail, alert and investigate

Mistake #5: Adding a Foreign Key Without NOT VALID

The dangerous migration

-- Adding a foreign key constraint
ALTER TABLE orders
  ADD CONSTRAINT orders_customer_fk
  FOREIGN KEY (customer_id) REFERENCES customers (id);

What happens in production

Adding a foreign key constraint acquires locks on both tables — the referencing table (orders) and the referenced table (customers). PostgreSQL needs to scan the entire referencing table to verify that every value in the foreign key column exists in the referenced table.

GoCardless documented a 15-second API outage caused by exactly this pattern: a foreign key constraint that locked both tables simultaneously, blocking all queries on both.

The lock on the referenced table (customers) is what most developers don't expect. You think you are modifying the orders table, but you are actually locking customers too. If customers is your most-queried table, this cascades quickly.

The safe pattern

-- Step 1: Add the constraint with NOT VALID
-- This only checks new/modified rows, no full table scan
-- Still locks both tables briefly, but completes instantly
ALTER TABLE orders
  ADD CONSTRAINT orders_customer_fk
  FOREIGN KEY (customer_id) REFERENCES customers (id)
  NOT VALID;

-- Step 2: Validate the constraint in a separate transaction
-- This scans the table but holds a weaker lock (SHARE UPDATE EXCLUSIVE)
-- Reads and writes continue normally on both tables
ALTER TABLE orders VALIDATE CONSTRAINT orders_customer_fk;

The NOT VALID + VALIDATE pattern splits the operation into two phases: a fast metadata change (milliseconds, locks both tables briefly) and a slow validation scan (seconds to minutes, but only holds a SHARE UPDATE EXCLUSIVE lock that allows reads and writes).

Common Thread: The Lock Queue Problem

All five mistakes share the same underlying mechanism: they acquire a lock that's too strong for too long. The PostgreSQL lock queue makes this worse than you'd expect:

Your DDL statement requests an ACCESS EXCLUSIVE lock
A long-running query is holding a conflicting lock — your DDL waits
While your DDL waits, all new queries on the table queue behind your DDL
The lock queue grows: 10 queries, 50 queries, 200 queries
Connection pool fills up with blocked queries
New requests can't get a connection — service returns 5xx errors

The total outage duration is not the time your DDL takes. It's the time the blocking query was already running + the time your DDL takes + the time to drain the queued requests. A 3-second DDL operation behind a 60-second analytics query causes a 63+ second outage.

Preventing These Mistakes

The safe patterns above are well-documented but rarely enforced. A developer who writes migrations twice a month is unlikely to remember the NOT VALID + VALIDATE pattern or the lock_timeout requirement every time.

The most effective prevention is automated linting in CI. When a migration is opened in a pull request, a linter can catch all five patterns before they reach production:

# Check all migration files for safety issues
npx migrationpilot check migrations/*.sql --fail-on critical

# Example output:
# migrations/V42__add_index.sql
#   MP001 [critical] CREATE INDEX without CONCURRENTLY
#     Acquires: SHARE lock (blocks writes)
#     Safe alternative:
#       CREATE INDEX CONCURRENTLY idx_orders_customer_id ON orders (customer_id);
#
#   MP004 [warning] Missing lock_timeout
#     Safe alternative:
#       SET lock_timeout = '5s';

Free tools that catch these patterns:

Squawk — 32 rules, Rust, GitHub Action
strong_migrations — Rails only, excellent error messages
MigrationPilot — 80 rules, lock classification, auto-fix, GitHub Action

The specific tool matters less than having any automated check. These five mistakes are repetitive patterns that a linter catches in milliseconds. The production incidents they cause take hours to diagnose and recover from.

Quick Reference

Mistake	Lock Acquired	Safe Pattern
CREATE INDEX (no CONCURRENTLY)	SHARE	CREATE INDEX CONCURRENTLY
SET NOT NULL (direct)	ACCESS EXCLUSIVE	CHECK NOT VALID + VALIDATE
ALTER COLUMN TYPE	ACCESS EXCLUSIVE	Expand-contract migration
No lock_timeout	Any (queues indefinitely)	SET lock_timeout = '5s'
FK without NOT VALID	SHARE ROW EXCLUSIVE (both tables)	NOT VALID + VALIDATE

DEV Community

The 5 PostgreSQL Migration Mistakes That Cause Production Outages

Mistake #1: CREATE INDEX Without CONCURRENTLY

The dangerous migration

What happens in production

The safe pattern

Mistake #2: ALTER TABLE SET NOT NULL Without the CHECK Pattern

The dangerous migration

What happens in production

The safe pattern (PostgreSQL 12+)

Mistake #3: ALTER COLUMN TYPE (Silent Table Rewrite)

The dangerous migration

What happens in production

Safe alternatives

Mistake #4: Running Migrations Without lock_timeout

The dangerous migration

What happens in production

The safe pattern

Mistake #5: Adding a Foreign Key Without NOT VALID

The dangerous migration

What happens in production

The safe pattern

Common Thread: The Lock Queue Problem

Preventing These Mistakes

Quick Reference

Top comments (0)