Handshake deployed what looked like a routine migration during their regular multi-daily release cycle:
ALTER TABLE table_name
ADD CONSTRAINT fk_user_id
FOREIGN KEY (user_id) REFERENCES users(id);
60 seconds later, their entire site was down.
The problem? Adding a foreign key constraint requires an ACCESS EXCLUSIVE lock on the referenced table. Postgres grants locks first-come, first-served. A long-running query on the users table was holding an ACCESS SHARE lock, so the migration queued up waiting for it. But here's the kicker: since ACCESS EXCLUSIVE conflicts with everything, all the regular SELECT queries that came after started queuing behind the migration. The lock queue grew. The site stopped responding. They had to abort the migration to bring the site back up.
GoCardless hit a similar issue. They were recreating foreign key constraints on renamed tables. The tables were empty, so it seemed safe. But adding the constraints required locks on the parent tables, which were heavily used. API timeouts across the board for 15 seconds.
Both incidents came from database migrations that looked completely normal. Both ran fine in staging. Both only showed their true colors in production.
The Postgres Lock Problem
Postgres uses different lock levels to keep your data consistent. The worst one is ACCESS EXCLUSIVE. When something holds this lock, nothing else can touch that table. Not SELECT, not INSERT, nothing.
A bunch of common migration operations take ACCESS EXCLUSIVE locks:
Creating Indexes
CREATE INDEX idx_users_email ON users(email);
This takes a SHARE lock, which blocks all writes (INSERT, UPDATE, DELETE) while the index builds. On a table with millions of rows, this can take minutes.
Adding NOT NULL
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
Postgres has to verify every single row has a non-null value. ACCESS EXCLUSIVE lock the entire time.
Changing Column Types
ALTER TABLE products ALTER COLUMN price TYPE NUMERIC(10,2);
Complete table rewrite. Every row converted to the new type. ACCESS EXCLUSIVE lock throughout.
The Lock Queue Makes It Worse
Here's what makes this really nasty: Postgres's lock queue is first-come, first-served. If a migration is waiting for a lock, all subsequent queries queue behind it, even if they don't conflict with the original lock holder.
So this happens:
- Long-running query holds
ACCESS SHARElock on users table. - Migration tries to get
ACCESS EXCLUSIVElock, queues up. - New
SELECTquery comes in, would normally be fine withACCESS SHARE. - But since migration is waiting, this
SELECTgoes to the back of the queue. - Another
SELECTcomes in, also queued. - Your application starts timing out.
The migration hasn't even started yet, and you're already down.
Why This is Hard to Catch
These migrations look fine. They work great in development where you have 100 test records. They run instantly in staging with 10,000 rows. Then in production with 5 million rows, they lock the table for minutes.
You don't find out until it's running against real data with real traffic.
Enter diesel-guard
diesel-guard was built to catch these problems before they hit production. It's a static analysis tool that scans your Diesel migration files for dangerous operations.
Install it:
cargo install diesel-guard
Run it:
diesel-guard check migrations/
When it finds something risky:
❌ Unsafe migration detected in migrations/2024_01_01_add_fk/up.sql
❌ ADD COLUMN with DEFAULT
Problem:
Adding column 'status' with DEFAULT on table 'orders' requires a full
table rewrite on PostgreSQL < 11, which acquires an ACCESS EXCLUSIVE
lock. On large tables, this can take significant time and block all
operations.
Safe alternative:
1. Add the column without a default:
ALTER TABLE orders ADD COLUMN status TEXT;
2. Backfill data in batches (outside migration):
UPDATE orders SET status = 'pending' WHERE status IS NULL;
3. Add default for new rows only:
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
Note: For Postgres 11+, this is safe if the default is a constant.
You get:
- What's dangerous about it
- What lock it takes
- Step-by-step fix with SQL
Safe Alternatives for Common Operations
Creating Indexes
Instead of:
CREATE INDEX idx_orders_created_at ON orders(created_at);
Do this:
CREATE INDEX CONCURRENTLY idx_orders_created_at ON orders(created_at);
CONCURRENTLY means writes can continue while the index builds. You need to add a metadata.toml:
# migrations/2024_01_01_add_order_index/metadata.toml
run_in_transaction = false
Adding NOT NULL
Instead of:
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
Do this:
-- Add CHECK constraint without validating existing rows
ALTER TABLE users
ADD CONSTRAINT users_email_not_null_check
CHECK (email IS NOT NULL) NOT VALID;
-- Validate separately (lighter lock)
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null_check;
-- Add NOT NULL (fast since we validated)
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
-- Clean up
ALTER TABLE users DROP CONSTRAINT users_email_not_null_check;
The VALIDATE step uses SHARE UPDATE EXCLUSIVE, which allows reads and writes to continue.
Adding UNIQUE Constraints
Instead of:
ALTER TABLE users ADD CONSTRAINT users_email_key UNIQUE (email);
Do this:
-- Build index concurrently
CREATE UNIQUE INDEX CONCURRENTLY users_email_idx ON users(email);
-- Add constraint using existing index (instant)
ALTER TABLE users
ADD CONSTRAINT users_email_key
UNIQUE USING INDEX users_email_idx;
Adding Foreign Keys
This is what bit Handshake. Instead of:
ALTER TABLE posts
ADD CONSTRAINT posts_user_id_fkey
FOREIGN KEY (user_id) REFERENCES users(id);
Do this:
-- Add constraint without validating existing rows
ALTER TABLE posts
ADD CONSTRAINT posts_user_id_fkey
FOREIGN KEY (user_id) REFERENCES users(id)
NOT VALID;
-- Validate separately (lighter lock)
ALTER TABLE posts VALIDATE CONSTRAINT posts_user_id_fkey;
NOT VALID means it doesn't scan existing rows. VALIDATE happens separately with a lighter lock.
Setting Timeouts
One thing both Handshake and GoCardless learned: set lock_timeout in your migrations.
-- At the top of your migration
SET lock_timeout = '2s';
ALTER TABLE users ADD COLUMN email TEXT;
If the migration can't get a lock within 2 seconds, it fails instead of queuing indefinitely. Your app stays up, and you can retry during lower traffic.
When You Know It's Safe
Sometimes you know a migration is safe (tiny table, maintenance window, etc.):
-- safety-assured:start
-- Safe because: table has 50 rows, deploying during maintenance window
ALTER TABLE countries ADD COLUMN flag_emoji TEXT DEFAULT '🏳️';
-- safety-assured:end
diesel-guard will skip anything in these blocks.
Configuration
Put a diesel-guard.toml in your project:
# Skip checking migrations before this date
start_after = "2024_01_01_000000"
# Check down.sql too
check_down = true
# Disable specific checks if needed
disable_checks = ["CreateExtensionCheck"]
CI/CD Integration
Add to GitHub Actions:
name: Check Migrations
on: [pull_request]
jobs:
check-migrations:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: ayarotsky/diesel-guard@v0.3.0
Now dangerous migrations get caught in PR review, not production.
What It Checks
diesel-guard currently detects 18 different issues:
- ADD COLUMN with DEFAULT
- CREATE INDEX without CONCURRENTLY
- ADD NOT NULL
- ADD UNIQUE constraint
- ADD FOREIGN KEY without NOT VALID
- ADD PRIMARY KEY to existing tables
- ADD SERIAL columns
- ALTER COLUMN TYPE
- CREATE EXTENSION
- DROP COLUMN
- DROP INDEX without CONCURRENTLY
- DROP PRIMARY KEY
- RENAME COLUMN
- RENAME TABLE
- Short integer primary keys (SMALLINT/INT)
- TRUNCATE TABLE
- Unnamed constraints
- Using JSON instead of JSONB
- Wide indexes (4+ columns)
More coming. Target is 40 checks covering most dangerous Postgres operations.
Why This Matters for Rust
The Rust ecosystem has great tooling. clippy lints your code. cargo audit catches security problems. But we didn't have anything for database migrations.
I've seen too many production incidents from migrations. The fix is usually obvious in hindsight, but you only find out when it's causing downtime.
diesel-guard brings the fix forward to development time.
Should You Use This?
Maybe you're thinking "my tables are small."
Tables grow. The users table with 100 rows today might have a million rows next year. The migration that's instant now might take minutes then.
Building safe migrations from the start is way easier than fixing them during an incident. And diesel-guard takes a couple seconds to run.
Wrapping Up
Database migrations can be tricky. Operations that look perfectly safe can cause serious production issues. The gap between how they behave in staging versus production can be huge, and often you only find out when it's too late.
Static analysis can catch these problems early. Whether you use diesel-guard or build your own checks, having something review your migrations before they reach production is worth it. The patterns are well-documented and you just need tooling to enforce them.
Building safety into your database migration workflow pays off.
Top comments (0)