Isabelle Hue

Posted on Jun 2 • Originally published at dev.to

The 15 Postgres migration footguns that lock production — and how to catch them in PR review

#postgres #opensource #database #devops

TL;DR — Most prod migration incidents come from a small, well-known catalogue of SQL patterns. Your CI doesn't catch them. Your staging database doesn't either, because it has 30 000 rows and production has 1.2 billion. This post catalogues the 15 footguns, with safe rewrites, and ships a free GitHub Action (Marketplace) that blocks them in the PR review.

A migration that ran in 0.4s locked production for 20 minutes

The exact pattern, from a real postmortem:

ALTER TABLE posts ALTER COLUMN category SET NOT NULL;

Tested on staging (30 000 rows) — green in 0.4 seconds. Shipped to production (a posts table with over 1 billion rows). Postgres took an ACCESS EXCLUSIVE lock, scanned every row to validate the constraint, and held the lock for 20 minutes.

For 20 minutes, no read, no write, nothing. Reader traffic queues, writers time out, the app degrades into a 503 page.

The mechanism is well documented. The fix takes three lines of safer SQL. And yet teams keep doing it, because nothing in the standard PR review flow surfaces "this statement will lock writers for the duration of a full-table scan".

This is one of fifteen footguns I want to put on the table.

Why staging never catches it

Staging databases lie to you in three different ways:

Volume. A SET NOT NULL that scans 30 000 rows runs in milliseconds. The same statement against 1.2 billion rows can run for the duration of an episode of a sitcom. Lock duration is roughly linear in row count.
Contention. Your staging instance has one developer poking at it; production has thousands of concurrent connections waiting on every row they need. The lock that's free on staging is a queue depth bomb on prod.
Versions and extensions. Lock-upgrade behavior changed between Postgres 11, 12, and 13. Subtle index-build semantics depend on patch versions. Staging often lags. Production often leads.

The team that wrote the postmortem above had a staging environment. The migration was reviewed. CI passed. None of that helped.

The catalogue — 15 patterns to never merge

Data loss

These permanently delete data, or break the running app version mid-deploy.

1. DROP COLUMN

ALTER TABLE users DROP COLUMN email; -- ❌

Permanent. And during the deploy window, the previous app version still reads email and crashes.

Safe rewrite: ship a code-side stop-reading-this-column release first. Optionally rename to email_deprecated for a release. Drop the column in a later migration once nothing references it.

2. DROP TABLE

DROP TABLE legacy_orders; -- ❌

Same as above, table-scale. Add a check: is anything still referencing it? Foreign keys? A delete in production at 3am can take a service down at 9am.

Safe rewrite: rename, observe for a release, drop later.

3. TRUNCATE

TRUNCATE sessions; -- ❌ unless you really mean it

Often used as a quick reset in dev that survives into a migration. Production teams rarely intend it.

Production locks

These take an ACCESS EXCLUSIVE lock and scan every row. On a big table, that's minutes of blocked writes.

4. SET NOT NULL on an existing column

ALTER TABLE users ALTER COLUMN phone SET NOT NULL; -- ❌

Postgres validates the constraint by scanning every row, holding ACCESS EXCLUSIVE for the duration. Even checking an existing NOT NULL constraint requires a scan.

Safe rewrite (Postgres 12+): add a CHECK (phone IS NOT NULL) NOT VALID constraint, VALIDATE CONSTRAINT in a separate transaction, then SET NOT NULL (which becomes O(1) because the check is already validated).

5. ADD COLUMN ... NOT NULL without a default

ALTER TABLE users ADD COLUMN age integer NOT NULL; -- ❌

Outright fails on any non-empty table — Postgres can't synthesize a value.

Safe rewrite: add nullable, backfill in batches, then add the constraint via the NOT VALID dance above.

6. ALTER COLUMN ... TYPE

ALTER TABLE users ALTER COLUMN id TYPE bigint; -- ❌

Rewrites the entire table. Locks for the duration. The "free" int → bigint upgrade has bitten more teams than any other type change.

Safe rewrite: add a new nullable column of the target type, backfill, swap the application read path, drop the old column much later.

7. CREATE INDEX without CONCURRENTLY

CREATE INDEX idx_users_email ON users (email); -- ❌ on Postgres

Holds SHARE lock — blocks writes until the index is built. On a big table, that's minutes.

Safe rewrite:

CREATE INDEX CONCURRENTLY idx_users_email ON users (email); -- ✅

8. CREATE INDEX CONCURRENTLY inside a Prisma/Drizzle migration

-- Looks safe! But this fails at runtime.
CREATE INDEX CONCURRENTLY idx_users_email ON users (email);

Both Prisma and Drizzle wrap migrations in a transaction by default. CONCURRENTLY cannot run inside a transaction. The migration errors out, your deploy fails halfway through, and now you have a half-applied schema to clean up by hand at 3am.

Safe rewrite: split the index into a separate migration with BEGIN/COMMIT stripped, or use the ORM's escape hatch (Prisma: a raw SQL migration created with --create-only and the transaction wrapper removed).

Silent breaks

These don't lock — they break consumers in subtle ways the deployed app version can't anticipate.

9. ADD CONSTRAINT ... FOREIGN KEY (without NOT VALID)

ALTER TABLE posts
ADD CONSTRAINT fk_user FOREIGN KEY (user_id) REFERENCES users (id); -- ❌

Validates every existing row under a lock.

Safe rewrite: ADD CONSTRAINT ... NOT VALID, then VALIDATE CONSTRAINT in a separate, lock-free transaction.

10. ADD CONSTRAINT ... CHECK (without NOT VALID)
Same shape as the foreign-key case. Same fix.

11. ADD CONSTRAINT ... UNIQUE

ALTER TABLE users ADD CONSTRAINT uq_email UNIQUE (email); -- ❌

Builds the underlying unique index under a lock.

Safe rewrite: CREATE UNIQUE INDEX CONCURRENTLY, then ADD CONSTRAINT ... UNIQUE USING INDEX.

12. RENAME COLUMN

ALTER TABLE users RENAME COLUMN fname TO first_name; -- ❌ during deploy

Atomic on the DB side — but the previous app version still selects fname and crashes the second the rename commits.

Safe rewrite: add a new column, dual-write, migrate readers, drop the old column much later. Or, if downtime is acceptable, schedule it.

13. RENAME TABLE
Same pattern as rename column, one level up.

Lower severity but worth a heads-up

14. DROP INDEX without CONCURRENTLY

DROP INDEX idx_old; -- ⚠

Brief ACCESS EXCLUSIVE. Usually fast, but on a busy table can cause noticeable timeouts. Prefer DROP INDEX CONCURRENTLY (Postgres 9.2+).

15. Volatile defaults on ADD COLUMN

ALTER TABLE users ADD COLUMN token uuid DEFAULT gen_random_uuid(); -- ⚠

Postgres 11+ can skip the table rewrite for constant defaults — but gen_random_uuid() and now() are volatile and will trigger a full rewrite.

Safe rewrite: add the column nullable, backfill with the generator in batches, set the default afterwards.

What about `strong_migrations` / Squawk / Atlas?

These are excellent tools. None of them quite fits the modern stack:

ankane/strong_migrations — Rails-only. If you ship Rails, install this gem today.
Squawk — Postgres + raw SQL. Excellent linter, no understanding of Prisma/Drizzle migration semantics. You wire up the CI yourself.
Atlas — atlas migrate lint is a top-tier tool. They paywalled it in October 2025 ($9/dev + $59/CI project + $39/database).
Generic AI review bots (CodeRabbit, Greptile, Qodo) — don't model lock semantics. They'll happily approve a SET NOT NULL.

There's a gap: hosted, zero-config, deterministic, multi-ORM. So I built it.

Catch them in the PR — `migration-autopilot`

Migration Autopilot is a free, MIT-licensed GitHub Action that runs the 15 rules above on every pull request and blocks the merge if it finds a high-severity issue.

📺 60-second demo → — three real PRs, three rules triggered, three blocked merges.

Two minutes to set up:

# .github/workflows/migration-review.yml
name: Migration Autopilot
on: pull_request
permissions:
  contents: read
  pull-requests: write

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: isabellehuecloser-ctrl/migration-autopilot@v0
        with:
          fail-on: high # block merge on dangerous migrations
          dialect: postgres

That's it. The detection is a deterministic rule engine — no OpenAI key required, no LLM hallucinations, no false positives. A merge-gating tool that cries wolf gets disabled within a week. The rules are derived from Squawk, strong_migrations, the Atlas PG301-311 series, and the postmortems linked above.

Supported migration sources (auto-detected): Prisma prisma/migrations/, Drizzle drizzle/, Rails db/migrate/*.rb (the Ruby DSL is parsed directly — no Ruby runtime needed), and raw SQL in any directory that smells like migrations.

Dialects: Postgres and MySQL. Postgres-specific rules (like CREATE INDEX CONCURRENTLY) don't fire on MySQL files.

What I'd love your feedback on

I'm a solo dev. The free Action is feature-complete; a hosted Pro version (1-click install, dashboard, multi-repo policy) is in early access.

What rule is missing? If you've hit a footgun the catalogue doesn't cover, open an issue.
What's the false-positive rate on your real codebase? Install the Action on a recent branch and tell me what fires wrongly. Zero false positives is the hardest commitment to keep.
Is the safe-rewrite text useful, or noise? I'd rather it be slightly opinionated than vague.

Source code, issues, contributions welcome:
github.com/isabellehuecloser-ctrl/migration-autopilot

If the Action catches a real footgun on a PR of yours — drop a ⭐ on the repo. That's how I know it's landing somewhere.

References & further reading:

DEV Community

The 15 Postgres migration footguns that lock production — and how to catch them in PR review

A migration that ran in 0.4s locked production for 20 minutes

Why staging never catches it

The catalogue — 15 patterns to never merge

Data loss

Production locks

Silent breaks

Lower severity but worth a heads-up

What about `strong_migrations` / Squawk / Atlas?

Catch them in the PR — `migration-autopilot`

What I'd love your feedback on

Top comments (0)

A migration that ran in 0.4s locked production for 20 minutes

Why staging never catches it

The catalogue — 15 patterns to never merge

Data loss

Production locks

Silent breaks

Lower severity but worth a heads-up

What about strong_migrations / Squawk / Atlas?

Catch them in the PR — migration-autopilot

What I'd love your feedback on

What about `strong_migrations` / Squawk / Atlas?

Catch them in the PR — `migration-autopilot`