Originally published on Hashnode. Cross-posted for the DEV.to community.
The first time I broke production with a migration, I was running a NOT NULL constraint on a 12-million-row users table during peak traffic. The query locked the table, the API timed out, payments failed, and I spent the next 90 minutes hunched over a Slack incident channel watching my MRR bleed in real time. The migration was technically correct. The execution context was completely wrong.
Since then I've built a Claude Code workflow that catches dangerous migrations before they ship. Every migration in my codebase goes through an AI safety review that flags lock contention, missing rollback paths, and concurrent write hazards. I haven't shipped a migration-driven incident in 14 months. This is the workflow.
What Makes Database Migrations Different
A regular code change is reversible. If you ship a bug, you redeploy. The damage is bounded by how long it took you to notice.
Migrations are not reversible. Once you've added a column, dropped an index, or changed a type, the data is in the new shape. Rolling back means another migration that has to handle whatever data accumulated in between. If the migration corrupted data, no rollback recovers it.
The asymmetry between regular code and migrations is why migration review is its own discipline. The questions you ask are different. The risks are different. The tools are different. Treating migrations like regular PRs is how teams break production.
Every migration is a one-way door. Treating it like a two-way door is how you end up writing apology letters to customers.
The Eight Risk Categories
After analyzing every migration incident I've seen across three companies, I categorized the failure modes into eight buckets. My migration review skill checks all eight on every migration.
1. Lock Contention
Some operations lock the entire table. On a small table this is invisible. On a multi-million row table during peak traffic, it kills your API.
Operations that lock in PostgreSQL:
-
ADD COLUMNwith a default value (in older versions) -
ALTER COLUMNtype changes ADD CONSTRAINT NOT NULL-
CREATE INDEXwithoutCONCURRENTLY VACUUM FULL
The skill flags these and suggests safer alternatives.
2. Concurrent Write Hazards
Backfill migrations that update millions of rows can deadlock with concurrent writes. The skill checks for backfills and flags whether they batch and whether they use appropriate isolation levels.
3. Missing Rollback Paths
Every migration should have a documented rollback strategy. The skill flags migrations that don't define rollback or where the rollback would lose data.
4. Data Loss Risks
Some operations destroy data: DROP COLUMN, DROP TABLE, TRUNCATE. The skill flags these and verifies the destruction is intentional.
5. Replication Lag
Long-running migrations can fall behind on replicas. The skill estimates row counts and flags migrations that would take more than 60 seconds to replicate.
6. Constraint Validation
Adding a constraint to existing data can fail mid-migration if any row violates the constraint. The skill flags constraint additions and asks whether existing data has been validated.
7. Index Dependencies
Dropping an index can tank query performance if any query was using it. The skill checks if there are queries in the codebase that match the index pattern and flags potential performance regressions.
8. Foreign Key Implications
Adding or removing foreign keys can affect cascade behavior. The skill flags FK changes and asks about cascade implications.
The Migration Review Skill
The skill that runs all eight checks looks like this.
---
name: migration-review
description: Reviews a database migration file for safety risks before deployment.
---
# Migration Review
You are a senior database engineer reviewing a migration before it ships to production. The database has 50M+ rows in major tables and serves 10K req/min at peak. Downtime is unacceptable.
## Your Task
Review the migration file at the provided path. For each of the eight risk categories below, produce a finding: PASS, FLAG, or BLOCK.
1. Lock Contention
2. Concurrent Write Hazards
3. Missing Rollback Paths
4. Data Loss Risks
5. Replication Lag
6. Constraint Validation
7. Index Dependencies
8. Foreign Key Implications
## Output Format
| Category | Status | Issue | Recommended Action |
|----------|--------|-------|--------------------|
PASS = no issues. FLAG = potential issue, author should verify. BLOCK = ship this and you will have an incident. Do not ship.
## Rules
- For each FLAG or BLOCK, suggest the safer alternative pattern
- If the migration uses a backfill, verify it batches in chunks of <= 10K rows
- Assume PostgreSQL 14+ unless specified otherwise
- If you can't determine the risk because context is missing, mark it FLAG with note "Verify: <missing context>"
## Final Output
After the table, give an overall verdict: SAFE TO SHIP / SAFE WITH MODIFICATIONS / BLOCK. Justify in one sentence.
This skill is invoked on every PR that touches a migrations/ directory. The output appears as a PR comment within 60 seconds.
The Patterns Library
Beyond catching dangerous migrations, the skill points authors to safer patterns. I maintain a patterns file the skill references.
Pattern: Add NOT NULL Column on Large Table
Bad:
ALTER TABLE users ADD COLUMN email_verified BOOLEAN NOT NULL DEFAULT false;
This locks the table on older PostgreSQL versions. On newer versions it's metadata-only but the migration itself still risks blocking other DDL.
Good:
-- Migration 1: Add nullable column
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;
-- Migration 2: Backfill in batches (separate migration, deployed later)
UPDATE users SET email_verified = false
WHERE email_verified IS NULL AND id BETWEEN 1 AND 10000;
-- Repeat in batches
-- Migration 3: Add NOT NULL constraint after backfill
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
The skill flags single-migration NOT NULL additions and suggests the three-migration pattern.
Pattern: Create Index on Large Table
Bad:
CREATE INDEX idx_users_email ON users(email);
This locks the table for writes during index creation.
Good:
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
CONCURRENTLY doesn't lock writes but takes longer and can fail. The skill flags non-concurrent index creation.
Pattern: Drop Column on Large Table
Bad:
ALTER TABLE users DROP COLUMN deprecated_field;
This is destructive and immediately drops the column. If a deploy is in flight or a rollback is needed, you're stuck.
Good:
-- Step 1: Stop writing to the column (code change, deploy, verify)
-- Step 2: Stop reading from the column (code change, deploy, verify)
-- Step 3: After 1+ weeks of no usage, run the drop
ALTER TABLE users DROP COLUMN deprecated_field;
The skill flags column drops and asks for evidence that reads and writes have been removed.
The patterns library is what turns the review skill from a flagging tool into a teaching tool. New engineers see the patterns in action and learn the safe defaults instead of relearning them through incidents.
The PR Workflow
Here's the full flow from migration written to migration deployed.
Step 1: Author Writes Migration
Author drafts the migration in a feature branch. They run the migration locally against a copy of production data (size-reduced).
Step 2: PR Opens
GitHub Action detects a change in migrations/ and triggers the migration review skill. The skill produces the eight-category report and posts it as a PR comment.
Step 3: Author Addresses Findings
The author reviews FLAG and BLOCK findings. For FLAGs, they either fix the migration or add a comment explaining why the flag is acceptable. BLOCKs require restructuring.
Step 4: Senior Review
Once the AI review is clean, a senior engineer (me, on my team) does the human review. The human review focuses on business logic correctness and performance implications that the AI can't see.
Step 5: Staging Deployment
The migration runs against staging first. Staging has production-shaped data (anonymized). I monitor lock duration and replication lag during the staging run. Anything > 5 seconds gets flagged for re-architecture before it touches production.
Step 6: Production Deployment
The migration runs in production with a runbook attached. The runbook documents:
- Estimated duration based on staging
- Monitoring thresholds (alert if lock > 30s, replication lag > 60s)
- Rollback procedure
- Who to call if it goes sideways
The runbook is the most underrated artifact in migration deployment. If you can't write the rollback procedure before you ship, you're not ready to ship.
What I Learned About Backfills
Backfills are the migration pattern that bites teams hardest. They look simple. They are not.
Always Batch
Never run a backfill that updates the entire table in one transaction. Batch in chunks of 1K to 10K rows depending on table size. Larger batches lock more rows for longer.
Throttle Between Batches
Add a delay between batches: pg_sleep(0.1) or equivalent. This lets concurrent transactions in. Without a delay you'll cause replication lag and CPU spikes.
Track Progress
Backfills can take hours. They can fail mid-run. Track progress in a separate table so you can resume from where you left off.
CREATE TABLE backfill_progress (
backfill_name TEXT PRIMARY KEY,
last_id BIGINT NOT NULL,
completed_at TIMESTAMP
);
-- In each batch, update last_id
UPDATE backfill_progress
SET last_id = $batch_max_id
WHERE backfill_name = 'email_verified_backfill';
Idempotent Updates
Write the backfill so re-running it on already-updated rows is safe. Use WHERE column IS NULL filters or equivalent.
The skill flags backfills that don't batch, don't track progress, or aren't idempotent.
Real Incidents the Skill Caught
Three real examples from the past year where the skill prevented incidents.
Case 1: Concurrent Index Creation
A junior engineer wrote:
CREATE INDEX idx_orders_status ON orders(status);
The orders table has 80M rows. This would have locked writes for ~25 minutes. The skill flagged "Missing CONCURRENTLY clause on large table." The fix took 30 seconds.
Case 2: Implicit Cascade Delete
A migration added a foreign key:
ALTER TABLE order_items
ADD CONSTRAINT fk_orders
FOREIGN KEY (order_id) REFERENCES orders(id)
ON DELETE CASCADE;
The cascade was unintentional - it would have caused unrelated cleanup jobs to start cascading deletes that the team didn't expect. The skill flagged "ON DELETE CASCADE - verify cascade is intentional." The author confirmed it was a mistake and changed to ON DELETE RESTRICT.
Case 3: Backfill Without Batching
A migration to add a tenant_id to legacy records:
UPDATE legacy_records
SET tenant_id = (SELECT id FROM tenants WHERE name = 'default')
WHERE tenant_id IS NULL;
23M rows. Single transaction. Would have caused 4+ minutes of lock contention. The skill flagged "Backfill not batched on table > 1M rows." The author rewrote with batching.
Want the full migration review skill plus the patterns library and PR templates I use? Grab the database safety toolkit - it ships with the eight-category review prompt, the safe pattern examples, and the runbook template I attach to every production migration.
What I'd Do Differently
Three lessons from building this system.
First: start with the patterns library, not the skill. The skill is only as good as the patterns it can point to. I built the skill first and the patterns ad hoc, which meant early reviews caught issues but didn't teach. Start by writing 10 to 15 safe pattern examples, then build the skill that references them.
Second: invest in the staging data. The review skill catches what it can see in the migration file. The staging deployment catches what only shows up at production scale. If your staging environment doesn't have production-shaped data, you'll get bitten by issues the AI couldn't predict.
Third: write the runbook before writing the migration. If you can't articulate the rollback procedure, you don't understand the migration well enough to ship it. The runbook is a forcing function for clear thinking.
What's Next
The migration review skill is one piece of a broader database safety push. The next layer I'm building is automated query plan analysis: every migration that creates an index also runs EXPLAIN ANALYZE on the queries that should use the index, to verify the index actually helps. The layer after that is data drift detection: a daily job that compares the schema in production to the migrations in the repo and flags drift.
The pattern is the same: take an area where humans are doing high-stakes work without good tools, build the tools that catch the obvious mistakes, and free up the humans for the judgment-heavy parts that AI can't do.
If you take one thing from this article, take this: every dangerous migration shares the same root cause - it ran in a context the author didn't fully understand. The review skill exists to surface that context before the migration ships. Build it once, save yourself from a dozen incidents.
Top comments (0)