DEV Community

Nex Tools
Nex Tools

Posted on • Originally published at nextools.hashnode.dev

Claude Code for Database Migrations: How I Stopped Breaking Production With Schema Changes

Originally published on Hashnode. Cross-posted for the DEV.to community.

The first time I broke production with a migration, I was running a NOT NULL constraint on a 12-million-row users table during peak traffic. The query locked the table, the API timed out, payments failed, and I spent the next 90 minutes hunched over a Slack incident channel watching my MRR bleed in real time. The migration was technically correct. The execution context was completely wrong.

Since then I've built a Claude Code workflow that catches dangerous migrations before they ship. Every migration in my codebase goes through an AI safety review that flags lock contention, missing rollback paths, and concurrent write hazards. I haven't shipped a migration-driven incident in 14 months. This is the workflow.


What Makes Database Migrations Different

A regular code change is reversible. If you ship a bug, you redeploy. The damage is bounded by how long it took you to notice.

Migrations are not reversible. Once you've added a column, dropped an index, or changed a type, the data is in the new shape. Rolling back means another migration that has to handle whatever data accumulated in between. If the migration corrupted data, no rollback recovers it.

The asymmetry between regular code and migrations is why migration review is its own discipline. The questions you ask are different. The risks are different. The tools are different. Treating migrations like regular PRs is how teams break production.

Every migration is a one-way door. Treating it like a two-way door is how you end up writing apology letters to customers.


The Eight Risk Categories

After analyzing every migration incident I've seen across three companies, I categorized the failure modes into eight buckets. My migration review skill checks all eight on every migration.

1. Lock Contention

Some operations lock the entire table. On a small table this is invisible. On a multi-million row table during peak traffic, it kills your API.

Operations that lock in PostgreSQL:

  • ADD COLUMN with a default value (in older versions)
  • ALTER COLUMN type changes
  • ADD CONSTRAINT NOT NULL
  • CREATE INDEX without CONCURRENTLY
  • VACUUM FULL

The skill flags these and suggests safer alternatives.

2. Concurrent Write Hazards

Backfill migrations that update millions of rows can deadlock with concurrent writes. The skill checks for backfills and flags whether they batch and whether they use appropriate isolation levels.

3. Missing Rollback Paths

Every migration should have a documented rollback strategy. The skill flags migrations that don't define rollback or where the rollback would lose data.

4. Data Loss Risks

Some operations destroy data: DROP COLUMN, DROP TABLE, TRUNCATE. The skill flags these and verifies the destruction is intentional.

5. Replication Lag

Long-running migrations can fall behind on replicas. The skill estimates row counts and flags migrations that would take more than 60 seconds to replicate.

6. Constraint Validation

Adding a constraint to existing data can fail mid-migration if any row violates the constraint. The skill flags constraint additions and asks whether existing data has been validated.

7. Index Dependencies

Dropping an index can tank query performance if any query was using it. The skill checks if there are queries in the codebase that match the index pattern and flags potential performance regressions.

8. Foreign Key Implications

Adding or removing foreign keys can affect cascade behavior. The skill flags FK changes and asks about cascade implications.


The Migration Review Skill

The skill that runs all eight checks looks like this.

---
name: migration-review
description: Reviews a database migration file for safety risks before deployment.
---

# Migration Review

You are a senior database engineer reviewing a migration before it ships to production. The database has 50M+ rows in major tables and serves 10K req/min at peak. Downtime is unacceptable.

## Your Task

Review the migration file at the provided path. For each of the eight risk categories below, produce a finding: PASS, FLAG, or BLOCK.

1. Lock Contention
2. Concurrent Write Hazards
3. Missing Rollback Paths
4. Data Loss Risks
5. Replication Lag
6. Constraint Validation
7. Index Dependencies
8. Foreign Key Implications

## Output Format

| Category | Status | Issue | Recommended Action |
|----------|--------|-------|--------------------|

PASS = no issues. FLAG = potential issue, author should verify. BLOCK = ship this and you will have an incident. Do not ship.

## Rules

- For each FLAG or BLOCK, suggest the safer alternative pattern
- If the migration uses a backfill, verify it batches in chunks of <= 10K rows
- Assume PostgreSQL 14+ unless specified otherwise
- If you can't determine the risk because context is missing, mark it FLAG with note "Verify: <missing context>"

## Final Output

After the table, give an overall verdict: SAFE TO SHIP / SAFE WITH MODIFICATIONS / BLOCK. Justify in one sentence.
Enter fullscreen mode Exit fullscreen mode

This skill is invoked on every PR that touches a migrations/ directory. The output appears as a PR comment within 60 seconds.


The Patterns Library

Beyond catching dangerous migrations, the skill points authors to safer patterns. I maintain a patterns file the skill references.

Pattern: Add NOT NULL Column on Large Table

Bad:

ALTER TABLE users ADD COLUMN email_verified BOOLEAN NOT NULL DEFAULT false;
Enter fullscreen mode Exit fullscreen mode

This locks the table on older PostgreSQL versions. On newer versions it's metadata-only but the migration itself still risks blocking other DDL.

Good:

-- Migration 1: Add nullable column
ALTER TABLE users ADD COLUMN email_verified BOOLEAN;

-- Migration 2: Backfill in batches (separate migration, deployed later)
UPDATE users SET email_verified = false 
WHERE email_verified IS NULL AND id BETWEEN 1 AND 10000;
-- Repeat in batches

-- Migration 3: Add NOT NULL constraint after backfill
ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;
Enter fullscreen mode Exit fullscreen mode

The skill flags single-migration NOT NULL additions and suggests the three-migration pattern.

Pattern: Create Index on Large Table

Bad:

CREATE INDEX idx_users_email ON users(email);
Enter fullscreen mode Exit fullscreen mode

This locks the table for writes during index creation.

Good:

CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
Enter fullscreen mode Exit fullscreen mode

CONCURRENTLY doesn't lock writes but takes longer and can fail. The skill flags non-concurrent index creation.

Pattern: Drop Column on Large Table

Bad:

ALTER TABLE users DROP COLUMN deprecated_field;
Enter fullscreen mode Exit fullscreen mode

This is destructive and immediately drops the column. If a deploy is in flight or a rollback is needed, you're stuck.

Good:

-- Step 1: Stop writing to the column (code change, deploy, verify)
-- Step 2: Stop reading from the column (code change, deploy, verify)
-- Step 3: After 1+ weeks of no usage, run the drop
ALTER TABLE users DROP COLUMN deprecated_field;
Enter fullscreen mode Exit fullscreen mode

The skill flags column drops and asks for evidence that reads and writes have been removed.

The patterns library is what turns the review skill from a flagging tool into a teaching tool. New engineers see the patterns in action and learn the safe defaults instead of relearning them through incidents.


The PR Workflow

Here's the full flow from migration written to migration deployed.

Step 1: Author Writes Migration

Author drafts the migration in a feature branch. They run the migration locally against a copy of production data (size-reduced).

Step 2: PR Opens

GitHub Action detects a change in migrations/ and triggers the migration review skill. The skill produces the eight-category report and posts it as a PR comment.

Step 3: Author Addresses Findings

The author reviews FLAG and BLOCK findings. For FLAGs, they either fix the migration or add a comment explaining why the flag is acceptable. BLOCKs require restructuring.

Step 4: Senior Review

Once the AI review is clean, a senior engineer (me, on my team) does the human review. The human review focuses on business logic correctness and performance implications that the AI can't see.

Step 5: Staging Deployment

The migration runs against staging first. Staging has production-shaped data (anonymized). I monitor lock duration and replication lag during the staging run. Anything > 5 seconds gets flagged for re-architecture before it touches production.

Step 6: Production Deployment

The migration runs in production with a runbook attached. The runbook documents:

  • Estimated duration based on staging
  • Monitoring thresholds (alert if lock > 30s, replication lag > 60s)
  • Rollback procedure
  • Who to call if it goes sideways

The runbook is the most underrated artifact in migration deployment. If you can't write the rollback procedure before you ship, you're not ready to ship.


What I Learned About Backfills

Backfills are the migration pattern that bites teams hardest. They look simple. They are not.

Always Batch

Never run a backfill that updates the entire table in one transaction. Batch in chunks of 1K to 10K rows depending on table size. Larger batches lock more rows for longer.

Throttle Between Batches

Add a delay between batches: pg_sleep(0.1) or equivalent. This lets concurrent transactions in. Without a delay you'll cause replication lag and CPU spikes.

Track Progress

Backfills can take hours. They can fail mid-run. Track progress in a separate table so you can resume from where you left off.

CREATE TABLE backfill_progress (
    backfill_name TEXT PRIMARY KEY,
    last_id BIGINT NOT NULL,
    completed_at TIMESTAMP
);

-- In each batch, update last_id
UPDATE backfill_progress 
SET last_id = $batch_max_id 
WHERE backfill_name = 'email_verified_backfill';
Enter fullscreen mode Exit fullscreen mode

Idempotent Updates

Write the backfill so re-running it on already-updated rows is safe. Use WHERE column IS NULL filters or equivalent.

The skill flags backfills that don't batch, don't track progress, or aren't idempotent.


Real Incidents the Skill Caught

Three real examples from the past year where the skill prevented incidents.

Case 1: Concurrent Index Creation

A junior engineer wrote:

CREATE INDEX idx_orders_status ON orders(status);
Enter fullscreen mode Exit fullscreen mode

The orders table has 80M rows. This would have locked writes for ~25 minutes. The skill flagged "Missing CONCURRENTLY clause on large table." The fix took 30 seconds.

Case 2: Implicit Cascade Delete

A migration added a foreign key:

ALTER TABLE order_items 
ADD CONSTRAINT fk_orders 
FOREIGN KEY (order_id) REFERENCES orders(id) 
ON DELETE CASCADE;
Enter fullscreen mode Exit fullscreen mode

The cascade was unintentional - it would have caused unrelated cleanup jobs to start cascading deletes that the team didn't expect. The skill flagged "ON DELETE CASCADE - verify cascade is intentional." The author confirmed it was a mistake and changed to ON DELETE RESTRICT.

Case 3: Backfill Without Batching

A migration to add a tenant_id to legacy records:

UPDATE legacy_records 
SET tenant_id = (SELECT id FROM tenants WHERE name = 'default') 
WHERE tenant_id IS NULL;
Enter fullscreen mode Exit fullscreen mode

23M rows. Single transaction. Would have caused 4+ minutes of lock contention. The skill flagged "Backfill not batched on table > 1M rows." The author rewrote with batching.

Want the full migration review skill plus the patterns library and PR templates I use? Grab the database safety toolkit - it ships with the eight-category review prompt, the safe pattern examples, and the runbook template I attach to every production migration.


What I'd Do Differently

Three lessons from building this system.

First: start with the patterns library, not the skill. The skill is only as good as the patterns it can point to. I built the skill first and the patterns ad hoc, which meant early reviews caught issues but didn't teach. Start by writing 10 to 15 safe pattern examples, then build the skill that references them.

Second: invest in the staging data. The review skill catches what it can see in the migration file. The staging deployment catches what only shows up at production scale. If your staging environment doesn't have production-shaped data, you'll get bitten by issues the AI couldn't predict.

Third: write the runbook before writing the migration. If you can't articulate the rollback procedure, you don't understand the migration well enough to ship it. The runbook is a forcing function for clear thinking.


What's Next

The migration review skill is one piece of a broader database safety push. The next layer I'm building is automated query plan analysis: every migration that creates an index also runs EXPLAIN ANALYZE on the queries that should use the index, to verify the index actually helps. The layer after that is data drift detection: a daily job that compares the schema in production to the migrations in the repo and flags drift.

The pattern is the same: take an area where humans are doing high-stakes work without good tools, build the tools that catch the obvious mistakes, and free up the humans for the judgment-heavy parts that AI can't do.

If you take one thing from this article, take this: every dangerous migration shares the same root cause - it ran in a context the author didn't fully understand. The review skill exists to surface that context before the migration ships. Build it once, save yourself from a dozen incidents.

Top comments (0)