Alex Cloudstar

Posted on May 15 • Originally published at alexcloudstar.com

Zero-Downtime Postgres Migrations: The Mistakes That Locked My Production Database

#saas #backend #architecture #devtools

The first production database migration I ran that broke things took down an internal tool for forty-two minutes. The migration looked harmless. It added a NOT NULL column to a table with thirty-eight million rows. I ran it on a Wednesday afternoon, watched it sit at "pending" for a few seconds, then watched our entire app stop responding. Postgres was rewriting the table. Every read and write was queued behind an ACCESS EXCLUSIVE lock. I had no idea this would happen because in development the same migration ran in two hundred milliseconds.

That was the day I learned the difference between a migration that works on a small table and a migration that works on a real production database. They are not the same operation. They have different cost models, different failure modes, and different blast radius. The Postgres docs describe the locking behaviour of every command, but you have to know to look. Most ORM migration tutorials do not even mention locks.

This is the post I wish I had read before that Wednesday. It covers the operations that quietly lock your tables, the expand-and-contract pattern that lets you change schema without downtime, and the migrations I now refuse to run during business hours no matter how confident I am.

The Mental Model You Need First

A Postgres database serving live traffic is a high-frequency machine. Reads and writes are happening every millisecond. Every operation you run has to share the same data with everything else. The way Postgres makes that sharing safe is locks.

There are eight lock modes in Postgres, but the only one that matters for migration safety is ACCESS EXCLUSIVE. That lock blocks every other operation on the table, including reads. If your migration takes ACCESS EXCLUSIVE on a large table and holds it for thirty seconds, your app is down for thirty seconds. If it holds it for thirty minutes, your app is down for thirty minutes.

The dangerous operations are the ones that quietly take ACCESS EXCLUSIVE while also requiring a full table scan or table rewrite. These are the ones that work fine on a small table in staging and freeze you in production. The list of operations like this is short but important to memorise:

ALTER TABLE ... ADD COLUMN ... NOT NULL without a constant default
ALTER TABLE ... ADD COLUMN ... DEFAULT with a volatile default (older Postgres versions)
ALTER TABLE ... ALTER COLUMN TYPE for most type changes that require a rewrite
ALTER TABLE ... ADD CONSTRAINT for CHECK, FOREIGN KEY, or UNIQUE without using NOT VALID first
CREATE INDEX without CONCURRENTLY
VACUUM FULL
CLUSTER

Every other migration operation is either cheap (metadata change only) or has a safe variant if you know the magic words. The safe variants are the entire point of this post.

The Five-Step Migration That Made Me Stop Breaking Things

The pattern that solves almost every schema change is called expand-and-contract, or sometimes parallel change. It works because at no point in the migration is the application unable to read or write. The database holds both shapes of the schema at once, and the application code learns to deal with the old shape and the new shape simultaneously during the transition.

The five steps:

Expand. Add the new schema (new column, new table, new index) in a way that does not break the current application.
Dual-write. Update the application so every write that touches the old schema also writes to the new schema.
Backfill. Copy historical data from the old schema to the new schema, in batches, without locking.
Switch reads. Update the application to read from the new schema. The old schema is still being written to, just in case.
Contract. Stop writing to the old schema. Drop it.

Each step is a separate deploy. That is not optional. The whole point is that the database is always in a state where both the previous deploy and the next deploy work against it. If you collapse steps, you reintroduce coupling between the database state and the application state and you lose the zero-downtime property.

The most common mistake is trying to do steps 1, 2, and 5 in one deploy. That is the old-school way of writing migrations. It only works if you take downtime. The moment you have real users, you cannot take downtime, and you have to think in deploys.

Adding A Column Safely

The classic example is "add a phone_number column to users." A naive migration looks like this:

ALTER TABLE users
  ADD COLUMN phone_number TEXT NOT NULL DEFAULT '';

On Postgres 11 and later, this is actually fast because Postgres stores the default in the catalog and does not rewrite the table. On Postgres 10 or earlier, it rewrites every row. So step zero is "know which version of Postgres you are on." Postgres 18 is the current major as of May 2026, and you should be on it or close to it for any new project.

Even on modern Postgres, the safe pattern for a column the app actually needs is:

Step 1, expand. Add the column as nullable with no default. This is a metadata-only change and runs in milliseconds.

ALTER TABLE users ADD COLUMN phone_number TEXT;

Step 2, dual-write. Update the application to write phone_number on every insert and update where you have it. Old rows still have NULL. New rows have the value.

Step 3, backfill. If you have a source for historical phone numbers, copy them in batches. Use a script that does small updates with sleeps in between, not a single UPDATE users SET phone_number = ... over the whole table.

UPDATE users
SET phone_number = legacy_phone
WHERE id IN (
  SELECT id FROM users
  WHERE phone_number IS NULL AND legacy_phone IS NOT NULL
  ORDER BY id
  LIMIT 1000
);

Run that in a loop with a short pause between batches. Each batch holds row locks for milliseconds. Your application keeps serving traffic.

Step 4, switch reads. Update the application to read phone_number instead of legacy_phone. Both columns still exist.

Step 5, contract. Once you are confident nothing reads legacy_phone, drop it.

ALTER TABLE users DROP COLUMN legacy_phone;

Dropping a column in Postgres is metadata-only. The data is reclaimed by autovacuum later. The drop itself is instant.

If you also want phone_number to be NOT NULL, add the constraint after the backfill, using the safe two-step variant covered later.

Renaming A Column Without Downtime

Renaming a column is the migration most people get wrong. The naive version:

ALTER TABLE users RENAME COLUMN email_address TO email;

This is metadata-only and fast. The problem is not the database. The problem is the deploy. The old application code is still running and querying email_address. The new code queries email. Whichever one runs against the renamed column at any moment will break.

The safe version uses expand-and-contract:

Add a new email column. Empty.
Dual-write: every write to email_address also writes to email.
Backfill: copy historical email_address to email.
Switch reads to email.
Stop writing to email_address. Drop it.

Five deploys. Boring. Survives every concurrent deploy and rollback in between.

If you are using an ORM that auto-generates migrations from your models, this pattern requires you to break out of the ORM's default behaviour. You will need to write the SQL by hand or use a tool that understands expand-and-contract natively. pgroll is the open-source tool that does this for Postgres, and it has become the default I reach for in 2026.

Adding An Index Without Locking The Table

Indexes are the single most common cause of accidental downtime. The default CREATE INDEX takes a write lock on the table for the entire duration of the build. On a table with hundreds of millions of rows, that can be ten or twenty minutes. Every write to the table during that time waits.

The fix is one word: CONCURRENTLY.

CREATE INDEX CONCURRENTLY idx_users_email ON users (email);

The concurrent variant takes a much weaker lock that does not block writes. It is slower overall because it scans the table twice, but it does not stop your app.

Two gotchas with CONCURRENTLY that bite people:

The first is that it cannot run inside a transaction. Most migration frameworks wrap migrations in a transaction by default. You have to tell the framework not to. In Prisma, this means writing the migration as a raw SQL file with the right annotation. In Rails, you set disable_ddl_transaction!. In every framework, it is one line of config and forgetting it makes the migration fail loudly.

The second is that CREATE INDEX CONCURRENTLY can fail partway and leave an invalid index sitting on the table. If that happens, the index exists but is marked invalid and will not be used by queries. You have to drop the invalid index and try again. Always check the state of an index after building it:

SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE indrelid = 'users'::regclass;

If any indisvalid is false, drop that index and rebuild.

Changing A Column Type Without Rewriting The Table

Some type changes are free. VARCHAR(50) to VARCHAR(100) is free. INTEGER to BIGINT is not free, it rewrites the table. TEXT to INTEGER is definitely not free.

The expansion-and-contract pattern works for type changes too:

Add a new column with the target type.
Dual-write to both the old column and the new column, casting the value on the way in.
Backfill historical rows in batches.
Switch reads to the new column.
Drop the old column. Rename the new column to the old name (this is metadata-only).

The "rename to the old name" step is what makes this transparent to the rest of the application. By the end, the column name is the same, the type is different, and there was no table rewrite.

For columns that need to grow from INTEGER to BIGINT (because you ran out of auto-increment space), this pattern is the standard answer. Doing it the naive way on a billion-row table is a multi-hour outage.

Constraints Without Lock Storms

Adding a NOT NULL constraint on an existing column without taking a long lock requires a two-step:

-- Step 1: add the constraint as NOT VALID
ALTER TABLE users
  ADD CONSTRAINT users_email_not_null
  CHECK (email IS NOT NULL) NOT VALID;

-- Step 2: validate the constraint without an exclusive lock
ALTER TABLE users VALIDATE CONSTRAINT users_email_not_null;

The NOT VALID part tells Postgres "trust me, future writes will satisfy this, but do not check the existing rows yet." That part is fast and only locks briefly. The VALIDATE step scans the table but does so without blocking reads or writes.

Once the constraint is validated, you can convert it to a proper NOT NULL on the column itself if you want. In Postgres 12 and later, you can do this without rewriting the table because Postgres knows the check constraint already proved every row is non-null.

The same pattern works for FOREIGN KEY constraints. Add as NOT VALID, then VALIDATE separately. Splitting the operation into the cheap declaration and the slower scan is how you keep production responsive.

For UNIQUE constraints, the trick is to build the underlying index CONCURRENTLY, then attach the constraint to it:

CREATE UNIQUE INDEX CONCURRENTLY idx_users_email_unique ON users (email);
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE USING INDEX idx_users_email_unique;

The constraint creation in the second step is metadata-only because the index already exists. Zero downtime.

The Backfill Script That Will Not Crash Your Database

Backfills are where production migrations go to die. People write a single UPDATE over millions of rows, that update holds locks and bloats the WAL, and the database starts misbehaving. Then they kill the query, which Postgres has to roll back, which takes longer than running the query in the first place.

The pattern that works is small batches, slow pace, observable progress. Something like:

async function backfillPhoneNumbers() {
  let lastId = 0;
  const batchSize = 1000;

  while (true) {
    const result = await db.query(
      `UPDATE users
       SET phone_number = legacy_phone
       WHERE id > $1
         AND id <= $1 + $2
         AND phone_number IS NULL
         AND legacy_phone IS NOT NULL
       RETURNING id`,
      [lastId, batchSize]
    );

    if (result.rows.length === 0 && lastId > maxUserId) break;

    lastId += batchSize;
    await sleep(100);
  }
}

The script processes a thousand rows at a time, sleeps for a hundred milliseconds between batches, and tracks progress by primary key. Each batch holds row locks for a tiny window. If the script dies, it can resume from the last lastId. If it falls behind, the sleep can be reduced. If the database starts struggling, the sleep can be increased without changing the logic.

Run this from a worker, not from inside your migration tool. Migrations should be schema changes. Data backfills are application code with retry semantics, and they belong in your background job system.

The Tools That Make Expand-And-Contract Less Painful

The expand-and-contract pattern is correct but verbose. Five deploys for one schema change is a lot of process for a small team. Two tools have changed how I do this in 2026.

pgroll by Xata. An open-source tool that runs schema migrations as multi-step plans, automatically maintaining both the old and new schemas during the rollout. You describe the migration in a JSON or YAML file (add column, rename column, change type), and pgroll figures out the expand-contract steps and runs them safely. It serves both schema versions through Postgres views so the old app and the new app can both read and write during the transition. The contraction step happens when you explicitly complete the migration.

The other one is Atlas by Ariga. A schema-as-code tool that lints migrations for unsafe operations, warns you about lock implications, and integrates with CI to block dangerous migrations before they merge. The linter alone has saved me twice from running an ALTER COLUMN TYPE that would have rewritten a large table.

For projects already on a framework-specific migration system (Prisma, Drizzle, ActiveRecord, Alembic), the workflow is usually:

Write the migration in the framework's native format for the safe operations.
For the dangerous operations, write raw SQL and add the safe variants by hand.
Run pgroll or Atlas alongside the framework, or graduate to them for any non-trivial change.

The Prisma migrate workflow in particular has gotten better about generating CREATE INDEX CONCURRENTLY and similar safe variants in 2025 and 2026, but it still cannot reason about expand-contract across multiple deploys. That part is your job. The Prisma vs Neon comparison covers more of the Postgres-on-modern-platforms angle if you are picking a stack from scratch.

What I Refuse To Run During Business Hours

Even with the safe patterns, some operations should not be run during peak traffic. The list:

VACUUM FULL on any table that fits in cache. It rewrites the table from scratch and takes ACCESS EXCLUSIVE for the entire duration. There is almost never a case where you actually need it in 2026 because routine autovacuum handles bloat for most workloads. If you do need it, schedule a maintenance window.

Any backfill that touches a hot table (high write volume from your app). Even with small batches, the contention with normal traffic can degrade performance. Run backfills during low-traffic hours, even if the script is technically safe at any time.

Reindexing a unique index, even with CONCURRENTLY. The brief windows of exclusive locking around the rebuild can stack with your app traffic and cause weird stalls. Schedule it for quiet hours.

Any ALTER TABLE that you have not tested against a production-sized dataset. Staging databases are usually a fraction of production size. The lock behaviour is the same, but the duration scales with row count. Test against a copy of production or against a representative subset.

The general rule: if a migration could theoretically lock the table, the time of day matters even if you think it will be fast. The cost of being wrong at 2pm is much higher than the cost of being wrong at 2am.

Connection Pools And Migration Coordination

The other thing that bites people during migrations is the connection pool. Many migration tools take a separate connection to run the migration. Your app holds connections through PgBouncer or a built-in pool. If your migration takes an exclusive lock and your app has a connection mid-transaction on the same table, both will sit there waiting for each other.

The fix is to set a lock_timeout on the migration session:

SET lock_timeout = '5s';

ALTER TABLE users ADD COLUMN phone_number TEXT;

If the migration cannot acquire its lock within five seconds, it fails fast instead of blocking the entire database. You catch the error in your migration tool, wait, and retry. This is much better than the default behaviour, which is to wait forever and queue every other query behind you.

Pair this with statement_timeout to prevent any single migration statement from running indefinitely. Five seconds is a good default for the lock_timeout. The statement_timeout depends on the operation, but for any single DDL statement on a properly-sized table, anything over thirty seconds is suspicious.

SET lock_timeout = '5s';
SET statement_timeout = '30s';

For long backfills, the statement_timeout needs to be larger, but those should be running as application code with their own retry logic, not as migration statements.

Rollback Plans That Actually Work

The version of "rollback" you learn in tutorials does not work for real production migrations. The naive idea is "if the new code breaks, roll back to the old code, and roll back the migration." This fails because the migration usually cannot be reversed without data loss. Dropping a column you just added is fine. Reverting a column rename is fine if you saved the old data, which you did because you used expand-and-contract. But reverting a backfill that mutated data is generally impossible without restoring from backup.

The actual rollback strategy that works:

For the expand-and-contract pattern, the rollback at any step is to deploy the previous version of the application. The database state is forward-compatible at every step because both schemas are present. You never need to undo a migration mid-flight, you just deploy the previous code and the schema continues to support it.

For destructive steps (dropping a column, dropping a table), there is no rollback. The data is gone. The mitigation is to leave the destructive step until you are absolutely certain nothing reads or writes the old column. In practice this means waiting a full release cycle (or two) between the "switch reads" step and the "contract" step.

For data mutations (a backfill that transformed a column), the rollback is a restore from backup. This is fine for non-critical data. For critical data, the right strategy is to not mutate in place. Add a new column, backfill the new column with the transformed value, and switch reads. The original data is preserved until the final contraction step, which you delay until you trust the new column.

Take a logical backup before every production migration that touches data. Postgres pg_dump of the affected tables, copied somewhere safe, dated. The backup will be useless 99% of the time. The 1% will save your business.

What I Run In Production

The setup I have arrived at after enough scars:

Every migration goes through a CI pipeline that runs Atlas in lint mode against the migration file. The lint catches unsafe operations (missing CONCURRENTLY, NOT NULL without NOT VALID, type changes without expand-contract). The pipeline blocks the PR if any unsafe pattern is detected.

Every dangerous migration is broken into expand-and-contract steps. I use pgroll when the team is willing to learn it. For projects where pgroll is too much overhead, I write the steps as separate migration files with clear comments about which deploy each one belongs to.

Every backfill runs as a worker job, not as part of the migration. The worker uses small batches, primary-key cursors, and a sleep between batches. The job's progress is visible in the same observability stack the rest of the app uses. The production observability post for solo developers covers what that stack looks like for small teams.

Every migration session sets lock_timeout = '5s' and statement_timeout appropriate to the operation. Migrations fail fast instead of blocking the app.

Every destructive step (dropping a column or table) is delayed until at least one full release cycle after the switch-reads step. I have never regretted the wait. I have regretted not waiting.

Every production migration that touches data takes a pg_dump of the affected tables first. The dump is stored in object storage with a date prefix and a retention policy. The dump itself takes seconds for most tables. The peace of mind is worth the seconds.

Every migration is reviewed by at least one other engineer who has the explicit job of asking "what does this lock, and for how long." The review is not about the SQL syntax. The review is about the production behaviour.

What I Would Tell You If You Asked

Most production database outages are not from queries. They are from migrations. The migration runs, takes a lock you did not expect, the lock cascades into every other operation, and your app is down before you have finished refreshing the deploy log.

The fix is not to be smarter at writing migrations. The fix is to assume every operation is dangerous until you have verified it is not. Look up the lock mode in the Postgres docs. Check the version-specific behaviour. Test against a production-sized dataset. Use CONCURRENTLY, NOT VALID, expand-and-contract. Split every dangerous change into multiple deploys.

The reason this matters more in 2026 than it did five years ago is that databases have gotten bigger faster than tooling has gotten safer. A SaaS that hit ten million rows in 2020 hits a hundred million in 2026 because the underlying compute is cheaper and AI features generate more rows per user. The migrations that were fine on small tables are not fine on the new tables. The patterns that protected you at small scale do not protect you at the scale you are about to operate at.

The good news is that the safe patterns are well-known and the tools are good. pgroll, Atlas, and the safe variants built into your migration framework cover almost every case. The hard part is internalising that "safe" is not the default and you have to ask for it explicitly. Once you do, the production-Postgres game becomes much less scary.

The broader instinct, the one that generalises beyond databases, is to ask "what locks while this runs and what happens to everyone waiting." That question is the difference between code that runs in isolation and code that runs in production. It is the same question that separates a working webhook handler from a Stripe webhook handler that does not double-bill users, and the same question that separates a real-time feature from a real-time feature that survives a network blip. Production is the place where invisible costs become visible. Migrations are the part where they become visible to your customers.

Run your migrations like the database is also doing other things. Because it is.

DEV Community