Gabriel Anhaia

Posted on May 24

Logical Replication for Migrations: Zero-Downtime Postgres Upgrades in 2026

#postgres #database #devops #replication

Book: Database Playbook: Choosing the Right Store for Every System You Build
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Major Postgres upgrades used to mean a maintenance window. With logical replication, they're zero-downtime, if you set it up right. There are three traps that turn a 20-minute cutover into a 6-hour incident, and none of them are in the official upgrade docs.

This is the playbook a team I work with uses for 13 to 16 to 17 jumps on databases that can't take a banner. Real SQL, the sequence script that always gets forgotten, and the verification step that catches the silent data drift before users do.

Why pg_upgrade is the old way

pg_upgrade is fast. On the same machine, with hard links, it can move a 500GB cluster in under a minute. The catch is everything around it.

You stop the database. You run the upgrade. You restart it. You analyze statistics so the planner doesn't fall over on the first query. If anything fails partway, you're recovering from backup while your status page bleeds. And pg_upgrade doesn't help at all if you're moving across machines: different host, different storage class, different cloud region.

A team I talked to last quarter ran pg_upgrade on a 2TB cluster as part of a 13 to 16 jump. The binary swap took 90 seconds. The ANALYZE on the upgraded cluster took 47 minutes. During those 47 minutes, the application was up but every query that touched a non-trivial index ran a sequential scan. Their p99 went from 80ms to 14 seconds. Customers noticed.

Logical replication sidesteps all of that. You build the new cluster while the old one is still serving traffic. You analyze on the new cluster while nobody's watching it. You cut over at a moment of your choosing, with seconds of write pause instead of minutes of downtime.

Logical replication primer

Logical replication moves row-level changes from a publisher to a subscriber over a normal Postgres connection. It's been stable since Postgres 10 and has gotten meaningfully better every release since.

Three pieces you need to understand:

Publication: a named set of tables on the source. Created with CREATE PUBLICATION.
Subscription: a connection from the target back to the source that pulls changes for a publication. Created with CREATE SUBSCRIPTION.
Replication slot: a server-side bookmark on the publisher that tracks how far the subscriber has consumed. The slot keeps WAL files around until the subscriber acks them, which is why a stuck subscriber can fill your disk.

The flow: the publisher decodes its WAL into row changes, ships them over the wire, the subscriber applies them with regular SQL. Because it's logical (rows, not pages), the publisher and subscriber can run different Postgres versions. That's what makes this whole thing work.

The 5-step zero-downtime upgrade

The whole sequence assumes you have a target cluster running the new Postgres version, reachable from the source over the network, with enough disk for the data plus 30% headroom for the initial sync.

Step 1. Prepare the schema on the target

Logical replication does not copy the schema. You have to dump it yourself.

# on a host that can reach the source
pg_dump \
  --host=postgres-13.internal \
  --port=5432 \
  --username=migrator \
  --dbname=orders \
  --schema-only \
  --no-owner \
  --no-acl \
  --file=orders-schema.sql

# apply it to the target
psql \
  --host=postgres-16.internal \
  --port=5432 \
  --username=migrator \
  --dbname=orders \
  --file=orders-schema.sql

Disable foreign key checks on the subscriber? No, leave them on. Logical replication applies changes in commit order, so FK ordering works out. What you do need to disable, for the initial sync, is any trigger-based logic that would double-fire on rows that already happened upstream. The subscriber respects session_replication_role = replica, but only for triggers explicitly marked ENABLE REPLICA TRIGGER. By default, your application triggers sit out during replication apply. Good.

Step 2. Create the publication on the source

-- on postgres-13.internal
CREATE PUBLICATION orders_migration FOR ALL TABLES;

FOR ALL TABLES is the right call for a full-cluster migration. If you want a subset, list them: FOR TABLE customers, orders, line_items. Either way, this is cheap. It's metadata, no data movement yet.

Confirm the publisher's wal_level is logical. If it's not, you're restarting the source to flip it, which is the one downtime moment in this whole plan that you can't avoid. Check ahead of time:

SHOW wal_level;
-- if 'replica', set 'logical' in postgresql.conf and restart

Step 3. Create the subscription on the target

-- on postgres-16.internal
CREATE SUBSCRIPTION orders_migration_sub
  CONNECTION 'host=postgres-13.internal port=5432 dbname=orders user=replicator password=...'
  PUBLICATION orders_migration
  WITH (
    copy_data = true,
    create_slot = true,
    slot_name = 'orders_migration_slot',
    streaming = 'parallel',
    binary = true
  );

A few choices worth knowing:

copy_data = true triggers the initial table sync: every existing row gets copied before incremental changes start applying. For a 2TB cluster expect hours.
streaming = 'parallel' (Postgres 16+) lets large transactions stream as they happen instead of waiting for commit. Big win for write-heavy workloads.
binary = true is faster but requires matching column types on both sides. If you changed a type during this migration, leave it false.

Watch the sync progress:

SELECT
  subname,
  pid,
  received_lsn,
  latest_end_lsn,
  pg_size_pretty(
    pg_wal_lsn_diff(latest_end_lsn, received_lsn)
  ) AS lag_bytes
FROM pg_stat_subscription;

When lag_bytes is 0 bytes and stays there during normal write load, you're caught up.

Step 4. Cut over

The cutover is where teams get nervous. Here's the actual sequence:

Set the application to read-only mode (or pause writes entirely). One feature flag, one deploy.
Wait 5 seconds for in-flight transactions to commit on the source.
Confirm lag_bytes = 0 from the subscription stat query above.
Bump sequences (next section: the trap that bites everyone).
Flip DNS / connection string / PgBouncer config to point at the new cluster.
Take the application out of read-only mode.

Total write pause: 10 to 30 seconds depending on how fast your config rollout is. That's the "zero-downtime": reads kept serving from a read replica throughout, writes paused for under a minute.

Step 5. Decommission

Don't drop the old cluster yet. Run the new one as the source of truth for at least 24 hours, with the old cluster still up but no traffic. If something goes wrong, you can flip back.

After 24 hours of clean operation:

-- on postgres-16.internal
DROP SUBSCRIPTION orders_migration_sub;

-- on postgres-13.internal
DROP PUBLICATION orders_migration;

-- check the slot is gone (DROP SUBSCRIPTION should have cleaned it,
-- but verify because a leaked slot will fill disk)
SELECT slot_name, active, pg_size_pretty(
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal
FROM pg_replication_slots;

Then snapshot the old cluster and shut it down.

Trap 1. Sequences don't replicate

This is the one. Logical replication copies row changes. Sequences are not rows; they're a separate object Postgres bumps independently. When you cut over, every sequence on the new cluster is at whatever value the schema dump set it to (probably 1), while your application is about to insert a row that conflicts with an existing primary key.

You have to bump every sequence to match the source, right before the cutover. Here's the script that does it:

-- run on the SOURCE to generate the bump statements
SELECT format(
  'SELECT setval(%L, %s, true);',
  schemaname || '.' || sequencename,
  last_value
)
FROM pg_sequences
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY schemaname, sequencename;

That returns one line per sequence, like:

SELECT setval('public.orders_id_seq', 8472913, true);
SELECT setval('public.customers_id_seq', 412877, true);
SELECT setval('public.line_items_id_seq', 31204882, true);

You collect the output, then during step 4 of the cutover (between the write pause and the DNS flip), you run those statements on the target. The true flag means "the next call to nextval returns a value greater than this," which is what you want.

A safer variant, if you don't trust your write-pause window: add a buffer.

SELECT format(
  'SELECT setval(%L, %s, true);',
  schemaname || '.' || sequencename,
  last_value + 10000  -- buffer for in-flight writes
)
FROM pg_sequences
WHERE schemaname NOT IN ('pg_catalog', 'information_schema');

You lose 10,000 IDs per sequence. That's nothing. What you gain is bulletproof: even if a write slipped through your read-only flag, the new cluster will not collide.

Trap 2. DDL doesn't replicate

Logical replication moves row data, not schema changes. If you ALTER TABLE orders ADD COLUMN canceled_at TIMESTAMPTZ on the source while replication is active, the subscriber will start failing on the next insert that includes that column, with an error like:

ERROR: logical replication target relation "public.orders" is missing
replicated column: "canceled_at"

Replication halts. The slot keeps growing. You're now in an incident.

The fix is process, not SQL. During the migration window, you freeze DDL. No schema changes, no new columns, no index rebuilds that drop and recreate. If you absolutely must ship a schema change mid-migration:

Apply the DDL on the subscriber first.
Apply the DDL on the publisher second.
Use only additive changes (ADD COLUMN NULL, ADD INDEX CONCURRENTLY). Never drop, rename, or change a type.

For longer migrations where you can't freeze DDL for days, look at the pglogical extension or Postgres 17's experimental DDL replication. For a 1–2 day window, just freeze.

Trap 3. Large objects and unlogged tables get skipped

Two categories of data that logical replication silently ignores:

Large objects (the pg_largeobject system table, accessed via lo_* functions). If your app stores PDFs or images using the LO API, those bytes are not in your subscription. You'll have an empty file table on the new cluster and not know it.

Unlogged tables. These are tables created with CREATE UNLOGGED TABLE, used for ephemeral data because they skip the WAL. Logical replication reads from the WAL. No WAL, no replication.

Workarounds:

For large objects: use pg_dump --large-objects separately, restore into the target after the initial sync, then accept that any LOs written during the cutover window will need a delta sync. The cleaner long-term answer is to move away from large objects to bytea columns or external object storage. Postgres 16 marked the LO API as effectively legacy.
For unlogged tables: dump and restore them at cutover time, accepting that data written between dump and cutover is lost. If the table genuinely doesn't matter (cache, session store), that's fine. If it does matter, it shouldn't be unlogged.

There's a third quiet skip worth knowing: tables without a REPLICA IDENTITY. By default that's the primary key. If you have a table with no PK and no REPLICA IDENTITY FULL set, updates and deletes will fail to replicate with cannot update table without primary key. The check:

SELECT n.nspname, c.relname, c.relreplident
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
  AND c.relreplident = 'd'  -- default = primary key
  AND NOT EXISTS (
    SELECT 1 FROM pg_index i
    WHERE i.indrelid = c.oid AND i.indisprimary
  );

Any row that comes back needs a REPLICA IDENTITY FULL set before you start.

Verifying the cutover

Row counts are not enough. A table can have the right count and the wrong data. You want checksums.

-- run on both source and target, compare results
SELECT
  'orders' AS table_name,
  COUNT(*) AS row_count,
  md5(string_agg(
    md5(t::text),
    ',' ORDER BY id
  )) AS table_checksum
FROM orders t;

That builds a per-row hash, concatenates them in primary-key order, and hashes the result. Two clusters with identical data return identical checksums. Two clusters off by one row return wildly different hashes.

For tables with hundreds of millions of rows, this gets slow. Sample instead:

SELECT
  'orders' AS table_name,
  COUNT(*) AS sample_count,
  md5(string_agg(md5(t::text), ',' ORDER BY id)) AS checksum
FROM orders t TABLESAMPLE BERNOULLI (1)  -- 1% sample
WHERE id BETWEEN 1000000 AND 2000000;     -- bounded range

Run the same sample bounds on both sides. If checksums match on three or four random ranges, you're good.

The 60-second smoke test for the application: hit your top 5 read paths, hit your top 3 write paths, then run a query you know exact-counts the result for (a small reference table, an enum lookup). If all three pass, you're shipping.

The gotcha. Replication lag during cutover

Right when you cut over, replication lag tends to spike. The subscriber suddenly stops receiving the load it was streaming, the source is finishing in-flight transactions, and pg_stat_subscription can briefly show non-zero lag_bytes even after you've paused writes.

If your application has a read replica pattern, keep reads going to the old primary's read replica for a 5-minute window after cutover. Writes go to the new cluster, reads to the old one. Once lag_bytes has been zero for 60 seconds straight, flip reads over too.

This sounds like overkill. It isn't. The class of bug it prevents (a user submits a form, the write hits the new cluster, the immediate GET reads from a stale read replica) is exactly the kind of "ghost data" issue that gets logged as "intermittent UI bug" and lives in your backlog for months.

Zero-downtime is doable. It is not effortless. The five-step playbook gets you there; the three traps and the checksum verification keep you out of the incident channel.

What's the worst cutover you've shipped, and which of these traps bit you? Drop it in the comments.

If this was useful

This kind of migration sits in the gap between "knows Postgres" and "knows Postgres in production." If you want the longer treatment (picking the right replication tool for your topology, when to choose physical over logical, what happens to your stats and autovacuum after a major version jump), that's chapter 7 of the Database Playbook.