Anton Borisov

Posted on Oct 5 • Originally published at Medium

Switching Me Softly: Zero‑Downtime PostgreSQL 12 to 17 at Fresha

#postgres #database #devops #debezium

Killing me softly with his bash song, Zero‑downtime all along…

Artwork by Bojan Jevtić. Used with permission.

Introduction

Upgrading PostgreSQL in production is never just about running pg_upgrade.

At Fresha we run around 200+ PostgreSQL databases, and by the end of 2024 some of our most critical and heavily loaded ones were still stuck on Postgres 12 — about 20 databases in total. PostgreSQL 12 was heading for end‑of‑life in November 2024 – February 2025 on Amazon RDS, but the risk and pain of upgrades had been high enough that we postponed them again and again.

This created a dangerous situation: business‑critical data on a soon‑to‑be unsupported version, and no safe path forward. We had Debezium streaming change events into Kafka, outbox connectors producing ordered domain events, and fleets of replicas serving reads. Dropping into “maintenance mode” for even a few minutes was unacceptable.

For years, upgrades had been treated like staring into car headlights — everyone froze. I wanted to break that cycle. By combining database internals, streaming knowledge, and a healthy dose of Bash automation, I showed how zero‑downtime upgrades could work in practice. Once we proved it on the hardest PG12 databases, we scaled the method into a repeatable solution for dozens of teams and hundreds of databases.

Let’s step back for a moment. The obvious path was always “schedule maintenance windows” and take the hit. But that’s not realistic when your system is 24/7 and globally distributed. The real challenge wasn’t just moving the data — it was dealing with the entire ecosystem around it:

Debezium CDC connectors continuously streaming changes into Kafka
Outbox pattern connectors generating ordered business events
Replication slots and WAL management that can’t be dropped and recreated without data loss
Physical replicas serving production reads
PgBouncer pools handling thousands of concurrent connections

Instacart’s zero‑downtime cutover post describes a well‑known approach in the DBA world:

Take a consistent base backup (RDS snapshot or pg_basebackup)
Keep a replication slot open on the primary so WAL keeps accumulating
Align the replica’s replication origin with the backup’s last known LSN
Stream the changes from the slot until the replica catches up
Promote the replica and cut traffic over

Logical replication during a blue‑green upgrade: accumulate WAL during restore, then advance replication origin and catch up without loss.

That mechanism works, but it’s only the foundation. In practice it doesn’t address the messy parts we faced: Debezium connectors tied to those slots, outbox event ordering, sequence alignment, PgBouncer switchovers, and idempotency guarantees. Those were the hard problems we had to solve to make upgrades safe and repeatable at Fresha. And yes, we didn’t stop at a single major step — we did 12 → 17.

Why Classic Approaches Don’t Work

The classic “restore from snapshot, upgrade, and cut over” is conceptually simple. But in reality:

Logical replication must continue through the upgrade process.

Every physical replica needs to remain consistent and available for read‑only queries.

Kafka connectors tied to slots on the old cluster must transition cleanly — otherwise they either lose data or emit duplicates in unpredictable order.

Before building our own method, we looked at the obvious options:

RDS Blue/Green Deployments

Amazon RDS offers Blue/Green upgrades that spin up a parallel environment and promise minimal downtime. Sounds perfect until you realize it doesn’t integrate with the logical replication slots and Debezium connectors we rely on. Blue/Green can swap endpoints, but it doesn’t handle Kafka offsets, connector states, or downstream deduplication requirements. For us, that’s a non‑starter.

Direct `pg_upgrade` on Production

In‑place upgrades sound easy: stop Postgres, run pg_upgrade, restart. But that implies minutes or hours of downtime on terabyte‑scale databases. It also assumes you’re comfortable with no rollback option beyond restoring backups. We weren’t.

So we built something different:

Start the target database as a logical replica restored from a snapshot
Rebuild its physical replicas alongside, so the new cluster mirrors the old one
Use PgBouncer as the traffic switch, with a scripted pause/resume to freeze connections during switchover
Design explicit Debezium handling modes for both CDC and outbox connectors

Synchronizing slots and connectors: duplicate replication slots, create new Kafka Connect connectors, then switch PgBouncer to move traffic to the upgraded database.

The rest of this article walks through how we turned that into an automated, developer‑friendly upgrade framework that anyone on our team could run in ~5 minutes per database.

Winter Is Coming: Prewarm & Analyze

Catching up a replica is only half the story. Before you flip traffic, you need to make sure the new cluster can actually carry production.

Consistency checks. On staging we ran PG17 with full app tests: migrations, queries, extensions. Don’t skip this. If the app isn’t green here, it won’t be green in prod.

Prewarm and analyze. A fresh RDS restore is cold. First queries crawl while blocks come in from object storage. We pre‑warmed critical tables and ran ANALYZE so stats were correct and the planner didn’t go wild on the first real query.

Two‑phase switchover.

1) Read‑only flip — move replicas first. All writes still hit Blue, but reads come from Green. This is the production smoke test — if queries fail here, you still have a rollback button.

2) Full flip — once RO is solid, pause PgBouncer pools, wait for replication to be fully caught up, then point writes at Green. Under the hood, “full” just means “RO first, then RW.”

PgBouncer routes traffic: switch read‑only first, then read‑write.

PgBouncer tricks. Running it on Kubernetes has its quirks. One big one: how do you safely reschedule pods without jolting the on‑call at 3 a.m.? Our pain point was ConfigMaps. Redeploying them was too slow. The fix: pre‑mount all configs on every pod, and when it’s time to switch, skip redeploy and just fire off a few admin commands.

SET conffile = '/etc/pgbouncer/pgbouncer_new_rw.ini';
RELOAD;

No pod restarts, no reconciliation delay.

Mind the Gap (setval +100k)

Advance early, outside the pause. Right before switchover we run a sync_sequences script which iterates all sequences from pg_sequences on Blue, reads each sequence’s last_value, and then bumps the same sequence on Green to last_value + sync_sequences_gap (default 100000). This keeps the flip window short.
Exact mechanics. For each schemaname.sequencename:

-- On Green
SET transaction_read_only = OFF;
SELECT setval('<schema.seq>', <blue.last_value> + <gap>, true);

Why this is fine. We only need sequences to never collide after the flip. Pushing Green well ahead of Blue guarantees new inserts won’t reuse a value, and sequences are monotonic — no need to “pull back” on rollback.

Handling Debezium During Switchover

The problem. Debezium ties each connector to a logical replication slot on the source (Blue). When you bring up the target (Green) and want the same data to continue into the same topics, you may want to accept a short overlap window where both Blue and Green can emit the same changes.

CDC to Warehouse (easy mode: dedup)

For CDC → Debezium → Snowflake, we already deduplicate downstream. So we intentionally create a small overlap:

Managing WAL overlap: both Blue and Green CDC slots accumulate changes, with a controlled overlap to ensure continuity.

Keep Blue’s connector running
Create Green’s connector with the same schema/topic config, start from the fresh slot on Green
Allow a short duplication window so both produce the same tail of WAL
Stop Blue’s connector

Result: Snowflake reconciles quickly and retains one copy of each row/version. Reliable, fast, and zero‑risk if dedup is in place.

Outbox/Event Streams (stricter)

Outbox topics care about ordering semantics. Our rule of thumb:

During the flip there should be no “funny ordering” — the consumer either sees an event it has already processed, or it’s a brand‑new event.

We guarantee that by:

Pausing writes on Blue right before the RW flip (brief), giving Debezium time to flush the tail (config‑controlled)
Only then starting the Green connector
Keeping topic keying and partitioning identical, so any duplicates are the same key in the same partition

Outbox connector cutover: old connector flushes WAL, new connector starts reading from the overlapped slot on the upgraded database.

If your consumers are idempotent (exactly‑once at the business level), this pattern is safe: duplicates are replays of the same message, never re‑ordered across the cut.

Slots & offsets. You don’t “move” a slot; you create a new slot on Green and start the connector there. The old connector keeps its Blue slot until you stop it. Kafka offsets continue per topic/partition and duplicates are identical payloads, not new keys.

Overlap discipline. Keep the window short. Long overlaps create needless churn and noisy metrics.

Sensible checks. Only flip RW after:

Blue → Green replication lag under threshold
Outbox writes paused and flush complete (give it time to drain)
Green connector running and healthy

Rollback. If anything looks off, stop the Green connector, resume Blue, unpause writes. Because we haven’t changed keys/ordering, consumers remain consistent.

Orchestration Mode: Strict Ordering, Zero Confusion

When idempotency isn’t a sure bet for outbox consumers, we split the stream cleanly: old DB emits only “old” events, new DB emits only “new” events. No interleaving, no ambiguity.

How we mark events

Add a boolean version column to outbox tables, e.g. use_pg17 BOOLEAN DEFAULT NULL (present on all partitions)
Right before the flip, change the default on the target (Green) to TRUE (and keep it NULL/FALSE on source/Blue) — new writes on Green automatically carry use_pg17 = true

How we route events

Create a new outbox connector on Green with a JSR‑223 Groovy filter that only passes rows where use_pg17 == true. The old connector keeps running on Blue but filters to “not true” (null/false). This cleanly divides the topic stream.

Green filter:

value.after.use_pg17 == true

Blue filter:

value.after.use_pg17 == null || value.after.use_pg17 == false

In the connector config set:

transforms=filter
transforms.filter.type=io.debezium.transforms.Filter
transforms.filter.language=jsr223.groovy
transforms.filter.condition=<condition above>
transforms.outbox.type=io.debezium.transforms.outbox.EventRouter
snapshot.mode=never

This is applied to old and new connectors.

Switchover choreography

Prep: ensure use_pg17 exists on all outbox tables; the script checks this and fails fast if missing
Flip default on Green: set use_pg17 DEFAULT TRUE so any new events after the write flip carry the tag
Start Green outbox connector with the use_pg17 == true filter; old Blue connector still runs with “not true” filter
Pause writes on Blue, let Debezium flush and the old connector process current events, switchover to Green, resume writes — Green produces new (tagged) events; filtering removes duplicates from old
Retire Blue connector once the tail is drained; drop the filter on Green if desired (after Blue is off, everything is “new” anyway). Connector lifecycle and config mutation are automated via the Connect REST API.

Outbox orchestration: old and new connectors overlap, using a version flag to filter duplicate events during switchover.

Why this works

Topics never receive mixed “old/new” events during the window. Consumers either see already‑processed events (from Blue) or brand‑new events (from Green). There’s no reordering across the boundary because we pause writes, drain, then resume.
Rollback is symmetric: flip the defaults back, adjust filters (the script handles forward/reverse conditions automatically).

The Greatest Teacher, Failure Is

No matter how carefully you plan, switchover days have a way of surfacing unknowns. We built in multiple safety nets to make sure “wrong” never meant “irreversible.”

CDC connectors are the easy part. If a Debezium CDC connector stumbles, our downstream (Snowflake) is already deduplicating on keys, so we can always re‑consume the stream. Worst‑case, we spin up a new connector with snapshot.mode=initial against a trimmed dataset, relying on table indices to cut over at a known safe point. Consumers will reconcile cleanly.

Outbox connectors are trickier. Because they carry business events, we can’t just blindly replay snapshots. That’s why we lean on idempotent consumers and, when possible, the Kafka Connect REST API to drop offsets. On older Connect versions, the fallback is heavier: create a fresh connector/slot pair, trim the table online to the events we care about, then re‑bootstrap. You can use something like my previous article as a recipe.

If the new database turns out to be unusable. Our orchestration doesn’t just go forward — it can reverse. Once the switchover completes, we can run the same scripts in reverse: switch PgBouncer back, flip replication slots/publications, and re‑attach connectors to the old cluster. In theory, you can loop this forward/backward multiple times until you’re satisfied.

Dry‑run mode as insurance. We added a “test catch‑up” mode in switchover.sh: it pauses writes, lets logical replication and slots catch up, but never actually flips traffic to the new cluster. This gave us two superpowers:

Measure how long the pause would last under real workload
Test whether the application tolerated read‑only windows

If it didn’t, we’d just resume writes on the source and treat it as a rehearsal, not a disaster.

Takeaway: A zero‑downtime upgrade isn’t just about the happy path. It’s about building escape hatches: reset CDC, replay outbox carefully, reverse the switchover if needed, and dry‑run until you trust the process.

Everything Everywhere All at Once, known as YAML

One of the biggest lessons we learned after the first few upgrades was: you don’t want hardcoded scripts. Each database had its quirks — different connector sets, replica layouts, outbox usage — so we pushed everything into a YAML config and a single entrypoint.

Yeah, you can do modules with Bash :)

Our switchover.sh script is a thin wrapper around sub‑modules in lib/, reading from switchover-config.yaml. That config defines:

Source and target clusters

source:
  internal_name: postgres_source
  host: localhost
  port: 5433
  replicas:
    - name: postgres_source_ro_1
    - name: postgres_source_ro_2
target:
  internal_name: postgres_target
  host: localhost
  port: 5434
  replicas:
    - name: postgres_target_ro_1
    - name: postgres_target_ro_2

This lets us model not just primaries, but physical replicas, so PgBouncer can safely switch read‑only traffic separately from read‑write.

Database‑level details (slots, publications, subscriptions)

database:
  name: testdb
  user: testuser
  password: testpass
  logical_slot: dms_slot
  logical_publication: dms_pub
  logical_subscription: dms_sub
  transaction_wait_time: 3  # allow txs to finish before going RO on Blue

PgBouncer pools and pre‑delivered switchover configs

pgbouncer:
  switchover_rw_config_file: pgbouncer_new_rw.ini
  switchover_ro_config_file: pgbouncer_new_ro.ini

Kafka Connect clusters and outbox mapping

kafka:
  connect_clusters:
    - name: cdc
      url: http://localhost:8083
    - name: outbox
      url: http://localhost:8084
debezium:
  outbox:
    version_column: use_pg17
    tables:
      - outbox_events_1
      - outbox_events_2

Replication tolerances (lag thresholds, sequence gaps, catch‑up timeout)

replication:
  max_lag_bytes: 20000        # what to consider "no lag"
  catchup_timeout: 10         # how long to pause to catch up
  sync_sequences_gap: 100000  # safe gap for sequences

Why expose so many options? Because upgrades are never uniform:

Sometimes you want duplication mode for Debezium (let duplicates flow, downstream deduplicates)
Sometimes you need orchestration mode (JSR filters + version columns to separate old vs new events)
Some teams only wanted to test read‑only switchover, without touching writes
On the biggest DBs, we needed to pause transactions, wait for catch‑up, then flip. On smaller ones, we could skip the wait entirely

Instead of maintaining a dozen scripts, we ended up with a single command:

./run.sh switchover \
  --direction forward \
  --mode full \
  --debezium-mode orchestration \
  --transaction-mode wait \
  --test-catchup disabled \
  --config configs/switchover-config.yaml

Every option maps directly to one of those operational decisions. This made the process reproducible, reversible, and explainable.

Results

Between January and February, we upgraded 20+ PostgreSQL databases from PG12 to PG17 without a single minute of downtime. Even the largest and most sensitive database — the one where we couldn’t assume idempotent consumers — was migrated safely using the orchestration trick with version columns and JSR filters.

And yes, after the final switchover, we had a cake. 🎂

More importantly, we ended up with a repeatable framework:

The same scripts and configs could be applied across clusters
Every option (duplication vs orchestration, test catch‑up vs full switchover) was codified and reusable
Rollback and dry‑run modes gave us confidence under pressure

End‑to‑end switchover flow: from prewarming and replication slot setup, through pausing PgBouncer and Debezium orchestration, to resuming traffic on Green.

Conclusion

Zero‑downtime PostgreSQL upgrades are not just possible — they’re now our reality. With careful orchestration of CDC connectors, outbox event streams, replication slots, and PgBouncer traffic, we moved our production fleet forward to PG17 without disruption.

That said, the switchover is only half the story. The switchover itself is elegant, but the supporting work can still feel tedious:

Disabling migrations while replicas catch up
Tweaking PgBouncer configs
Changing Terraform definitions for connectors
Creating PRs, waiting for approvals, and fixing the inevitable typo

We’re not stopping here. The scripts and configs are good enough today that teams across Fresha can already run upgrades on their own by following the documented process. The next step is to take the human glue out of it:

Templated configs
PR generation
Orchestration with AWS Step Functions
And finally a workflow where upgrading Postgres feels like a safe, boring routine

We’ve proven this works at Fresha scale, and now we’re investing in making it a one‑button operation. When PG18 lands, we don’t want anyone on‑call to break a sweat.

Credits

Anton Borisov — rule‑breaker and architect of the impossible; brought the idea nobody thought would fly, built the first script, solved the hardest DB hurdle, and set the pace as the project’s boldest innovator. Basically the one person you shouldn’t let near production… unless you want results.
Emiliano Mancuso — Head of Infrastructure; believed in the process when others said it was impossible. Also doubles as our premier rubber duck.
Jan Zaremba — leader of Infrastructure; encyclopedic knowledge and steady hand that made every iteration safer than the last.
Robert Pyciarz — doomsday architect; mapped out every failure path and tucked Kubernetes into all the right corners.
Paritosh Anand & Paweł Michna — deadline wrestlers; ran upgrades at crunch time and wrote down every step so the rest of us could sleep.
Blend Halilaj — snapshot tactician; turned the idea of blue‑green RDS restores into a clean, repeatable workflow.
Special encore for Rehan Ullah, Sam Stewart, and (again) Paweł Michna, the crew turning all of this into a one‑button upgrade right now.

DEV Community

Switching Me Softly: Zero‑Downtime PostgreSQL 12 to 17 at Fresha

Introduction

Why Classic Approaches Don’t Work

RDS Blue/Green Deployments

Direct `pg_upgrade` on Production

Winter Is Coming: Prewarm & Analyze

Mind the Gap (setval +100k)

Handling Debezium During Switchover

CDC to Warehouse (easy mode: dedup)

Outbox/Event Streams (stricter)

Orchestration Mode: Strict Ordering, Zero Confusion

The Greatest Teacher, Failure Is

Everything Everywhere All at Once, known as YAML

Results

Conclusion

Credits

Top comments (0)

Introduction

Why Classic Approaches Don’t Work

RDS Blue/Green Deployments

Direct pg_upgrade on Production

Winter Is Coming: Prewarm & Analyze

Mind the Gap (setval +100k)

Handling Debezium During Switchover

CDC to Warehouse (easy mode: dedup)

Outbox/Event Streams (stricter)

Orchestration Mode: Strict Ordering, Zero Confusion

The Greatest Teacher, Failure Is

Everything Everywhere All at Once, known as YAML

Results

Conclusion

Credits

Direct `pg_upgrade` on Production