Deleting Millions of Rows in PostgreSQL Without Taking Down Production

Shanthan Kondapalli — Sun, 14 Jun 2026 18:35:33 +0000

MVCC, WAL, vacuum, and replication slots under sustained delete load - and how to delete billions of rows without your database noticing

Most "how to delete a lot of rows" articles stop at "batch it and delete children before parents." That advice is correct, it's table stakes, and everyone already knows it. This article is about everything after that - the parts that actually decide whether your cleanup runs quietly in the background for a week or pages you at 3 a.m. with a full disk and a replica that's six hours behind.
The thesis: at scale, your DELETE statement is the easy part. The adversaries are the subsystems a delete feeds - MVCC tuple versioning, the write-ahead log, autovacuum, and the replication machinery. Bulk deletion is really an exercise in flow control across those subsystems. Get the SQL right and the systems wrong, and you'll still take production down.
We'll assume PostgreSQL (the internals are PG-specific), a live OLTP primary with at least one physical replica and one or more logical/CDC consumers, and a target of hundreds of millions to billions of rows across many related tables.

The one paragraph of "basics," so we can move on: delete in dependency order (referencing rows before referenced rows); collect parent keys once; never rely on ON DELETE CASCADE for huge deletes because you can't throttle a cascade. Done. Now the real material.

1. What a DELETE actually costs

A delete is not "remove a row." Under MVCC it's "mark a row version dead and write that fact everywhere." For each deleted tuple, PostgreSQL:

Sets xmax on the heap tuple to your transaction id. The row is still physically present; it becomes a dead tuple once your transaction commits and no snapshot can still see it.

Writes a WAL record for the heap change. If this is the first modification of that page since the last checkpoint, it also writes a full-page image (FPI) - potentially 8 KB of WAL for a single-row change.

Touches every index. Index entries aren't removed at delete time; they're left dangling and cleaned up later by vacuum. So a table with 12 indexes generates index-cleanup debt across all 12.

Feeds logical decoding. If a logical replication slot exists, the change is decoded from WAL and streamed to consumers - and how much gets written depends on the table's REPLICA IDENTITY (Section 5). Multiply by a billion. The deleted rows don't shrink the table - they inflate it with dead tuples until vacuum reclaims them, and they inflate WAL, which every replica and slot must consume. The delete is cheap; the wake it leaves is expensive. Everything below is about managing that wake.

- ## 2. Batching is transaction-lifetime control, not just "smaller deletes" People batch to avoid one long lock. True, but the deeper reason is snapshot lifetime, and it's the single most important concept in this entire article. A transaction holds a snapshot. That snapshot defines an xmin horizon - the oldest transaction id still considered "in progress" from the system's point of view. Two brutal consequences follow, and they're the root cause of most bulk-delete disasters:

Vacuum cannot reclaim any dead tuple newer than the oldest snapshot's xmin. A long-running transaction anywhere in the database freezes the cleanup horizon. So if your delete runs as one multi-hour transaction, autovacuum literally cannot clean up the dead tuples you're producing - they pile up until you commit.

Logical replication slots cannot advance past an open transaction. More on this in Section 6, but the headline is the same: a long-lived transaction pins WAL on disk. So a billion-row delete in one transaction is a triple threat: a giant lock, an un-vacuumable mountain of dead tuples, and pinned WAL. Batching with per-batch COMMIT is how you keep the xmin horizon moving so vacuum and replication can do their jobs while you work.

 - The shape that matters: tiny transactions, monotonic cursor, durable progress.
LOOP
 SELECT max(id) INTO v_hi
 FROM (SELECT id FROM t WHERE owner=:o AND id > v_lo ORDER BY id LIMIT :n) s;
 EXIT WHEN v_hi IS NULL;
 DELETE FROM t WHERE owner=:o AND id > v_lo AND id <= v_hi;
 v_lo := v_hi;
 COMMIT; - releases the snapshot; advances xmin horizon
 PERFORM dbo_backpressure_sleep(); - Section 7
END LOOP;

Keyset pagination (id > v_lo ORDER BY id LIMIT n) rides the primary-key index and is O(1) per batch regardless of depth, unlike OFFSET which re-scans everything it skips. The COMMIT is doing far more than releasing locks - it's resetting your contribution to the global xmin horizon every few thousand rows.

COMMIT inside a loop requires a procedure (or anonymous DO block). Functions run inside the caller's transaction and cannot commit. This distinction will matter again in Section 7.

- ## 3. Dead tuples and the vacuum you must run yourself Autovacuum is designed for steady-state churn, not for a deliberate billion-row purge. During a heavy delete you out-produce it, and three things go wrong: the table bloats, indexes bloat worse, and - if the run is long enough - you flirt with transaction-id wraparound pressure. You need to take vacuum into your own hands. ### 3.1 Why "just let autovacuum handle it" fails Autovacuum triggers when dead tuples exceed autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples. With the default scale_factor = 0.2, a 2-billion-row table won't trigger autovacuum until ~400 million dead tuples have accumulated. By then you have catastrophic bloat and a vacuum that takes hours and competes with your delete for I/O. For a purge window, tighten autovacuum on the target tables for the duration, then reset:

ALTER TABLE big SET (
 autovacuum_vacuum_scale_factor = 0.005, - fire at 0.5% dead, not 20%
 autovacuum_vacuum_threshold = 5000000,
 autovacuum_vacuum_cost_delay = 5, - throttled but not glacial
 autovacuum_vacuum_cost_limit = 2000 - ~10x default budget per cycle
);
 - … run the purge …
ALTER TABLE big RESET (autovacuum_vacuum_scale_factor,
 autovacuum_vacuum_threshold, autovacuum_vacuum_cost_delay,
 autovacuum_vacuum_cost_limit);

3.2 Vacuuming during the run, from inside a procedure

You also want explicit VACUUM checkpoints - say every N million rows deleted from a given table. Two PostgreSQL facts make this awkward:

VACUUM cannot run inside a transaction block. Your batched-delete procedure is always in a transaction, so you can't call it directly.
The clean workaround is to run it over a separate connection. dblink is the idiomatic trick - dblink_exec opens its own backend where VACUUM is a legitimate top-level statement:

PERFORM dblink_exec(:conn, format('VACUUM (ANALYZE) %I.%I', :schema, :table));

And the scripting-layer trap that catches everyone: psql -c "stmt1; stmt2" wraps both statements in one implicit transaction, so a VACUUM after a SET fails with "VACUUM cannot run inside a transaction block." Use separate -c flags - each becomes its own transaction:

psql -c "SET statement_timeout=0" -c "VACUUM ANALYZE big" # works

3.3 The horizon makes mid-run vacuum partly futile - unless you batch

Here's the payoff from Section 2: a VACUUM can only remove tuples older than the global oldest xmin. If your own delete is one long transaction, your mid-run vacuum reclaims almost nothing, because your snapshot is the horizon. Only because you batch-and-commit does mid-run vacuum actually free space. Vacuum and batching are not independent tactics - batching is what makes vacuuming work at all.

3.4 Index bloat outlives table bloat

VACUUM makes dead space reusable but doesn't return it to the OS, and B-tree indexes bloat faster than heaps under massive deletes. After a large purge, estimate index bloat (the catalog/pgstattuple estimators) and REINDEX INDEX CONCURRENTLY the worst offenders - concurrently, so you don't take an AccessExclusive lock on a live table. Don't forget TOAST tables for wide columns; they bloat too.

3.5 Wraparound, briefly

A days-long purge consumes transaction ids (one per batch commit, plus the deletes). It won't cause wraparound on its own, but combined with suppressed vacuuming it raises the table's age. Keep an eye on age(relfrozenxid) for the big tables if your purge runs for many days; the tightened autovacuum above also drives freezing.

- ## 4. WAL: the firehose you're pointing at your replicas Every batch you delete becomes WAL, and WAL is the shared resource that couples your delete to everything downstream. Two amplifiers turn a reasonable delete into a WAL flood. ### 4.1 Full-page writes and checkpoints The first write to a page after a checkpoint emits a full 8 KB page image. A delete that dirties many pages right after a checkpoint can emit far more WAL than the logical change suggests. You can't avoid FPIs, but you can avoid making them worse: don't run a purge concurrently with aggressive checkpoint settings, and spread the load (throttle) so checkpoints amortize. Watch pg_stat_bgwriter and the WAL generation rate. ### 4.2 REPLICA IDENTITY FULL is a WAL multiplier This one silently doubles or triples your WAL and most people never check it. For logical decoding, PostgreSQL must write enough of the old row to identify it on the subscriber:
REPLICA IDENTITY DEFAULT → writes only the primary key of the deleted row. Cheap.
REPLICA IDENTITY FULL → writes the entire old row, every column, to WAL for each delete. On a wide table, that's a multi-hundred-byte WAL record per deleted row instead of a few bytes. Tables without a primary key are often set to FULL (sometimes implicitly via tooling). If you're deleting a billion rows from such a table with a logical slot attached, you may write an order of magnitude more WAL than necessary. Give big tables a real primary key and REPLICA IDENTITY DEFAULT before a large purge. Measure WAL per batch (pg_current_wal_lsn() deltas) so amplification shows up immediately:

SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), :prev_lsn) AS wal_bytes_this_batch;

- ## 5. Replication slots: where bulk deletes go to die This is the section that separates a clean purge from an outage. Your WAL has to be consumed by every replica and every replication slot before the primary can recycle it. If you produce WAL faster than they consume - or worse, if you prevent them from advancing - WAL accumulates on the primary's disk until it fills, and a full WAL volume is a hard production stop. ### 5.1 Two kinds of lag, and one of them is invisible
Physical replicas expose replay_lag (a time interval) in pg_stat_replication. Easy to see.
Logical slots (CDC, decoding-based replication) do not appear in pg_stat_replication. They live in pg_replication_slots, and their backlog is measured in bytes of retained WAL: pg_current_wal_lsn() - confirmed_flush_lsn. A stuck or slow logical consumer is the classic silent killer - it shows nothing in the obvious place while quietly pinning hundreds of gigabytes of WAL.

 - physical, seconds behind
SELECT coalesce(max(extract(epoch FROM replay_lag))::int,0) FROM pg_stat_replication;
 - logical, bytes of WAL retained (CHECK THIS - it's the one that bites)
SELECT coalesce(max(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)),0)
FROM pg_replication_slots WHERE slot_type='logical' AND active;

5.2 `restart_lsn`, `confirmed_flush_lsn`, and the open-transaction trap

A logical slot tracks two LSNs: confirmed_flush_lsn (what the consumer has acknowledged) and restart_lsn (the oldest WAL the slot might still need if decoding restarts). The primary can only recycle WAL older than the minimum restart_lsn across all slots.
Now the trap. restart_lsn cannot advance past the start of the oldest in-progress transaction, because logical decoding emits whole transactions at commit and may need to re-read from a transaction's beginning. So an open transaction does two terrible things at once:

It pins the xmin horizon → vacuum can't clean up (Section 3).
It pins restart_lsn → the primary retains all WAL written since that transaction began, by every session. This is why a pause loop that sleeps inside an open transaction is actively harmful: you stop deleting to let consumers catch up, but your idle transaction's snapshot keeps restart_lsn frozen, so the WAL backlog grows during the pause instead of draining. You engineered a backpressure mechanism that makes backpressure worse. The fix is one line, and it's the whole reason this needs to be a procedure:

WHILE lag_exceeds_threshold LOOP
 EXIT WHEN deadline_passed; - escape hatch, Section 7
 COMMIT; - release snapshot BEFORE sleeping → restart_lsn can advance
 PERFORM pg_sleep(:recheck);
 - re-measure lag
END LOOP;

Commit, then sleep. Each pause iteration becomes a sub-10-second transaction, and restart_lsn is free to chase the consumers while you wait. There's also catalog_xmin on logical slots that pins catalog vacuuming similarly - same fix, same reasoning.

5.3 Slot disk exhaustion is the actual failure mode

The incident isn't "replication is slow." It's "the primary's disk filled with retained WAL and the database stopped accepting writes." Set max_slot_wal_keep_size (PG 13+) as a safety valve so a dead slot gets invalidated rather than taking the primary down - and alert on per-slot retained bytes and on replication-instance disk headroom. A slow consumer should degrade your purge, never your primary.

- ## 6. Designing the backpressure controller Put Sections 4–5 together into an explicit control loop that runs between batches. It's a feedback controller: measure lag, decide, act.

after each batch COMMIT:
 sleep(baseline_throttle) # steady-state pacing
loop: # physical replica gate
 lag = max(replay_lag)
 if lag >= ALARM: warn(operator) # loud, no auto-abort
 if lag < PAUSE: break
 COMMIT; sleep(recheck); continue # pause WITHOUT pinning WAL
loop: # logical slot gate (separate!)
 backlog = max(wal_retained_bytes)
 if backlog >= ALARM: warn(operator)
 if backlog < PAUSE: break
 COMMIT; sleep(recheck); continue

Design notes that matter in production:

Two independent gates. Physical lag is in seconds; logical backlog is in bytes. They fail differently and need different thresholds. Checking only pg_stat_replication (the common mistake) leaves the logical-slot killer completely unmonitored.
Generous physical thresholds. Cross-region async replicas naturally lag minutes under write load. Pausing too aggressively makes your purge crawl while protecting nothing real. Pause at minutes, not seconds.
Tight logical thresholds. A logical slot growing past ~1 GB retained is a genuine signal - a consumer is falling behind and disk is at risk. Pause early here.
A deadline escape hatch in every pause loop. A dead consumer never recovers, so an unconditional "wait until lag clears" loop waits forever and bypasses your run deadline. Every pause must also exit when the run's wall-clock deadline passes (Section 7) - otherwise one stuck slot hangs your job indefinitely. (Note the tension with continuous/no-deadline runs: if you disable the deadline, you also disable this escape hatch, so a stuck consumer pauses you forever. Only run deadline-less under supervision.)
- ## 7. Resumability at scale, and a planner trap that masquerades as a hang Long purges get interrupted - deploys, deadlines, crashes. The job must resume and make forward progress. The non-obvious failure here isn't crashing; it's a resume pattern that runs forever doing nothing. ### 7.1 The re-walk anti-pattern Suppose you precompute a key list (millions of parent keys) and drive child-table deletes by paging through that list with a cursor. First run: fine. But the key list is static while the child table shrinks as you delete. On resume the cursor resets to the start and re-walks the entire already-cleared prefix - issuing a delete-commit-throttle cycle per key that matches nothing. If the surviving rows sit at the high end of the key space, a single pass spends its whole budget re-walking empties and times out before reaching live data. Every restart repeats it. Net progress: zero. The job can run for days "deleting," and delete nothing. It looks like a hang; it's actually an algorithmic mismatch between a static driver and a shrinking target. ### 7.2 Make the driver track the work, not the original plan Two fixes, both of which let a fresh run jump straight to remaining work: Self-referential - page the target table itself. Deleted rows are gone, so id > cursor from a cold start returns only survivors:

SELECT id FROM child WHERE owner=:o AND id > :cursor ORDER BY id LIMIT :n;

EXISTS-guarded driver - if you must keep a separate key list, filter it to keys that still have rows:

SELECT k.id FROM key_list k
WHERE k.id > :cursor
 AND EXISTS (SELECT 1 FROM child c WHERE c.fk = k.id) - skip cleared keys
ORDER BY k.id LIMIT :n;

The planner difference is the whole game. The EXISTS becomes a semi-join: with an index on child.fk it's an index-only probe per key (microseconds, no heap touch) or a merge semi-join between two sorted key streams - either way it skips millions of cleared keys at index speed instead of dragging each through a commit + throttle. Without an index on child.fk, that same EXISTS is a sequential scan of a billion-row table per probe - so the "fix" becomes a far worse disaster. Verify the index exists and check the EXPLAIN before trusting this at scale. Run EXPLAIN (ANALYZE, BUFFERS) on the real driver query against production-sized data; an Index Only Scan with single-digit heap fetches is what you want to see, not a Seq Scan.

7.3 Idempotency and audit

Record progress per (entity, table) in an audit table - rows found vs deleted, status, timestamp. Resume logic reads it to skip completed work; operators read it to answer "did it run, where is it, is it stuck." Re-running a drained table must be a clean no-op. Treat interruption as the normal path.

- ## 8. Locking and online DDL around the purge Bulk deletes themselves take only row locks, but the schema work around them (adding a PK to enable REPLICA IDENTITY DEFAULT, building an index for the EXISTS guard) needs care on a live table:
CREATE INDEX CONCURRENTLY - builds without an AccessExclusive lock (takes only ShareUpdateExclusive). Slower, online. Mandatory on hot tables.
SET NOT NULL without a full-table lock - the column-constraint route scans the table under AccessExclusive. On PG 12+ use the validated CHECK trick: add a CHECK (col IS NOT NULL) NOT VALID (instant, metadata-only), VALIDATE CONSTRAINT (online, ShareUpdateExclusive), then SET NOT NULL (now metadata-only because the validated check proves non-nullity), then drop the check.
lock_timeout + retry - wrap brief AccessExclusive operations (promoting a unique index to PK, switching replica identity) in a short lock_timeout with a retry loop, so they back off under contention instead of forming a lock queue that stalls the application.
idle_in_transaction_session_timeout - set it on the role running ad-hoc work near these tables, so an abandoned BEGIN in someone's SQL console can't pin your xmin horizon and restart_lsn for hours.
- ## 9. The escape hatch: don't delete at all The fastest delete is the one you never run. If the data you purge aligns with a time range or a tenant boundary, partition for deletion:
Range/list partition by the dimension you delete along (date, tenant, region).
Then deletion is ALTER TABLE … DETACH PARTITION (instant, metadata-only) followed by DROP TABLE on the detached partition - O(1), no dead tuples, no WAL flood, no vacuum debt, no replica strain. A DROP TABLE writes a trivial amount of WAL versus a billion row-delete records. Partitioning has its own costs (planning overhead, constraint exclusion, cross-partition uniqueness), and it only helps if your deletion axis matches a partition key - you can't partition by "arbitrary tenant scattered across an integer PK." But when it fits, it turns a week-long throttled purge into a millisecond DROP. Always ask this question first; the rest of this article is what you do when the answer is "we can't partition along that axis."
- ## 10. Running it as a background job The orchestration layer matters as much as the SQL:
Wall-clock deadline. Bound each run (MAX_RUN_HOURS); check it after every batch and at the top of every pause iteration; stop cleanly and let the next run resume. Bounds blast radius and prevents a run from colliding with the next scheduled one. Tip: make 0 an explicit "no deadline" sentinel for supervised runs, but guard every deadline comparison on > 0 first - otherwise MAX_RUN_HOURS=0 reads as "deadline is now" and aborts instantly.
No overlap. Before starting, assert no run is already in progress (advisory lock, or a "is the worker already running?" preflight). Two concurrent purges on the same tables double the WAL and contend on locks.
Dry-run mode. A preview pass that reports what would be deleted (counts per table) without deleting, so a misconfigured run fails safe.
Observability. Emit per-batch rows-deleted, both lag readings, WAL-per-batch, and vacuum events to your metrics pipeline. A dashboard of "WAL generated vs replica lag vs slot backlog" during a purge is worth more than any log.
- ## 11. The gotchas that cost a weekend
NULL = NULL is NULL, not true. Any anti-join or dedup on a nullable column silently mis-handles NULL rows on re-runs. Dedup on a guaranteed-non-null key.
Chunk only on a NOT NULL key. Slicing work by a nullable column corrupts your "next chunk" boundaries (NULLs sort last and get skipped or mangled). Confirm is_nullable = 'NO' in the catalog, not "it's always populated in practice."
Diagnostic queries match themselves. SELECT … FROM pg_stat_activity WHERE query LIKE '%vacuum%' finds itself. Add pid <> pg_backend_pid().
grep -q under set -o pipefail. grep -q exits on first match, SIGPIPEs the upstream producer (exit 141), and pipefail turns the pipeline non-zero - flipping your if ! logic. In shell glue, prefer pure-bash matching to printf … | grep -q. Stochastic and timing-dependent; it'll pass on your laptop and fail in the container.
Stale planner stats after a big delete. Row-count estimates go stale; the next query may flip from an index scan to a seq scan. ANALYZE the big tables after the purge.
Snapshot drift in restore paths. If a recovery/copy path runs alongside, never SELECT * between two databases that can drift in column set/order - compute the explicit column intersection.
- ## A pre-flight checklist
[ ] Can I partition-and-DROP instead of deleting? (Section 9 - ask first.)
[ ] Batched deletes, keyset cursor, per-batch COMMIT (procedure, not function)
[ ] Big tables have a real PK and REPLICA IDENTITY DEFAULT (not FULL)
[ ] Indexes on every EXISTS/join/chunk key - confirmed via EXPLAIN
[ ] Resume reads the target (or EXISTS-guards the key list); no static-list re-walk
[ ] Backpressure on both physical lag (sec) and logical slot backlog (bytes)
[ ] COMMIT before every pause-loop sleep; deadline escape hatch in each loop
[ ] Per-table autovacuum tuned for the window; mid-run VACUUM via separate connection
[ ] max_slot_wal_keep_size set; disk + per-slot retained-bytes alerts armed
[ ] Deadline, no-overlap guard, dry-run, audit table, metrics ## Closing A DELETE is a single line of SQL. A bulk delete on a live system is a distributed-systems problem wearing a SQL costume: you're modulating the rate at which you feed WAL, dead tuples, and decoded transactions into a web of replicas, slots, and vacuum workers - none of which you control, all of which can take production down if you outrun them. The mental model that makes it tractable: keep the xmin horizon moving (batch + commit), keep WAL recyclable (commit before you pause, don't write more than you must), and keep the slowest consumer in your control loop (measure both lags, pause without pinning). Everything else is detail. Delete slowly, commit often, watch your slots - and when you can, don't delete at all; drop a partition.

DEV Community: Shanthan Kondapalli

Deleting Millions of Rows in PostgreSQL Without Taking Down Production

MVCC, WAL, vacuum, and replication slots under sustained delete load - and how to delete billions of rows without your database noticing

1. What a `DELETE` actually costs

3.2 Vacuuming during the run, from inside a procedure

3.3 The horizon makes mid-run vacuum partly futile - unless you batch

3.4 Index bloat outlives table bloat

3.5 Wraparound, briefly

5.2 `restart_lsn`, `confirmed_flush_lsn`, and the open-transaction trap

5.3 Slot disk exhaustion is the actual failure mode

7.3 Idempotency and audit

DEV Community: Shanthan Kondapalli

Deleting Millions of Rows in PostgreSQL Without Taking Down Production

MVCC, WAL, vacuum, and replication slots under sustained delete load - and how to delete billions of rows without your database noticing

1. What a DELETE actually costs

3.2 Vacuuming during the run, from inside a procedure

3.3 The horizon makes mid-run vacuum partly futile - unless you batch

3.4 Index bloat outlives table bloat

3.5 Wraparound, briefly

5.2 restart_lsn, confirmed_flush_lsn, and the open-transaction trap

5.3 Slot disk exhaustion is the actual failure mode

7.3 Idempotency and audit

1. What a `DELETE` actually costs

5.2 `restart_lsn`, `confirmed_flush_lsn`, and the open-transaction trap