<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shanthan Kondapalli </title>
    <description>The latest articles on DEV Community by Shanthan Kondapalli  (@shanthan_kumar_3aa868cb94).</description>
    <link>https://dev.to/shanthan_kumar_3aa868cb94</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3984284%2F12911ebc-dbf5-44fa-9314-3fa9b7493882.png</url>
      <title>DEV Community: Shanthan Kondapalli </title>
      <link>https://dev.to/shanthan_kumar_3aa868cb94</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shanthan_kumar_3aa868cb94"/>
    <language>en</language>
    <item>
      <title>The Deep Mechanics of Online Bulk Deletion in PostgreSQL</title>
      <dc:creator>Shanthan Kondapalli </dc:creator>
      <pubDate>Sun, 14 Jun 2026 18:35:33 +0000</pubDate>
      <link>https://dev.to/shanthan_kumar_3aa868cb94/the-deep-mechanics-of-online-bulk-deletion-in-postgresql-3dhb</link>
      <guid>https://dev.to/shanthan_kumar_3aa868cb94/the-deep-mechanics-of-online-bulk-deletion-in-postgresql-3dhb</guid>
      <description>&lt;h3&gt;
  
  
  MVCC, WAL, vacuum, and replication slots under sustained delete load - and how to delete billions of rows without your database noticing
&lt;/h3&gt;

&lt;p&gt;Most "how to delete a lot of rows" articles stop at &lt;em&gt;"batch it and delete children before parents."&lt;/em&gt; That advice is correct, it's table stakes, and everyone already knows it. This article is about everything &lt;strong&gt;after&lt;/strong&gt; that - the parts that actually decide whether your cleanup runs quietly in the background for a week or pages you at 3 a.m. with a full disk and a replica that's six hours behind.&lt;br&gt;
The thesis: at scale, &lt;strong&gt;your &lt;code&gt;DELETE&lt;/code&gt; statement is the easy part.&lt;/strong&gt; The adversaries are the subsystems a delete &lt;em&gt;feeds&lt;/em&gt; - MVCC tuple versioning, the write-ahead log, autovacuum, and the replication machinery. Bulk deletion is really an exercise in &lt;strong&gt;flow control across those subsystems&lt;/strong&gt;. Get the SQL right and the systems wrong, and you'll still take production down.&lt;br&gt;
We'll assume PostgreSQL (the internals are PG-specific), a live OLTP primary with at least one physical replica and one or more logical/CDC consumers, and a target of hundreds of millions to billions of rows across many related tables.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The one paragraph of "basics," so we can move on: delete in dependency order (referencing rows before referenced rows); collect parent keys once; never rely on &lt;code&gt;ON DELETE CASCADE&lt;/code&gt; for huge deletes because you can't throttle a cascade. Done. Now the real material.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. What a &lt;code&gt;DELETE&lt;/code&gt; actually costs
&lt;/h2&gt;

&lt;p&gt;A delete is not "remove a row." Under MVCC it's "mark a row version dead and write that fact everywhere." For each deleted tuple, PostgreSQL:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Sets &lt;code&gt;xmax&lt;/code&gt;&lt;/strong&gt; on the heap tuple to your transaction id. The row is still physically present; it becomes a &lt;em&gt;dead tuple&lt;/em&gt; once your transaction commits and no snapshot can still see it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Writes a WAL record&lt;/strong&gt; for the heap change. If this is the first modification of that page since the last checkpoint, it also writes a &lt;strong&gt;full-page image (FPI)&lt;/strong&gt; - potentially 8 KB of WAL for a single-row change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Touches every index.&lt;/strong&gt; Index entries aren't removed at delete time; they're left dangling and cleaned up later by vacuum. So a table with 12 indexes generates index-cleanup debt across all 12.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feeds logical decoding.&lt;/strong&gt; If a logical replication slot exists, the change is decoded from WAL and streamed to consumers - and how &lt;em&gt;much&lt;/em&gt; gets written depends on the table's &lt;code&gt;REPLICA IDENTITY&lt;/code&gt; (Section 5).
Multiply by a billion. The deleted rows don't shrink the table - they inflate it with dead tuples until vacuum reclaims them, and they inflate WAL, which every replica and slot must consume. &lt;strong&gt;The delete is cheap; the wake it leaves is expensive.&lt;/strong&gt; Everything below is about managing that wake.&lt;/li&gt;
&lt;li&gt;-
## 2. Batching is transaction-lifetime control, not just "smaller deletes"
People batch to avoid one long lock. True, but the deeper reason is &lt;strong&gt;snapshot lifetime&lt;/strong&gt;, and it's the single most important concept in this entire article.
A transaction holds a snapshot. That snapshot defines an &lt;strong&gt;xmin horizon&lt;/strong&gt; - the oldest transaction id still considered "in progress" from the system's point of view. Two brutal consequences follow, and they're the root cause of most bulk-delete disasters:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vacuum cannot reclaim any dead tuple newer than the oldest snapshot's xmin.&lt;/strong&gt; A long-running transaction &lt;em&gt;anywhere&lt;/em&gt; in the database freezes the cleanup horizon. So if your delete runs as one multi-hour transaction, autovacuum literally cannot clean up the dead tuples you're producing - they pile up until you commit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical replication slots cannot advance past an open transaction.&lt;/strong&gt; More on this in Section 6, but the headline is the same: a long-lived transaction pins WAL on disk.
So a billion-row delete in one transaction is a triple threat: a giant lock, an un-vacuumable mountain of dead tuples, and pinned WAL. &lt;strong&gt;Batching with per-batch &lt;code&gt;COMMIT&lt;/code&gt; is how you keep the xmin horizon moving&lt;/strong&gt; so vacuum and replication can do their jobs &lt;em&gt;while you work&lt;/em&gt;.
&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;shape&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt; &lt;span class="n"&gt;matters&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tiny&lt;/span&gt; &lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;monotonic&lt;/span&gt; &lt;span class="k"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;durable&lt;/span&gt; &lt;span class="n"&gt;progress&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="n"&gt;LOOP&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;v_hi&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;v_lo&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;EXIT&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;v_hi&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;v_lo&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;v_hi&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;v_lo&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;v_hi&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;releases&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;advances&lt;/span&gt; &lt;span class="n"&gt;xmin&lt;/span&gt; &lt;span class="n"&gt;horizon&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;PERFORM&lt;/span&gt; &lt;span class="n"&gt;dbo_backpressure_sleep&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;Section&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="n"&gt;LOOP&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Keyset pagination (&lt;code&gt;id &amp;gt; v_lo ORDER BY id LIMIT n&lt;/code&gt;) rides the primary-key index and is &lt;strong&gt;O(1) per batch regardless of depth&lt;/strong&gt;, unlike &lt;code&gt;OFFSET&lt;/code&gt; which re-scans everything it skips. The &lt;code&gt;COMMIT&lt;/code&gt; is doing far more than releasing locks - it's resetting your contribution to the global xmin horizon every few thousand rows.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;COMMIT&lt;/code&gt; inside a loop requires a &lt;strong&gt;procedure&lt;/strong&gt; (or anonymous &lt;code&gt;DO&lt;/code&gt; block). Functions run inside the caller's transaction and cannot commit. This distinction will matter again in Section 7.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;-
## 3. Dead tuples and the vacuum you must run yourself
Autovacuum is designed for steady-state churn, not for a deliberate billion-row purge. During a heavy delete you out-produce it, and three things go wrong: the table bloats, indexes bloat worse, and - if the run is long enough - you flirt with transaction-id wraparound pressure. You need to take vacuum into your own hands.
### 3.1 Why "just let autovacuum handle it" fails
Autovacuum triggers when dead tuples exceed &lt;code&gt;autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * reltuples&lt;/code&gt;. With the default &lt;code&gt;scale_factor = 0.2&lt;/code&gt;, a 2-billion-row table won't trigger autovacuum until &lt;strong&gt;~400 million&lt;/strong&gt; dead tuples have accumulated. By then you have catastrophic bloat and a vacuum that takes hours and competes with your delete for I/O.
For a purge window, &lt;strong&gt;tighten autovacuum on the target tables&lt;/strong&gt; for the duration, then reset:
&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;big&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;autovacuum_vacuum_scale_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;005&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;fire&lt;/span&gt; &lt;span class="k"&gt;at&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;dead&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;autovacuum_vacuum_threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;autovacuum_vacuum_cost_delay&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;throttled&lt;/span&gt; &lt;span class="n"&gt;but&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="n"&gt;glacial&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;autovacuum_vacuum_cost_limit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="n"&gt;budget&lt;/span&gt; &lt;span class="n"&gt;per&lt;/span&gt; &lt;span class="k"&gt;cycle&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; …&lt;/span&gt; &lt;span class="n"&gt;run&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;purge&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;…&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;big&lt;/span&gt; &lt;span class="k"&gt;RESET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;autovacuum_vacuum_scale_factor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;autovacuum_vacuum_threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;autovacuum_vacuum_cost_delay&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;autovacuum_vacuum_cost_limit&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  3.2 Vacuuming &lt;em&gt;during&lt;/em&gt; the run, from inside a procedure
&lt;/h3&gt;

&lt;p&gt;You also want explicit &lt;code&gt;VACUUM&lt;/code&gt; checkpoints - say every N million rows deleted from a given table. Two PostgreSQL facts make this awkward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;VACUUM&lt;/code&gt; cannot run inside a transaction block.&lt;/strong&gt; Your batched-delete procedure is always in a transaction, so you can't call it directly.&lt;/li&gt;
&lt;li&gt;The clean workaround is to run it over a &lt;strong&gt;separate connection&lt;/strong&gt;. &lt;code&gt;dblink&lt;/code&gt; is the idiomatic trick - &lt;code&gt;dblink_exec&lt;/code&gt; opens its own backend where &lt;code&gt;VACUUM&lt;/code&gt; is a legitimate top-level statement:
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;PERFORM&lt;/span&gt; &lt;span class="n"&gt;dblink_exec&lt;/span&gt;&lt;span class="p"&gt;(:&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'VACUUM (ANALYZE) %I.%I'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="k"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="k"&gt;table&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;And the scripting-layer trap that catches everyone: &lt;strong&gt;&lt;code&gt;psql -c "stmt1; stmt2"&lt;/code&gt; wraps both statements in one implicit transaction&lt;/strong&gt;, so a &lt;code&gt;VACUUM&lt;/code&gt; after a &lt;code&gt;SET&lt;/code&gt; fails with &lt;em&gt;"VACUUM cannot run inside a transaction block."&lt;/em&gt; Use separate &lt;code&gt;-c&lt;/code&gt; flags - each becomes its own transaction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;psql &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"SET statement_timeout=0"&lt;/span&gt; &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"VACUUM ANALYZE big"&lt;/span&gt; &lt;span class="c"&gt;# works&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.3 The horizon makes mid-run vacuum partly futile - unless you batch
&lt;/h3&gt;

&lt;p&gt;Here's the payoff from Section 2: a &lt;code&gt;VACUUM&lt;/code&gt; can only remove tuples older than the &lt;strong&gt;global&lt;/strong&gt; oldest xmin. If your own delete is one long transaction, your mid-run vacuum reclaims almost nothing, because &lt;em&gt;your&lt;/em&gt; snapshot is the horizon. Only because you batch-and-commit does mid-run vacuum actually free space. Vacuum and batching are not independent tactics - &lt;strong&gt;batching is what makes vacuuming work at all.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4 Index bloat outlives table bloat
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;VACUUM&lt;/code&gt; makes dead space reusable but doesn't return it to the OS, and B-tree indexes bloat faster than heaps under massive deletes. After a large purge, estimate index bloat (the catalog/&lt;code&gt;pgstattuple&lt;/code&gt; estimators) and &lt;code&gt;REINDEX INDEX CONCURRENTLY&lt;/code&gt; the worst offenders - concurrently, so you don't take an &lt;code&gt;AccessExclusive&lt;/code&gt; lock on a live table. Don't forget TOAST tables for wide columns; they bloat too.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.5 Wraparound, briefly
&lt;/h3&gt;

&lt;p&gt;A days-long purge consumes transaction ids (one per batch commit, plus the deletes). It won't cause wraparound on its own, but combined with suppressed vacuuming it raises the table's age. Keep an eye on &lt;code&gt;age(relfrozenxid)&lt;/code&gt; for the big tables if your purge runs for many days; the tightened autovacuum above also drives freezing.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;-
## 4. WAL: the firehose you're pointing at your replicas
Every batch you delete becomes WAL, and WAL is the shared resource that couples your delete to &lt;em&gt;everything downstream&lt;/em&gt;. Two amplifiers turn a reasonable delete into a WAL flood.
### 4.1 Full-page writes and checkpoints
The first write to a page after a checkpoint emits a full 8 KB page image. A delete that dirties many pages right after a checkpoint can emit far more WAL than the logical change suggests. You can't avoid FPIs, but you can avoid making them worse: don't run a purge concurrently with aggressive checkpoint settings, and spread the load (throttle) so checkpoints amortize. Watch &lt;code&gt;pg_stat_bgwriter&lt;/code&gt; and the WAL generation rate.
### 4.2 &lt;code&gt;REPLICA IDENTITY FULL&lt;/code&gt; is a WAL multiplier
This one silently doubles or triples your WAL and most people never check it. For logical decoding, PostgreSQL must write enough of the &lt;em&gt;old&lt;/em&gt; row to identify it on the subscriber:&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;REPLICA IDENTITY DEFAULT&lt;/code&gt; → writes only the &lt;strong&gt;primary key&lt;/strong&gt; of the deleted row. Cheap.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;REPLICA IDENTITY FULL&lt;/code&gt; → writes the &lt;strong&gt;entire old row&lt;/strong&gt;, every column, to WAL for each delete. On a wide table, that's a multi-hundred-byte WAL record per deleted row instead of a few bytes.
Tables without a primary key are often set to &lt;code&gt;FULL&lt;/code&gt; (sometimes implicitly via tooling). If you're deleting a billion rows from such a table with a logical slot attached, you may write &lt;em&gt;an order of magnitude&lt;/em&gt; more WAL than necessary. &lt;strong&gt;Give big tables a real primary key and &lt;code&gt;REPLICA IDENTITY DEFAULT&lt;/code&gt; before a large purge.&lt;/strong&gt; Measure WAL per batch (&lt;code&gt;pg_current_wal_lsn()&lt;/code&gt; deltas) so amplification shows up immediately:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_wal_lsn_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_current_wal_lsn&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;prev_lsn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;wal_bytes_this_batch&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;-
## 5. Replication slots: where bulk deletes go to die
This is the section that separates a clean purge from an outage. Your WAL has to be consumed by every replica and every replication slot before the primary can recycle it. If you produce WAL faster than they consume - or worse, if you &lt;em&gt;prevent&lt;/em&gt; them from advancing - WAL accumulates on the primary's disk until it fills, and a full WAL volume is a hard production stop.
### 5.1 Two kinds of lag, and one of them is invisible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Physical replicas&lt;/strong&gt; expose &lt;code&gt;replay_lag&lt;/code&gt; (a time interval) in &lt;code&gt;pg_stat_replication&lt;/code&gt;. Easy to see.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical slots&lt;/strong&gt; (CDC, decoding-based replication) do &lt;strong&gt;not&lt;/strong&gt; appear in &lt;code&gt;pg_stat_replication&lt;/code&gt;. They live in &lt;code&gt;pg_replication_slots&lt;/code&gt;, and their backlog is measured in &lt;strong&gt;bytes of retained WAL&lt;/strong&gt;: &lt;code&gt;pg_current_wal_lsn() - confirmed_flush_lsn&lt;/code&gt;. A stuck or slow logical consumer is the classic silent killer - it shows nothing in the obvious place while quietly pinning hundreds of gigabytes of WAL.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;physical&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;seconds&lt;/span&gt; &lt;span class="n"&gt;behind&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;epoch&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;replay_lag&lt;/span&gt;&lt;span class="p"&gt;))::&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_replication&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;logical&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="n"&gt;WAL&lt;/span&gt; &lt;span class="n"&gt;retained&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="n"&gt;THIS&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="s1"&gt;'s the one that bites)
SELECT coalesce(max(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)),0)
FROM pg_replication_slots WHERE slot_type='&lt;/span&gt;&lt;span class="n"&gt;logical&lt;/span&gt;&lt;span class="s1"&gt;' AND active;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.2 &lt;code&gt;restart_lsn&lt;/code&gt;, &lt;code&gt;confirmed_flush_lsn&lt;/code&gt;, and the open-transaction trap
&lt;/h3&gt;

&lt;p&gt;A logical slot tracks two LSNs: &lt;code&gt;confirmed_flush_lsn&lt;/code&gt; (what the consumer has acknowledged) and &lt;code&gt;restart_lsn&lt;/code&gt; (the oldest WAL the slot might still need if decoding restarts). The primary can only recycle WAL older than the &lt;strong&gt;minimum &lt;code&gt;restart_lsn&lt;/code&gt; across all slots&lt;/strong&gt;.&lt;br&gt;
Now the trap. &lt;strong&gt;&lt;code&gt;restart_lsn&lt;/code&gt; cannot advance past the start of the oldest in-progress transaction&lt;/strong&gt;, because logical decoding emits whole transactions at commit and may need to re-read from a transaction's beginning. So an open transaction does two terrible things at once:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It pins the &lt;strong&gt;xmin horizon&lt;/strong&gt; → vacuum can't clean up (Section 3).&lt;/li&gt;
&lt;li&gt;It pins &lt;strong&gt;&lt;code&gt;restart_lsn&lt;/code&gt;&lt;/strong&gt; → the primary retains all WAL written since that transaction began, by &lt;em&gt;every&lt;/em&gt; session.
This is why a pause loop that &lt;strong&gt;sleeps inside an open transaction is actively harmful&lt;/strong&gt;: you stop deleting to let consumers catch up, but your idle transaction's snapshot keeps &lt;code&gt;restart_lsn&lt;/code&gt; frozen, so the WAL backlog &lt;em&gt;grows&lt;/em&gt; during the pause instead of draining. You engineered a backpressure mechanism that makes backpressure worse.
The fix is one line, and it's the whole reason this needs to be a procedure:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;WHILE&lt;/span&gt; &lt;span class="n"&gt;lag_exceeds_threshold&lt;/span&gt; &lt;span class="n"&gt;LOOP&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;EXIT&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;deadline_passed&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="k"&gt;escape&lt;/span&gt; &lt;span class="n"&gt;hatch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Section&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;release&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="k"&gt;BEFORE&lt;/span&gt; &lt;span class="n"&gt;sleeping&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;restart_lsn&lt;/span&gt; &lt;span class="n"&gt;can&lt;/span&gt; &lt;span class="n"&gt;advance&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;PERFORM&lt;/span&gt; &lt;span class="n"&gt;pg_sleep&lt;/span&gt;&lt;span class="p"&gt;(:&lt;/span&gt;&lt;span class="k"&gt;recheck&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;measure&lt;/span&gt; &lt;span class="n"&gt;lag&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="n"&gt;LOOP&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Commit, &lt;em&gt;then&lt;/em&gt; sleep. Each pause iteration becomes a sub-10-second transaction, and &lt;code&gt;restart_lsn&lt;/code&gt; is free to chase the consumers while you wait. There's also &lt;code&gt;catalog_xmin&lt;/code&gt; on logical slots that pins catalog vacuuming similarly - same fix, same reasoning.&lt;/p&gt;
&lt;h3&gt;
  
  
  5.3 Slot disk exhaustion is the actual failure mode
&lt;/h3&gt;

&lt;p&gt;The incident isn't "replication is slow." It's "the primary's disk filled with retained WAL and the database stopped accepting writes." Set &lt;code&gt;max_slot_wal_keep_size&lt;/code&gt; (PG 13+) as a safety valve so a dead slot gets invalidated rather than taking the primary down - and alert on per-slot retained bytes and on replication-instance disk headroom. A slow consumer should degrade your purge, never your primary.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;-
## 6. Designing the backpressure controller
Put Sections 4–5 together into an explicit control loop that runs between batches. It's a feedback controller: measure lag, decide, act.
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;after&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt; &lt;span class="n"&gt;batch&lt;/span&gt; &lt;span class="n"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;baseline_throttle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# steady-state pacing
&lt;/span&gt;&lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# physical replica gate
&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;lag&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;replay_lag&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lag&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;ALARM&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# loud, no auto-abort
&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;lag&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;PAUSE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recheck&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt; &lt;span class="c1"&gt;# pause WITHOUT pinning WAL
&lt;/span&gt;&lt;span class="n"&gt;loop&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# logical slot gate (separate!)
&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;backlog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wal_retained_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;backlog&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;ALARM&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;backlog&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;PAUSE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="n"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recheck&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Design notes that matter in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Two independent gates.&lt;/strong&gt; Physical lag is in &lt;em&gt;seconds&lt;/em&gt;; logical backlog is in &lt;em&gt;bytes&lt;/em&gt;. They fail differently and need different thresholds. Checking only &lt;code&gt;pg_stat_replication&lt;/code&gt; (the common mistake) leaves the logical-slot killer completely unmonitored.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generous physical thresholds.&lt;/strong&gt; Cross-region async replicas &lt;em&gt;naturally&lt;/em&gt; lag minutes under write load. Pausing too aggressively makes your purge crawl while protecting nothing real. Pause at minutes, not seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tight logical thresholds.&lt;/strong&gt; A logical slot growing past ~1 GB retained is a genuine signal - a consumer is falling behind and disk is at risk. Pause early here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A deadline escape hatch in every pause loop.&lt;/strong&gt; A &lt;em&gt;dead&lt;/em&gt; consumer never recovers, so an unconditional "wait until lag clears" loop waits forever and bypasses your run deadline. Every pause must also exit when the run's wall-clock deadline passes (Section 7) - otherwise one stuck slot hangs your job indefinitely. (Note the tension with continuous/no-deadline runs: if you disable the deadline, you also disable this escape hatch, so a stuck consumer pauses you forever. Only run deadline-less under supervision.)&lt;/li&gt;
&lt;li&gt;-
## 7. Resumability at scale, and a planner trap that masquerades as a hang
Long purges get interrupted - deploys, deadlines, crashes. The job must resume and make forward progress. The non-obvious failure here isn't crashing; it's a resume pattern that runs forever doing nothing.
### 7.1 The re-walk anti-pattern
Suppose you precompute a key list (millions of parent keys) and drive child-table deletes by paging through that list with a cursor. First run: fine. But the key list is &lt;strong&gt;static&lt;/strong&gt; while the child table &lt;strong&gt;shrinks&lt;/strong&gt; as you delete. On resume the cursor resets to the start and re-walks the entire already-cleared prefix - issuing a delete-commit-throttle cycle per key that matches &lt;em&gt;nothing&lt;/em&gt;.
If the surviving rows sit at the high end of the key space, a single pass spends its whole budget re-walking empties and &lt;strong&gt;times out before reaching live data&lt;/strong&gt;. Every restart repeats it. Net progress: zero. The job can run for days "deleting," and delete nothing. It looks like a hang; it's actually an algorithmic mismatch between a static driver and a shrinking target.
### 7.2 Make the driver track the work, not the original plan
Two fixes, both of which let a fresh run jump straight to remaining work:
&lt;strong&gt;Self-referential&lt;/strong&gt; - page the &lt;em&gt;target table itself&lt;/em&gt;. Deleted rows are gone, so &lt;code&gt;id &amp;gt; cursor&lt;/code&gt; from a cold start returns only survivors:
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="k"&gt;cursor&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;&lt;code&gt;EXISTS&lt;/code&gt;-guarded driver&lt;/strong&gt; - if you must keep a separate key list, filter it to keys that still have rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;key_list&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="k"&gt;cursor&lt;/span&gt;
&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="err"&gt; &lt;/span&gt;&lt;span class="n"&gt;skip&lt;/span&gt; &lt;span class="n"&gt;cleared&lt;/span&gt; &lt;span class="n"&gt;keys&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt;&lt;span class="err"&gt;&amp;nbsp;&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The planner difference is the whole game. The &lt;code&gt;EXISTS&lt;/code&gt; becomes a semi-join: with an index on &lt;code&gt;child.fk&lt;/code&gt; it's an &lt;strong&gt;index-only probe per key&lt;/strong&gt; (microseconds, no heap touch) or a &lt;strong&gt;merge semi-join&lt;/strong&gt; between two sorted key streams - either way it skips millions of cleared keys at index speed instead of dragging each through a commit + throttle. Without an index on &lt;code&gt;child.fk&lt;/code&gt;, that same &lt;code&gt;EXISTS&lt;/code&gt; is a sequential scan of a billion-row table &lt;em&gt;per probe&lt;/em&gt; - so the "fix" becomes a far worse disaster. &lt;strong&gt;Verify the index exists and check the &lt;code&gt;EXPLAIN&lt;/code&gt; before trusting this at scale.&lt;/strong&gt; Run &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt; on the real driver query against production-sized data; an &lt;code&gt;Index Only Scan&lt;/code&gt; with single-digit heap fetches is what you want to see, not a &lt;code&gt;Seq Scan&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.3 Idempotency and audit
&lt;/h3&gt;

&lt;p&gt;Record progress per (entity, table) in an audit table - rows found vs deleted, status, timestamp. Resume logic reads it to skip completed work; operators read it to answer "did it run, where is it, is it stuck." Re-running a drained table must be a clean no-op. Treat interruption as the normal path.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;-
## 8. Locking and online DDL around the purge
Bulk deletes themselves take only row locks, but the schema work around them (adding a PK to enable &lt;code&gt;REPLICA IDENTITY DEFAULT&lt;/code&gt;, building an index for the &lt;code&gt;EXISTS&lt;/code&gt; guard) needs care on a live table:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt;&lt;/strong&gt; - builds without an &lt;code&gt;AccessExclusive&lt;/code&gt; lock (takes only &lt;code&gt;ShareUpdateExclusive&lt;/code&gt;). Slower, online. Mandatory on hot tables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SET NOT NULL&lt;/code&gt; without a full-table lock&lt;/strong&gt; - the column-constraint route scans the table under &lt;code&gt;AccessExclusive&lt;/code&gt;. On PG 12+ use the &lt;em&gt;validated CHECK&lt;/em&gt; trick: add a &lt;code&gt;CHECK (col IS NOT NULL) NOT VALID&lt;/code&gt; (instant, metadata-only), &lt;code&gt;VALIDATE CONSTRAINT&lt;/code&gt; (online, &lt;code&gt;ShareUpdateExclusive&lt;/code&gt;), then &lt;code&gt;SET NOT NULL&lt;/code&gt; (now metadata-only because the validated check proves non-nullity), then drop the check.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;lock_timeout&lt;/code&gt; + retry&lt;/strong&gt; - wrap brief &lt;code&gt;AccessExclusive&lt;/code&gt; operations (promoting a unique index to PK, switching replica identity) in a short &lt;code&gt;lock_timeout&lt;/code&gt; with a retry loop, so they back off under contention instead of forming a lock queue that stalls the application.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;idle_in_transaction_session_timeout&lt;/code&gt;&lt;/strong&gt; - set it on the role running ad-hoc work near these tables, so an abandoned &lt;code&gt;BEGIN&lt;/code&gt; in someone's SQL console can't pin your xmin horizon and &lt;code&gt;restart_lsn&lt;/code&gt; for hours.&lt;/li&gt;
&lt;li&gt;-
## 9. The escape hatch: don't delete at all
The fastest delete is the one you never run. If the data you purge aligns with a time range or a tenant boundary, &lt;strong&gt;partition for deletion&lt;/strong&gt;:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Range/list partition&lt;/strong&gt; by the dimension you delete along (date, tenant, region).&lt;/li&gt;
&lt;li&gt;Then deletion is &lt;code&gt;ALTER TABLE&amp;nbsp;… DETACH PARTITION&lt;/code&gt; (instant, metadata-only) followed by &lt;code&gt;DROP TABLE&lt;/code&gt; on the detached partition - &lt;strong&gt;O(1), no dead tuples, no WAL flood, no vacuum debt, no replica strain.&lt;/strong&gt; A &lt;code&gt;DROP TABLE&lt;/code&gt; writes a trivial amount of WAL versus a billion row-delete records.
Partitioning has its own costs (planning overhead, constraint exclusion, cross-partition uniqueness), and it only helps if your deletion axis matches a partition key - you can't partition by "arbitrary tenant scattered across an integer PK." But when it fits, it turns a week-long throttled purge into a millisecond &lt;code&gt;DROP&lt;/code&gt;. Always ask this question first; the rest of this article is what you do when the answer is "we can't partition along that axis."&lt;/li&gt;
&lt;li&gt;-
## 10. Running it as a background job
The orchestration layer matters as much as the SQL:&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wall-clock deadline.&lt;/strong&gt; Bound each run (&lt;code&gt;MAX_RUN_HOURS&lt;/code&gt;); check it after every batch and at the top of every pause iteration; stop cleanly and let the next run resume. Bounds blast radius and prevents a run from colliding with the next scheduled one. Tip: make &lt;code&gt;0&lt;/code&gt; an explicit "no deadline" sentinel for supervised runs, but guard every deadline comparison on &lt;code&gt;&amp;gt; 0&lt;/code&gt; first - otherwise &lt;code&gt;MAX_RUN_HOURS=0&lt;/code&gt; reads as "deadline is now" and aborts instantly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No overlap.&lt;/strong&gt; Before starting, assert no run is already in progress (advisory lock, or a "is the worker already running?" preflight). Two concurrent purges on the same tables double the WAL and contend on locks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dry-run mode.&lt;/strong&gt; A preview pass that reports &lt;em&gt;what would be deleted&lt;/em&gt; (counts per table) without deleting, so a misconfigured run fails safe.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability.&lt;/strong&gt; Emit per-batch rows-deleted, both lag readings, WAL-per-batch, and vacuum events to your metrics pipeline. A dashboard of "WAL generated vs replica lag vs slot backlog" during a purge is worth more than any log.&lt;/li&gt;
&lt;li&gt;-
## 11. The gotchas that cost a weekend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NULL = NULL&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;, not &lt;code&gt;true&lt;/code&gt;.&lt;/strong&gt; Any anti-join or dedup on a nullable column silently mis-handles NULL rows on re-runs. Dedup on a guaranteed-non-null key.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chunk only on a &lt;code&gt;NOT NULL&lt;/code&gt; key.&lt;/strong&gt; Slicing work by a nullable column corrupts your "next chunk" boundaries (NULLs sort last and get skipped or mangled). Confirm &lt;code&gt;is_nullable = 'NO'&lt;/code&gt; in the catalog, not "it's always populated in practice."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Diagnostic queries match themselves.&lt;/strong&gt; &lt;code&gt;SELECT&amp;nbsp;… FROM pg_stat_activity WHERE query LIKE '%vacuum%'&lt;/code&gt; finds &lt;em&gt;itself&lt;/em&gt;. Add &lt;code&gt;pid &amp;lt;&amp;gt; pg_backend_pid()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;grep -q&lt;/code&gt; under &lt;code&gt;set -o pipefail&lt;/code&gt;.&lt;/strong&gt; &lt;code&gt;grep -q&lt;/code&gt; exits on first match, SIGPIPEs the upstream producer (exit 141), and &lt;code&gt;pipefail&lt;/code&gt; turns the pipeline non-zero - flipping your &lt;code&gt;if&amp;nbsp;!&lt;/code&gt; logic. In shell glue, prefer pure-bash matching to &lt;code&gt;printf&amp;nbsp;… | grep -q&lt;/code&gt;. Stochastic and timing-dependent; it'll pass on your laptop and fail in the container.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stale planner stats after a big delete.&lt;/strong&gt; Row-count estimates go stale; the &lt;em&gt;next&lt;/em&gt; query may flip from an index scan to a seq scan. &lt;code&gt;ANALYZE&lt;/code&gt; the big tables after the purge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Snapshot drift in restore paths.&lt;/strong&gt; If a recovery/copy path runs alongside, never &lt;code&gt;SELECT *&lt;/code&gt; between two databases that can drift in column set/order - compute the explicit column intersection.&lt;/li&gt;
&lt;li&gt;-
## A pre-flight checklist&lt;/li&gt;
&lt;li&gt;[ ] Can I partition-and-&lt;code&gt;DROP&lt;/code&gt; instead of deleting? (Section 9 - ask first.)&lt;/li&gt;
&lt;li&gt;[ ] Batched deletes, keyset cursor, per-batch &lt;code&gt;COMMIT&lt;/code&gt; (procedure, not function)&lt;/li&gt;
&lt;li&gt;[ ] Big tables have a real PK and &lt;code&gt;REPLICA IDENTITY DEFAULT&lt;/code&gt; (not &lt;code&gt;FULL&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;[ ] Indexes on every &lt;code&gt;EXISTS&lt;/code&gt;/join/chunk key - confirmed via &lt;code&gt;EXPLAIN&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;[ ] Resume reads the &lt;em&gt;target&lt;/em&gt; (or &lt;code&gt;EXISTS&lt;/code&gt;-guards the key list); no static-list re-walk&lt;/li&gt;
&lt;li&gt;[ ] Backpressure on &lt;strong&gt;both&lt;/strong&gt; physical lag (sec) and logical slot backlog (bytes)&lt;/li&gt;
&lt;li&gt;[ ] &lt;code&gt;COMMIT&lt;/code&gt; before every pause-loop &lt;code&gt;sleep&lt;/code&gt;; deadline escape hatch in each loop&lt;/li&gt;
&lt;li&gt;[ ] Per-table autovacuum tuned for the window; mid-run &lt;code&gt;VACUUM&lt;/code&gt; via separate connection&lt;/li&gt;
&lt;li&gt;[ ] &lt;code&gt;max_slot_wal_keep_size&lt;/code&gt; set; disk + per-slot retained-bytes alerts armed&lt;/li&gt;
&lt;li&gt;[ ] Deadline, no-overlap guard, dry-run, audit table, metrics
## Closing
A &lt;code&gt;DELETE&lt;/code&gt; is a single line of SQL. A &lt;em&gt;bulk&lt;/em&gt; delete on a live system is a distributed-systems problem wearing a SQL costume: you're modulating the rate at which you feed WAL, dead tuples, and decoded transactions into a web of replicas, slots, and vacuum workers - none of which you control, all of which can take production down if you outrun them.
The mental model that makes it tractable: &lt;strong&gt;keep the xmin horizon moving (batch + commit), keep WAL recyclable (commit before you pause, don't write more than you must), and keep the slowest consumer in your control loop (measure both lags, pause without pinning).&lt;/strong&gt; Everything else is detail. Delete slowly, commit often, watch your slots - and when you can, don't delete at all; drop a partition.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
  </channel>
</rss>
