<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Philip McClarence</title>
    <description>The latest articles on DEV Community by Philip McClarence (@philip_mcclarence_2ef9475).</description>
    <link>https://dev.to/philip_mcclarence_2ef9475</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2690053%2F913499a1-620d-4487-a868-d677f1aca106.png</url>
      <title>DEV Community: Philip McClarence</title>
      <link>https://dev.to/philip_mcclarence_2ef9475</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/philip_mcclarence_2ef9475"/>
    <language>en</language>
    <item>
      <title>PostgreSQL Query Rewriting Techniques</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Mon, 04 May 2026 14:00:07 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/postgresql-query-rewriting-techniques-50pe</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/postgresql-query-rewriting-techniques-50pe</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Query Rewriting Techniques
&lt;/h1&gt;

&lt;p&gt;The previous articles in this series covered performance problems you fix by adding indexes, restructuring joins, or tuning memory. This one is about the queries where the plan is "fine" — every node is doing something reasonable — but the query itself is asking the wrong question, producing unnecessarily large intermediate results or forcing the planner down a path that a different SQL shape would avoid.&lt;/p&gt;

&lt;p&gt;These rewrites don't change what the query returns. They change how PostgreSQL goes about computing it. Learn to recognise the patterns and most of them are mechanical — if the original form matches X, rewrite to Y — and the performance improvement is often an order of magnitude or more with no downside.&lt;/p&gt;

&lt;p&gt;This article is the seventh in the &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;Complete Guide to PostgreSQL SQL Query Analysis &amp;amp; Optimization&lt;/a&gt; series. Every EXPLAIN block below is captured from the same Neon Postgres 17.8 database used throughout.&lt;/p&gt;

&lt;h2&gt;
  
  
  Offset pagination → keyset pagination
&lt;/h2&gt;

&lt;p&gt;The single highest-impact rewrite in this article. &lt;code&gt;OFFSET N LIMIT M&lt;/code&gt; is the default pagination shape in most ORMs and REST API frameworks. It's also a performance landmine as soon as users deep-paginate. To return page 1000 of 500,000 rows (20 per page), PostgreSQL must read and discard 19,980 rows before returning the 20 you want. Page 1 is fast; page 1000 is slow; page 10000 is a disaster.&lt;/p&gt;

&lt;p&gt;Captured against our 500,000-row &lt;code&gt;sim_bp_orders&lt;/code&gt; table — "page 24000 of 25000, 20 orders per page":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="k"&gt;OFFSET&lt;/span&gt; &lt;span class="mi"&gt;480000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Limit  (cost=28511.34..28512.53 rows=20 width=16) (actual time=1900.713..1900.731 rows=20 loops=1)
  Buffers: shared hit=481693 read=1775
  -&amp;gt;  Index Scan Backward using idx_sim_bp_orders_created_at
        (actual time=0.018..1878.609 rows=480020 loops=1)
 Execution Time: 1900.750 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1.9 seconds for 20 rows. The &lt;code&gt;Index Scan Backward&lt;/code&gt; returns rows=480020 before the &lt;code&gt;Limit&lt;/code&gt; takes 20 — PostgreSQL walked the &lt;code&gt;created_at&lt;/code&gt; index backwards, visited every heap tuple for visibility checks, and discarded 99.996% of them. &lt;code&gt;Buffers: shared hit=481693 read=1775&lt;/code&gt; is 3.8 GB of page traffic for a result the size of a tweet.&lt;/p&gt;

&lt;p&gt;The fix is &lt;strong&gt;keyset pagination&lt;/strong&gt; — instead of &lt;code&gt;OFFSET 480000&lt;/code&gt;, remember the cursor value of the last row you returned and ask for rows &lt;em&gt;less than&lt;/em&gt; that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Pass the (created_at, order_id) from the last row of the previous page.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'2024-03-01'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Limit  (actual time=0.979..1.014 rows=20 loops=1)
  Buffers: shared hit=22 read=1
  -&amp;gt;  Index Scan Backward using idx_sim_bp_orders_created_at
        Index Cond: (created_at &amp;lt; '2024-03-01'::timestamptz)
        (actual time=0.978..1.010 rows=20 loops=1)
 Execution Time: 1.032 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;1 ms, 23 buffers hit.&lt;/strong&gt; The &lt;code&gt;Index Cond&lt;/code&gt; means the planner could start the index scan from the cursor position rather than the beginning — no discarded rows, no wasted buffer reads. Page 1 and page 10,000 have identical cost.&lt;/p&gt;

&lt;p&gt;Three things to know about keyset pagination:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use a composite cursor for uniqueness.&lt;/strong&gt; &lt;code&gt;ORDER BY created_at DESC&lt;/code&gt; isn't a deterministic total order unless &lt;code&gt;created_at&lt;/code&gt; is unique. For production systems, use &lt;code&gt;(created_at, id)&lt;/code&gt; or similar: &lt;code&gt;WHERE (created_at, order_id) &amp;lt; ('2024-03-01 14:22:00+00', 984523) ORDER BY created_at DESC, order_id DESC LIMIT 20&lt;/code&gt;. This ensures no rows are skipped or duplicated at page boundaries when multiple rows share the same timestamp.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The index has to match the sort.&lt;/strong&gt; &lt;code&gt;ORDER BY created_at DESC, order_id DESC&lt;/code&gt; works against &lt;code&gt;(created_at DESC, order_id DESC)&lt;/code&gt; directly or &lt;code&gt;(created_at, order_id)&lt;/code&gt; read backwards. Mismatches force an in-memory sort that undoes the keyset win.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You give up random-access "jump to page N" semantics.&lt;/strong&gt; Keyset pagination is forward/backward through an ordered stream. Most APIs and infinite-scroll UIs don't actually need random access; if yours does, you're stuck with OFFSET (or need a completely different data model).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Correlated scalar subquery → aggregating JOIN
&lt;/h2&gt;

&lt;p&gt;A scalar subquery in the SELECT list runs once per outer row (&lt;code&gt;SubPlan N&lt;/code&gt; in the plan). When the outer set is large, this is O(n²). The rewrite is a LEFT JOIN to a pre-aggregated table or CTE:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Before: SubPlan runs once per user.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
        &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;pending_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- After: single aggregation, left-joined.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pending_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;pending_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;pending_count&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rewrite computes all per-user counts in a single aggregating scan over &lt;code&gt;sim_bp_orders&lt;/code&gt;, then joins them against users. On large outer sets (say, all 200k active users instead of &lt;code&gt;LIMIT 100&lt;/code&gt;), the rewrite is usually 20-100× faster because the aggregation happens once rather than 200,000 times.&lt;/p&gt;

&lt;p&gt;For "top-N related rows per outer" (not just count), use &lt;code&gt;LATERAL JOIN&lt;/code&gt; with &lt;code&gt;LIMIT N&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;NOT IN&lt;/code&gt; → &lt;code&gt;NOT EXISTS&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The most insidious bug in SQL, bar none. &lt;code&gt;NOT IN&lt;/code&gt; returns no rows whenever the inner set contains a single NULL, because &lt;code&gt;x NOT IN (a, b, NULL)&lt;/code&gt; evaluates to &lt;code&gt;x &amp;lt;&amp;gt; a AND x &amp;lt;&amp;gt; b AND x &amp;lt;&amp;gt; NULL&lt;/code&gt;, and &lt;code&gt;x &amp;lt;&amp;gt; NULL&lt;/code&gt; is unknown, making the whole AND evaluate to unknown (not-true, hence excluded).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- If any user in the inner query has a NULL email, this returns empty.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;unsubscribed_users&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Correct, NULL-safe equivalent:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;unsubscribed_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;NOT EXISTS&lt;/code&gt; uses existence semantics, not three-valued logic, so NULLs don't poison the result. The two forms also often produce different plans — &lt;code&gt;NOT EXISTS&lt;/code&gt; usually becomes an Anti Semi Join, which PostgreSQL executes as cheaply as a regular join. &lt;code&gt;NOT IN&lt;/code&gt; with a nullable inner column can force a hash anti-join that's aware of NULL semantics, and that's slower.&lt;/p&gt;

&lt;p&gt;Rule: never write &lt;code&gt;NOT IN&lt;/code&gt; against a subquery unless you've confirmed the compared column is &lt;code&gt;NOT NULL&lt;/code&gt; at the schema level. In production code, just default to &lt;code&gt;NOT EXISTS&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;DISTINCT&lt;/code&gt; → &lt;code&gt;GROUP BY&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;SELECT DISTINCT&lt;/code&gt; tells PostgreSQL to deduplicate the output; &lt;code&gt;GROUP BY&lt;/code&gt; on the same columns does the same thing. When the only goal is deduplication (no aggregate functions), the two are equivalent, and the planner usually produces the same plan for each. But &lt;code&gt;GROUP BY&lt;/code&gt; is strictly more flexible — it composes with &lt;code&gt;HAVING&lt;/code&gt;, plays nicely with window functions, and handles expressions more cleanly.&lt;/p&gt;

&lt;p&gt;The rewrite that actually matters is when &lt;code&gt;DISTINCT&lt;/code&gt; is used in a query shape that's really asking for something else. "The first order per user" is often written as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Wrong: this gets any order, not the first.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;DISTINCT ON (user_id)&lt;/code&gt; returns one row per &lt;code&gt;user_id&lt;/code&gt;, but &lt;em&gt;which&lt;/em&gt; row is unspecified without an ORDER BY. Usually you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This returns the &lt;em&gt;latest&lt;/em&gt; order per user, provided the ORDER BY starts with the DISTINCT ON column. An index on &lt;code&gt;(user_id, created_at DESC)&lt;/code&gt; lets this run as an index scan that emits one row per user without a separate sort.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;DISTINCT ON&lt;/code&gt; is a PostgreSQL extension (not standard SQL) but it's the cleanest expression of "top-1 per group" when the pattern fits. For top-N with N &amp;gt; 1, use LATERAL (below) or a window function with a &lt;code&gt;Run Condition&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Chunked deletes and updates
&lt;/h2&gt;

&lt;p&gt;Large &lt;code&gt;DELETE&lt;/code&gt; or &lt;code&gt;UPDATE&lt;/code&gt; statements take locks on every row they touch, generate WAL proportional to the row count, and can trigger autovacuum storms. A 10-million-row delete often locks out writers for minutes. The rewrite is to do it in chunks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Problematic: single massive delete.&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_logs&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'90 days'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Chunked: loop until no more rows to delete.&lt;/span&gt;
&lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="err"&gt;$$&lt;/span&gt;
&lt;span class="k"&gt;DECLARE&lt;/span&gt;
    &lt;span class="n"&gt;deleted_count&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;BEGIN&lt;/span&gt;
    &lt;span class="n"&gt;LOOP&lt;/span&gt;
        &lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_logs&lt;/span&gt;
        &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;log_id&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;log_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_logs&lt;/span&gt;
            &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'90 days'&lt;/span&gt;
            &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;GET&lt;/span&gt; &lt;span class="k"&gt;DIAGNOSTICS&lt;/span&gt; &lt;span class="n"&gt;deleted_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;ROW_COUNT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="n"&gt;EXIT&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;deleted_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;COMMIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- Releases locks; next iteration starts fresh txn.&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="n"&gt;LOOP&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="err"&gt;$$&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each chunk commits separately, releasing locks and letting autovacuum catch up between iterations. Use &lt;code&gt;LIMIT&lt;/code&gt; + &lt;code&gt;IN (SELECT ... LIMIT ...)&lt;/code&gt; because &lt;code&gt;DELETE ... LIMIT&lt;/code&gt; isn't valid PostgreSQL syntax (unlike MySQL).&lt;/p&gt;

&lt;p&gt;The same pattern applies to bulk &lt;code&gt;UPDATE&lt;/code&gt;s. Batch size depends on row width and lock contention tolerance — 1,000 for wide rows with heavy concurrent load, up to 100,000 for narrow rows on an off-hours maintenance window.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;INSERT ... ON CONFLICT&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Pre-existing code often uses a read-then-write pattern for upserts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Anti-pattern: race condition between the SELECT and INSERT.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- (application: if not found) INSERT INTO sim_bp_users ...;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two round trips, and two sessions can both read "not found" and both try to insert, producing a unique-constraint violation. The PostgreSQL idiom is &lt;code&gt;INSERT ... ON CONFLICT&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;CONFLICT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt;
    &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EXCLUDED&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;status&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
&lt;span class="n"&gt;RETURNING&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One round trip, atomic, race-free. &lt;code&gt;EXCLUDED&lt;/code&gt; references the row that would have been inserted (before the conflict). For "do nothing on duplicate," use &lt;code&gt;ON CONFLICT (col) DO NOTHING&lt;/code&gt;. The conflict target must be a column or constraint that has a unique index — without one, PostgreSQL has no way to detect "a conflicting row already exists."&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;SELECT *&lt;/code&gt; in production queries
&lt;/h2&gt;

&lt;p&gt;Not a rewrite of the query's logic, but a rewrite of its projection. &lt;code&gt;SELECT *&lt;/code&gt; from a wide table pulls every column over the wire and through every plan node — Index Only Scans degrade to regular Index Scans (heap fetches required for the extra columns), join memory usage multiplies, sort widths explode.&lt;/p&gt;

&lt;p&gt;The specific cost isn't always catastrophic, but the robustness cost is. A column-type change on an upstream table can break downstream consumers that didn't know they depended on the old width. In production code, name every column you actually need.&lt;/p&gt;

&lt;p&gt;The exception: dump tools, ad-hoc debugging, and CTEs that genuinely pass all columns through. Context-dependent, but the default should be "name the columns."&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;HAVING&lt;/code&gt; vs &lt;code&gt;WHERE&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;HAVING&lt;/code&gt; filters after aggregation; &lt;code&gt;WHERE&lt;/code&gt; filters before. If a predicate &lt;em&gt;could&lt;/em&gt; apply before aggregation, it should — the aggregate then operates on fewer rows. A classic misuse:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Inefficient: aggregate over all orders, then filter.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;order_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;active_users&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Better: filter before aggregation.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;order_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;active_users&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The WHERE clause restricts the set of rows that go into the GROUP BY, so the aggregate runs over a smaller input. Only predicates that depend on the &lt;em&gt;aggregate result&lt;/em&gt; (e.g., &lt;code&gt;HAVING count(*) &amp;gt; 5&lt;/code&gt;) belong in HAVING; anything else is almost always more efficient in WHERE.&lt;/p&gt;

&lt;p&gt;The planner usually pushes predicates from HAVING to WHERE when it's safe, but not always — especially when there are subqueries or complex expressions involved. Writing the filter in WHERE to begin with removes the uncertainty.&lt;/p&gt;

&lt;h2&gt;
  
  
  Composite rewrites: correlated subquery + LATERAL + keyset pagination
&lt;/h2&gt;

&lt;p&gt;Real-world queries often combine several anti-patterns. The "show me the latest 20 orders for each of the top 100 users by lifetime spend" query is classic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Naive: one subquery for the user list, window function for the per-user top-N.&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;top_users&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
    &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;top_users&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CTE lists 100 top users; the window function computes row numbers for &lt;em&gt;all&lt;/em&gt; their orders (potentially thousands each); then the outer WHERE keeps only the top 20 per user. The window function is doing 10-100× the work that's actually needed.&lt;/p&gt;

&lt;p&gt;Rewritten with LATERAL + LIMIT:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;top_users&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_spent&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
    &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_spent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;top_users&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;LATERAL&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
    &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;recent&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each of the 100 top users, a LATERAL subquery returns their 20 most recent orders — at most 2000 rows total, vs potentially hundreds of thousands in the window-function form. PostgreSQL 15+ can sometimes optimise the window-function form via &lt;code&gt;Run Condition&lt;/code&gt;, but LATERAL is both clearer and more reliably cheap.&lt;/p&gt;

&lt;h2&gt;
  
  
  When not to rewrite
&lt;/h2&gt;

&lt;p&gt;Every rewrite has a small risk of changing semantics in an edge case. Before deploying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Diff the results.&lt;/strong&gt; Run the old and new forms against the same data; check the row counts and a representative sample match exactly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check the plan with EXPLAIN ANALYZE.&lt;/strong&gt; The rewrite should show the cost improvement you expect; if it doesn't, there's a case where the planner disagreed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run both under load.&lt;/strong&gt; Synthetic benchmarks rarely capture the real cache and concurrency effects. A rewrite that's 10× faster in isolation might be only 2× faster in production — still worth it, but measure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rewriting for performance is the right move after indexing, before buying bigger hardware. The patterns in this article cover most of what you'll find in a typical OLTP codebase; for the actually-broken queries — the ones that are wrong by construction — see the companion article on &lt;a href="https://mydba.dev/blog/postgres-query-anti-patterns" rel="noopener noreferrer"&gt;PostgreSQL Query Anti-Patterns and Common Mistakes&lt;/a&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Full series and canonical version: &lt;a href="https://mydba.dev/blog/postgres-query-rewriting-techniques" rel="noopener noreferrer"&gt;https://mydba.dev/blog/postgres-query-rewriting-techniques&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>PostgreSQL WHERE Clause Optimization</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Fri, 01 May 2026 14:00:04 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/postgresql-where-clause-optimization-59da</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/postgresql-where-clause-optimization-59da</guid>
      <description>&lt;p&gt;The single question that decides whether an index helps your query is: &lt;em&gt;can the planner match the WHERE clause against the index?&lt;/em&gt; If the answer is yes, you get an index or bitmap scan and the query returns quickly. If the answer is no — because you wrapped the indexed column in a function, used an implicit cast, or combined conditions with OR in a way the planner can't decompose — the index is silently unused and the table is sequentially scanned.&lt;/p&gt;

&lt;p&gt;The catch is that "the planner can match the predicate" isn't a yes-or-no rule; it's a long list of conditions. This article is the sixth in the &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;Complete Guide to PostgreSQL SQL Query Analysis &amp;amp; Optimization&lt;/a&gt; series and covers the conditions most often violated in production SQL. Every EXPLAIN block is captured from the series' Neon Postgres 17.8 database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sargable predicates — the rule in one sentence
&lt;/h2&gt;

&lt;p&gt;A predicate is &lt;em&gt;sargable&lt;/em&gt; (Search ARGument ABLE) when it compares an indexed column, or a leading prefix of an indexed expression, against a constant or parameter — without wrapping the indexed value in a function the planner can't invert. The term isn't formal PostgreSQL terminology, but it's the right mental model: sargable ⇒ indexable; non-sargable ⇒ sequential scan, no matter how many indexes you add.&lt;/p&gt;

&lt;p&gt;The canonical non-sargable predicate is a function on the column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Not sargable — the index on email can't help.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'user42@example.com'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An index on &lt;code&gt;email&lt;/code&gt; doesn't help here because the planner's test is &lt;code&gt;lower(email) = constant&lt;/code&gt;, and the index doesn't store &lt;code&gt;lower(email)&lt;/code&gt;. The fix is either:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Normalise on write.&lt;/strong&gt; Store emails lowercased; query against the raw column. Most applications should have been doing this anyway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expression index on the function.&lt;/strong&gt; &lt;code&gt;CREATE INDEX ON sim_bp_users (lower(email))&lt;/code&gt; — the index stores the lowercased value, and &lt;code&gt;lower(email) = 'x'&lt;/code&gt; becomes sargable against it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;citext&lt;/code&gt; extension.&lt;/strong&gt; A case-insensitive text type with its own operator class. Indexes on citext columns work for equality and pattern operators; which exact cases are index-usable depends on the operator class and the collation semantics. &lt;code&gt;citext&lt;/code&gt; is usually the cleanest solution for "case-insensitive equality everywhere" in the schema; for prefix-heavy workloads, an expression index with &lt;code&gt;text_pattern_ops&lt;/code&gt; (covered in the &lt;a href="https://mydba.dev/blog/postgres-index-usage-optimization" rel="noopener noreferrer"&gt;index usage article&lt;/a&gt;) is often a better fit because its semantics are simpler.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A real capture shows the difference. Non-sargable &lt;code&gt;lower(email) LIKE 'user12%'&lt;/code&gt; against 200k rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Parallel Seq Scan on sim_bp_users
    Filter: (lower((email)::text) ~~ 'user12%'::text)
    Rows Removed by Filter: 94444
 Execution Time: 122.833 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sargable &lt;code&gt;email LIKE 'user12%'&lt;/code&gt; with the existing &lt;code&gt;text_pattern_ops&lt;/code&gt; index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Index Only Scan using idx_sim_bp_users_email_pattern on sim_bp_users
    Index Cond: ((email ~&amp;gt;=~ 'user12'::text) AND (email ~&amp;lt;~ 'user13'::text))
    Heap Fetches: 0
 Execution Time: 24.757 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same data, same 20-row output, 5× faster — and the ratio widens with table size.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implicit casts that silently disable indexes
&lt;/h2&gt;

&lt;p&gt;PostgreSQL's type system is strict, but it will coerce types when the operator allows it. The implicit coercion happens at the &lt;em&gt;constant&lt;/em&gt; side of the comparison usually, which is safe. When it happens at the &lt;em&gt;column&lt;/em&gt; side, it's a silent index-bypass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Sargable — PG casts '123' to int; index on int_col still applies.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;int_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'123'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Not sargable — PG casts text_col to int, wrapping the column.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;text_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the second form, the planner sees &lt;code&gt;int_col_cast(text_col) = 123&lt;/code&gt; and the cast prevents the index on &lt;code&gt;text_col&lt;/code&gt; from matching. The fix is usually "use the right type in the query," but occasionally a text column genuinely needs to index-match integer literals — in which case, an expression index on the cast solves it: &lt;code&gt;CREATE INDEX ON t ((text_col::int))&lt;/code&gt;. Rare, but real.&lt;/p&gt;

&lt;p&gt;More insidious: the &lt;code&gt;varchar(N)&lt;/code&gt; ↔ &lt;code&gt;text&lt;/code&gt; case. &lt;code&gt;status varchar(20)&lt;/code&gt; is indexed; the query does &lt;code&gt;WHERE status = 'pending'&lt;/code&gt;. PostgreSQL picks the right operator and the index is used. Change the column type to &lt;code&gt;citext&lt;/code&gt; or an application-specific domain, and operator resolution can pick a different candidate — sometimes applying a cast on the column side and silently disabling the index. Schema-type changes are a plan-breaking migration; re-run &lt;code&gt;EXPLAIN&lt;/code&gt; on the key queries after any column-type change.&lt;/p&gt;

&lt;h2&gt;
  
  
  The leftmost-prefix rule (quickly, with the consequences)
&lt;/h2&gt;

&lt;p&gt;A composite btree on &lt;code&gt;(a, b, c)&lt;/code&gt; helps queries that use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;a = ?&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;a = ? AND b = ?&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;a = ? AND b = ? AND c = ?&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;a = ? AND b &amp;lt; ?&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;a = ? AND b = ? ORDER BY c&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does &lt;strong&gt;not&lt;/strong&gt; help queries that use only &lt;code&gt;b&lt;/code&gt; or only &lt;code&gt;c&lt;/code&gt;, or only the range portion of a leading column plus equality on a trailing one. The planner can use a prefix of the index's columns starting from the leading one.&lt;/p&gt;

&lt;p&gt;The practical implication: for a composite index, put equality predicates first and range/ORDER BY columns last. An index on &lt;code&gt;(tenant_id, created_at)&lt;/code&gt; serves a tenant-scoped time-range filter cleanly; &lt;code&gt;(created_at, tenant_id)&lt;/code&gt; forces a seq scan for the same query on a specific tenant.&lt;/p&gt;

&lt;p&gt;A common mistake is trying to "cover multiple access patterns with one composite index." If the app filters sometimes by &lt;code&gt;status&lt;/code&gt;, sometimes by &lt;code&gt;user_id&lt;/code&gt;, and sometimes by both, neither &lt;code&gt;(status, user_id)&lt;/code&gt; nor &lt;code&gt;(user_id, status)&lt;/code&gt; serves both single-column filters efficiently. You usually want &lt;em&gt;two&lt;/em&gt; single-column indexes — the planner will combine them with a &lt;code&gt;BitmapAnd&lt;/code&gt; when both are filtered — or one composite index plus one lone single-column index on whichever column is the more common filter in isolation.&lt;/p&gt;

&lt;h2&gt;
  
  
  OR across indexed columns — the BitmapOr pattern
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;OR&lt;/code&gt; in a WHERE clause used to be a classic "can't use an index" gotcha. Modern PostgreSQL handles the common case well via &lt;code&gt;BitmapOr&lt;/code&gt;. Each branch of the OR produces a bitmap from its respective index; the bitmaps are unioned; a single heap scan visits only the matching pages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'user42@example.com'&lt;/span&gt;
   &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'user42'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bitmap Heap Scan on sim_bp_users
  Recheck Cond: (((email)::text = 'user42@example.com'::text)
              OR ((username)::text = 'user42'::text))
  -&amp;gt;  BitmapOr
        -&amp;gt;  Bitmap Index Scan on idx_sim_bp_users_email_pattern
              Index Cond: ((email)::text = 'user42@example.com'::text)
        -&amp;gt;  Bitmap Index Scan on idx_sim_bp_users_username_pattern
              Index Cond: ((username)::text = 'user42'::text)
 Execution Time: 4.406 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;4.4 ms. Both branches of the OR hit an index; the &lt;code&gt;BitmapOr&lt;/code&gt; merges the two TID bitmaps (automatically deduplicating tuple IDs that appeared in both branches, since the bitmap is a set structure indexed by TID); the Bitmap Heap Scan visits each matched page once, rechecks the combined condition, and emits matching rows. No rewrite needed.&lt;/p&gt;

&lt;p&gt;OR becomes a problem when &lt;em&gt;only some&lt;/em&gt; of the branches are indexable, or when the branches match most of the table. In those cases the planner often falls back to a seq scan because the total estimated cost of two bitmap scans + union + recheck is similar to a single scan. If the optimizer picks a seq scan for an OR you thought would hit an index, check each branch individually — the non-sargable one is usually the culprit.&lt;/p&gt;

&lt;h2&gt;
  
  
  OR → UNION ALL — when the planner won't decompose
&lt;/h2&gt;

&lt;p&gt;For the classic "OR across tables" case — &lt;code&gt;WHERE t.x = 1 OR u.y = 2&lt;/code&gt; in a join — the planner can't always produce a BitmapOr because the two sides are in different relations. The rewrite:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Before: OR across joined tables.&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'suspended'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- After: UNION (not UNION ALL — we want deduplication).&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;
&lt;span class="k"&gt;UNION&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'suspended'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each branch of the UNION is a separate query the planner can optimise independently — one can use an index on &lt;code&gt;o.status&lt;/code&gt;, the other on &lt;code&gt;u.status&lt;/code&gt;, and the dedup at the top removes overlap. This only wins when both branches are individually selective; if one branch matches most of the table, UNION isn't faster.&lt;/p&gt;

&lt;p&gt;UNION vs UNION ALL matters for correctness: UNION dedupes (expensive if the output is large and has many overlaps); UNION ALL doesn't (faster, but returns duplicate rows for the overlap). Default to UNION if you're rewriting an OR to preserve equivalent semantics.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;LIKE '%needle%'&lt;/code&gt; — leading wildcards
&lt;/h2&gt;

&lt;p&gt;Standard btree indexes can only help LIKE when the pattern has a fixed prefix. &lt;code&gt;LIKE 'user12%'&lt;/code&gt; is range-scannable (with &lt;code&gt;text_pattern_ops&lt;/code&gt; or C collation); &lt;code&gt;LIKE '%user12%'&lt;/code&gt; isn't — there's no way to translate it into a range on a sorted index.&lt;/p&gt;

&lt;p&gt;The fix is a trigram index (&lt;code&gt;pg_trgm&lt;/code&gt; extension, GIN):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;pg_trgm&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_users_email_trgm&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="n"&gt;gin_trgm_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Now this becomes a GIN index scan instead of a seq scan:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%user12%'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trigram GIN indexes store overlapping 3-character substrings of the text. The query engine decomposes &lt;code&gt;%user12%&lt;/code&gt; the same way and looks up candidate rows in the index. The match set is usually narrow enough that the subsequent heap scan is cheap, even though it has to re-verify each candidate against the full pattern.&lt;/p&gt;

&lt;p&gt;GIN indexes have write amplification — inserts are roughly 3× slower than for a btree, and updates trigger full re-indexing of the changed row. Use trigram GINs sparingly on high-write tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;IS NULL&lt;/code&gt;, &lt;code&gt;IS NOT NULL&lt;/code&gt;, and three-valued logic
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;IS NULL&lt;/code&gt; is sargable against a btree and against most specialised indexes. &lt;code&gt;column IS NULL&lt;/code&gt; can use an index if the index covers nulls (the default for btrees), producing a fast point-scan of the null rows. This is worth knowing because "find the records that haven't been processed yet" is a common pattern on append-mostly tables.&lt;/p&gt;

&lt;p&gt;The failure mode is &lt;code&gt;&amp;lt;&amp;gt;&lt;/code&gt; with nullable columns. &lt;code&gt;status &amp;lt;&amp;gt; 'completed'&lt;/code&gt; excludes rows where &lt;code&gt;status&lt;/code&gt; is NULL — NULL is not-equal to everything but also not-not-equal. If you actually want "all rows where status is not completed or is unknown," you have to write it explicitly: &lt;code&gt;status &amp;lt;&amp;gt; 'completed' OR status IS NULL&lt;/code&gt;, or &lt;code&gt;status IS DISTINCT FROM 'completed'&lt;/code&gt; (which treats NULL as a value).&lt;/p&gt;

&lt;p&gt;&lt;code&gt;NOT IN (SELECT ...)&lt;/code&gt; on a nullable inner column is the same trap at a higher level: if any row in the subquery has &lt;code&gt;NULL&lt;/code&gt; for the compared column, &lt;code&gt;NOT IN&lt;/code&gt; returns no rows at all. Use &lt;code&gt;NOT EXISTS&lt;/code&gt; (see the &lt;a href="https://mydba.dev/blog/postgres-subquery-cte-optimization" rel="noopener noreferrer"&gt;subquery/CTE article&lt;/a&gt;) unless you've proven the inner column is &lt;code&gt;NOT NULL&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Function calls on the constant side — safe
&lt;/h2&gt;

&lt;p&gt;The non-sargable warning applies to functions on the &lt;em&gt;column&lt;/em&gt; side, not the constant side. &lt;code&gt;WHERE created_at &amp;gt; now() - interval '1 day'&lt;/code&gt; is sargable because &lt;code&gt;now() - interval '1 day'&lt;/code&gt; evaluates to a constant (once per query), and the planner compares the indexed &lt;code&gt;created_at&lt;/code&gt; to that constant.&lt;/p&gt;

&lt;p&gt;The subtlety is that functions marked &lt;code&gt;VOLATILE&lt;/code&gt; (like &lt;code&gt;random()&lt;/code&gt;) can't be evaluated once and cached; they're re-evaluated per row, which changes the plan in surprising ways. User-defined functions default to &lt;code&gt;VOLATILE&lt;/code&gt; unless you explicitly mark them &lt;code&gt;STABLE&lt;/code&gt; or &lt;code&gt;IMMUTABLE&lt;/code&gt;. If you're calling a UDF in a WHERE clause that should be constant for the duration of the query, mark it &lt;code&gt;STABLE&lt;/code&gt; — otherwise the planner treats it as volatile and loses optimisation opportunities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Partial index predicate implication
&lt;/h2&gt;

&lt;p&gt;Partial indexes have their own sargability requirement, in addition to the usual one: the query's WHERE clause must &lt;em&gt;imply&lt;/em&gt; the partial index's predicate, from the planner's perspective. The planner uses a built-in theorem prover on predicates, which handles equality, inequality, and simple boolean structure. It doesn't handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Function calls (&lt;code&gt;WHERE lower(status) = 'pending'&lt;/code&gt; won't match a partial index on &lt;code&gt;WHERE status = 'pending'&lt;/code&gt; because the function disables the implication).&lt;/li&gt;
&lt;li&gt;OR-wrapped forms that don't obviously decompose.&lt;/li&gt;
&lt;li&gt;Casts that the theorem prover doesn't recognise as reversible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a partial index isn't being used, the most common reason is that the query's predicate isn't obviously implying the partial predicate. Rewrite the query to match the partial predicate as literally as possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  A diagnostic recipe
&lt;/h2&gt;

&lt;p&gt;When an index exists but a query isn't using it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Look at the &lt;code&gt;Filter:&lt;/code&gt; line in EXPLAIN.&lt;/strong&gt; If the filter mentions the indexed column with any function around it, that's the non-sargable form. Rewrite.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check column types match literal types.&lt;/strong&gt; &lt;code&gt;WHERE int_column = '123'&lt;/code&gt; is fine; &lt;code&gt;WHERE text_column = 123&lt;/code&gt; casts the column and loses the index.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Check the &lt;code&gt;Index Cond:&lt;/code&gt; line for the expected index.&lt;/strong&gt; If the index is available but the plan shows &lt;code&gt;Filter:&lt;/code&gt; instead of &lt;code&gt;Index Cond:&lt;/code&gt;, the planner decided the predicate couldn't use the index — look for functions or casts on the column.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Try &lt;code&gt;SET enable_seqscan = off;&lt;/code&gt;&lt;/strong&gt; just for the session. The resulting plan tells you what the planner &lt;em&gt;would&lt;/em&gt; use if forced. If it's still a seq scan or a bizarre fallback, the predicate is genuinely unindexable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For partial indexes, read the partial predicate carefully.&lt;/strong&gt; The query's WHERE clause has to imply it literally, not just semantically.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;When the predicates are right but the query itself is structured awkwardly, the next article — &lt;a href="https://mydba.dev/blog/postgres-query-rewriting-techniques" rel="noopener noreferrer"&gt;Query Rewriting Techniques&lt;/a&gt; — covers the systematic transformations that turn expensive SQL into cheap SQL without changing results: &lt;code&gt;DISTINCT → GROUP BY&lt;/code&gt;, keyset pagination, batch operations, and the other rewrites every production SQL writer eventually needs.&lt;/p&gt;

&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-where-clause-optimization" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-where-clause-optimization&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>PostgreSQL Aggregate and Window Function Tuning</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Thu, 30 Apr 2026 14:00:08 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/postgresql-aggregate-and-window-function-tuning-4nbl</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/postgresql-aggregate-and-window-function-tuning-4nbl</guid>
      <description>&lt;p&gt;&lt;code&gt;GROUP BY&lt;/code&gt; and window functions look declarative — the query says what it wants, and PostgreSQL figures out how to compute it. In practice the planner has strong opinions about &lt;em&gt;how&lt;/em&gt;: whether to hash or sort, whether to parallelise, whether to spill memory to disk, whether a matching index changes the plan entirely. Learn to read what the planner picked and why, and aggregate-heavy queries become one of the easiest categories to tune.&lt;/p&gt;

&lt;p&gt;This article is the fifth in the &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;Complete Guide to PostgreSQL SQL Query Analysis &amp;amp; Optimization&lt;/a&gt; series. Every EXPLAIN block below is captured from a real run on the series' Neon Postgres 17.8 database (500,000-row &lt;code&gt;sim_bp_orders&lt;/code&gt; and friends).&lt;/p&gt;

&lt;h2&gt;
  
  
  The two aggregate strategies
&lt;/h2&gt;

&lt;p&gt;For &lt;code&gt;GROUP BY&lt;/code&gt;, the planner chooses primarily between two implementations — plus some parallel and distinct variants layered on top.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;HashAggregate&lt;/code&gt;&lt;/strong&gt; builds a hash table keyed by the group-by columns; each incoming row probes the hash and either creates a new entry or updates an existing one's running aggregate state. Fast when the hash table fits in &lt;code&gt;work_mem&lt;/code&gt;. Doesn't care about input order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;GroupAggregate&lt;/code&gt;&lt;/strong&gt; requires input already sorted on the group-by columns. Each group's rows arrive contiguously, so the aggregate can emit a result row and clear its state between groups — constant memory regardless of group count. Picked when the input is already sorted (typically because the group-by matches an index order) or when the planner thinks the hash table won't fit.&lt;/p&gt;

&lt;p&gt;The distinguishing signal in EXPLAIN is the node type itself: &lt;code&gt;HashAggregate&lt;/code&gt; vs &lt;code&gt;GroupAggregate&lt;/code&gt;. When you see &lt;code&gt;Sort → GroupAggregate&lt;/code&gt; and no matching index, the planner has decided a sort + streaming aggregate is cheaper than trying to hash. In parallel plans you'll often see a &lt;em&gt;composite&lt;/em&gt; shape — &lt;code&gt;Partial HashAggregate&lt;/code&gt; inside each worker, topped by &lt;code&gt;Finalize GroupAggregate&lt;/code&gt; on the leader — which is a parallel partial-aggregation pattern rather than "just a HashAggregate."&lt;/p&gt;

&lt;p&gt;Here's that exact shape, from the classic dashboard query "how many orders in each status?":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Finalize GroupAggregate  (cost=8334.96..8336.27 rows=5 width=49)
    (actual time=148.938..151.912 rows=5 loops=1)
  Group Key: sim_bp_orders.status
  Buffers: shared hit=3705
  -&amp;gt;  Gather Merge  (actual time=148.924..151.895 rows=15 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        -&amp;gt;  Sort  (actual time=140.390..140.391 rows=5 loops=3)
              Sort Key: sim_bp_orders.status
              Sort Method: quicksort  Memory: 25kB
              -&amp;gt;  Partial HashAggregate
                    (actual time=140.366..140.367 rows=5 loops=3)
                    Group Key: sim_bp_orders.status
                    Batches: 1  Memory Usage: 24kB
                    -&amp;gt;  Parallel Seq Scan on sim_bp_orders
                          (actual time=0.006..32.097 rows=166667 loops=3)
 Execution Time: 151.973 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;152 ms. This is &lt;strong&gt;parallel partial aggregation&lt;/strong&gt;: each parallel worker (plus the leader, making three process loops) computes a partial HashAggregate over its slice of the table (&lt;code&gt;rows=166667 loops=3&lt;/code&gt; ≈ 500k total), produces its five-row partial result, sorts those by &lt;code&gt;status&lt;/code&gt;, and feeds them up to &lt;code&gt;Gather Merge&lt;/code&gt;. The leader then finalises with &lt;code&gt;Finalize GroupAggregate&lt;/code&gt; — combining the three sets of partial states into five final rows. Partial aggregation is the reason aggregate queries scale so well with parallel workers: only the partial group states (5 rows per worker here, 15 rows total) cross the worker-to-leader boundary, no matter how big the input was.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;Batches: 1  Memory Usage: 24kB&lt;/code&gt; on the Partial HashAggregate means the hash table fit in &lt;code&gt;work_mem&lt;/code&gt; and didn't spill. Five groups with running sum/count fits easily in 24 kB.&lt;/p&gt;

&lt;h2&gt;
  
  
  The aggregate spill — and how to diagnose it
&lt;/h2&gt;

&lt;p&gt;Things get interesting when the number of groups grows. A HashAggregate spill on a 117k-group count looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HashAggregate  (actual time=347.354..392.138 rows=117060 loops=1)
  Group Key: u.email
  Planned Partitions: 4  Batches: 5  Memory Usage: 8241kB  Disk Usage: 6920kB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;Disk Usage: 6920kB&lt;/code&gt; and &lt;code&gt;Batches &amp;gt; 1&lt;/code&gt; are the spill signals. PostgreSQL 13+ handles this gracefully — the executor detects that not all groups fit in memory, writes partial state to per-partition spill files, and processes them in additional passes — but the extra I/O is not free. On our database it cost roughly 40% of the query's total time.&lt;/p&gt;

&lt;p&gt;Two fixes for HashAggregate spills:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Raise &lt;code&gt;work_mem&lt;/code&gt; per-session&lt;/strong&gt; so the hash fits in memory. Set per-role (&lt;code&gt;ALTER ROLE analytics SET work_mem = '64MB'&lt;/code&gt;) rather than cluster-wide, because &lt;code&gt;work_mem&lt;/code&gt; is allocated per sort/hash node &lt;em&gt;per connection&lt;/em&gt; and a cluster-wide raise multiplies by concurrency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sort + GroupAggregate&lt;/strong&gt; is cheaper than a spilling HashAggregate when the group-by column is indexed. Force it with &lt;code&gt;SET enable_hashagg = off;&lt;/code&gt; as a diagnostic, and if the Sort + GroupAggregate plan is faster, the underlying issue is "too many groups for current work_mem." Usually the right answer is to raise &lt;code&gt;work_mem&lt;/code&gt; for the session anyway, since Sort also uses &lt;code&gt;work_mem&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The MyDBA analyzer rule &lt;code&gt;temp_blocks_written&lt;/code&gt; fires when a node's &lt;code&gt;Temp Written Blocks&lt;/code&gt; exceeds 100. That field is populated from JSON-format EXPLAIN output — MyDBA's visualiser runs the rules over JSON plans, not the text format pasted here — so the rule fires automatically on both HashAggregate spills and Sort spills when captured through the native integration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sort spills — the external merge
&lt;/h2&gt;

&lt;p&gt;When a sort doesn't fit in &lt;code&gt;work_mem&lt;/code&gt;, PostgreSQL falls back to an external merge sort: write sorted runs to disk, then merge them. You see this as &lt;code&gt;Sort Method: external merge&lt;/code&gt; with a &lt;code&gt;Disk:&lt;/code&gt; size in the sort node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;percentile_cont&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;percentile_cont&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p95&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Percentiles are expensive because the implementation needs an ordered sample per group. PostgreSQL's &lt;code&gt;percentile_cont&lt;/code&gt; evaluates as an ordered-set aggregate, which requires sorting the input per group:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GroupAggregate  (actual time=202.589..358.067 rows=5 loops=1)
  Group Key: status
  Buffers: shared hit=3697, temp read=2707 written=2494
  -&amp;gt;  Sort  (actual time=146.514..202.485 rows=500000 loops=1)
        Sort Key: status
        Sort Method: external merge  Disk: 12048kB
        Buffers: shared hit=3689, temp read=1506 written=1512
        -&amp;gt;  Seq Scan on sim_bp_orders  (actual time=0.007..49.886 rows=500000 loops=1)
 Execution Time: 358.230 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;358 ms. The Sort spilled 12 MB of temp files. The &lt;code&gt;GroupAggregate&lt;/code&gt; node above it shows its own &lt;code&gt;temp read=2707 written=2494&lt;/code&gt; — that's the ordered-set aggregate's internal tuplestore materialising per-group sorted input for the percentile computation, not a generic "every aggregate spills" phenomenon. Ordered-set aggregates like &lt;code&gt;percentile_cont&lt;/code&gt;, &lt;code&gt;percentile_disc&lt;/code&gt;, and &lt;code&gt;mode()&lt;/code&gt; all force per-group materialisation; a simple &lt;code&gt;count()&lt;/code&gt; or &lt;code&gt;avg()&lt;/code&gt; on the same plan wouldn't produce that second temp-I/O figure. The MyDBA rule &lt;code&gt;sort_on_disk&lt;/code&gt; fires on any Sort with &lt;code&gt;Sort Space Type = Disk&lt;/code&gt;, which this plan has.&lt;/p&gt;

&lt;p&gt;The right fix depends on the workload. For a one-off analytical report, raising &lt;code&gt;work_mem&lt;/code&gt; to ~40 MB for that session turns the external merge into an in-memory quicksort. For a dashboard that runs this every minute, you want a materialised view:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;order_amount_percentiles&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;percentile_cont&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;percentile_cont&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p95&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Refresh on whatever schedule fits your freshness requirement:&lt;/span&gt;
&lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;order_amount_percentiles&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;REFRESH MATERIALIZED VIEW CONCURRENTLY&lt;/code&gt; requires a unique index on the view, reads the source tables outside the refresh window, and replaces the view atomically. The dashboard then queries the view instead of re-running the percentile calculation, and the 358 ms query becomes a 0.5 ms single-row scan.&lt;/p&gt;

&lt;h2&gt;
  
  
  Window functions
&lt;/h2&gt;

&lt;p&gt;A window function produces an output row for every input row, but with access to a &lt;em&gt;frame&lt;/em&gt; of related rows. The syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;agg_func&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;col1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col2&lt;/span&gt;     &lt;span class="c1"&gt;-- split input into independent groups&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;col3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col4&lt;/span&gt;          &lt;span class="c1"&gt;-- order within each partition&lt;/span&gt;
    &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;     &lt;span class="c1"&gt;-- or RANGE BETWEEN, or GROUPS BETWEEN&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The planner implements window functions via a &lt;code&gt;WindowAgg&lt;/code&gt; node that consumes an input ordered appropriately and emits one output row per input. If the input isn't already ordered, the planner inserts a &lt;code&gt;Sort&lt;/code&gt; before the WindowAgg — which is often where the cost lives.&lt;/p&gt;

&lt;p&gt;Consider a common pattern: "the most recent order per user." Pre-PostgreSQL 15 the usual rewrite was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The PostgreSQL 15+ optimisation for this is the &lt;strong&gt;WindowAgg Run Condition&lt;/strong&gt; — the planner notices that &lt;code&gt;WHERE rn = 1&lt;/code&gt; can be pushed into the WindowAgg, so it can stop computing row numbers for each partition as soon as &lt;code&gt;rn &amp;gt; 1&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Limit  (actual time=0.093..0.525 rows=100 loops=1)
  Buffers: shared hit=305
  -&amp;gt;  WindowAgg  (actual time=0.092..0.518 rows=100 loops=1)
        Run Condition: (row_number() OVER (?) &amp;lt;= 1)
        -&amp;gt;  Incremental Sort
              Sort Key: user_id, created_at DESC
              Presorted Key: user_id
              Full-sort Groups: 9  Sort Method: quicksort
              -&amp;gt;  Index Scan using idx_sim_bp_orders_user_id on sim_bp_orders
                    (actual time=0.014..0.339 rows=302 loops=1)
 Execution Time: 0.543 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;0.54 ms. Two optimisations are visible:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Run Condition: (row_number() OVER (?) &amp;lt;= 1)&lt;/code&gt;&lt;/strong&gt; — the WindowAgg stops producing rows for a partition once &lt;code&gt;rn&lt;/code&gt; exceeds 1, so only the first row per user is computed. This lets the plan short-circuit once &lt;code&gt;LIMIT 100&lt;/code&gt; is satisfied after only 302 input rows (not the full 500k).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Incremental Sort&lt;/code&gt; with &lt;code&gt;Presorted Key: user_id&lt;/code&gt;&lt;/strong&gt; — the input arrives already sorted by &lt;code&gt;user_id&lt;/code&gt; (from &lt;code&gt;idx_sim_bp_orders_user_id&lt;/code&gt;), and the WindowAgg needs it sorted by &lt;code&gt;(user_id, created_at DESC)&lt;/code&gt;. An Incremental Sort only sorts &lt;em&gt;within each &lt;code&gt;user_id&lt;/code&gt; group&lt;/em&gt; rather than globally, which costs drastically less memory and allows pipelined execution.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even so, a &lt;code&gt;LATERAL&lt;/code&gt; join with &lt;code&gt;LIMIT 1&lt;/code&gt; inside is often simpler and at least as fast for "top-N per group" with small N.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frame specifications
&lt;/h2&gt;

&lt;p&gt;Most window function work defaults to an implicit frame clause that trips people up. The rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;ORDER BY&lt;/code&gt; clause&lt;/strong&gt; → the frame defaults to &lt;code&gt;RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING&lt;/code&gt; — the whole partition. This is what you want for &lt;code&gt;sum()&lt;/code&gt; or &lt;code&gt;avg()&lt;/code&gt; over an entire partition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ORDER BY&lt;/code&gt; clause present&lt;/strong&gt; → the frame defaults to &lt;code&gt;RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW&lt;/code&gt; — the running total up to this row. This is what you want for running sums, but easy to get wrong.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ranking functions&lt;/strong&gt; (&lt;code&gt;row_number()&lt;/code&gt;, &lt;code&gt;rank()&lt;/code&gt;, &lt;code&gt;dense_rank()&lt;/code&gt;) — the frame is irrelevant because the function's result only depends on the ordering.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A common mistake: computing a "running average over the last 7 rows" and getting a running average over all preceding rows because the frame clause was omitted. The fix is explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt;
    &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;CURRENT&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;ROWS BETWEEN N PRECEDING AND CURRENT ROW&lt;/code&gt; is a physical window of N+1 rows. &lt;code&gt;RANGE BETWEEN '7 days' PRECEDING AND CURRENT ROW&lt;/code&gt; is a logical window based on the ORDER BY value — useful when timestamps aren't evenly spaced. &lt;code&gt;GROUPS BETWEEN N PRECEDING AND CURRENT ROW&lt;/code&gt; (PostgreSQL 11+) treats ties as a single "group" and counts those.&lt;/p&gt;

&lt;h2&gt;
  
  
  LAG, LEAD, and first/last value
&lt;/h2&gt;

&lt;p&gt;The navigation functions — &lt;code&gt;lag(x, n)&lt;/code&gt;, &lt;code&gt;lead(x, n)&lt;/code&gt;, &lt;code&gt;first_value(x)&lt;/code&gt;, &lt;code&gt;last_value(x)&lt;/code&gt; — let you reference rows offset from the current one. Classic use: detect state transitions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;lag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;created_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each row gets the status of the user's &lt;em&gt;previous&lt;/em&gt; order. The window can then be wrapped in a subquery or CTE to find "orders where the status changed":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;ordered&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
           &lt;span class="n"&gt;lag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;prev_status&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;ordered&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;prev_status&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two performance notes. First, &lt;code&gt;last_value()&lt;/code&gt; with a default frame is surprising — because the default frame ends at the current row, &lt;code&gt;last_value()&lt;/code&gt; returns the current row's value, not the partition's last. To actually get the partition's last value, specify &lt;code&gt;ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING&lt;/code&gt;. Second, LAG and LEAD compile to very cheap operations (just a pointer to the previous/next row in the window), while first_value/last_value with an explicit full-partition frame can force materialisation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aggregate-related indexes
&lt;/h2&gt;

&lt;p&gt;An index on the GROUP BY columns is a straightforward win when it exists: the planner can use &lt;code&gt;GroupAggregate&lt;/code&gt; over an index scan and skip the hash build entirely. The index has to cover the group key in &lt;em&gt;exactly&lt;/em&gt; the right order — a composite index on &lt;code&gt;(status, created_at)&lt;/code&gt; serves &lt;code&gt;GROUP BY status&lt;/code&gt;, but a &lt;code&gt;(created_at, status)&lt;/code&gt; doesn't.&lt;/p&gt;

&lt;p&gt;For queries that frequently aggregate a narrow window of a big table (&lt;code&gt;WHERE created_at &amp;gt; ... GROUP BY user_id&lt;/code&gt;), a partial index or materialised view of the aggregate result is usually the right answer, because re-aggregating millions of rows every time beats out any planner optimisation. Precomputation is the most robust performance tactic for aggregates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick diagnostic checklist
&lt;/h2&gt;

&lt;p&gt;When an aggregate query is slow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Is the aggregate node a &lt;code&gt;HashAggregate&lt;/code&gt; with &lt;code&gt;Batches &amp;gt; 1&lt;/code&gt; or &lt;code&gt;Disk Usage &amp;gt; 0&lt;/code&gt;?&lt;/strong&gt; The hash table spilled. Raise &lt;code&gt;work_mem&lt;/code&gt; for the session, or create a supporting index to enable GroupAggregate instead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is there a &lt;code&gt;Sort&lt;/code&gt; above a &lt;code&gt;GroupAggregate&lt;/code&gt; with &lt;code&gt;Sort Method: external merge&lt;/code&gt;?&lt;/strong&gt; The sort spilled. Same fix: more &lt;code&gt;work_mem&lt;/code&gt;, or an index that provides pre-sorted input.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is there a &lt;code&gt;WindowAgg&lt;/code&gt; over a &lt;code&gt;Sort&lt;/code&gt; that processes all input before the &lt;code&gt;LIMIT&lt;/code&gt;?&lt;/strong&gt; Check if a &lt;code&gt;Run Condition&lt;/code&gt; is possible (PG 15+) or if the problem can be rewritten as &lt;code&gt;LATERAL + LIMIT N&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the aggregate running every time the dashboard loads?&lt;/strong&gt; Move it behind a materialised view refreshed on schedule. This is usually the biggest win of all.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does the MyDBA analyzer flag &lt;code&gt;sort_on_disk&lt;/code&gt;, &lt;code&gt;hash_batches_spill&lt;/code&gt;, or &lt;code&gt;temp_blocks_written&lt;/code&gt;?&lt;/strong&gt; These are the three rules that specifically target aggregate-related spills; if any fire, follow the suggestion inline.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;Aggregates interact closely with the shape of your WHERE clauses — a filter that narrows the input set before aggregation is almost always cheaper than aggregating and then filtering. The next article, &lt;a href="https://mydba.dev/blog/postgres-where-clause-optimization" rel="noopener noreferrer"&gt;WHERE Clause Optimisation&lt;/a&gt;, covers sargability and composite-index ordering in detail, with an eye toward getting predicates to apply as early in the plan as possible.&lt;/p&gt;




&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Full article with the complete series: &lt;a href="https://mydba.dev/blog/postgres-aggregate-window-tuning" rel="noopener noreferrer"&gt;https://mydba.dev/blog/postgres-aggregate-window-tuning&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>PostgreSQL Subquery and CTE Optimization</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Wed, 29 Apr 2026 14:00:05 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/postgresql-subquery-and-cte-optimization-53f4</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/postgresql-subquery-and-cte-optimization-53f4</guid>
      <description>&lt;p&gt;Every &lt;code&gt;SELECT&lt;/code&gt; in PostgreSQL is made of smaller &lt;code&gt;SELECT&lt;/code&gt;s, even when it doesn't look that way. &lt;code&gt;WHERE col IN (SELECT ...)&lt;/code&gt;, &lt;code&gt;WHERE EXISTS (SELECT ...)&lt;/code&gt;, &lt;code&gt;(SELECT count(*) FROM ... WHERE ...)&lt;/code&gt; in the column list, &lt;code&gt;WITH x AS (SELECT ...)&lt;/code&gt; — these look syntactically different but all get rewritten into plan nodes at plan time. Which plan node the planner chooses determines whether your query runs in three milliseconds or three seconds, and the rules are different for each pattern.&lt;/p&gt;

&lt;p&gt;This is part of the &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;Complete Guide to PostgreSQL SQL Query Analysis &amp;amp; Optimization&lt;/a&gt;. Assumes you can &lt;a href="https://mydba.dev/blog/postgres-explain-analyze-reading" rel="noopener noreferrer"&gt;read EXPLAIN output&lt;/a&gt; and are familiar with how the planner &lt;a href="https://mydba.dev/blog/postgres-join-optimization" rel="noopener noreferrer"&gt;chooses join strategies&lt;/a&gt;. Running dataset: 500k-row &lt;code&gt;sim_bp_orders&lt;/code&gt;, 200k-row &lt;code&gt;sim_bp_users&lt;/code&gt;, on Neon Postgres 17.8.&lt;/p&gt;

&lt;p&gt;We'll cover: scalar and existence subqueries (&lt;code&gt;SubPlan&lt;/code&gt;, &lt;code&gt;EXISTS&lt;/code&gt;, &lt;code&gt;IN&lt;/code&gt;), when correlated subqueries should be rewritten as joins, how CTEs are executed on modern PostgreSQL, when to use &lt;code&gt;MATERIALIZED&lt;/code&gt; vs &lt;code&gt;NOT MATERIALIZED&lt;/code&gt;, &lt;code&gt;LATERAL&lt;/code&gt; joins, and recursive CTEs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scalar correlated subqueries — the SubPlan trap
&lt;/h2&gt;

&lt;p&gt;A scalar subquery in the column list is the easiest way to accidentally write an O(n²) query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
          &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
         &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
           &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;pending_count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The query reads naturally: "for each active user, count their pending orders." The plan is what that description implies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;Limit&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;1642&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;83&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;088&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;438&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;565&lt;/span&gt; &lt;span class="k"&gt;read&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users_pkey&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;3118066&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;189807&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;087&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;433&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;SubPlan&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
          &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;Aggregate&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;033&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;033&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Bitmap&lt;/span&gt; &lt;span class="n"&gt;Heap&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
                      &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;032&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;033&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                      &lt;span class="k"&gt;Recheck&lt;/span&gt; &lt;span class="n"&gt;Cond&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                      &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                      &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Bitmap&lt;/span&gt; &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;idx_sim_bp_orders_user_id&lt;/span&gt;
                            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;029&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;029&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Cond&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;444&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two signals. First, the &lt;code&gt;SubPlan 1&lt;/code&gt; node is inside the outer index scan — it runs &lt;em&gt;once per outer row&lt;/em&gt;. &lt;code&gt;actual time=0.033..0.033 rows=1 loops=100&lt;/code&gt; tells you the subquery was executed 100 times (once per user returned). With &lt;code&gt;LIMIT 100&lt;/code&gt; it's cheap; without the limit, it would run 200,000 times and that's six seconds of just-subquery time before any other work.&lt;/p&gt;

&lt;p&gt;Second, &lt;code&gt;SubPlan N&lt;/code&gt; in a plan is a heads-up that the query is executing per-outer-row work, which is almost always worth rewriting — either as an aggregating JOIN or a correlated aggregate pushed into a LATERAL. Both rewrites scale better as the outer set grows.&lt;/p&gt;

&lt;h2&gt;
  
  
  EXISTS, IN, and JOIN — three ways to express "filter by related rows"
&lt;/h2&gt;

&lt;p&gt;For the "find rows that have at least one related row" pattern, SQL offers three syntactic choices. They don't all produce the same plan.&lt;/p&gt;

&lt;p&gt;Rewriting the earlier query as an &lt;code&gt;EXISTS&lt;/code&gt; — asking a boolean question, "find users who have at least one pending order":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
      &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
        &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;Limit&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;137&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;98&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;089&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;238&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;726&lt;/span&gt; &lt;span class="k"&gt;read&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Merge&lt;/span&gt; &lt;span class="n"&gt;Semi&lt;/span&gt; &lt;span class="k"&gt;Join&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;088&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;234&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;Merge&lt;/span&gt; &lt;span class="n"&gt;Cond&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users_pkey&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
              &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;idx_sim_bp_orders_user_id&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
              &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="k"&gt;Rows&lt;/span&gt; &lt;span class="n"&gt;Removed&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;586&lt;/span&gt;
 &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;240&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The planner picked a &lt;strong&gt;Merge Semi Join&lt;/strong&gt; — stops at the first match per outer row. That's exactly what EXISTS semantics require. Both sides come in user_id-ordered streams (left from the users primary-key btree; right from &lt;code&gt;idx_sim_bp_orders_user_id&lt;/code&gt; with &lt;code&gt;status='pending'&lt;/code&gt; as a filter), and the merge walks them in lockstep. No per-outer-row SubPlan, no re-execution. The planner doesn't always pick Merge Semi Join — a Nested Loop Semi Join with an index probe is also common, especially with a tight outer &lt;code&gt;LIMIT&lt;/code&gt;. Both shapes scale linearly; the SubPlan pattern was quadratic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;IN (SELECT ...)&lt;/code&gt;&lt;/strong&gt; is a third way. Most of the time PostgreSQL treats &lt;code&gt;WHERE col IN (SELECT ...)&lt;/code&gt; and &lt;code&gt;WHERE EXISTS (SELECT ... WHERE ... = col)&lt;/code&gt; identically, producing the same plan. Two gotchas:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;NOT IN&lt;/code&gt; with nullable columns&lt;/strong&gt; is not equivalent to &lt;code&gt;NOT EXISTS&lt;/code&gt;. If any value in the inner set is NULL, &lt;code&gt;NOT IN&lt;/code&gt; returns unknown (effectively no rows). Always prefer &lt;code&gt;NOT EXISTS&lt;/code&gt; unless you've proven the column is NOT NULL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;IN&lt;/code&gt; on an array literal&lt;/strong&gt; (&lt;code&gt;WHERE id IN (1, 2, 3)&lt;/code&gt;) is a different beast — syntactic sugar for &lt;code&gt;ANY (ARRAY[1,2,3])&lt;/code&gt;, nothing to do with subqueries.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;An explicit JOIN&lt;/strong&gt; works too, but duplicates outer rows for each matching inner row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;DISTINCT&lt;/code&gt; is required because a user with five pending orders would appear five times. Usually slower than EXISTS (produces all matching rows then distincts them down), and you have to remember the &lt;code&gt;DISTINCT&lt;/code&gt;. Use EXISTS for existence questions, JOIN for data you actually want from the related table.&lt;/p&gt;

&lt;p&gt;Rule of thumb:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Count&lt;/strong&gt; of related rows → aggregating subquery or aggregating JOIN with GROUP BY.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Existence&lt;/strong&gt; → &lt;code&gt;EXISTS&lt;/code&gt;. (&lt;strong&gt;Non-existence&lt;/strong&gt; → &lt;code&gt;NOT EXISTS&lt;/code&gt;.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data from related rows&lt;/strong&gt; → regular JOIN.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First/last/top-N of related rows per outer&lt;/strong&gt; → LATERAL (below).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  LATERAL — top-N per group without window functions
&lt;/h2&gt;

&lt;p&gt;A LATERAL join lets a subquery on the right side of a FROM reference columns from the left side: "for each row on the left, evaluate this subquery with those columns bound, and join the result." The SQL-standard way to express "the latest order per customer," "the most recent status message per ticket" — any top-N per outer group.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_amount_cents&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;CROSS&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="k"&gt;LATERAL&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
    &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
    &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;latest&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"For each active user, return their most recent order."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;Nested&lt;/span&gt; &lt;span class="n"&gt;Loop&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;016&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;452&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;314&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users_pkey&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
        &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Subquery&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;latest&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;008&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;008&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Sort&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;007&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;007&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="n"&gt;Sort&lt;/span&gt; &lt;span class="k"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
              &lt;span class="n"&gt;Sort&lt;/span&gt; &lt;span class="k"&gt;Method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;quicksort&lt;/span&gt;  &lt;span class="n"&gt;Memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;
              &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Bitmap&lt;/span&gt; &lt;span class="n"&gt;Heap&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;003&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;006&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;Recheck&lt;/span&gt; &lt;span class="n"&gt;Cond&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;462&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;0.46 ms. The planner ran the lateral subquery 54 times (one per user, until outer &lt;code&gt;LIMIT 50&lt;/code&gt; was satisfied after some users had zero orders). Each lateral execution was a cheap bitmap index scan + tiny sort bounded by &lt;code&gt;LIMIT 1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The window-function equivalent — &lt;code&gt;ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC)&lt;/code&gt; with an outer &lt;code&gt;WHERE rn = 1&lt;/code&gt; — often produces a worse plan on PostgreSQL when only the top 1 or 2 per group are needed, because it computes row numbers for every row before filtering. LATERAL with a &lt;code&gt;LIMIT&lt;/code&gt; inside lets the planner stop early.&lt;/p&gt;

&lt;p&gt;Two practical notes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CROSS JOIN LATERAL&lt;/code&gt; vs &lt;code&gt;LEFT JOIN LATERAL&lt;/code&gt;.&lt;/strong&gt; &lt;code&gt;CROSS JOIN LATERAL&lt;/code&gt; drops outer rows where the subquery returns nothing. &lt;code&gt;LEFT JOIN LATERAL ... ON TRUE&lt;/code&gt; preserves them with NULLs. Swapping them changes results silently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Indexes matter more than for anything else.&lt;/strong&gt; The subquery runs per outer row, so any table scan inside it multiplies. The lateral on &lt;code&gt;sim_bp_orders.user_id&lt;/code&gt; was quick because &lt;code&gt;idx_sim_bp_orders_user_id&lt;/code&gt; exists. Without it, the query would be 500,000× slower.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  CTEs — materialised by default no longer
&lt;/h2&gt;

&lt;p&gt;Before PostgreSQL 12, every &lt;code&gt;WITH&lt;/code&gt; clause was an &lt;strong&gt;optimisation fence&lt;/strong&gt;: the CTE was computed in full and stored in a temporary buffer, and the planner could not push predicates from the outer query into the CTE. People used this intentionally (the "CTE trick" to force materialisation), but it also silently hurt a lot of queries.&lt;/p&gt;

&lt;p&gt;PostgreSQL 12 reversed the default. Now a CTE referenced &lt;strong&gt;once&lt;/strong&gt; and without data-modifying statements is inlined — the planner treats it like a subquery, and predicate pushdown works as expected. CTEs referenced &lt;strong&gt;multiple times&lt;/strong&gt; or containing &lt;code&gt;INSERT/UPDATE/DELETE&lt;/code&gt; are still materialised.&lt;/p&gt;

&lt;p&gt;Two keywords override the default:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WITH foo AS NOT MATERIALIZED (...)&lt;/code&gt;&lt;/strong&gt; — force inlining even if referenced multiple times.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;WITH foo AS MATERIALIZED (...)&lt;/code&gt;&lt;/strong&gt; — force materialisation even if referenced only once.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typical cases:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Inlined by default — works like a subquery, predicates push in.&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;recent_pending&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;rp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;recent_pending&lt;/span&gt; &lt;span class="n"&gt;rp&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;rp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'7 days'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;created_at &amp;gt; now() - interval '7 days'&lt;/code&gt; filter is pushed into the CTE, so the combined filter (&lt;code&gt;status = 'pending' AND created_at &amp;gt; ...&lt;/code&gt;) can use a single index scan rather than materialising all pending orders first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Expensive aggregation referenced twice — worth materialising once.&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;user_totals&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ut&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;user_totals&lt;/span&gt; &lt;span class="n"&gt;ut&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;ut&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ut&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000000&lt;/span&gt;

&lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;user_totals&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without &lt;code&gt;MATERIALIZED&lt;/code&gt;, the aggregation runs twice (once per reference). With it, it runs once and both references read from the materialised temp table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recursive CTEs
&lt;/h2&gt;

&lt;p&gt;Recursive CTEs are for hierarchical data: trees, graphs, transitive closures, category parents, reporting chains.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;RECURSIVE&lt;/span&gt; &lt;span class="n"&gt;employee_tree&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;-- Base case: root of the tree&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;depth&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;

    &lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;

    &lt;span class="c1"&gt;-- Recursive step: children of previously-found rows&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;et&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;depth&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;employee_tree&lt;/span&gt; &lt;span class="n"&gt;et&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;et&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employee_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employee_tree&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL computes the base case, then repeatedly applies the recursive step to previously-produced rows until no new rows are generated. Two practical concerns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No termination without a base case.&lt;/strong&gt; A recursive CTE referencing itself in the base term, or whose recursive step produces the same rows forever, loops forever. Use &lt;code&gt;depth &amp;lt; N&lt;/code&gt; as a guard when testing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Index the join column.&lt;/strong&gt; The recursive step joins the CTE's accumulated rows against the source table — without an index on &lt;code&gt;employees.manager_id&lt;/code&gt;, each iteration is a sequential scan.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For transitive-closure queries (shortest paths, graph traversals), recursive CTEs work but scale poorly beyond a few tens of thousands of rows. For heavier graph workloads, look at dedicated extensions or materialised adjacency tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Subqueries in the FROM clause
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;SELECT ... FROM (SELECT ...) AS sub&lt;/code&gt; is semantically just a derived table. The planner inlines it the same way it inlines a CTE (PG 12+ behaviour), pushing predicates in.&lt;/p&gt;

&lt;p&gt;One case where FROM subqueries matter: forcing a computation to happen once rather than per outer row. If you have &lt;code&gt;SELECT ..., f(x) AS computed_val FROM t WHERE f(x) &amp;gt; 10&lt;/code&gt;, PostgreSQL may call &lt;code&gt;f(x)&lt;/code&gt; twice per row (once for filter, once for projection) unless &lt;code&gt;f&lt;/code&gt; is marked &lt;code&gt;STABLE&lt;/code&gt;. Wrapping the expensive call in a FROM subquery sometimes ensures one-call-per-row evaluation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical rules
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SubPlan in the plan output&lt;/strong&gt; → consider rewriting as a JOIN or LATERAL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EXISTS / IN / JOIN+DISTINCT&lt;/strong&gt; → default to EXISTS for boolean questions; it's usually clearest &lt;em&gt;and&lt;/em&gt; gets the best plan on PostgreSQL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NOT IN on a nullable column&lt;/strong&gt; → almost always a bug. Use NOT EXISTS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CTE used once&lt;/strong&gt; → inlined by default in PG 12+. Don't wrap something in a CTE hoping to force materialisation; use &lt;code&gt;MATERIALIZED&lt;/code&gt; explicitly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CTE used multiple times with expensive aggregation&lt;/strong&gt; → &lt;code&gt;MATERIALIZED&lt;/code&gt; wins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top-N per group&lt;/strong&gt; → LATERAL with &lt;code&gt;LIMIT&lt;/code&gt; inside. Cleaner plan than window functions for small N.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursive traversals&lt;/strong&gt; → &lt;code&gt;WITH RECURSIVE&lt;/code&gt;, but index the join column and put a depth guard on anything you're not sure terminates.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Next in the series: &lt;a href="https://mydba.dev/blog/postgres-where-clause-optimization" rel="noopener noreferrer"&gt;WHERE Clause Optimisation&lt;/a&gt; — sargability, composite-index column ordering, and the operators that silently disable indexes.&lt;/p&gt;

&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Canonical version: &lt;a href="https://mydba.dev/blog/postgres-subquery-cte-optimization" rel="noopener noreferrer"&gt;https://mydba.dev/blog/postgres-subquery-cte-optimization&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>PostgreSQL Join Optimization: Nested Loop, Hash, and Merge</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Tue, 28 Apr 2026 14:00:09 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/postgresql-join-optimization-nested-loop-hash-and-merge-1cn9</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/postgresql-join-optimization-nested-loop-hash-and-merge-1cn9</guid>
      <description>&lt;p&gt;PostgreSQL has three join algorithms. The planner picks between them for every join in every query, driven by several things at once: the estimated sizes of the two inputs, whether they arrive already sorted on the join key, the type of join (inner vs left/semi/anti), which operators are &lt;code&gt;mergejoinable&lt;/code&gt; or &lt;code&gt;hashjoinable&lt;/code&gt;, whether a hash table will fit in &lt;code&gt;work_mem&lt;/code&gt;, and the cost parameters that weigh I/O against CPU. Get the decision right and a three-way join across millions of rows runs in tens of milliseconds. Get it wrong — usually by encouraging a Nested Loop on two large unsorted inputs — and the same query takes minutes.&lt;/p&gt;

&lt;p&gt;This article is the third in the &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;Complete Guide to PostgreSQL SQL Query Analysis &amp;amp; Optimization&lt;/a&gt; series. We assume the reader can &lt;a href="https://mydba.dev/blog/postgres-explain-analyze-reading" rel="noopener noreferrer"&gt;read EXPLAIN output&lt;/a&gt; and is familiar with the &lt;a href="https://mydba.dev/blog/postgres-index-usage-optimization" rel="noopener noreferrer"&gt;indexing vocabulary&lt;/a&gt;. The running dataset is the same Neon Postgres 17.8 database used throughout the series: 500,000-row &lt;code&gt;sim_bp_orders&lt;/code&gt;, 1,000,000-row &lt;code&gt;sim_bp_order_items&lt;/code&gt;, 200,000-row &lt;code&gt;sim_bp_users&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We'll cover how each of the three join strategies works, when the planner picks each, what indexes each one wants, and how to read multi-way joins.&lt;/p&gt;

&lt;h2&gt;
  
  
  Nested Loop — small outer, indexed inner
&lt;/h2&gt;

&lt;p&gt;Nested Loop is the simplest strategy: for each row on the outer side, scan the inner side for matches. Without any index on the inner side, this is a full scan per outer row — O(outer × inner) — and catastrophic for two large tables. With an index on the inner side's join key, each "scan" of the inner is a handful of page reads (a btree descent plus a heap fetch for any columns not in the index), so the total cost is &lt;em&gt;outer-rows × random-I/O-per-probe&lt;/em&gt; rather than a polynomial blowup. When the outer side is small and the inner has an index, Nested Loop is nearly unbeatable.&lt;/p&gt;

&lt;p&gt;Here's a three-way join that the planner executes as a tower of Nested Loops. The query is "twenty recent pending orders with the user's email and the items in each order":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unit_price_cents&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_order_items&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Limit  (cost=1.28..18.06 rows=20 width=41) (actual time=5.098..43.038 rows=20 loops=1)
  Buffers: shared hit=96 read=45
  -&amp;gt;  Nested Loop  (cost=1.28..159741.54 rows=190376 width=41)
        (actual time=5.097..43.027 rows=20 loops=1)
        -&amp;gt;  Nested Loop  (cost=0.85..78058.04 rows=95188 width=33)
              (actual time=2.949..10.878 rows=9 loops=1)
              Inner Unique: true
              -&amp;gt;  Index Scan Backward using idx_sim_bp_orders_created_at on sim_bp_orders o
                    (cost=0.42..30949.29 rows=100300 width=16)
                    (actual time=1.566..2.610 rows=9 loops=1)
                    Filter: ((o.status)::text = 'pending'::text)
                    Rows Removed by Filter: 44
              -&amp;gt;  Memoize  (cost=0.43..0.55 rows=1 width=25)
                    (actual time=0.916..0.916 rows=1 loops=9)
                    Cache Key: o.user_id
                    Cache Mode: logical
                    Hits: 0  Misses: 9  Evictions: 0  Overflows: 0  Memory Usage: 2kB
                    -&amp;gt;  Index Scan using sim_bp_users_pkey on sim_bp_users u
                          (cost=0.42..0.54 rows=1 width=25)
                          (actual time=0.846..0.846 rows=1 loops=9)
                          Index Cond: (u.user_id = o.user_id)
                          Filter: ((u.status)::text = 'active'::text)
        -&amp;gt;  Index Scan using idx_sim_bp_order_items_order_id on sim_bp_order_items oi
              (cost=0.42..0.83 rows=3 width=12)
              (actual time=2.418..3.567 rows=2 loops=9)
              Index Cond: (oi.order_id = o.order_id)
 Execution Time: 43.129 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;43 ms for a three-way join across 200k × 500k × 1M rows is good. The plan is a tower of two Nested Loops — the inner one joins orders and users, the outer one joins that intermediate result with order items. Read it top-down:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;code&gt;Index Scan Backward&lt;/code&gt; on &lt;code&gt;sim_bp_orders.created_at&lt;/code&gt; walks the index in reverse — newest first — looking for pending orders. &lt;code&gt;rows=9 loops=1&lt;/code&gt; means the outer driver produced nine orders before the whole pipeline had enough downstream rows to satisfy &lt;code&gt;LIMIT 20&lt;/code&gt;. Forty-four rows were read and filtered as non-pending along the way.&lt;/li&gt;
&lt;li&gt;For each of those nine orders, a &lt;code&gt;Memoize → Index Scan on sim_bp_users_pkey&lt;/code&gt; looks up the user. Memoize is a PostgreSQL 14+ cache that short-circuits the inner scan when the same key appears repeatedly; here the nine orders happen to be from nine different users, so it's effectively nine primary-key lookups with no cache hits.&lt;/li&gt;
&lt;li&gt;For each matching &lt;code&gt;(order, user)&lt;/code&gt; pair, the outer &lt;code&gt;Index Scan using idx_sim_bp_order_items_order_id&lt;/code&gt; returns an average of two to three line items per order (&lt;code&gt;rows=2 loops=9&lt;/code&gt;). The &lt;code&gt;LIMIT 20&lt;/code&gt; applies to the final joined row count, so the executor stops as soon as 20 &lt;code&gt;(order, user, item)&lt;/code&gt; tuples have been produced — which is roughly the point where 9 orders × ~2 items each = 20 rows.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the Nested Loop success case: the outer driver returns a tiny number of rows thanks to the &lt;code&gt;LIMIT&lt;/code&gt; + ordered index, and every inner lookup is an indexed point query. Without the &lt;code&gt;LIMIT&lt;/code&gt;, the planner would likely pick a very different strategy — possibly a Hash Join cascade — because it would have to produce tens of thousands of rows instead of twenty.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Nested Loop failure mode
&lt;/h3&gt;

&lt;p&gt;The same strategy is a disaster when the outer side is large. Consider "count the items across all pending orders," which must process 100,000 pending orders:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_order_items&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we force the planner to use a Nested Loop (by disabling hash and merge joins), the result is telling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Aggregate (actual time=1621.494..1621.495 rows=1 loops=1)
  Buffers: shared hit=398994 read=2894
  -&amp;gt;  Nested Loop  (actual time=6.422..1606.338 rows=200535 loops=1)
        -&amp;gt;  Index Only Scan on sim_bp_orders o
              (actual time=4.859..123.354 rows=100252 loops=1)
        -&amp;gt;  Index Only Scan on sim_bp_order_items oi
              (actual time=0.013..0.014 rows=2 loops=100252)
              Index Cond: (oi.order_id = o.order_id)
 Execution Time: 1621.525 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1.6 seconds for the same result the planner produces in 1.2 seconds via a Parallel Hash Join (next section). More interestingly, the &lt;code&gt;Buffers&lt;/code&gt; line shows &lt;strong&gt;398,994 pages hit&lt;/strong&gt; — that's from 100,252 inner-index probes, each one re-traversing the btree descent of &lt;code&gt;idx_sim_bp_order_items_order_id&lt;/code&gt;. Many of those probes hit the same upper index pages over and over (that's why it's mostly &lt;code&gt;hit&lt;/code&gt;, not &lt;code&gt;read&lt;/code&gt;), but it's still enormous repeated page traffic that dominates CPU even when the data is fully cached. Under concurrency, other queries would find their own working set evicted from &lt;code&gt;shared_buffers&lt;/code&gt; to make room.&lt;/p&gt;

&lt;p&gt;The MyDBA analyzer rule &lt;code&gt;nested_loop_large&lt;/code&gt; is specifically for this failure mode: it fires when a Nested Loop has &lt;code&gt;Plan Rows &amp;gt; 1000&lt;/code&gt; on the outer side and &lt;code&gt;Plan Rows &amp;gt; 100&lt;/code&gt; on the inner side. At those sizes the Nested Loop is almost always the wrong strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hash Join — larger sides, unsorted input
&lt;/h2&gt;

&lt;p&gt;Hash Join works in two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Build phase.&lt;/strong&gt; Read the smaller side in full, building an in-memory hash table keyed by the join column(s). This happens inside the &lt;code&gt;Hash&lt;/code&gt; node you see in the plan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probe phase.&lt;/strong&gt; Stream the larger side through the hash table, emitting matched rows as they come.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Hash Join doesn't care whether the inputs are sorted, which makes it the fallback when Merge Join isn't available. It wants the build side to fit in &lt;code&gt;work_mem&lt;/code&gt;; if it doesn't, the join spills: PostgreSQL partitions both sides by the join key and processes one pair of partitions at a time. Spilling is visible in the plan as &lt;code&gt;Batches &amp;gt; 1&lt;/code&gt; on the &lt;code&gt;Hash&lt;/code&gt; or &lt;code&gt;Hash Join&lt;/code&gt; node, and the MyDBA analyzer rule &lt;code&gt;hash_batches_spill&lt;/code&gt; fires on it.&lt;/p&gt;

&lt;p&gt;Here's the same count query the planner actually chose — a Parallel Hash Join:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Finalize Aggregate  (actual time=1196.234..1199.894 rows=1 loops=1)
  Buffers: shared hit=3827 read=6356
  -&amp;gt;  Gather (Workers Planned: 2, Workers Launched: 2)
        -&amp;gt;  Partial Aggregate  (actual time=1179.014..1179.016 rows=1 loops=3)
              -&amp;gt;  Parallel Hash Join
                    (actual time=170.554..1143.676 rows=333333 loops=3)
                    Hash Cond: (oi.order_id = o.order_id)
                    -&amp;gt;  Parallel Seq Scan on sim_bp_order_items oi
                          (actual time=1.589..703.241 rows=333333 loops=3)
                    -&amp;gt;  Parallel Hash
                          Buckets: 524288  Batches: 1  Memory Usage: 23712kB
                          -&amp;gt;  Parallel Seq Scan on sim_bp_orders o
                                (actual time=0.009..38.403 rows=166667 loops=3)
                                Filter: ((o.status)::text = 'pending'::text)
 Execution Time: 1199.945 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;1.2 seconds, 10,183 buffer pages touched — about 40× fewer than the forced Nested Loop. The planner built the hash table from &lt;code&gt;sim_bp_orders&lt;/code&gt; (the smaller filtered side, 100k pending rows) and probed it with &lt;code&gt;sim_bp_order_items&lt;/code&gt;. &lt;code&gt;Batches: 1&lt;/code&gt; means the hash table fit in &lt;code&gt;work_mem&lt;/code&gt; entirely, so there was no spill.&lt;/p&gt;

&lt;p&gt;Note the Parallel Seq Scan on both sides. That is not a planner mistake — when you're going to read every pending row anyway, a sequential scan is cheaper than an indexed scan because it avoids random I/O and plays nicely with read-ahead. Hash Join is perfectly happy to consume an unsorted stream.&lt;/p&gt;

&lt;p&gt;The Parallel Hash Join is a newer variant (PostgreSQL 11+) where workers collaborate to build one shared hash table and then probe it in parallel. Under the hood, &lt;code&gt;Parallel Hash&lt;/code&gt; coordinates the build; each worker contributes to it and then proceeds to scan its share of the probe side. This is why you see &lt;code&gt;Workers Planned: 2, Workers Launched: 2&lt;/code&gt; at the top and three loops in each node (one leader + two workers).&lt;/p&gt;

&lt;h3&gt;
  
  
  When Hash Join is suboptimal
&lt;/h3&gt;

&lt;p&gt;Three cases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build side too large.&lt;/strong&gt; If the smaller table is still multiple-of-work_mem, hash-join spilling degrades performance sharply. The fix is either to raise &lt;code&gt;work_mem&lt;/code&gt; (per-session, not cluster-wide), or to force a different strategy via index creation. &lt;code&gt;hash_batches_spill&lt;/code&gt; flags this in the analyzer output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Probe side is tiny.&lt;/strong&gt; If one input is five rows and the other is fifty million, Nested Loop into an indexed inner is cheaper than building any hash table. PostgreSQL's cost model handles this case correctly most of the time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Both inputs already sorted.&lt;/strong&gt; If both sides come out of index scans that produce rows in join-key order, Merge Join is strictly cheaper because it skips the hash build. The planner usually figures this out on its own when it sees the access paths.&lt;/p&gt;

&lt;h2&gt;
  
  
  Merge Join — both sides sorted
&lt;/h2&gt;

&lt;p&gt;Merge Join walks two pre-sorted inputs in parallel, pairing rows with matching keys in a single pass. It's optimal when both inputs are already sorted on the join key — typically because both are served from index scans on the join column, or because the query itself requires an &lt;code&gt;ORDER BY&lt;/code&gt; that aligns with the join key.&lt;/p&gt;

&lt;p&gt;The planner picks Merge Join less often than you might expect, because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If one side has a smaller size and the other has an index, Nested Loop is usually cheaper per row.&lt;/li&gt;
&lt;li&gt;If neither side is sorted and both are large, Hash Join wins — sorting both sides just to merge them is rarely cost-effective.&lt;/li&gt;
&lt;li&gt;Merge Join's sweet spot is two large pre-sorted streams, which is often a signal that a materialised view or a pre-joined table would be cheaper still.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A canonical Merge Join shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;sim_bp_order_items&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;oi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If both tables have indexes on &lt;code&gt;order_id&lt;/code&gt; (they do — the primary key on orders and &lt;code&gt;idx_sim_bp_order_items_order_id&lt;/code&gt;) and the ORDER BY forces ordered output, the planner may produce something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Merge Join
  Merge Cond: (o.order_id = oi.order_id)
  -&amp;gt;  Index Scan using sim_bp_orders_pkey on sim_bp_orders o
  -&amp;gt;  Index Scan using idx_sim_bp_order_items_order_id on sim_bp_order_items oi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Single pass through both indexes, no hash build, no random access. When the prerequisites are met — both sides produced in join-key order — Merge Join is the cheapest option by a wide margin.&lt;/p&gt;

&lt;p&gt;In practice you'll see Merge Join most often on joins with explicit ordering, or in the middle of larger plans where the planner noticed that an upstream node was already producing sorted output.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the planner chooses
&lt;/h2&gt;

&lt;p&gt;PostgreSQL's planner is cost-based. For each join, it enumerates the plausible strategies (Nested Loop, Hash Join, Merge Join, and each direction for each — which side is inner, which is outer) and picks the lowest-cost option. The cost model incorporates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Estimated row counts from both sides (crucially — if these are wrong, everything downstream is wrong).&lt;/li&gt;
&lt;li&gt;Whether each side has a useful index on the join column.&lt;/li&gt;
&lt;li&gt;Current &lt;code&gt;work_mem&lt;/code&gt; — the planner knows whether a hash table will fit or whether it'll have to plan a spill.&lt;/li&gt;
&lt;li&gt;Whether inputs are already sorted (from index scans or prior sort nodes).&lt;/li&gt;
&lt;li&gt;The cost parameters: &lt;code&gt;random_page_cost&lt;/code&gt;, &lt;code&gt;seq_page_cost&lt;/code&gt;, &lt;code&gt;cpu_tuple_cost&lt;/code&gt;, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The single biggest cause of wrong-strategy joins is &lt;strong&gt;bad row estimates&lt;/strong&gt;. If the planner thinks a side will produce 15 rows and it actually produces 150,000, it might pick a Nested Loop (optimal for 15) when a Hash Join (optimal for 150,000) would be 100× faster. The MyDBA analyzer rule &lt;code&gt;row_estimate_inaccurate&lt;/code&gt; fires when the actual-to-estimated ratio exceeds 10× in either direction, and the fix is almost always &lt;code&gt;ANALYZE&lt;/code&gt; on the affected table, or extended statistics if the bad estimate comes from a correlation the planner doesn't know about.&lt;/p&gt;

&lt;p&gt;The second biggest cause is &lt;strong&gt;stale column statistics on correlated predicates&lt;/strong&gt;. The planner assumes predicates are independent — if &lt;code&gt;WHERE tenant_id = 7 AND region = 'eu'&lt;/code&gt; implies a much narrower row set than &lt;code&gt;P(tenant_id=7) × P(region='eu')&lt;/code&gt;, the planner will underestimate and pick the wrong join strategy. Extended statistics (&lt;code&gt;CREATE STATISTICS ... ON tenant_id, region FROM ...&lt;/code&gt;) are the specific fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Join order: how PostgreSQL decides what to join first
&lt;/h2&gt;

&lt;p&gt;In a three-way join &lt;code&gt;A ⨝ B ⨝ C&lt;/code&gt;, there are several possible orders: &lt;code&gt;(A ⨝ B) ⨝ C&lt;/code&gt;, &lt;code&gt;A ⨝ (B ⨝ C)&lt;/code&gt;, and if the join conditions allow it, &lt;code&gt;(A ⨝ C) ⨝ B&lt;/code&gt;. For a fourth table you get a lot more permutations. PostgreSQL's planner searches through them.&lt;/p&gt;

&lt;p&gt;The heuristic is: &lt;strong&gt;do the most selective joins first&lt;/strong&gt;, so the intermediate result is as small as possible. A join that filters &lt;code&gt;rows_A × rows_B&lt;/code&gt; down to 100 rows should happen before a join that would blow the intermediate to millions.&lt;/p&gt;

&lt;p&gt;For queries with fewer than 12 tables, PostgreSQL uses dynamic programming to enumerate orders exhaustively. For 12+ tables, the planner switches to the Genetic Query Optimizer (GEQO) which uses heuristic search — sometimes producing non-optimal plans on complex joins. If you have a very wide query (12+ tables, complex conditions), tune &lt;code&gt;geqo_threshold&lt;/code&gt; and &lt;code&gt;from_collapse_limit&lt;/code&gt; or consider rewriting with explicit CTEs to split the problem.&lt;/p&gt;

&lt;p&gt;A few practical levers when the planner picks a wrong join order:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add or fix indexes.&lt;/strong&gt; A missing index on a join column often drives the planner to avoid that join until later, resulting in large intermediates. Indexing fixes it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ANALYZE&lt;/code&gt; recently.&lt;/strong&gt; Stale row counts → bad estimates → bad orders. Autovacuum handles this for active tables; it's often out of date after a bulk load.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extended statistics.&lt;/strong&gt; For correlated join keys, &lt;code&gt;CREATE STATISTICS&lt;/code&gt; on the correlation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rewriting to constrain the planner.&lt;/strong&gt; &lt;code&gt;STRAIGHT_JOIN&lt;/code&gt; doesn't exist in PostgreSQL, but you can force the order by using explicit &lt;code&gt;JOIN&lt;/code&gt; syntax and setting &lt;code&gt;join_collapse_limit = 1&lt;/code&gt;. Use sparingly — the cost model is usually right.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When join strategy doesn't matter — and what does
&lt;/h2&gt;

&lt;p&gt;Sometimes the join strategy is correct and the query is still slow. The real costs are upstream:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A slow sub-query or CTE feeding the join.&lt;/strong&gt; The join isn't the problem; its input is. Diagnose by looking at the actual timing of each side.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An expensive filter that prevents index use.&lt;/strong&gt; If one side of the join is doing a sequential scan because of a non-sargable WHERE clause, the join strategy can't save you. See &lt;a href="https://mydba.dev/blog/postgres-where-clause-optimization" rel="noopener noreferrer"&gt;WHERE Clause Optimisation&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Over-selective projections.&lt;/strong&gt; &lt;code&gt;SELECT *&lt;/code&gt; on a 400-column table passed through a join is expensive in row width; projecting only the columns you need tightens the whole pipeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When reading a multi-way join plan, resist the urge to focus on the outermost join. Instead, scan the leaves of the plan tree for the biggest &lt;code&gt;actual rows × loops&lt;/code&gt; node — that's where the time is actually going.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Outer size&lt;/th&gt;
&lt;th&gt;Inner size&lt;/th&gt;
&lt;th&gt;Inner indexed?&lt;/th&gt;
&lt;th&gt;Inputs sorted?&lt;/th&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Small (≤1K)&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Nested Loop&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Nested Loop or Hash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Both&lt;/td&gt;
&lt;td&gt;Merge Join&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Hash Join (may spill)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;Large&lt;/td&gt;
&lt;td&gt;Build side &amp;gt; work_mem&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Hash Join with spill — raise work_mem or add an index&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A plan shape that should always prompt investigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nested Loop with outer rows &amp;gt; 1,000 and no Memoize cache → fires &lt;code&gt;nested_loop_large&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Hash or Hash Join with &lt;code&gt;Batches &amp;gt; 1&lt;/code&gt; → fires &lt;code&gt;hash_batches_spill&lt;/code&gt;; either raise &lt;code&gt;work_mem&lt;/code&gt; or index to eliminate the join.&lt;/li&gt;
&lt;li&gt;Any join where &lt;code&gt;row_estimate_inaccurate&lt;/code&gt; fires on either side — fix statistics first, then re-examine the join.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;Joins are the category most affected by the quality of your WHERE clauses. The next article in the series covers &lt;a href="https://mydba.dev/blog/postgres-where-clause-optimization" rel="noopener noreferrer"&gt;WHERE Clause Optimisation&lt;/a&gt; — sargability, composite-index column ordering, and the operators that silently disable indexes. If your joins look right but the inputs to them are slow, that's almost always where the fix lives.&lt;/p&gt;

&lt;p&gt;For the subquery/CTE patterns that sometimes appear in place of explicit joins (&lt;code&gt;EXISTS&lt;/code&gt;, correlated subqueries, LATERAL), see &lt;a href="https://mydba.dev/blog/postgres-subquery-cte-optimization" rel="noopener noreferrer"&gt;Subquery &amp;amp; CTE Optimisation&lt;/a&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-join-optimization" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-join-optimization&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>PostgreSQL Index Usage and Optimization</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Mon, 27 Apr 2026 14:00:03 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/postgresql-index-usage-and-optimization-4jgf</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/postgresql-index-usage-and-optimization-4jgf</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Index Usage and Optimization
&lt;/h1&gt;

&lt;p&gt;Indexing is the single biggest lever in SQL performance, and it is also the category where most of the bad advice lives. "Add an index" solves a narrow class of problems. "Add the right index, in the right shape, for the right query, and drop the ones you don't need" is the actual job — and it's more design work than most teams expect.&lt;/p&gt;

&lt;p&gt;This is article 2 in a series on PostgreSQL query analysis. The pillar is &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;The Complete Guide to PostgreSQL SQL Query Analysis &amp;amp; Optimization&lt;/a&gt;; article 1 covers &lt;a href="https://mydba.dev/blog/postgres-explain-analyze-reading" rel="noopener noreferrer"&gt;reading EXPLAIN output&lt;/a&gt;. The running dataset is 500k-row &lt;code&gt;sim_bp_orders&lt;/code&gt; / 200k-row &lt;code&gt;sim_bp_users&lt;/code&gt; / 50k-row &lt;code&gt;sim_bp_products&lt;/code&gt; on Neon Postgres 17.8; every EXPLAIN block is from a real run.&lt;/p&gt;

&lt;p&gt;We'll cover: when the planner actually uses an index, the four design choices that matter most (column selection, partial, covering, expression), the less-common index types and when they beat btrees, how to find unused indexes, and four cases where &lt;em&gt;not&lt;/em&gt; adding an index is the correct call.&lt;/p&gt;

&lt;h2&gt;
  
  
  When the planner picks an index
&lt;/h2&gt;

&lt;p&gt;An index is a data structure; "using an index" is a planner decision. PostgreSQL estimates the cost of each candidate plan — sequential scan, index scan, index-only scan, bitmap scan — and picks the cheapest. Three things drive that choice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Selectivity.&lt;/strong&gt; The estimated fraction of rows the query will return. If the filter returns 0.1% of rows, an index scan is almost always cheaper. If the filter returns 30%, it depends on the rest of the query shape. If the filter returns 70%, the planner will almost always choose a sequential scan because visiting most of the heap sequentially costs less than reading index pages plus random heap I/O.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Correlation.&lt;/strong&gt; If the rows matching the filter are physically clustered on disk, the planner's random-access penalty shrinks and an index scan becomes more attractive. If they're scattered, random I/O dominates and seq scan wins. The &lt;code&gt;pg_stats.correlation&lt;/code&gt; column (range -1 to 1) tells you how clustered each column's values are. Time-series tables (&lt;code&gt;created_at&lt;/code&gt;) often have near-1 correlation because they're append-mostly; status columns usually hover near 0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost parameters.&lt;/strong&gt; &lt;code&gt;random_page_cost&lt;/code&gt; (default 4.0) vs &lt;code&gt;seq_page_cost&lt;/code&gt; (default 1.0). On SSD-backed storage those defaults are too conservative; lowering &lt;code&gt;random_page_cost&lt;/code&gt; to 1.5 or 2.0 makes the planner reach for indexes more readily. Setting it &lt;em&gt;below&lt;/em&gt; &lt;code&gt;seq_page_cost&lt;/code&gt; is almost always wrong — it implies random I/O is faster than sequential, which isn't true on any real storage. If you're tempted to go there, you probably want to raise &lt;code&gt;effective_cache_size&lt;/code&gt; instead.&lt;/p&gt;

&lt;p&gt;If a plan has a &lt;code&gt;Seq Scan&lt;/code&gt;, no index-type nodes, and more than two nodes total, you probably have a missing or ignored index. It's a signal, not a verdict — some queries genuinely don't want an index — but it's worth checking.&lt;/p&gt;

&lt;h2&gt;
  
  
  The boring case — primary key lookup
&lt;/h2&gt;

&lt;p&gt;The cheapest index in any database is the primary-key btree:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;12345&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Index Scan using sim_bp_users_pkey on sim_bp_users
  (cost=0.42..8.44 rows=1 width=51) (actual time=8.683..8.686 rows=1 loops=1)
  Index Cond: (sim_bp_users.user_id = 12345)
  Buffers: shared read=4
 Execution Time: 9.700 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four shared-buffer reads for a 200,000-row table. The 9.7 ms execution time is dominated by cold-cache reads against Neon's networked storage; on a warm-cache benchmark this drops to sub-millisecond. This is the shape every OLTP single-row lookup should have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The four design choices that matter
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Column selection — matching the query shape
&lt;/h3&gt;

&lt;p&gt;A composite index on &lt;code&gt;(user_id, created_at)&lt;/code&gt; helps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;WHERE user_id = ?&lt;/code&gt; (uses the leading column alone).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;WHERE user_id = ? AND created_at &amp;gt; ?&lt;/code&gt; (uses both).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;WHERE user_id = ? ORDER BY created_at DESC LIMIT n&lt;/code&gt; (uses leading equality + sorted trailing column).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does &lt;strong&gt;not&lt;/strong&gt; help &lt;code&gt;WHERE created_at &amp;gt; ?&lt;/code&gt; in isolation. This is the leftmost-prefix rule: a btree composite index can answer queries that use a contiguous prefix of its columns, starting with the leading one. Skip-scan isn't efficient on PostgreSQL btrees for reasonable-cardinality leading columns.&lt;/p&gt;

&lt;p&gt;Rule of thumb: leading columns should be equality predicates, trailing columns range predicates or sort keys. &lt;code&gt;(tenant_id, created_at)&lt;/code&gt;, not &lt;code&gt;(created_at, tenant_id)&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Partial indexes — when 80% of the table is irrelevant
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_bp_orders_pending_recent&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The index only contains rows where &lt;code&gt;status = 'pending'&lt;/code&gt;, so it's roughly one-fifth the size of a full index on &lt;code&gt;created_at&lt;/code&gt;. The planner will use it for any query whose &lt;code&gt;WHERE&lt;/code&gt; clause &lt;em&gt;implies&lt;/em&gt; &lt;code&gt;status = 'pending'&lt;/code&gt; — it proves this by theorem-proving over the predicates. So &lt;code&gt;WHERE status = 'pending' AND created_at &amp;gt; now() - interval '1 day'&lt;/code&gt; works, but &lt;code&gt;WHERE status IN ('pending', 'shipped') AND ...&lt;/code&gt; doesn't (the &lt;code&gt;IN&lt;/code&gt; predicate doesn't imply the partial predicate).&lt;/p&gt;

&lt;p&gt;Two gotchas: they're fragile to query rewording (a function, a cast, a reworded predicate can break the implication proof), and they pay write cost whenever a row moves into or out of the partial predicate.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Covering indexes — eliminating heap fetches
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;INCLUDE&lt;/code&gt; tucks non-key columns into the leaf pages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_bp_orders_pending_by_amount&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount_cents&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A query that &lt;code&gt;SELECT&lt;/code&gt;s any combination of &lt;code&gt;order_id, user_id, total_amount_cents, created_at&lt;/code&gt; from this index can be served entirely from index pages — provided the visibility map marks the relevant heap pages as all-visible. On a write-heavy table where autovacuum can't keep up, you'll see non-zero &lt;code&gt;Heap Fetches:&lt;/code&gt; in EXPLAIN, which defeats most of the benefit.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;INCLUDE&lt;/code&gt; columns cannot be used for index conditions. Rule: put columns used for filtering/joining/ordering in the key; put columns you're only retrieving in &lt;code&gt;INCLUDE&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Expression indexes — indexing computed values
&lt;/h3&gt;

&lt;p&gt;This is where most "why isn't my index being used?" problems live. A btree on &lt;code&gt;email&lt;/code&gt; can't serve &lt;code&gt;WHERE lower(email) = ?&lt;/code&gt; or &lt;code&gt;WHERE lower(email) LIKE 'prefix%'&lt;/code&gt;. Case-insensitive prefix search on a 200k-row table without an expression index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Gather  (cost=1000.00..5841.09 rows=1000 width=25) (actual time=0.553..122.758 rows=1 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  -&amp;gt;  Parallel Seq Scan on sim_bp_users
        Filter: (lower((email)::text) ~~ 'user12%'::text)
        Rows Removed by Filter: 94444
 Execution Time: 122.833 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parallel seq scan, 94k rows filtered per worker, 122 ms. The fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_bp_users_email_lower&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;text_pattern_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For equality on lowercased email, a plain &lt;code&gt;CREATE INDEX ... (lower(email))&lt;/code&gt; is enough. For prefix &lt;code&gt;LIKE&lt;/code&gt;, &lt;code&gt;text_pattern_ops&lt;/code&gt; is needed because PostgreSQL can only rewrite &lt;code&gt;LIKE 'prefix%'&lt;/code&gt; into an index range scan when the index orders text by byte value rather than by locale collation.&lt;/p&gt;

&lt;p&gt;With the existing &lt;code&gt;idx_sim_bp_users_email_pattern&lt;/code&gt; index on &lt;code&gt;email text_pattern_ops&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Index Only Scan using idx_sim_bp_users_email_pattern on sim_bp_users
  (cost=0.42..29.87 rows=20 width=8) (actual time=0.057..24.729 rows=20 loops=1)
  Index Cond: ((email ~&amp;gt;=~ 'user12'::text) AND (email ~&amp;lt;~ 'user13'::text))
  Filter: ((email)::text ~~ 'user12%'::text)
  Heap Fetches: 0
 Execution Time: 24.757 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Index Cond uses &lt;code&gt;~&amp;gt;=~&lt;/code&gt; and &lt;code&gt;~&amp;lt;~&lt;/code&gt; — real PostgreSQL operators from &lt;code&gt;text_pattern_ops&lt;/code&gt; that do byte-order comparisons. 24.7 ms vs 122.8 ms — five times faster, and the gap widens on larger tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Index types beyond btree
&lt;/h2&gt;

&lt;h3&gt;
  
  
  GIN — when equality becomes containment
&lt;/h3&gt;

&lt;p&gt;For values with internal structure (arrays, JSONB, full-text search vectors, trigrams):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_events_data_gin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event_data&lt;/span&gt; &lt;span class="n"&gt;jsonb_path_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Now this is sargable:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_data&lt;/span&gt; &lt;span class="o"&gt;@&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'{"type": "purchase"}'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;jsonb_path_ops&lt;/code&gt; indexes only the &lt;code&gt;@&amp;gt;&lt;/code&gt; operator but produces a significantly smaller and faster index than the default &lt;code&gt;jsonb_ops&lt;/code&gt;. Use it unless you need the other JSONB operators.&lt;/p&gt;

&lt;p&gt;GIN with &lt;code&gt;pg_trgm&lt;/code&gt; turns substring &lt;code&gt;LIKE&lt;/code&gt; queries (&lt;code&gt;LIKE '%needle%'&lt;/code&gt;) into index-backed scans.&lt;/p&gt;

&lt;h3&gt;
  
  
  BRIN — when the data is physically ordered
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;idx_bp_orders_created_at_brin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For our 500,000-row orders table, a BRIN index is ~24 kB; a btree on the same column is ~5 MB. BRIN loses effectiveness immediately if the data isn't correlated — on a shuffled table, the min/max of every page range overlaps the whole value domain and the planner can't skip anything. BRIN is effectively useless on uncorrelated columns and brilliant on time-series data.&lt;/p&gt;

&lt;h3&gt;
  
  
  GiST / SP-GiST / hash
&lt;/h3&gt;

&lt;p&gt;Geometric types, ranges, and fuzzy matching use GiST or SP-GiST. Hash indexes only support equality and are usually beaten by btrees even for point lookups — use them only when you've measured a specific case where they win.&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to add an index
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Write-heavy, read-light tables.&lt;/strong&gt; Every index is write cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low selectivity.&lt;/strong&gt; A btree on a boolean &lt;code&gt;is_active&lt;/code&gt; where 90% of rows are active will never be used. A partial index is better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queries that need most of the table.&lt;/strong&gt; Reports over large windows are best served by parallel seq scan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redundant indexes.&lt;/strong&gt; &lt;code&gt;(a, b, c)&lt;/code&gt; subsumes &lt;code&gt;(a, b)&lt;/code&gt; and &lt;code&gt;(a)&lt;/code&gt;. Drop the prefixes, keep the longest.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Finding unused indexes
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx_scan&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_constraint&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
      &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conindid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contype&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'p'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'u'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'x'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real result from the running database:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;index_name&lt;/th&gt;
&lt;th&gt;size&lt;/th&gt;
&lt;th&gt;idx_scan&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;idx_sim_bp_users_username_pattern&lt;/td&gt;
&lt;td&gt;6184 kB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;idx_sim_bp_users_email_pattern&lt;/td&gt;
&lt;td&gt;7960 kB&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One 6 MB index with zero scans is a straightforward drop. The &lt;code&gt;NOT EXISTS&lt;/code&gt; clause skips PK/unique/exclusion constraint indexes — those enforce integrity and are used internally even if no user query hits them.&lt;/p&gt;

&lt;p&gt;Two caveats: &lt;code&gt;pg_stat_reset()&lt;/code&gt; zeros the counter (check the stats timestamp before acting), and a replica's stats only count scans on that replica (don't drop an index from the primary based on replica stats alone).&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding the right index — a complete example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;51 ms sequential scan over 500k rows with a top-n heapsort. Three plausible candidates:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;(status)&lt;/code&gt;&lt;/strong&gt; — cheapest, most general, but the planner still needs a sort step.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;(status, total_amount_cents DESC)&lt;/code&gt;&lt;/strong&gt; — solves filter and sort. The sort is free because the index is already ordered on the trailing column within each status group.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;(total_amount_cents DESC) WHERE status = 'pending'&lt;/code&gt;&lt;/strong&gt; — only pending rows indexed. Smaller, faster to maintain, but only helps pending queries.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Option 3 plus &lt;code&gt;INCLUDE (order_id, user_id, created_at)&lt;/code&gt; gives Index Only Scan and is the right call for this specific query. If the dashboard later adds &lt;code&gt;status IN ('pending', 'processing')&lt;/code&gt;, you'd want option 2 instead. Design indexes for the query you have, and re-read the plans every six months.&lt;/p&gt;




&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-index-usage-optimization" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-index-usage-optimization&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>Reading PostgreSQL EXPLAIN and EXPLAIN ANALYZE Output</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Fri, 24 Apr 2026 14:00:07 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/reading-postgresql-explain-and-explain-analyze-output-3o74</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/reading-postgresql-explain-and-explain-analyze-output-3o74</guid>
      <description>&lt;p&gt;Every PostgreSQL performance conversation eventually lands on a question that sounds trivial: &lt;em&gt;what does this EXPLAIN mean?&lt;/em&gt; The output is almost readable. There are node names in English, numbers that look familiar, and enough structure that you can guess at the intent. But if you're guessing, you're going to miss the signal that actually matters — and the difference between a plan that returns in 0.3 ms and one that returns in 400 ms is often one line of EXPLAIN output that looks like boilerplate.&lt;/p&gt;

&lt;p&gt;This article is a systematic walk through how to read an EXPLAIN plan on PostgreSQL 17, using real output captured from a live database. By the end you should be able to look at a plan, identify what each node is doing and why, spot the three places where things usually go wrong, and articulate in one sentence why the query is slow — or whether it's actually fine and something else is wrong.&lt;/p&gt;

&lt;p&gt;This is part of the &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;Complete Guide to PostgreSQL SQL Query Analysis &amp;amp; Optimization&lt;/a&gt; series.&lt;/p&gt;

&lt;h2&gt;
  
  
  EXPLAIN vs EXPLAIN ANALYZE vs EXPLAIN (ANALYZE, BUFFERS)
&lt;/h2&gt;

&lt;p&gt;The three variants you'll use in practice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;EXPLAIN&lt;/code&gt;&lt;/strong&gt; — asks the planner what it &lt;em&gt;would&lt;/em&gt; do, without running the query. Fast (milliseconds), safe for expensive queries, but every number is an estimate. Useful for "how expensive does the planner think this is?" and "did my new index change the plan shape?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;&lt;/strong&gt; — actually runs the query and reports what happened. You get both the planner's estimates and the real measured results, side by side. Use this in development and staging; use it on production only after thinking about the cost. &lt;strong&gt;Three warnings:&lt;/strong&gt; (1) &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on an &lt;code&gt;INSERT&lt;/code&gt;/&lt;code&gt;UPDATE&lt;/code&gt;/&lt;code&gt;DELETE&lt;/code&gt; will execute the DML — wrap in a &lt;code&gt;BEGIN; ... ROLLBACK;&lt;/code&gt; if you don't want the side effects. (2) The query runs end-to-end, so a slow query is slow again, and any locks it takes are held for real. (3) &lt;code&gt;ANALYZE&lt;/code&gt; pulls rows into the buffer cache and may evict other working-set pages; running it on a busy production system can perturb the performance of the exact thing you're measuring. On hot-path queries, prefer capturing a representative plan via &lt;code&gt;auto_explain&lt;/code&gt; or an EXPLAIN visualiser in a monitoring tool rather than running &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; ad-hoc under load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS)&lt;/code&gt;&lt;/strong&gt; — the version you should default to. &lt;code&gt;BUFFERS&lt;/code&gt; adds per-node cache-hit/read/dirtied counts and must still be specified explicitly; &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; on its own does &lt;em&gt;not&lt;/em&gt; include buffer statistics. &lt;code&gt;VERBOSE&lt;/code&gt; adds the output column list at each node (useful for spotting why indexes aren't being chosen). &lt;code&gt;SETTINGS&lt;/code&gt; reports any non-default planner knobs that might be influencing the plan.&lt;/p&gt;

&lt;p&gt;You can also ask for structured output with &lt;code&gt;FORMAT JSON&lt;/code&gt;, &lt;code&gt;FORMAT YAML&lt;/code&gt;, or &lt;code&gt;FORMAT XML&lt;/code&gt;. JSON preserves every field and is what you want for programmatic analysis; the text format is easier to read inline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The plan tree
&lt;/h2&gt;

&lt;p&gt;Every EXPLAIN output is a tree. The root is the outermost node, which is whatever produces the query's final rows; children feed their output up to their parent. PostgreSQL indents children under their parent with arrows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Parent Node
  -&amp;gt;  Child A
  -&amp;gt;  Child B
        -&amp;gt;  Grandchild
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The top-down narrative is: "to produce &lt;code&gt;Parent Node&lt;/code&gt;'s output, PostgreSQL runs &lt;code&gt;Child A&lt;/code&gt; and &lt;code&gt;Child B&lt;/code&gt;, feeding both into the parent. &lt;code&gt;Child B&lt;/code&gt; itself is produced by running &lt;code&gt;Grandchild&lt;/code&gt;." Execution order is bottom-up (leaves run first), but the way to &lt;em&gt;read&lt;/em&gt; the plan is top-down — start with "what is this query ultimately asking for?" and then follow the tree down to understand how PostgreSQL intends to answer.&lt;/p&gt;

&lt;p&gt;Here's a real example — "show twenty recent pending orders with the user's email." The plan is against a 500,000-row &lt;code&gt;sim_bp_orders&lt;/code&gt; table and 200,000-row &lt;code&gt;sim_bp_users&lt;/code&gt; table on PostgreSQL 17.8:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;Limit&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;38&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;37&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;075&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;277&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;211&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Nested&lt;/span&gt; &lt;span class="n"&gt;Loop&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;77853&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;84&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100300&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;37&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;074&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;275&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;Inner&lt;/span&gt; &lt;span class="k"&gt;Unique&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;
        &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;211&lt;/span&gt;
        &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;Backward&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;idx_sim_bp_orders_created_at&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
              &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;30949&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100300&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;012&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;151&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="k"&gt;Rows&lt;/span&gt; &lt;span class="n"&gt;Removed&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;106&lt;/span&gt;
              &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;131&lt;/span&gt;
        &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Memoize&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;43&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;006&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;006&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="k"&gt;Cache&lt;/span&gt; &lt;span class="k"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
              &lt;span class="k"&gt;Cache&lt;/span&gt; &lt;span class="k"&gt;Mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;logical&lt;/span&gt;
              &lt;span class="n"&gt;Hits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="n"&gt;Misses&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;  &lt;span class="n"&gt;Evictions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="n"&gt;Overflows&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="n"&gt;Memory&lt;/span&gt; &lt;span class="k"&gt;Usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;
              &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;
              &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users_pkey&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;003&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;003&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="n"&gt;Cond&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;
 &lt;span class="n"&gt;Planning&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;183&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
 &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;309&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read top-down. The root is &lt;code&gt;Limit&lt;/code&gt;, which caps the result at twenty rows. Below it is a &lt;code&gt;Nested Loop&lt;/code&gt; that joins two sources: an &lt;code&gt;Index Scan Backward&lt;/code&gt; over &lt;code&gt;sim_bp_orders&lt;/code&gt; and a &lt;code&gt;Memoize&lt;/code&gt; wrapping an &lt;code&gt;Index Scan&lt;/code&gt; on &lt;code&gt;sim_bp_users&lt;/code&gt;. The outer loop walks the orders index backwards (newest first) filtering for &lt;code&gt;status = 'pending'&lt;/code&gt;, and for each matching order, looks up the user via the primary-key index — but the &lt;code&gt;Memoize&lt;/code&gt; caches results by &lt;code&gt;user_id&lt;/code&gt; in case the same user appears multiple times (they don't in this particular run, so all 20 are cache misses).&lt;/p&gt;

&lt;p&gt;This is a very good plan. 0.309 ms, 211 shared-buffer hits, no reads from disk. The &lt;code&gt;LIMIT 20&lt;/code&gt; short-circuits the nested loop early — only 106 rows are read and filtered before twenty matches are found. The same query with a much larger &lt;code&gt;LIMIT&lt;/code&gt; would have very different numbers.&lt;/p&gt;

&lt;p&gt;Now let's break down what each number means.&lt;/p&gt;

&lt;h2&gt;
  
  
  Per-node fields: cost, rows, width, time, loops
&lt;/h2&gt;

&lt;p&gt;On every node, PostgreSQL prints something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;Node&lt;/span&gt; &lt;span class="n"&gt;Name&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;R&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;W&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first parenthesis (&lt;code&gt;cost=... rows=... width=...&lt;/code&gt;) is the &lt;strong&gt;planner's estimate&lt;/strong&gt;. The second (&lt;code&gt;actual time=... rows=... loops=...&lt;/code&gt;) is &lt;strong&gt;what actually happened&lt;/strong&gt; when the query ran. &lt;code&gt;EXPLAIN&lt;/code&gt; without &lt;code&gt;ANALYZE&lt;/code&gt; only prints the first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;cost=startup..total&lt;/code&gt;.&lt;/strong&gt; Dimensionless units, scaled relative to &lt;code&gt;seq_page_cost&lt;/code&gt; (1.0 by default). The other cost GUCs — &lt;code&gt;random_page_cost&lt;/code&gt;, &lt;code&gt;cpu_tuple_cost&lt;/code&gt;, &lt;code&gt;cpu_index_tuple_cost&lt;/code&gt;, &lt;code&gt;cpu_operator_cost&lt;/code&gt; — are all expressed in the same arbitrary unit, which lets the planner compare heterogeneous operations against each other. &lt;code&gt;startup&lt;/code&gt; is the estimated cost to produce the first row from this node; &lt;code&gt;total&lt;/code&gt; is the estimated cost to produce all rows. The difference matters: a &lt;code&gt;Sort&lt;/code&gt; node has a high startup cost (it has to consume all input before it can produce the first row) but a low marginal cost per row after that. An &lt;code&gt;Index Scan&lt;/code&gt; has a very low startup cost. When you see a node above a &lt;code&gt;LIMIT&lt;/code&gt;, what matters is the &lt;em&gt;startup&lt;/em&gt; cost of the child, because the limit stops asking for rows as soon as it has enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;rows=N&lt;/code&gt;.&lt;/strong&gt; The planner's estimate of how many rows this node will emit. &lt;em&gt;Per loop&lt;/em&gt; — see below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;width=W&lt;/code&gt;.&lt;/strong&gt; Estimated average row width in bytes. Mostly informational; you use it to sanity-check whether a &lt;code&gt;Sort&lt;/code&gt; or &lt;code&gt;Hash&lt;/code&gt; might spill to disk (row width × estimated rows ≈ memory requirement).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;actual time=startup..total&lt;/code&gt;.&lt;/strong&gt; Wall-clock milliseconds, measured &lt;em&gt;per loop&lt;/em&gt;. &lt;code&gt;startup&lt;/code&gt; is the time to produce the first row from this node; &lt;code&gt;total&lt;/code&gt; is the time to produce the last row.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;actual rows=r loops=l&lt;/code&gt;.&lt;/strong&gt; &lt;code&gt;rows&lt;/code&gt; is the number of rows produced &lt;em&gt;per loop&lt;/em&gt;, averaged over all &lt;code&gt;l&lt;/code&gt; loops. To get the total rows this node emitted, multiply: &lt;code&gt;rows × loops&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Loops matter. In the nested loop example above, the &lt;code&gt;Memoize&lt;/code&gt; node reports &lt;code&gt;rows=1 loops=20&lt;/code&gt; — meaning the node was executed 20 times (once per outer row), and each execution produced 1 row. Total output: 20 rows. But the &lt;code&gt;actual time=0.006..0.006&lt;/code&gt; is &lt;em&gt;per loop&lt;/em&gt;, so the total time spent in &lt;code&gt;Memoize&lt;/code&gt; was about &lt;code&gt;0.006 ms × 20 = 0.12 ms&lt;/code&gt;. Forgetting to multiply by &lt;code&gt;loops&lt;/code&gt; is the single most common mistake in reading EXPLAIN output — a node that looks fast per loop can still dominate the query time if it runs 50,000 times.&lt;/p&gt;

&lt;p&gt;The relationship between &lt;code&gt;rows&lt;/code&gt; estimate and &lt;code&gt;actual rows&lt;/code&gt; is arguably &lt;em&gt;the&lt;/em&gt; most important signal in a plan. If the planner estimated 15 and the actual was 8,000, the plan was built on bad assumptions: every decision it made downstream (join strategy, memory allocation, whether to parallelise) was wrong. A ratio past 10× in either direction is worth treating as a warning; past 100× it's usually critical. The fix is almost always &lt;code&gt;ANALYZE&lt;/code&gt; on the affected table, or extended statistics if the bad estimate comes from correlated columns that the planner assumes are independent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Node types you'll see most often
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scan nodes&lt;/strong&gt; — where rows enter the plan.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Seq Scan&lt;/code&gt;&lt;/strong&gt; — read every row of a table. Reports &lt;code&gt;Filter:&lt;/code&gt; when there's a WHERE clause applied, and &lt;code&gt;Rows Removed by Filter:&lt;/code&gt; telling you how many rows were read and discarded. Cheap on small tables, catastrophic on large ones with selective filters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Index Scan&lt;/code&gt;&lt;/strong&gt; — use an index to find rows, then fetch each matching row from the heap for any columns the index doesn't contain. Reports &lt;code&gt;Index Cond:&lt;/code&gt; for conditions satisfied by the index, and optionally &lt;code&gt;Filter:&lt;/code&gt; for conditions that have to be rechecked after the heap fetch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Index Only Scan&lt;/code&gt;&lt;/strong&gt; — use an index to find rows &lt;em&gt;and&lt;/em&gt; return all requested columns directly from the index, skipping the heap entirely. Requires either that the index includes every referenced column (see &lt;code&gt;INCLUDE&lt;/code&gt;) or that all columns are part of the index keys. Reports &lt;code&gt;Heap Fetches:&lt;/code&gt; — this number should be close to zero; a non-zero count means the visibility map didn't cover some pages and PostgreSQL had to check the heap anyway, defeating the point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Bitmap Index Scan&lt;/code&gt; + &lt;code&gt;Bitmap Heap Scan&lt;/code&gt;&lt;/strong&gt; — two-step pattern for combining multiple index conditions or for queries that match many rows. First, the index scan builds a bitmap of heap pages that might have matches. Then the heap scan visits those pages once each, avoiding re-reading pages that contain multiple matches. Reports &lt;code&gt;Exact Heap Blocks&lt;/code&gt; and &lt;code&gt;Lossy Heap Blocks&lt;/code&gt; — a high lossy-block count means &lt;code&gt;work_mem&lt;/code&gt; was too small to track individual tuples, so PostgreSQL fell back to page-level tracking and has to re-filter the matches.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Join nodes&lt;/strong&gt; — combining two inputs.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Nested Loop&lt;/code&gt;&lt;/strong&gt; — for each row on the outer side, scan the inner side. Optimal when the outer side is small and the inner side has an index on the join column. Pathological when both sides are large.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Hash Join&lt;/code&gt;&lt;/strong&gt; — build a hash table from the smaller side (the &lt;code&gt;Hash&lt;/code&gt; child), then probe it with each row from the other side. Optimal for equi-joins on unordered data when the smaller side fits in &lt;code&gt;work_mem&lt;/code&gt;. Reports &lt;code&gt;Hash Batches:&lt;/code&gt; — if this is greater than 1, the hash table didn't fit in memory and had to spill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Merge Join&lt;/code&gt;&lt;/strong&gt; — two pre-sorted inputs, walked in parallel. Optimal when both sides are already sorted (or can be sorted cheaply via an index). Reports &lt;code&gt;Merge Cond:&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sort and aggregation nodes.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Sort&lt;/code&gt;&lt;/strong&gt; — ordering rows. Reports &lt;code&gt;Sort Key:&lt;/code&gt; (the columns being sorted), &lt;code&gt;Sort Method:&lt;/code&gt; (algorithm), &lt;code&gt;Sort Space Type:&lt;/code&gt; (Memory or Disk), and &lt;code&gt;Sort Space Used:&lt;/code&gt; (in KB).

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;top-N heapsort&lt;/code&gt;&lt;/strong&gt; — used under a &lt;code&gt;LIMIT N&lt;/code&gt;. Keeps only N rows in a heap regardless of input size. Efficient in memory and time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;quicksort&lt;/code&gt;&lt;/strong&gt; — everything fits in &lt;code&gt;work_mem&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;external merge&lt;/code&gt;&lt;/strong&gt; — didn't fit; spilled to disk.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;Aggregate&lt;/code&gt; / &lt;code&gt;HashAggregate&lt;/code&gt; / &lt;code&gt;GroupAggregate&lt;/code&gt;&lt;/strong&gt; — SUM/AVG/COUNT/GROUP BY. &lt;code&gt;HashAggregate&lt;/code&gt; builds a hash table keyed by the group-by columns; &lt;code&gt;GroupAggregate&lt;/code&gt; requires presorted input. &lt;code&gt;HashAggregate&lt;/code&gt; can spill to disk with &lt;code&gt;Planned Partitions: N Batches: M&lt;/code&gt;.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;Limit&lt;/code&gt;&lt;/strong&gt; — cap the number of rows. Often the shortcut that makes a plan fast.&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;&lt;code&gt;WindowAgg&lt;/code&gt;&lt;/strong&gt; — window functions like &lt;code&gt;ROW_NUMBER()&lt;/code&gt; and &lt;code&gt;SUM() OVER&lt;/code&gt;.&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Parallelism.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Gather&lt;/code&gt; / &lt;code&gt;Gather Merge&lt;/code&gt;&lt;/strong&gt; — the leader process collecting results from parallel workers. &lt;code&gt;Workers Planned:&lt;/code&gt; and &lt;code&gt;Workers Launched:&lt;/code&gt; tell you how many workers the planner asked for vs actually got. When &lt;code&gt;Launched &amp;lt; Planned&lt;/code&gt;, the system is short on parallel worker slots.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Parallel Seq Scan&lt;/code&gt; / &lt;code&gt;Parallel Index Scan&lt;/code&gt; / &lt;code&gt;Parallel Hash Join&lt;/code&gt;&lt;/strong&gt; — parallel-aware variants of the base node types.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Utility nodes.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Materialize&lt;/code&gt;&lt;/strong&gt; — cache an intermediate result so the parent can rescan it without redoing the work. Common above the inner side of a Nested Loop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Memoize&lt;/code&gt;&lt;/strong&gt; (new in PostgreSQL 14) — LRU cache above an inner loop. Reports &lt;code&gt;Cache Key:&lt;/code&gt;, &lt;code&gt;Hits:&lt;/code&gt;, &lt;code&gt;Misses:&lt;/code&gt;, &lt;code&gt;Evictions:&lt;/code&gt;, and &lt;code&gt;Memory Usage:&lt;/code&gt;. A high hit ratio is good; a high miss ratio just means the cache didn't help this particular query but didn't hurt either.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;CTE Scan&lt;/code&gt;&lt;/strong&gt; — reading from a materialised CTE. In PostgreSQL 12+ most CTEs are inlined and this node disappears; you see it when a CTE is referenced multiple times or marked &lt;code&gt;MATERIALIZED&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SubPlan&lt;/code&gt;&lt;/strong&gt; — a correlated subquery, executed once per outer row. Almost always worth rewriting as a JOIN.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Buffers line
&lt;/h2&gt;

&lt;p&gt;With &lt;code&gt;BUFFERS&lt;/code&gt; enabled, every node reports how many 8 KB pages it touched:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Buffers: shared hit=3689
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four counters to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;shared hit=N&lt;/code&gt;&lt;/strong&gt; — pages found in &lt;code&gt;shared_buffers&lt;/code&gt; (PostgreSQL's cache). No I/O system calls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;shared read=N&lt;/code&gt;&lt;/strong&gt; — pages the backend had to read into &lt;code&gt;shared_buffers&lt;/code&gt; via a &lt;code&gt;read()&lt;/code&gt; system call. Whether the OS page cache satisfied the read without touching disk is invisible to EXPLAIN — these show up as reads regardless.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;shared dirtied=N&lt;/code&gt;&lt;/strong&gt; — pages the query modified in cache. Common with DML; in a read-only SELECT, usually comes from hint-bit updates or cleanup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;shared written=N&lt;/code&gt;&lt;/strong&gt; — pages written back out during this node's execution. Usually this is the backend itself being forced to evict dirty pages to make room for new ones, not the background writer — so a high &lt;code&gt;written&lt;/code&gt; count means your query is doing someone else's work because the dirty-page pool was already full.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's also &lt;code&gt;local hit/read/dirtied/written&lt;/code&gt; for per-session temporary tables, and &lt;code&gt;temp read=N written=N&lt;/code&gt; for work files produced by sorts and hash joins that spilled.&lt;/p&gt;

&lt;p&gt;A query doing &lt;code&gt;shared read=2016, temp written=2051&lt;/code&gt; in a single node is telling you two things: the table isn't fitting in cache, &lt;em&gt;and&lt;/em&gt; the query itself is generating its own on-disk temp files because some operation (hash, sort, bitmap) exceeded &lt;code&gt;work_mem&lt;/code&gt;. Both are fixable; both hurt.&lt;/p&gt;

&lt;h2&gt;
  
  
  A harder plan: the HashAggregate spill
&lt;/h2&gt;

&lt;p&gt;Here's a plan with more going on — "the twenty users with the most pending-or-shipped orders." Against the same 500,000-row orders table and 200,000-row users table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;Limit&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42281&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;42282&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;408&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;141&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;408&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;145&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3737&lt;/span&gt; &lt;span class="k"&gt;read&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2016&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;temp&lt;/span&gt; &lt;span class="k"&gt;read&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1320&lt;/span&gt; &lt;span class="n"&gt;written&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2051&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Sort&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;42281&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;42722&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;81&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;176383&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;406&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;664&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;406&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;667&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;Sort&lt;/span&gt; &lt;span class="k"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
        &lt;span class="n"&gt;Sort&lt;/span&gt; &lt;span class="k"&gt;Method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="n"&gt;heapsort&lt;/span&gt;  &lt;span class="n"&gt;Memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;
        &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;HashAggregate&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;33757&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;37588&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;36&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;176383&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;347&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;354&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;392&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;138&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;117060&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="k"&gt;Group&lt;/span&gt; &lt;span class="k"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;
              &lt;span class="n"&gt;Planned&lt;/span&gt; &lt;span class="n"&gt;Partitions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;  &lt;span class="n"&gt;Batches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;  &lt;span class="n"&gt;Memory&lt;/span&gt; &lt;span class="k"&gt;Usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8241&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;  &lt;span class="n"&gt;Disk&lt;/span&gt; &lt;span class="k"&gt;Usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6920&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;
              &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Hash&lt;/span&gt; &lt;span class="k"&gt;Join&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;7932&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;21080&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;176383&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;140&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;974&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;285&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;275&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;175263&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;Hash&lt;/span&gt; &lt;span class="n"&gt;Cond&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Seq&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;9939&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;176383&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;018&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;51&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;809&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;175263&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;ANY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'{pending,shipped}'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[]))&lt;/span&gt;
                          &lt;span class="k"&gt;Rows&lt;/span&gt; &lt;span class="n"&gt;Removed&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;324737&lt;/span&gt;
                    &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Hash&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4064&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;4064&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200000&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;140&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;864&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;140&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;865&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200000&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="n"&gt;Buckets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;131072&lt;/span&gt;  &lt;span class="n"&gt;Batches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="n"&gt;Memory&lt;/span&gt; &lt;span class="k"&gt;Usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;6822&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;
                          &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Seq&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;4064&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200000&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;735&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;87&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;218&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200000&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="n"&gt;Planning&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;107&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
 &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;408&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;215&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;408 ms. Let's read it top-down and find where the time actually goes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root: &lt;code&gt;Limit&lt;/code&gt; + &lt;code&gt;Sort&lt;/code&gt;.&lt;/strong&gt; The Sort is &lt;code&gt;top-N heapsort, Memory: 26 kB&lt;/code&gt; — fine. Under the &lt;code&gt;LIMIT 20&lt;/code&gt;, a top-N sort is almost free regardless of input size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;HashAggregate&lt;/code&gt; — the first red flag.&lt;/strong&gt; The Group Key is &lt;code&gt;u.email&lt;/code&gt;; the aggregate is a &lt;code&gt;count(*)&lt;/code&gt; across the 175k joined rows. Two numbers jump out: &lt;code&gt;Planned Partitions: 4  Batches: 5&lt;/code&gt; and &lt;code&gt;Memory Usage: 8241 kB  Disk Usage: 6920 kB&lt;/code&gt;. PostgreSQL 13+ can spill a &lt;code&gt;HashAggregate&lt;/code&gt; to disk when the hash table exceeds &lt;code&gt;work_mem&lt;/code&gt;: the executor detects that not all groups will fit in memory, writes unfinished groups out to per-partition spill files, and processes them in a second pass. The exact number of spill-and-resume cycles isn't something you should read literally from the &lt;code&gt;Batches&lt;/code&gt; count, but the presence of &lt;code&gt;Disk Usage&lt;/code&gt; at all is the signal — this query is paying for temp file I/O on every run. The &lt;code&gt;temp written=2051&lt;/code&gt; buffer count at the top is driven by exactly this, and this is the dominant cost of the query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Hash Join&lt;/code&gt; + &lt;code&gt;Hash&lt;/code&gt; child — the second red flag.&lt;/strong&gt; &lt;code&gt;Buckets: 131072  Batches: 2  Memory Usage: 6822 kB&lt;/code&gt;. The hash table built from &lt;code&gt;sim_bp_users&lt;/code&gt; needed about 13 MB (the build side is 200k rows at ~64 bytes each) and didn't fit in 4 MB of &lt;code&gt;work_mem&lt;/code&gt;. When a hash join spills, PostgreSQL partitions &lt;em&gt;both&lt;/em&gt; sides by the join key and processes one matched pair of partitions at a time — each probe row is tested only against its matching partition, not against every batch. The cost is the extra I/O of writing the build and probe sides to per-partition temp files and reading them back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Seq Scan on sim_bp_orders&lt;/code&gt;.&lt;/strong&gt; 175k rows returned, 324k removed by filter (total = 500k, the whole table). The filter is &lt;code&gt;status IN ('pending', 'shipped')&lt;/code&gt;. No index on &lt;code&gt;status&lt;/code&gt;, so the whole table is scanned.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;Seq Scan on sim_bp_users&lt;/code&gt;.&lt;/strong&gt; 200k rows returned, no filter — we need all users. Reads 2016 pages from disk (&lt;code&gt;shared read=2016&lt;/code&gt;); the users table is mostly cold in cache.&lt;/p&gt;

&lt;p&gt;The bottleneck order, from biggest to smallest: HashAggregate spill, Hash Join build-side batches, Seq Scans. Three different fixes are plausible, and which one is appropriate depends on how often this query runs, how much &lt;code&gt;work_mem&lt;/code&gt; the rest of the workload can tolerate, and whether the data is append-mostly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Raise &lt;code&gt;work_mem&lt;/code&gt; per-session&lt;/strong&gt; to ~20 MB so both the HashAggregate and the Hash Join stay in memory. Caveat: &lt;code&gt;work_mem&lt;/code&gt; is allocated per sort/hash node per connection, so raising it globally multiplies by the number of concurrent queries doing sorts. Set it per-role (&lt;code&gt;ALTER ROLE dashboard SET work_mem = '32MB'&lt;/code&gt;) or per-session in the dashboard's connection pool, not cluster-wide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Index &lt;code&gt;sim_bp_orders.status&lt;/code&gt;&lt;/strong&gt; so the scan becomes a Bitmap or Index Scan instead of reading all 500k rows. At ~35% selectivity a plain btree might not beat a seq scan by much, but a partial index or a multi-column &lt;code&gt;(status, user_id)&lt;/code&gt; would.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Materialise the aggregate&lt;/strong&gt; into a small summary table refreshed on a schedule or via triggers, if the query is a dashboard that runs every 10 seconds and the underlying data is append-mostly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A fair DBA answer is "measure each fix in isolation and pick based on the workload" — not any specific prescribed order. If the query runs once a day in a reporting job, the &lt;code&gt;work_mem&lt;/code&gt; bump is cheapest; if it runs constantly and powers a UI, the materialised result wins.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five most common mistakes in reading plans
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Comparing &lt;code&gt;rows&lt;/code&gt; without multiplying by &lt;code&gt;loops&lt;/code&gt;.&lt;/strong&gt; A node reporting &lt;code&gt;rows=1 loops=50000&lt;/code&gt; produced 50,000 rows. A node reporting &lt;code&gt;rows=50000 loops=1&lt;/code&gt; produced the same 50,000 rows in a very different shape. Always look at &lt;code&gt;loops&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Looking at top-line cost/time and calling it a day.&lt;/strong&gt; The top-line number tells you the query is slow; it doesn't tell you &lt;em&gt;which node&lt;/em&gt; is slow. Scan the tree for the node with the highest &lt;code&gt;actual time × loops&lt;/code&gt; — that's where the time is spent, and usually where the fix is.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Trusting the planner's estimates when &lt;code&gt;actual rows&lt;/code&gt; disagrees.&lt;/strong&gt; If &lt;code&gt;rows=15&lt;/code&gt; on the estimate and &lt;code&gt;actual rows=8000&lt;/code&gt;, every downstream decision was built on the wrong premise. Don't try to understand why the plan is shaped the way it is until you've fixed the estimate (usually with &lt;code&gt;ANALYZE&lt;/code&gt; or extended statistics).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Missing the &lt;code&gt;Rows Removed by Filter&lt;/code&gt; line.&lt;/strong&gt; A &lt;code&gt;Seq Scan&lt;/code&gt; returning a reasonable number of rows looks fine — until you notice the filter line says ten million rows were read and discarded to produce those few. The scan was fine; the cost is in the discard.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ignoring the &lt;code&gt;Buffers&lt;/code&gt; line.&lt;/strong&gt; Two plans can have identical shapes and wildly different performance if one hits cache and the other doesn't. &lt;code&gt;shared hit=5&lt;/code&gt; means "hot"; &lt;code&gt;shared read=50000&lt;/code&gt; means "the storage layer did all the work, and next time it might be even worse." The &lt;code&gt;Buffers&lt;/code&gt; line is the only way to see this without looking at the timing.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;p&gt;If the first plan in this article (the nested loop) looked straightforward and the second (the HashAggregate spill) made sense, you've mostly got it. The rest of the series digs into specific bottleneck categories — missing indexes, join-strategy mistakes, aggregate spills, non-sargable WHERE clauses — and what to do about each. The next piece is &lt;a href="https://mydba.dev/blog/postgres-index-usage-optimization" rel="noopener noreferrer"&gt;PostgreSQL Index Usage and Optimization&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-explain-analyze-reading" rel="noopener noreferrer"&gt;https://mydba.dev/blog/postgres-explain-analyze-reading&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>The Complete Guide to PostgreSQL SQL Query Analysis &amp; Optimization</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Thu, 23 Apr 2026 14:00:06 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/the-complete-guide-to-postgresql-sql-query-analysis-optimization-3lbe</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/the-complete-guide-to-postgresql-sql-query-analysis-optimization-3lbe</guid>
      <description>&lt;p&gt;Most PostgreSQL performance work is wasted because it starts from the wrong end. Someone notices a slow query, skim-reads &lt;code&gt;EXPLAIN&lt;/code&gt;, pattern-matches to "missing index," adds one, and moves on. Sometimes that works. Often it doesn't — and when it doesn't, the next attempt is usually an even blunter instrument: "just add more RAM," "just use a read replica," "just cache it."&lt;/p&gt;

&lt;p&gt;This guide is a systematic alternative. The argument is that a large fraction of single-query latency problems in OLTP workloads fall into one of a small number of bottleneck categories, each with a characteristic EXPLAIN signature and a well-understood fix. (Lock contention, vacuum bloat, replication lag, and the generic-plan vs custom-plan behaviour of prepared statements are real and common, but they are cluster-level or protocol-level problems rather than single-plan problems; this guide is strictly about the latter.) If you can name the category in sixty seconds of reading the plan, the fix usually follows in minutes.&lt;/p&gt;

&lt;p&gt;We'll work through the full workflow end-to-end on a real query against a real PostgreSQL 17 database, then map the eight bottleneck categories to the eight deep-dive articles that make up this series. Every EXPLAIN snippet below is captured from an actual run against a 500,000-row &lt;code&gt;sim_bp_orders&lt;/code&gt; table on a Neon Postgres 17.8 database — not a synthetic example.&lt;/p&gt;

&lt;h2&gt;
  
  
  The workflow
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Read the EXPLAIN plan&lt;/strong&gt; — specifically the three signals that matter most: estimated-vs-actual row counts, access path at each scan node, and where time is actually spent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Categorise the bottleneck&lt;/strong&gt; — translate the plan signals into one of eight categories.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apply the matching fix&lt;/strong&gt; — index, rewrite, tune memory, or restructure the query.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify with a second EXPLAIN&lt;/strong&gt; — before/after is how you know you actually fixed something.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. The rest of this article walks through each step on a concrete example.&lt;/p&gt;

&lt;h2&gt;
  
  
  A typical slow query
&lt;/h2&gt;

&lt;p&gt;Our running example is a dashboard query: "show me the fifty highest-value pending orders."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;total_amount_cents&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;created_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_amount_cents&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The table is 500,000 rows, with roughly 20% in &lt;code&gt;status = 'pending'&lt;/code&gt;. There's a primary key on &lt;code&gt;order_id&lt;/code&gt;, indexes on &lt;code&gt;user_id&lt;/code&gt; and &lt;code&gt;created_at&lt;/code&gt;, but &lt;strong&gt;no index on &lt;code&gt;status&lt;/code&gt; or &lt;code&gt;total_amount_cents&lt;/code&gt;&lt;/strong&gt;. We've disabled parallel execution (&lt;code&gt;SET max_parallel_workers_per_gather = 0&lt;/code&gt;) for this example so the plan reads cleanly. Here's the plan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;Limit&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;13270&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;89&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;13271&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;02&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;873&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;883&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3689&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Sort&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;13270&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;89&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;13521&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100300&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;871&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;877&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;Sort&lt;/span&gt; &lt;span class="k"&gt;Key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_amount_cents&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
        &lt;span class="n"&gt;Sort&lt;/span&gt; &lt;span class="k"&gt;Method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;top&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="n"&gt;heapsort&lt;/span&gt;  &lt;span class="n"&gt;Memory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="n"&gt;kB&lt;/span&gt;
        &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3689&lt;/span&gt;
        &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="n"&gt;Seq&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;9939&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;00&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100300&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;011&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;37&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;781&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100252&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
              &lt;span class="k"&gt;Rows&lt;/span&gt; &lt;span class="n"&gt;Removed&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;Filter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;399748&lt;/span&gt;
              &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3689&lt;/span&gt;
 &lt;span class="n"&gt;Planning&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;073&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
 &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;908&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fifty-one milliseconds is not a disaster on its own. It's the kind of number that gets shrugged at until a hundred of these queries run concurrently on a busy application server, at which point CPU saturates and every request starts stacking.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Read the plan
&lt;/h2&gt;

&lt;p&gt;Three signals tell you nearly everything about a plan node.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal 1 — how the table is accessed.&lt;/strong&gt; At the leaf of this plan is &lt;code&gt;Seq Scan on sim_bp_orders&lt;/code&gt;. A sequential scan means the planner's cost model decided reading every row was cheaper than any available index — sometimes because no useful index exists, sometimes because existing indexes don't match the query shape, occasionally because statistics are misleading the cost estimate. On small tables, or when the query needs a large fraction of the table anyway, a seq scan is often genuinely the cheapest plan. But on a 500k-row table with a selective filter and an &lt;code&gt;ORDER BY ... LIMIT 50&lt;/code&gt;, it's the wrong shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal 2 — rows removed by filter.&lt;/strong&gt; &lt;code&gt;Rows Removed by Filter: 399,748&lt;/code&gt;, with &lt;code&gt;Actual Rows: 100,252&lt;/code&gt; matching. The scan touched every row in the table. The filter selectivity is ~20% — not pathological by itself — but 400,000 rows of pure waste every time the dashboard refreshes. An index on the filter column would let PostgreSQL skip them entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Signal 3 — planner estimate vs reality.&lt;/strong&gt; &lt;code&gt;rows=100,300&lt;/code&gt; estimated vs &lt;code&gt;rows=100,252&lt;/code&gt; actual. Essentially perfect. If the ratio had been ten-to-one or worse in either direction, the plan would be built on bad assumptions and &lt;code&gt;ANALYZE&lt;/code&gt; would be the first move. Here, statistics are healthy.&lt;/p&gt;

&lt;p&gt;There's a fourth node worth naming: the &lt;code&gt;Sort&lt;/code&gt; above the scan is a &lt;code&gt;top-N heapsort&lt;/code&gt;. Unlike a full sort, a top-N heapsort streams all input rows through a heap of size N (50 here) — it reads all 100,252 pending rows but only ever holds 50 in memory. That's why the &lt;code&gt;Memory: 30kB&lt;/code&gt; is so small. Even so, it's 100,252 rows of unnecessary work: an index on &lt;code&gt;(total_amount_cents DESC) WHERE status = 'pending'&lt;/code&gt; would let the planner walk the index from the largest value downward and stop after fifty entries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Categorise the bottleneck
&lt;/h2&gt;

&lt;p&gt;Once you've read the plan, map what you see to one of eight categories. Each category has a characteristic signature; each maps to a deep-dive article in this series.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plan signal&lt;/th&gt;
&lt;th&gt;Bottleneck category&lt;/th&gt;
&lt;th&gt;Fix article&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;You can't even read the plan confidently&lt;/td&gt;
&lt;td&gt;Plan literacy&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-explain-analyze-reading" rel="noopener noreferrer"&gt;Reading EXPLAIN / EXPLAIN ANALYZE Output&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;Seq Scan&lt;/code&gt; with large row counts, many &lt;code&gt;Rows Removed by Filter&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Missing or wrong index&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-index-usage-optimization" rel="noopener noreferrer"&gt;Index Usage &amp;amp; Optimisation&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nested Loop joining large tables; Hash Join spilling to disk&lt;/td&gt;
&lt;td&gt;Join strategy&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-join-optimization" rel="noopener noreferrer"&gt;Join Optimisation&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;CTE Scan&lt;/code&gt; feeding a filter; &lt;code&gt;SubPlan&lt;/code&gt; running per outer row&lt;/td&gt;
&lt;td&gt;Subquery / CTE structure&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-subquery-cte-optimization" rel="noopener noreferrer"&gt;Subquery &amp;amp; CTE Optimisation&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;HashAggregate&lt;/code&gt; or &lt;code&gt;Sort&lt;/code&gt; spilling, expensive window functions&lt;/td&gt;
&lt;td&gt;Aggregate or window tuning&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-aggregate-window-tuning" rel="noopener noreferrer"&gt;Aggregate &amp;amp; Window Function Tuning&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Index exists but isn't being used; function on indexed column&lt;/td&gt;
&lt;td&gt;WHERE clause shape&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-where-clause-optimization" rel="noopener noreferrer"&gt;WHERE Clause Optimisation&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Plan is "fine" but the query itself is the problem&lt;/td&gt;
&lt;td&gt;Query rewriting&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-query-rewriting-techniques" rel="noopener noreferrer"&gt;Query Rewriting Techniques&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;SELECT *&lt;/code&gt;, implicit casts, deep pagination, N+1 from ORM&lt;/td&gt;
&lt;td&gt;Anti-pattern&lt;/td&gt;
&lt;td&gt;&lt;a href="https://mydba.dev/blog/postgres-query-anti-patterns" rel="noopener noreferrer"&gt;Anti-Patterns &amp;amp; Common Mistakes&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Our example plan maps cleanly. A &lt;code&gt;Seq Scan&lt;/code&gt; with 400,000 rows removed by filter, sitting under an &lt;code&gt;ORDER BY ... LIMIT&lt;/code&gt; that can't exploit any existing index, is the textbook signature for the &lt;em&gt;Missing or wrong index&lt;/em&gt; category. The &lt;code&gt;Sort&lt;/code&gt; above it is solvable in the same stroke — a single partial index can eliminate both the scan and the sort.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 — Apply the fix
&lt;/h2&gt;

&lt;p&gt;The fix is a partial index, with the non-filter columns tucked into &lt;code&gt;INCLUDE&lt;/code&gt; so the planner can serve the query from the index alone without touching the heap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_sim_bp_orders_pending_by_amount&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount_cents&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'pending'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four non-obvious choices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Partial index.&lt;/strong&gt; Only pending orders are indexed, because that's the only status the query cares about. A full index on &lt;code&gt;(status, total_amount_cents)&lt;/code&gt; would work too; it would contain roughly 5× more entries. Partial indexes only help queries whose WHERE clause implies the index's predicate — so if this dashboard later adds &lt;code&gt;WHERE status IN ('pending', 'processing')&lt;/code&gt;, the planner will skip this index.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sort direction in the index.&lt;/strong&gt; Specifying &lt;code&gt;total_amount_cents DESC&lt;/code&gt; means the planner can scan the btree in the direction that produces rows in the needed order without an explicit &lt;code&gt;Sort&lt;/code&gt; node.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tiebreaker.&lt;/strong&gt; In a real dashboard you'd almost always want a tiebreaker column — &lt;code&gt;ORDER BY total_amount_cents DESC&lt;/code&gt; isn't deterministic for ties, and two rows with equal totals would shuffle between pages; adding &lt;code&gt;, order_id DESC&lt;/code&gt; to both the index and the query fixes that.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Covering (&lt;code&gt;INCLUDE&lt;/code&gt;).&lt;/strong&gt; The SELECT list is satisfied entirely from index tuples, which lets the planner serve the query as an Index Only Scan without heap fetches. Index Only Scan also requires the visibility map to mark the relevant heap pages all-visible, so on a write-heavy table where autovacuum can't keep up, you may still see heap fetches even with a covering index.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;CREATE INDEX CONCURRENTLY&lt;/code&gt; avoids taking an &lt;code&gt;ACCESS EXCLUSIVE&lt;/code&gt; lock on the table, so normal reads and writes continue while the index builds. It still takes weaker locks (&lt;code&gt;SHARE UPDATE EXCLUSIVE&lt;/code&gt;) twice, waits for transactions that hold old snapshots on the target table to finish before advancing between phases, and runs two passes over the table — so it's slower than &lt;code&gt;CREATE INDEX&lt;/code&gt; in wall-clock time, and a single long-running transaction that has touched this table can stall the build indefinitely. On a 500k-row table the build takes seconds; on a 500M-row table it can take hours. The application stays up the whole time. A partial index on &lt;code&gt;status = 'pending'&lt;/code&gt; still pays write cost when rows are inserted into or updated out of that state — so if &lt;code&gt;pending&lt;/code&gt; is a high-churn status, weigh the read win against the write overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4 — Verify
&lt;/h2&gt;

&lt;p&gt;Same query, same data, index in place:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;Limit&lt;/span&gt;  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;021&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;031&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
  &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;  &lt;span class="k"&gt;Index&lt;/span&gt; &lt;span class="k"&gt;Only&lt;/span&gt; &lt;span class="n"&gt;Scan&lt;/span&gt; &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;idx_sim_bp_orders_pending_by_amount&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;sim_bp_orders&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;3544&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;68&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100300&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;021&lt;/span&gt;&lt;span class="p"&gt;..&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;026&lt;/span&gt; &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="n"&gt;loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;Heap&lt;/span&gt; &lt;span class="n"&gt;Fetches&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="n"&gt;Buffers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;shared&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
 &lt;span class="n"&gt;Planning&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;186&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
 &lt;span class="n"&gt;Execution&lt;/span&gt; &lt;span class="nb"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;045&lt;/span&gt; &lt;span class="n"&gt;ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;0.045 ms&lt;/strong&gt;, down from 50.9 ms — roughly 1100× faster. Buffers dropped from 3,689 hit to 5 hit. The &lt;code&gt;Sort&lt;/code&gt; node is gone entirely: the index is already sorted in the right order. The &lt;code&gt;Filter&lt;/code&gt; line is gone: the partial index guarantees every row it contains already satisfies &lt;code&gt;status = 'pending'&lt;/code&gt;. &lt;code&gt;Heap Fetches: 0&lt;/code&gt; means the visibility map covered every leaf page we touched, so PostgreSQL served all 50 tuples from the index without reading a single heap page.&lt;/p&gt;

&lt;p&gt;Two caveats on the headline number. First, &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;'s &lt;code&gt;Execution Time&lt;/code&gt; measures server-side SQL execution only — it excludes network round-trip, client-side tuple deserialisation, and connection pool overhead. Real application latency for this query is probably closer to 2–10 ms depending on your region. Second, the measurement is on a hot cache with an immediately-post-vacuum visibility map; a colder cache would show &lt;code&gt;Buffers: shared read=N&lt;/code&gt; instead of all &lt;code&gt;hit&lt;/code&gt;. The meaningful improvement is the ~700× drop in buffer reads — that's what translates into lower CPU under concurrency.&lt;/p&gt;

&lt;h2&gt;
  
  
  The eight categories
&lt;/h2&gt;

&lt;p&gt;The workflow above treats "spot the category" as a two-sentence step. In practice, each category has its own rules, exceptions, and non-obvious variants. The rest of this series is eight standalone articles, each diving into one category.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Reading EXPLAIN / EXPLAIN ANALYZE output
&lt;/h3&gt;

&lt;p&gt;Before you can optimise a plan, you have to be able to read one. EXPLAIN reports the planner's estimated plan; EXPLAIN ANALYZE executes the query and reports what actually happened. The deep dive covers every common node type, the meaning of &lt;code&gt;loops&lt;/code&gt;, &lt;code&gt;Buffers&lt;/code&gt;, &lt;code&gt;Memory&lt;/code&gt;, &lt;code&gt;Workers Planned vs Launched&lt;/code&gt;, and the five most common ways to misread a plan. → &lt;a href="https://mydba.dev/blog/postgres-explain-analyze-reading" rel="noopener noreferrer"&gt;Reading EXPLAIN / EXPLAIN ANALYZE Output&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Index usage and optimisation
&lt;/h3&gt;

&lt;p&gt;A large share of single-query OLTP latency problems come down to indexing — either missing, or present but not matching the query shape. But "add an index" understates what's actually required: choosing columns in the right order, deciding between full and partial indexes, using &lt;code&gt;INCLUDE&lt;/code&gt; for covering indexes, expression indexes for computed predicates, GIN/GiST/BRIN for the data types where btrees are wrong, and knowing when &lt;em&gt;not&lt;/em&gt; to add one. → &lt;a href="https://mydba.dev/blog/postgres-index-usage-optimization" rel="noopener noreferrer"&gt;Index Usage &amp;amp; Optimisation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Join optimisation
&lt;/h3&gt;

&lt;p&gt;The planner picks between Nested Loop, Hash Join, and Merge Join based on cost estimates. Each has a regime where it's best, and the worst joins are the ones using the wrong strategy — usually a Nested Loop on two large tables. → &lt;a href="https://mydba.dev/blog/postgres-join-optimization" rel="noopener noreferrer"&gt;Join Optimisation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Subquery and CTE optimisation
&lt;/h3&gt;

&lt;p&gt;PostgreSQL 12 changed CTE semantics — what used to always be materialised is now inlined by default, except when you ask for materialisation explicitly. That change made many old "CTE as optimisation fence" tricks silently stop working. → &lt;a href="https://mydba.dev/blog/postgres-subquery-cte-optimization" rel="noopener noreferrer"&gt;Subquery &amp;amp; CTE Optimisation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Aggregate and window function tuning
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;GROUP BY&lt;/code&gt; and window functions look declarative, but the planner has strong opinions about how to execute them: HashAggregate versus GroupAggregate, partial and parallel aggregation, window frame optimisation. Sorts and hashes that spill to disk are almost always the visible symptom, and &lt;code&gt;work_mem&lt;/code&gt; is almost always the knob. → &lt;a href="https://mydba.dev/blog/postgres-aggregate-window-tuning" rel="noopener noreferrer"&gt;Aggregate &amp;amp; Window Function Tuning&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. WHERE clause optimisation
&lt;/h3&gt;

&lt;p&gt;An index is only useful if the WHERE clause is &lt;em&gt;sargable&lt;/em&gt;. Wrapping an indexed column in a function (&lt;code&gt;lower(email) = '...'&lt;/code&gt;), doing implicit casts (&lt;code&gt;varchar_column = 123&lt;/code&gt;), or comparing on the wrong side of an operator all silently disable indexes that look like they should apply. → &lt;a href="https://mydba.dev/blog/postgres-where-clause-optimization" rel="noopener noreferrer"&gt;WHERE Clause Optimisation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Query rewriting techniques
&lt;/h3&gt;

&lt;p&gt;Sometimes the plan is "fine" but the query itself is asking the wrong question. Correlated subqueries can usually become lateral joins; &lt;code&gt;NOT IN&lt;/code&gt; with NULLs should be &lt;code&gt;NOT EXISTS&lt;/code&gt;; offset pagination past a few hundred pages should be keyset pagination; &lt;code&gt;DISTINCT&lt;/code&gt; over a large set is often &lt;code&gt;GROUP BY&lt;/code&gt; in disguise. → &lt;a href="https://mydba.dev/blog/postgres-query-rewriting-techniques" rel="noopener noreferrer"&gt;Query Rewriting Techniques&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Anti-patterns and common mistakes
&lt;/h3&gt;

&lt;p&gt;The final category is queries that are wrong by construction: &lt;code&gt;SELECT *&lt;/code&gt; in hot paths, implicit type casts that silently disable indexes, missing &lt;code&gt;LIMIT&lt;/code&gt; on exploratory joins, N+1 patterns coming out of ORMs, inserting one row at a time instead of batching. → &lt;a href="https://mydba.dev/blog/postgres-query-anti-patterns" rel="noopener noreferrer"&gt;Anti-Patterns &amp;amp; Common Mistakes&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to start
&lt;/h2&gt;

&lt;p&gt;If you can already read EXPLAIN confidently, the highest-value articles are probably &lt;strong&gt;Index Usage&lt;/strong&gt; and &lt;strong&gt;Query Rewriting&lt;/strong&gt;, because those are where the largest wins hide. If reading the plans in this article felt like work, start with &lt;strong&gt;Reading EXPLAIN&lt;/strong&gt; and come back here.&lt;/p&gt;

&lt;p&gt;Slow queries are not mysterious. They fall into a small number of categories, each with a characteristic plan signature and a well-understood fix. Learn to recognise the signatures and most of the rest follows.&lt;/p&gt;




&lt;h1&gt;
  
  
  postgres #performance #database #sql
&lt;/h1&gt;

&lt;p&gt;Canonical version with the full series linked: &lt;a href="https://mydba.dev/blog/postgres-query-analysis-complete-guide" rel="noopener noreferrer"&gt;https://mydba.dev/blog/postgres-query-analysis-complete-guide&lt;/a&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
    <item>
      <title>PostgreSQL Parallel Query: Configuration &amp; Performance Tuning</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Wed, 22 Apr 2026 10:00:02 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/postgresql-parallel-query-configuration-performance-tuning-1oih</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/postgresql-parallel-query-configuration-performance-tuning-1oih</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Parallel Query: Configuration &amp;amp; Performance Tuning
&lt;/h1&gt;

&lt;p&gt;Your analytical query scans a 50 GB table, aggregates 200 million rows, and takes 25 seconds. Your server has 16 CPU cores. PostgreSQL uses... 2 of them. The other 14 sit idle. The &lt;code&gt;max_parallel_workers_per_gather&lt;/code&gt; default of 2 is leaving 7x potential speedup on the table. Let's fix that -- and understand when you should not.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Parallel Query Works
&lt;/h2&gt;

&lt;p&gt;PostgreSQL divides large operations across multiple CPU cores. Worker processes each scan a portion of the data, feed results through a Gather node to the leader process, which combines them. Sequential scans, hash joins, aggregates, and B-tree index scans all support parallel execution.&lt;/p&gt;

&lt;p&gt;The key defaults:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;max_parallel_workers_per_gather = 2&lt;/code&gt; -- max workers per parallel operation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_parallel_workers = 8&lt;/code&gt; -- total parallel workers across all sessions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_worker_processes = 8&lt;/code&gt; -- total background workers (shared with other subsystems)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;min_parallel_table_scan_size = 8MB&lt;/code&gt; -- minimum table size for parallel scan&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;parallel_setup_cost = 1000&lt;/code&gt; -- planner's estimate for starting a worker&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;parallel_tuple_cost = 0.1&lt;/code&gt; -- per-tuple transfer cost estimate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are tuned for general-purpose workloads. For analytical queries on large tables, they're far too conservative.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detecting the Problem
&lt;/h2&gt;

&lt;p&gt;Check your current settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;setting&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;unit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;short_desc&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_settings&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%parallel%'&lt;/span&gt;
   &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'max_worker_processes'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check whether queries actually use parallel workers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;VERBOSE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-01'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Gather&lt;/code&gt; or &lt;code&gt;Gather Merge&lt;/code&gt; -- parallel execution is happening&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Workers Planned: 2&lt;/code&gt; and &lt;code&gt;Workers Launched: 2&lt;/code&gt; -- how many workers&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Workers Launched &amp;lt; Workers Planned&lt;/code&gt; -- system ran out of workers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check if workers are being exhausted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Currently active parallel workers&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;active_parallel_workers&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;backend_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'parallel worker'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- The limit&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;setting&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_settings&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'max_parallel_workers'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If active workers frequently approach the limit, queries are competing for workers and some run with fewer than planned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tuning for Analytical Workloads
&lt;/h2&gt;

&lt;p&gt;If your database runs analytical queries on large tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- More workers per query&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers_per_gather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- More total workers&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Enough background worker slots&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;max_worker_processes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Lower table size threshold&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;min_parallel_table_scan_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'1MB'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Apply (max_worker_processes requires restart)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rule of thumb: &lt;code&gt;max_parallel_workers_per_gather&lt;/code&gt; = half your CPU cores, &lt;code&gt;max_parallel_workers&lt;/code&gt; = total cores. On a 16-core server: 8 and 16 respectively.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lower Cost Thresholds
&lt;/h3&gt;

&lt;p&gt;If medium-sized tables aren't getting parallelized despite adequate configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;parallel_setup_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;-- default: 1000&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;parallel_tuple_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;-- default: 0.1&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lower &lt;code&gt;parallel_setup_cost&lt;/code&gt; makes the planner consider parallelism for smaller operations. Lower &lt;code&gt;parallel_tuple_cost&lt;/code&gt; makes parallel plans look cheaper.&lt;/p&gt;

&lt;h3&gt;
  
  
  Per-Session Overrides
&lt;/h3&gt;

&lt;p&gt;For mixed workloads, set parallelism based on the connection type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Reporting query: maximum parallelism&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers_per_gather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;parallel_setup_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;parallel_tuple_cost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- OLTP session: disable parallelism&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers_per_gather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Per-Table Settings
&lt;/h3&gt;

&lt;p&gt;For critical large tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Guarantee up to 8 workers for scans on this table&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parallel_workers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This overrides the planner's automatic worker count calculation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Parallelizes (and What Doesn't)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Parallel?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Sequential Scan&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B-tree Index Scan&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bitmap Heap Scan&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hash Join&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Merge Join&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nested Loop&lt;/td&gt;
&lt;td&gt;Yes (outer side)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregate (count, sum, avg)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CREATE INDEX (B-tree)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Append (UNION ALL, partitions)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UPDATE, DELETE&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CTEs (WITH queries)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursors&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FOR UPDATE/SHARE&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Memory Multiplication Trap
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;work_mem&lt;/code&gt; applies per worker. A query with &lt;code&gt;work_mem = 256MB&lt;/code&gt; and 4 parallel workers can consume 1.28 GB for sorting and hashing. Budget accordingly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;max_connections * max_parallel_workers_per_gather * work_mem &amp;lt; available RAM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This catches people who increase parallelism without accounting for memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verify the Impact
&lt;/h2&gt;

&lt;p&gt;Compare sequential vs parallel execution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Disable parallelism&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers_per_gather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TIMING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Enable parallelism&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;LOCAL&lt;/span&gt; &lt;span class="n"&gt;max_parallel_workers_per_gather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TIMING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_region&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The parallel plan should show roughly &lt;code&gt;sequential_time / (1 + num_workers)&lt;/code&gt; execution time, with 60-80% of theoretical speedup typical due to Gather overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention Strategy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OLAP databases&lt;/strong&gt;: aggressive parallelism. &lt;code&gt;max_parallel_workers_per_gather = CPU_cores / 2&lt;/code&gt;, &lt;code&gt;max_parallel_workers = CPU_cores&lt;/code&gt;, lower cost thresholds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OLTP databases&lt;/strong&gt;: keep defaults or disable. Many short concurrent queries don't benefit -- worker overhead exceeds speedup on small queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixed workloads&lt;/strong&gt;: per-connection settings. Reporting connections get high parallelism. App connections get zero.&lt;/p&gt;

&lt;p&gt;Monitor &lt;code&gt;Workers Launched&lt;/code&gt; vs &lt;code&gt;Workers Planned&lt;/code&gt;. Consistent shortfall means you need more &lt;code&gt;max_parallel_workers&lt;/code&gt;. If CPU hits 100% during parallel queries and other sessions slow down, reduce &lt;code&gt;max_parallel_workers_per_gather&lt;/code&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-parallel-query" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-parallel-query&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>PostgreSQL Point-in-Time Recovery with pgBackRest</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Tue, 21 Apr 2026 10:00:03 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/postgresql-point-in-time-recovery-with-pgbackrest-1cg6</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/postgresql-point-in-time-recovery-with-pgbackrest-1cg6</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Point-in-Time Recovery with pgBackRest
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;pg_dump&lt;/code&gt; gives you a snapshot at the moment you ran it. If your last dump was 6 hours ago and someone accidentally deletes a production table, those 6 hours are gone. Even with hourly dumps, you lose everything between the last dump and the incident. For a database processing thousands of transactions per minute, that gap is devastating. Point-in-time recovery (PITR) eliminates that gap -- restoring your database to any specific second by replaying the write-ahead log on top of a base backup.&lt;/p&gt;

&lt;h2&gt;
  
  
  How PITR Works
&lt;/h2&gt;

&lt;p&gt;Two mechanisms combine:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Base backups&lt;/strong&gt; -- periodic snapshots of all database files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WAL archiving&lt;/strong&gt; -- continuous streaming of every WAL segment to a backup repository&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The WAL records every change made to the database. By replaying WAL segments from a base backup forward to a target timestamp, you reconstruct the exact state at that moment. If the last archived WAL is 30 seconds old, your maximum data loss is 30 seconds -- not 6 hours.&lt;/p&gt;

&lt;p&gt;pgBackRest is the standard tool for this. It handles base backups (full, incremental, differential), WAL archiving, retention, verification, and recovery -- with parallel compression, encryption, and remote repository support.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detecting Whether You're Protected
&lt;/h2&gt;

&lt;p&gt;Check if WAL archiving is even enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;setting&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_settings&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'archive_mode'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'archive_command'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'archive_timeout'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'wal_level'&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You need &lt;code&gt;archive_mode = on&lt;/code&gt; and &lt;code&gt;wal_level = replica&lt;/code&gt; (or &lt;code&gt;logical&lt;/code&gt;). If &lt;code&gt;archive_mode&lt;/code&gt; is &lt;code&gt;off&lt;/code&gt;, PITR is impossible.&lt;/p&gt;

&lt;p&gt;Check for archiving failures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;archived_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;failed_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_archived_wal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_archived_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_failed_wal&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_failed_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;last_archived_time&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;archive_lag&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_archiver&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A non-zero &lt;code&gt;failed_count&lt;/code&gt; or &lt;code&gt;archive_lag&lt;/code&gt; greater than a few minutes means the pipeline is broken. WAL segments are accumulating on the primary and will eventually fill the disk.&lt;/p&gt;

&lt;p&gt;Verify backup freshness:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgbackrest info &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main
pgbackrest verify &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setting Up pgBackRest
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Install and Configure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Debian/Ubuntu&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;pgbackrest

&lt;span class="c"&gt;# RHEL/Rocky&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;dnf &lt;span class="nb"&gt;install &lt;/span&gt;pgbackrest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create &lt;code&gt;/etc/pgbackrest/pgbackrest.conf&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[main]&lt;/span&gt;
&lt;span class="py"&gt;pg1-path&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/var/lib/postgresql/18/main&lt;/span&gt;

&lt;span class="nn"&gt;[global]&lt;/span&gt;
&lt;span class="py"&gt;repo1-path&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/var/lib/pgbackrest&lt;/span&gt;
&lt;span class="py"&gt;repo1-retention-full&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;
&lt;span class="py"&gt;repo1-retention-diff&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;7&lt;/span&gt;
&lt;span class="py"&gt;repo1-cipher-type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;aes-256-cbc&lt;/span&gt;
&lt;span class="py"&gt;repo1-cipher-pass&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;your-secure-encryption-passphrase&lt;/span&gt;

&lt;span class="py"&gt;process-max&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;
&lt;span class="py"&gt;compress-type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;zst&lt;/span&gt;
&lt;span class="py"&gt;compress-level&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;6&lt;/span&gt;

&lt;span class="py"&gt;log-level-console&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;info&lt;/span&gt;
&lt;span class="py"&gt;log-level-file&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;detail&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Configure PostgreSQL
&lt;/h3&gt;

&lt;p&gt;Add to &lt;code&gt;postgresql.conf&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;wal_level&lt;/span&gt; = &lt;span class="n"&gt;replica&lt;/span&gt;
&lt;span class="n"&gt;archive_mode&lt;/span&gt; = &lt;span class="n"&gt;on&lt;/span&gt;
&lt;span class="n"&gt;archive_command&lt;/span&gt; = &lt;span class="s1"&gt;'pgbackrest --stanza=main archive-push %p'&lt;/span&gt;
&lt;span class="n"&gt;archive_timeout&lt;/span&gt; = &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;archive_timeout = 60&lt;/code&gt; forces a WAL switch every 60 seconds even if the segment isn't full. This caps maximum data loss at 60 seconds.&lt;/p&gt;

&lt;p&gt;Restart PostgreSQL, then initialize the stanza:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main stanza-create
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Schedule Backups
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Full backup (weekly)&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;full backup

&lt;span class="c"&gt;# Differential (daily -- changes since last full)&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;diff backup

&lt;span class="c"&gt;# Incremental (every 6 hours -- changes since last any backup)&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;incr backup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cron schedule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0 2 * * 0  pgbackrest --stanza=main --type=full backup
0 2 * * 1-6  pgbackrest --stanza=main --type=diff backup
0 */6 * * *  pgbackrest --stanza=main --type=incr backup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performing Recovery
&lt;/h2&gt;

&lt;p&gt;When disaster strikes, restore to a specific timestamp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl stop postgresql

pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"2026-02-28 14:30:00+00"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target-action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;promote &lt;span class="se"&gt;\&lt;/span&gt;
    restore

&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start postgresql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set &lt;code&gt;--target&lt;/code&gt; to just before the incident. &lt;code&gt;--target-action=promote&lt;/code&gt; opens the database for read-write after recovery.&lt;/p&gt;

&lt;p&gt;You can also restore to a named restore point:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create before a risky operation&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_create_restore_point&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'before_schema_migration'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;name &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"before_schema_migration"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target-action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;promote &lt;span class="se"&gt;\&lt;/span&gt;
    restore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Test Recovery Regularly
&lt;/h2&gt;

&lt;p&gt;This is the most critical step. Schedule monthly recovery tests to a standby server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"2026-02-28 12:00:00+00"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--target-action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;promote &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--pg1-path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/var/lib/postgresql/18/test_recovery &lt;span class="se"&gt;\&lt;/span&gt;
    restore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the restored database contains expected data at the target timestamp. If recovery fails, fix the configuration before you need it in an emergency. Record the actual recovery time -- that's your real RTO.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention
&lt;/h2&gt;

&lt;p&gt;Build PITR into infrastructure from day one. Every production PostgreSQL database should have WAL archiving before its first production write.&lt;/p&gt;

&lt;p&gt;Monitor three metrics continuously:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Archive lag&lt;/strong&gt; -- alert if &amp;gt; 5 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failed archive count&lt;/strong&gt; -- any non-zero value requires investigation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backup age&lt;/strong&gt; -- alert if exceeding your backup interval + buffer&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;An untested backup is not a backup. Test quarterly at minimum. Document the procedure, the expected recovery time, and who executes it. Run end-to-end: restore, replay WAL, verify data, record duration.&lt;/p&gt;

&lt;p&gt;Store backups off-host. A backup on the same disk is destroyed by the same failure. Use S3, Azure Blob, GCS, or a separate server. Enable encryption.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-point-in-time-recovery" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-point-in-time-recovery&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>devops</category>
      <category>postgres</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>PostgreSQL BRIN Indexes: When &amp; How to Use Block Range Indexes</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Mon, 20 Apr 2026 10:00:03 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/postgresql-brin-indexes-when-how-to-use-block-range-indexes-3g6d</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/postgresql-brin-indexes-when-how-to-use-block-range-indexes-3g6d</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL BRIN Indexes: When &amp;amp; How to Use Block Range Indexes
&lt;/h1&gt;

&lt;p&gt;You have a 500-million-row events table. The B-tree index on &lt;code&gt;created_at&lt;/code&gt; consumes 12 GB. Every insert must update that 12 GB index. Backups include 12 GB of index data. The buffer cache is full of index pages. And all you ever do is range queries: "give me events from last week." There's a better way. A BRIN index on the same column would be roughly 100 KB -- not 12 GB -- and for your query pattern, it works just as well.&lt;/p&gt;

&lt;h2&gt;
  
  
  How BRIN Works
&lt;/h2&gt;

&lt;p&gt;Instead of indexing every individual row (like B-tree), BRIN stores the minimum and maximum values for ranges of consecutive physical blocks. The default is 128 pages (~1 MB of table data) per range entry.&lt;/p&gt;

&lt;p&gt;To find rows where &lt;code&gt;created_at = '2026-01-15'&lt;/code&gt;, PostgreSQL reads the BRIN index, identifies which block ranges &lt;em&gt;could&lt;/em&gt; contain that date (any range where min &amp;lt;= '2026-01-15' &amp;lt;= max), and scans only those ranges. Block ranges that can't contain the target value are skipped entirely.&lt;/p&gt;

&lt;p&gt;The trade-off is precision. B-tree points to exact rows. BRIN points to block ranges that &lt;em&gt;might&lt;/em&gt; contain matching rows, then scans those blocks sequentially. This is fine when matching rows are clustered together (time-series data), but terrible when values are scattered randomly across the table.&lt;/p&gt;

&lt;h2&gt;
  
  
  When BRIN Works (and When It Doesn't)
&lt;/h2&gt;

&lt;p&gt;The key metric is &lt;strong&gt;physical correlation&lt;/strong&gt; -- how closely the column values track with the physical row position on disk.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check correlation for candidate columns&lt;/span&gt;
&lt;span class="c1"&gt;-- Values close to 1.0 or -1.0 = good for BRIN&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;attname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;column_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;correlation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_distinct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;null_frac&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stats&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;schemaname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'public'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;tablename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'events'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;attname&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'created_at'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'event_id'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'user_id'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;correlation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Above 0.9&lt;/strong&gt;: ideal for BRIN. Matching rows are tightly clustered in a few block ranges.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0.7 to 0.9&lt;/strong&gt;: can still benefit, but more false-positive blocks scanned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Below 0.7&lt;/strong&gt;: BRIN will scan too many irrelevant blocks. Use B-tree.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The ideal BRIN candidate:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Characteristic&lt;/th&gt;
&lt;th&gt;Why It Matters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Append-only or mostly-append&lt;/td&gt;
&lt;td&gt;Physical order matches logical order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time-series or log data&lt;/td&gt;
&lt;td&gt;Timestamp correlates with insertion order&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Table size &amp;gt; 1 GB&lt;/td&gt;
&lt;td&gt;B-tree overhead becomes significant&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Range queries are primary access&lt;/td&gt;
&lt;td&gt;BRIN excels at range filtering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low update/delete frequency&lt;/td&gt;
&lt;td&gt;Updates break physical correlation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The failure mode: creating a BRIN index on &lt;code&gt;user_id&lt;/code&gt; in a table where inserts come from many users in random order. Every block range contains every user_id, and PostgreSQL must scan the entire table anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding BRIN Candidates
&lt;/h2&gt;

&lt;p&gt;Identify large tables with oversized B-tree indexes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;table_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;NULLIF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_to_table_pct&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1073741824&lt;/span&gt;  &lt;span class="c1"&gt;-- tables &amp;gt; 1 GB&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tables over 1 GB with B-tree indexes consuming 10%+ of the table size are prime candidates -- if the indexed columns have high correlation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creating and Tuning BRIN Indexes
&lt;/h2&gt;

&lt;p&gt;Basic BRIN index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_created_brin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Tune pages_per_range
&lt;/h3&gt;

&lt;p&gt;The default of 128 pages summarizes ~1 MB of table data per entry. You can tune this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- More granular: larger index, fewer false positives&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_created_brin_fine&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages_per_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Less granular: tiny index, more false positives&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_created_brin_coarse&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages_per_range&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For most time-series tables, the default of 128 works well.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enable Autosummarize
&lt;/h3&gt;

&lt;p&gt;By default, new blocks are not reflected in the BRIN index until vacuum runs. This means recent data might trigger sequential scans:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_created_brin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;autosummarize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For append-heavy workloads, &lt;code&gt;autosummarize = on&lt;/code&gt; is strongly recommended.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Column BRIN
&lt;/h3&gt;

&lt;p&gt;BRIN indexes support multiple columns when both correlate with physical order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_events_multi_brin&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;brin&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both &lt;code&gt;created_at&lt;/code&gt; and an auto-incrementing &lt;code&gt;event_id&lt;/code&gt; increase together, so both have high correlation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verify the Improvement
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-01'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-31'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see &lt;code&gt;Bitmap Heap Scan&lt;/code&gt; with &lt;code&gt;Bitmap Index Scan on idx_events_created_brin&lt;/code&gt;. Buffer count should be much lower than a full sequential scan.&lt;/p&gt;

&lt;p&gt;Compare index sizes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;indexrelname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'events'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The size difference should be dramatic -- often 100-1000x smaller.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention
&lt;/h2&gt;

&lt;p&gt;Always check &lt;code&gt;pg_stats.correlation&lt;/code&gt; before creating a BRIN index. A BRIN on a low-correlation column is worse than useless -- it costs maintenance time and fools you into thinking data is indexed.&lt;/p&gt;

&lt;p&gt;Monitor BRIN effectiveness after creation. Compare buffer counts in EXPLAIN ANALYZE between BRIN-indexed queries and sequential scans. If BRIN isn't reducing buffer reads by at least 50%, the correlation is too low.&lt;/p&gt;

&lt;p&gt;Watch for operations that break physical correlation: UPDATEs that change the indexed column, CLUSTER on a different column, or bulk DELETEs followed by new inserts. If correlation degrades, consider running &lt;code&gt;CLUSTER&lt;/code&gt; to restore physical order.&lt;/p&gt;

&lt;p&gt;For time-series tables growing by gigabytes per day, switching from B-tree to BRIN can reduce index storage by 99% -- and your insert performance will thank you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-brin-index" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-brin-index&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>PostgreSQL Covering Indexes: Eliminate Heap Fetches with INCLUDE</title>
      <dc:creator>Philip McClarence</dc:creator>
      <pubDate>Sun, 19 Apr 2026 10:00:02 +0000</pubDate>
      <link>https://dev.to/philip_mcclarence_2ef9475/postgresql-covering-indexes-eliminate-heap-fetches-with-include-3lcl</link>
      <guid>https://dev.to/philip_mcclarence_2ef9475/postgresql-covering-indexes-eliminate-heap-fetches-with-include-3lcl</guid>
      <description>&lt;h1&gt;
  
  
  PostgreSQL Covering Indexes: Eliminate Heap Fetches with INCLUDE
&lt;/h1&gt;

&lt;p&gt;You have an index on &lt;code&gt;customer_id&lt;/code&gt;. Your query filters by &lt;code&gt;customer_id&lt;/code&gt; and selects &lt;code&gt;customer_name&lt;/code&gt; and &lt;code&gt;customer_email&lt;/code&gt;. PostgreSQL finds the matching rows in the index (fast), then fetches each row from the heap table to get the name and email (slow). Those heap fetches are random I/O operations scattered across the table. On a 100M-row table returning 1,000 rows, that is 1,000 random reads -- and they dominate the query execution time. A covering index eliminates them entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Index Lookups Actually Work
&lt;/h2&gt;

&lt;p&gt;Every standard B-tree index lookup is two steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Index scan&lt;/strong&gt;: find matching row pointers (TIDs) in the index -- fast, sequential access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heap fetch&lt;/strong&gt;: retrieve actual row data from the heap table -- slow, random I/O&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The heap fetch is the bottleneck. For single-row lookups it's barely noticeable. For queries returning hundreds or thousands of rows, it dominates execution time.&lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;index-only scan&lt;/strong&gt; skips step 2 entirely. If all columns the query needs exist in the index, PostgreSQL reads everything from the index. No heap access, no random I/O.&lt;/p&gt;

&lt;h2&gt;
  
  
  The INCLUDE Clause (PostgreSQL 11+)
&lt;/h2&gt;

&lt;p&gt;Before PostgreSQL 11, covering indexes required composite indexes on all columns: &lt;code&gt;CREATE INDEX ON customers (customer_id, customer_name, customer_email)&lt;/code&gt;. This works but has a cost -- the index maintains sort order on all three columns, wasting CPU on sorts for columns nobody searches or orders by.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;INCLUDE&lt;/code&gt; adds columns to the index leaf pages without including them in the sort key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_customers_covering&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_email&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key column (&lt;code&gt;customer_id&lt;/code&gt;) is the search key for &lt;code&gt;WHERE&lt;/code&gt;, &lt;code&gt;ORDER BY&lt;/code&gt;, and joins. The included columns are stored alongside but not sorted -- they exist solely to enable index-only scans.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detecting Covering Index Opportunities
&lt;/h2&gt;

&lt;p&gt;Look for index scans with heap fetches in EXPLAIN output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_email&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Watch for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Index Scan using idx_customers_id&lt;/code&gt; -- the heap was visited for each row&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Heap Fetches: 1000&lt;/code&gt; -- 1,000 trips to the heap table&lt;/li&gt;
&lt;li&gt;High &lt;code&gt;Buffers: shared hit=...&lt;/code&gt; from random heap access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal: &lt;code&gt;Index Only Scan&lt;/code&gt; with &lt;code&gt;Heap Fetches: 0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Find candidates system-wide:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;schemaname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;indexrelname&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idx_scan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idx_tup_read&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;idx_tup_fetch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;pg_size_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg_relation_size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indexrelid&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;index_size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;idx_tup_fetch&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tables with high &lt;code&gt;idx_tup_fetch&lt;/code&gt; are performing many heap fetches. Cross-reference with your most frequent queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Dashboard Query
&lt;/h3&gt;

&lt;p&gt;A reporting dashboard showing recent orders:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- The query&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_status&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'7 days'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- The covering index&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;CONCURRENTLY&lt;/span&gt; &lt;span class="n"&gt;idx_orders_recent_covering&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_status&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single index handles the WHERE filter, ORDER BY, LIMIT, and returns all selected columns -- entirely from the index. No heap access.&lt;/p&gt;

&lt;h3&gt;
  
  
  INCLUDE vs Composite
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- INCLUDE: customer_name is retrieved but never searched&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;INCLUDE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Composite: both columns are used in WHERE or ORDER BY&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;INCLUDE&lt;/code&gt; when extra columns are only in the SELECT list. Use composite when columns appear in WHERE or ORDER BY.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Vacuum Dependency
&lt;/h2&gt;

&lt;p&gt;Index-only scans require pages to be marked "all-visible" in the visibility map. PostgreSQL can skip the heap only for these pages. Vacuum maintains the visibility map.&lt;/p&gt;

&lt;p&gt;If vacuum falls behind:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pages are not marked all-visible&lt;/li&gt;
&lt;li&gt;The planner falls back to regular index scans with heap fetches&lt;/li&gt;
&lt;li&gt;Your covering index provides zero benefit
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check visibility map health&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_autovacuum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_vacuum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'customers'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;last_autovacuum&lt;/code&gt; is stale and dead tuples are accumulating, tune autovacuum to run more frequently on that table. A covering index without healthy vacuum is a wasted investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prevention
&lt;/h2&gt;

&lt;p&gt;When writing a new query that selects specific columns from an indexed table, ask: "Can I add these columns to the index with INCLUDE to eliminate heap fetches?"&lt;/p&gt;

&lt;p&gt;Keep covering indexes focused. Don't add every column -- include only the columns your most frequent queries select. Different queries needing different columns should get targeted covering indexes, not one massive index.&lt;/p&gt;

&lt;p&gt;Monitor heap fetch counts over time. A query showing &lt;code&gt;Heap Fetches: 0&lt;/code&gt; today may regress if vacuum falls behind or if a new column is added to the SELECT list. Track &lt;code&gt;idx_tup_fetch&lt;/code&gt; on important indexes -- a sudden increase signals regression.&lt;/p&gt;

&lt;p&gt;Review covering indexes when query patterns change. Application refactors that add columns to SELECT clauses break index-only scans silently -- the query works, but performance drops.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://mydba.dev/blog/postgres-covering-index" rel="noopener noreferrer"&gt;mydba.dev/blog/postgres-covering-index&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>postgres</category>
      <category>sql</category>
    </item>
  </channel>
</rss>
