<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: JoongHyuk Shin</title>
    <description>The latest articles on DEV Community by JoongHyuk Shin (@joonghyukshin).</description>
    <link>https://dev.to/joonghyukshin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3911251%2F63b302ae-0e4c-4fa7-916b-72b2a119b393.png</url>
      <title>DEV Community: JoongHyuk Shin</title>
      <link>https://dev.to/joonghyukshin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joonghyukshin"/>
    <language>en</language>
    <item>
      <title>Closing Chapter 1: From Query to Data</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:18:04 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/closing-chapter-1-from-query-to-data-33nb</link>
      <guid>https://dev.to/joonghyukshin/closing-chapter-1-from-query-to-data-33nb</guid>
      <description>&lt;p&gt;We opened Chapter 1 with a single line, &lt;code&gt;SELECT * FROM users WHERE id = 1&lt;/code&gt;. For that line to leave the client and come back as a result row, the PostgreSQL backend went through five stages. First it decided which processing path the message should take; then the parser and analyzer turned the text into a tree and gave it meaning from the catalog. The rewriter expanded views and injected policies to transform the tree, the planner weighed the possible execution paths by cost and picked the cheapest one, and the executor followed that plan, pulling up one tuple at a time and sending them back to the client.&lt;/p&gt;

&lt;p&gt;Chapter 1 was a story about &lt;em&gt;how a query is processed&lt;/em&gt;. What tree a given SQL becomes, what plan it turns into, in what order it runs. From start to finish, a chain of logical transformations.&lt;/p&gt;

&lt;p&gt;But what every one of those stages ultimately deals with is data. The executor pulls up tuples, yet where on disk those tuples lie and in what shape, how they come up into memory, Chapter 1 never asked. When the planner judged an index scan cheaper than a sequential scan, it never opened up what that index physically is. Chapter 1 followed only the logical journey of a query, leaving untouched the substance of the data that journey stands on.&lt;/p&gt;

&lt;p&gt;Chapter 2, Storage &amp;amp; Access Methods, opens up that substance. In what unit data sits on disk (page), where disk and memory meet (buffer manager), where and how a row survives (heap), and how that row is found quickly (B-tree and the specialized indexes). The very tuple the planner weighed by cost and the executor pulled up in Chapter 1, where it actually came from and how it came to be there, is what Chapter 2 reveals.&lt;/p&gt;

&lt;p&gt;If Chapter 1 was the logical life of a query, Chapter 2 is the physical dwelling of data. We now look at how the data a query reaches for actually lives on disk.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>query</category>
    </item>
    <item>
      <title>1.5.5 Parallel Query: How Worker Processes Cooperate</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:16:59 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/155-parallel-query-how-worker-processes-cooperate-1l3k</link>
      <guid>https://dev.to/joonghyukshin/155-parallel-query-how-worker-processes-cooperate-1l3k</guid>
      <description>&lt;p&gt;The executor we have seen so far is a single flow: one process pulls the plan tree from the top. An aggregate or join over a large table forces that one process to read every page from start to finish, so even when several CPU cores sit idle, only one of them works. Parallel query hands that work to several processes. But PostgreSQL's parallelism is not threads inside one process. It is &lt;em&gt;separate processes&lt;/em&gt; cooperating. Since processes do not share memory, three structural challenges follow: distributing pages, sharing state, and forbidding writes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gather: Parallelism Happens in Only One Stretch of the Plan Tree
&lt;/h2&gt;

&lt;p&gt;A parallel query does not run the whole plan tree in parallel. Somewhere in the tree sits a node called &lt;code&gt;Gather&lt;/code&gt;, and only the subtree &lt;em&gt;below&lt;/em&gt; that &lt;code&gt;Gather&lt;/code&gt; is split across processes. Everything above &lt;code&gt;Gather&lt;/code&gt; runs in a single process, as usual.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Gather&lt;/code&gt; does two things. It launches the worker processes, and it merges the rows those workers produce together with its own rows into a single output stream for its parent. The process that launches the workers is called the leader. The leader does not just sit and collect: it also runs the subtree below it and produces rows itself. So with two workers, three participants run the same plan, counting the leader.&lt;/p&gt;

&lt;p&gt;This only works under one condition. The subtree below &lt;code&gt;Gather&lt;/code&gt; must produce &lt;em&gt;no duplicate rows even when the leader and the workers each run it independently&lt;/em&gt;. If three participants run the same subtree with no coordination, every row comes out three times and the result triples, which is wrong. The node that solves this is a 'parallel-aware' node. A parallel-aware scan node coordinates among the processes so that each one reads a &lt;em&gt;different part&lt;/em&gt; of the table even though they run the same node. This is the node EXPLAIN shows as &lt;code&gt;Parallel Seq Scan&lt;/code&gt;. A plain &lt;code&gt;Seq Scan&lt;/code&gt; has one process read the whole table; a &lt;code&gt;Parallel Seq Scan&lt;/code&gt; has several processes running the same node split the pages among themselves.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Workers Split Pages Without Overlap
&lt;/h2&gt;

&lt;p&gt;So three workers read the same table. How do they avoid reading the same page twice? The answer is surprisingly simple. There is one counter in shared memory, and they advance it atomically to divide the pages. (Atomic here means that even when several processes touch the same value at the same instant, the operation completes as one indivisible step, with no one slipping in halfway.)&lt;/p&gt;

&lt;p&gt;A table lays out its pages in order, from page 0 to page N. The shared state of a parallel sequential scan holds a counter called &lt;code&gt;phs_nallocated&lt;/code&gt;, which says how many pages have been handed out to workers so far. When a worker runs out of pages to read, it does a fetch-and-add on this counter. Fetch-and-add is an atomic operation that reads the counter's current value and adds to it in the same step, so even if several workers call it at the same moment, each one receives a different value. One worker gets "starting at 100," and the next automatically gets the range after it. Coordinating through this single counter means the workers never have to take locks and talk to each other.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdzx012mhephilp43irnp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fdzx012mhephilp43irnp.png" alt="Workers split a table's pages into chunks" width="800" height="258"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is one more piece to how pages get handed out. A worker does not take pages one at a time. It takes a &lt;em&gt;run of consecutive pages&lt;/em&gt; at once, and that run is called a chunk. If a worker took only page 100, then knocked on the counter again and got page 250, its reads would scatter all over the table. From the OS's point of view the disk accesses would look sparse, and read-ahead (the OS prefetching the next pages from disk before they are asked for) would not kick in, because each backend is a separate process and the OS cannot recognize them as one sequential stream. So PG hands a worker a whole consecutive chunk, and only after the worker finishes that chunk does it knock on the counter again for the next one. Since a chunk is contiguous, reads within it are sequential and read-ahead stays alive.&lt;/p&gt;

&lt;p&gt;Chunk size is not fixed; it is derived from the table size. PG splits the table into roughly 1024 to 2048 pieces and uses that as the base chunk size, capped at 8192 pages so it never grows too large. Then, as the scan nears the end, when only about 64 chunks remain, it halves the chunk size. This final touch matters. With large chunks, one worker can pick up the last big chunk late and keep reading alone for a long time while the others sit idle with nothing to do. Splitting the tail into smaller chunks spreads the remaining work evenly across the workers, so they all finish at about the same time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Workers See the Same Data as the Leader
&lt;/h2&gt;

&lt;p&gt;Splitting the pages is settled. But there is a deeper problem. A worker is a separate process, so it starts fresh, inheriting none of the in-memory state the leader holds. The worker does not know how far the transaction has progressed, which snapshot it is viewing data through (a snapshot is the point in time a transaction sees its data from, and it determines which rows are visible and which are not), or what the GUC settings are. Left like this, a worker could see different data than the leader. For instance, inside the same transaction, a worker could take a different snapshot, so a row invisible to the leader becomes visible to the worker.&lt;/p&gt;

&lt;p&gt;The snapshot, transaction IDs, and row visibility (which rows are visible to this transaction right now) that show up here are all covered in detail in Chapter 3, Transactions and MVCC.&lt;/p&gt;

&lt;p&gt;PG solves this by copying state. Before launching workers, the leader allocates a piece of dynamic shared memory (a region of memory that several processes look into together, abbreviated DSM) and serializes into it the state a worker needs to behave just like the leader. As soon as a worker starts, it attaches to this DSM and restores that state as its own. The key pieces copied are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The transaction snapshot and the active snapshot. This is the most direct mechanism for making a worker see data from the same point in time as the leader.&lt;/li&gt;
&lt;li&gt;The current transaction's XID and the list of in-progress XIDs. This makes a tuple's visibility check return the same answer in the worker and the leader.&lt;/li&gt;
&lt;li&gt;The combo CID mappings. This is the information needed to handle consistently the cases where a visibility decision gets complicated, such as a transaction that created a row and then deleted it.&lt;/li&gt;
&lt;li&gt;All GUC values. So a worker runs the plan with the same &lt;code&gt;work_mem&lt;/code&gt; and the same settings as the leader.&lt;/li&gt;
&lt;li&gt;The authenticated user ID, the database, and the security context. The worker connects to the same database with the same privileges.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once the copy is done, a worker runs the plan on top of the same transaction, the same snapshot, and the same settings as the leader. So no matter who reads which page, the visibility decisions match.&lt;/p&gt;

&lt;p&gt;How do the result rows a worker produces get back to the leader? Through a queue set up inside the DSM. Each worker has its own shared-memory queue (a tuple queue): the worker writes the rows it produces into that queue, and the leader cycles through the queues pulling rows out. From the leader's side, it gathers its own rows and the rows pulled from the workers' queues into a single stream and sends them up.&lt;/p&gt;

&lt;p&gt;These queues carry more than results; they carry errors too. When an error happens in a worker, the error message is placed on a dedicated queue and the worker signals the leader. At its next interrupt check, the leader reads that message and re-throws it as if it had raised the error itself. So an error that fired inside a worker still appears to the user as a single, ordinary query error.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Parallel Queries Are Read-Only
&lt;/h2&gt;

&lt;p&gt;Parallel query carries one big restriction. While running below &lt;code&gt;Gather&lt;/code&gt;, it cannot write to the database and cannot run DDL. The parallel stretch is strictly read-only. This is not an unfinished implementation or a temporary limit; it follows directly from the state-copying structure above.&lt;/p&gt;

&lt;p&gt;At the heart of it is the combo CID. When a transaction deletes or updates, within the same transaction, a row it just created, the ordinary XID alone is not enough to decide that row's visibility, so an extra mapping called the combo CID is built. In parallel mode the leader copies its combo CID mapping to the workers at startup. The catch is that this copy happens &lt;em&gt;once, at the start&lt;/em&gt;. If a worker were to write data mid-execution, a new combo CID would be created, and there would be no way to tell the leader or the other workers about it. The processes inside the same transaction would then diverge, each holding different visibility information. There is no way to prevent that divergence, so the choice was to forbid writes outright.&lt;/p&gt;

&lt;p&gt;PG enforces this read-only rule in code as well. Before building a parallel context, it enters parallel mode and arms checks that block unsafe operations, then disarms them when the parallel work ends. While those checks are armed, attempting a dangerous operation such as a data write or a permanent GUC change fails with an error.&lt;/p&gt;

&lt;p&gt;Because of this, the SELECT inside a data-modifying statement like &lt;code&gt;INSERT ... SELECT&lt;/code&gt; or &lt;code&gt;UPDATE&lt;/code&gt; gets no parallelism, no matter how heavy that SELECT is. The whole statement is a writing transaction.&lt;/p&gt;

&lt;p&gt;Not every write is like that, though. &lt;code&gt;CREATE TABLE AS&lt;/code&gt;, &lt;code&gt;SELECT INTO&lt;/code&gt;, and &lt;code&gt;CREATE MATERIALIZED VIEW&lt;/code&gt; are allowed to go parallel. The trick is that the part that runs in parallel is only the &lt;em&gt;read&lt;/em&gt;. &lt;code&gt;CREATE TABLE AS ... SELECT&lt;/code&gt; has the workers split the reading of its SELECT, while the leader alone loads the gathered results into the new table. The workers still only read, so the read-only rule is not broken. On top of that, the table the results go into is being created at that very moment, so no worker can see or touch it, and it collides with no one's visibility. That is why PG allows parallelism for these three commands.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gather vs GatherMerge: Dropping Order or Keeping It
&lt;/h2&gt;

&lt;p&gt;There are two kinds of nodes that collect workers' results: &lt;code&gt;Gather&lt;/code&gt; and &lt;code&gt;GatherMerge&lt;/code&gt;. The difference is how they handle result order.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Gather&lt;/code&gt; does not care about order. Whichever worker delivers a row first, that row goes out first, so the workers' results come out interleaved as they arrive. For a query that needs no ordering, such as a plain aggregate or a scan with only a filter, this is the lightest option.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;GatherMerge&lt;/code&gt; preserves order. It is used when each worker delivers &lt;em&gt;sorted&lt;/em&gt; results. Since each worker's output is sorted within itself, the leader merges while preserving the overall sort order: it compares only the front rows of each worker's queue and emits the smallest. To do this comparison efficiently it uses a heap internally (a data structure that quickly yields the smallest among many values). &lt;code&gt;GatherMerge&lt;/code&gt; is chosen for an &lt;code&gt;ORDER BY&lt;/code&gt; parallel query where each worker sorts separately and the results have to be combined without breaking that sort.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means in Practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For a job that precomputes a heavy aggregate, pulling the result into a new table and swapping it in by name preserves the parallel gain, rather than loading directly into an existing summary table.&lt;/strong&gt; An &lt;code&gt;INSERT INTO&lt;/code&gt; or &lt;code&gt;UPDATE&lt;/code&gt; is a writing transaction and loses parallelism even for the SELECT inside it, whereas &lt;code&gt;CREATE TABLE AS&lt;/code&gt; lets the workers split the reading part among themselves.&lt;br&gt;
&lt;br&gt;
&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>executor</category>
    </item>
    <item>
      <title>1.5.4 Aggregation: HashAgg, GroupAgg, Sort-Based</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:15:54 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/154-aggregation-hashagg-groupagg-sort-based-4bbn</link>
      <guid>https://dev.to/joonghyukshin/154-aggregation-hashagg-groupagg-sort-based-4bbn</guid>
      <description>&lt;p&gt;A scan node reads rows from a table one at a time, and a join node pairs up rows arriving from its two children. Both can take one input row and emit one result row right away. Aggregation breaks that pattern. In &lt;code&gt;SELECT region, COUNT(*) FROM sales GROUP BY region&lt;/code&gt;, producing the single row for &lt;code&gt;region = 'Seoul'&lt;/code&gt; requires counting &lt;em&gt;every&lt;/em&gt; row that belongs to &lt;code&gt;'Seoul'&lt;/code&gt; first. A group's answer is only settled after all of that group's rows have been seen.&lt;/p&gt;

&lt;p&gt;But the executor runs on a pull model that drags results up one row at a time. The top node asks for "the next row," that request travels down the tree, and a row climbs back up one level at a time. The "you have to see everything before you get an answer" nature of aggregation collides, at first glance, with "pull one row at a time." PostgreSQL resolves this in two ways. One is GroupAggregate, which sorts the input by the grouping key so that rows of the same group come out one after another. The other is HashAggregate, which skips sorting and accumulates per-group state in a hash table. The &lt;code&gt;GroupAggregate&lt;/code&gt; and &lt;code&gt;HashAggregate&lt;/code&gt; nodes you see in EXPLAIN for a &lt;code&gt;GROUP BY&lt;/code&gt; query are exactly these two.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aggregation builds its answer by accumulating state
&lt;/h2&gt;

&lt;p&gt;Before looking at how the two methods differ, we need to see how a single group's answer is built, because both methods share this accumulation engine.&lt;/p&gt;

&lt;p&gt;Each aggregate function carries one running state value, the transition value. It starts by setting that state to an initial condition, and every time a row in the group arrives, a transition function updates the state. Once the group's last row is processed, a final function extracts the final result from that state. A comment in the PostgreSQL source sums this up in three lines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;transvalue = initcond
foreach input_tuple do
    transvalue = transfunc(transvalue, input_value)
result = finalfunc(transvalue)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Take &lt;code&gt;SUM(amount)&lt;/code&gt;. The initial condition is 0, and the transition function is "add this row's &lt;code&gt;amount&lt;/code&gt; to the running sum." As rows come in as &lt;code&gt;10&lt;/code&gt;, &lt;code&gt;20&lt;/code&gt;, &lt;code&gt;30&lt;/code&gt;, the state grows &lt;code&gt;0 → 10 → 30 → 60&lt;/code&gt;. &lt;code&gt;SUM&lt;/code&gt; has no separate final function because the accumulated value is the answer. &lt;code&gt;AVG&lt;/code&gt;, by contrast, accumulates both a sum and a count, and at the end its final function computes "sum divided by count" to produce the average. Whether or not a final function exists varies by aggregate, but the flow of carrying a state, updating it per row, and finishing at the end is the same for every aggregate.&lt;/p&gt;

&lt;p&gt;One thing becomes clear here: aggregation has to push &lt;em&gt;every&lt;/em&gt; row of a group through the transition function before that group's state is complete. So the key question becomes "how do you gather all the rows of the same group into one state value, missing none?" GroupAggregate and HashAggregate only differ in how they gather; the way they accumulate is identical.&lt;/p&gt;

&lt;p&gt;The figure below shows, side by side, how the two methods handle the same input. The grouping key has three values &lt;code&gt;a&lt;/code&gt;, &lt;code&gt;b&lt;/code&gt;, &lt;code&gt;c&lt;/code&gt;, and the number attached to each row is the value to &lt;code&gt;SUM&lt;/code&gt;. Adding the numbers within the same key gives that group's result (&lt;code&gt;a&lt;/code&gt; is &lt;code&gt;10 + 20 = 30&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzkdx7urgpdt0v5xkrpsb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fzkdx7urgpdt0v5xkrpsb.png" alt="Two aggregation strategies compared" width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  GroupAggregate: catch the point where a group ends in sorted input
&lt;/h2&gt;

&lt;p&gt;GroupAggregate starts from the assumption that the input is &lt;em&gt;already sorted&lt;/em&gt; by the grouping key. When it is sorted, rows of the same group come out one after another without a break. If you group by &lt;code&gt;region&lt;/code&gt;, all the &lt;code&gt;'Busan'&lt;/code&gt; rows come out, then the &lt;code&gt;'Seoul'&lt;/code&gt; rows follow, then the &lt;code&gt;'Incheon'&lt;/code&gt; rows.&lt;/p&gt;

&lt;p&gt;Because they are grouped together this way, you can tell immediately where a group ends. Compare the grouping key of the previous row with the one just received: if they are equal, it is the same group, so add it to the state through the transition function; if they differ, that spot is a group boundary. A changed key means you have seen the previous group's last row, so at that moment you run the previous group's final function, complete one result row, and send it up to the parent. Then you reset the new group's state to the initial condition and start accumulating the new group from the row you just received.&lt;/p&gt;

&lt;p&gt;This behavior meshes cleanly with the pull model. GroupAggregate does not gather the entire input and wait. Each time a group ends, it emits that group's result right away. When the parent asks for the first row, it pulls input only as far as the first group's end, returns one result row, and stops. If there is a &lt;code&gt;LIMIT 3&lt;/code&gt; above it, it produces only three groups and never touches the rest of the input. What it holds in memory is just the state value of &lt;em&gt;the one group it is currently processing&lt;/em&gt;, so even with millions of groups, only one sits in memory at a time.&lt;/p&gt;

&lt;p&gt;The catch is that the "input must be sorted" assumption is not free. If the input arrives already sorted from a grouping-key index, it splits groups with no extra cost; otherwise a Sort node is placed below GroupAggregate to sort the input first, and aggregation only starts after that. In EXPLAIN, whether a &lt;code&gt;Sort&lt;/code&gt; hangs under &lt;code&gt;GroupAggregate&lt;/code&gt;, or a sorted &lt;code&gt;Index Scan&lt;/code&gt; feeds in directly, tells you whether this sorting cost was paid.&lt;/p&gt;

&lt;h2&gt;
  
  
  HashAggregate: gather groups in a hash table without sorting
&lt;/h2&gt;

&lt;p&gt;HashAggregate drops the sorting assumption and uses a hash table instead. Feeding the grouping key into a hash function yields a fixed slot per key, and the same key always goes to the same slot. So no matter what order the input comes in, you hash a row's key, find that key's slot, and apply the transition function to the state sitting there. A &lt;code&gt;'Seoul'&lt;/code&gt; row, wherever it is scattered across the input, all gathers into the one &lt;code&gt;'Seoul'&lt;/code&gt; slot.&lt;/p&gt;

&lt;p&gt;Not needing sorting is HashAggregate's strength. Since there is no need to line up the input in advance, it saves the entire sorting cost in situations where there is no suitable index and a sort would otherwise be required. It shines especially when the number of groups is small. With only a handful of regions, the hash table holds only a few slots, and you can read a large input quickly while just accumulating into those few slots.&lt;/p&gt;

&lt;p&gt;In return, a constraint that GroupAggregate did not have appears. HashAggregate cannot emit a single result row until it has pulled in &lt;em&gt;all&lt;/em&gt; of the input and completed the hash table. Until the last row arrives, no group can be sure "this is all of this group." So in the build phase it exhausts the input to the end and completes the state of every group, and only then does it scan the table and pull out group results one by one to send up. Unlike GroupAggregate, which emits a result every time a group ends, HashAggregate cannot emit even the first row before the input is fully read. Even with a &lt;code&gt;LIMIT 3&lt;/code&gt; above it, the build phase must read the entire input, so it cannot stop as early as the sort-based method.&lt;/p&gt;

&lt;h2&gt;
  
  
  When memory runs out: HashAggregate's disk spill
&lt;/h2&gt;

&lt;p&gt;HashAggregate's other weakness is that the hash table takes up memory. When groups are few, only a handful of slots are needed, so it is light; but when the grouping key is nearly unique and groups balloon into the millions, the hash table presses on memory. The memory limit available here is &lt;code&gt;work_mem&lt;/code&gt; multiplied by &lt;code&gt;hash_mem_multiplier&lt;/code&gt;. &lt;code&gt;work_mem&lt;/code&gt; is the amount of memory a single operation like sorting or hashing can use before spilling to disk, and &lt;code&gt;hash_mem_multiplier&lt;/code&gt; is a factor that raises that limit several-fold for hash-based operations only. Since its default is 2.0, the hash table can use up to twice &lt;code&gt;work_mem&lt;/code&gt; by default.&lt;/p&gt;

&lt;p&gt;When this limit is exceeded, HashAggregate enters spill mode. Groups that already have a slot in the table keep accumulating in memory, while a row that &lt;em&gt;would need to create a new group&lt;/em&gt; does not claim another slot in memory and is instead written aside to disk. When writing to disk, rows are split across several partitions by the hash value of the key. After reading through the input once and completing the in-memory groups, the partitions stacked on disk are pulled back in one at a time and aggregated the same way. If a partition is too large to fit in memory again, that partition is split into finer pieces.&lt;/p&gt;

&lt;p&gt;Concretely, it looks like this. Suppose &lt;code&gt;work_mem&lt;/code&gt; is 4MB and you run &lt;code&gt;SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id&lt;/code&gt; over 10 million orders. If there are 2 million customers, the hash table needs 2 million slots, and the moment these slots exceed the memory limit (4MB times &lt;code&gt;hash_mem_multiplier&lt;/code&gt;) during accumulation, it switches to spill mode. The customer slots that had entered the table up to that point keep adding their sums in memory, while rows of new customers that did not yet have a slot are pushed to disk partitions. After reading through all 10 million orders once, the sums of the customers remaining in memory are completed and emitted as results, and the customers pushed to disk are pulled back partition by partition and finished. Whether a given customer stays in memory or is pushed to disk is just a matter of whether its slot was created before or after the limit was crossed.&lt;/p&gt;

&lt;p&gt;This disk split resembles the way HashJoin, seen earlier, splits into batches when the hash table overflows memory. Both follow the same principle of spilling to disk once the hash table exceeds the &lt;code&gt;work_mem&lt;/code&gt; limit; they only differ in purpose, with the join matching pairs across both sides and the aggregate accumulating per-group state.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; shows in numbers whether this spill happened. A &lt;code&gt;Batches&lt;/code&gt; value greater than 1 on the &lt;code&gt;HashAggregate&lt;/code&gt; node means it was split and processed on disk, &lt;code&gt;Memory Usage&lt;/code&gt; reports the amount of memory used, and &lt;code&gt;Disk Usage&lt;/code&gt; reports how much was written to disk. A &lt;code&gt;Batches&lt;/code&gt; of 1 signals that the hash table fit entirely in memory.&lt;/p&gt;

&lt;p&gt;Disk spill is a feature introduced in PostgreSQL 13. Thanks to it, HashAggregate can be attempted even when groups are so numerous that the hash table does not fit entirely in memory.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>executor</category>
    </item>
    <item>
      <title>1.5.3 Join Nodes: NestLoop, HashJoin, MergeJoin</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:14:47 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/153-join-nodes-nestloop-hashjoin-mergejoin-2fek</link>
      <guid>https://dev.to/joonghyukshin/153-join-nodes-nestloop-hashjoin-mergejoin-2fek</guid>
      <description>&lt;p&gt;A scan node sits at the leaf of the tree and pulls rows from a single table. A join node sits in the middle and brings together the rows that its two children send up. It takes one row from &lt;code&gt;users&lt;/code&gt;, one row from &lt;code&gt;orders&lt;/code&gt;, checks whether they belong to the same user, and if they match, emits the combined row. PostgreSQL has three nodes for this one job: NestLoop, HashJoin, and MergeJoin. The reason a single task splits into three nodes is much like the reason scans did. There is more than one way to find matching pairs from two inputs, and which way is cheapest depends on the size of the inputs and the shape of the join condition.&lt;/p&gt;

&lt;p&gt;Deciding which way is cheapest, by costing the alternatives, was the planner's job in an earlier chapter. This section looks at what those three nodes actually do when they execute. Given the same two tables, the three find matches in completely different ways, and that difference in approach is exactly what tells them apart.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the three nodes route requests
&lt;/h2&gt;

&lt;p&gt;All three join nodes are internal nodes with two children. One child is called the outer, the other the inner. All three run on the Volcano model's pull framework: when the parent asks for the next row, the join node takes rows from its two children, builds one matched row, and sends it up. The only difference is the order and manner in which it routes pull requests to its two children.&lt;/p&gt;

&lt;p&gt;NestLoop pulls the inner from the start all over again for each outer row it receives. HashJoin slurps the inner in one pass to build an index in memory, then takes outer rows one at a time and probes that index. MergeJoin, on the assumption that both sides are sorted in the same order, advances both sides one step at a time in lockstep.&lt;/p&gt;

&lt;h2&gt;
  
  
  NestLoop: rescan the inner for every outer row
&lt;/h2&gt;

&lt;p&gt;The simplest method is NestLoop. As the name says, the loops are nested. The outer loop takes one row from the outer; the inner loop scans the inner from beginning to end, looking for inner rows that match that outer row. When the inner is exhausted, the outer loop moves to its next row and the inner loop scans the inner &lt;em&gt;from the start again&lt;/em&gt;. If the outer has N rows and the inner has M rows, the node performs N×M match comparisons.&lt;/p&gt;

&lt;p&gt;At runtime the node keeps only two pieces of state: whether the inner scan for the current outer row is finished so the next outer row is needed, and whether the current outer row has matched the inner at least once. That second flag exists for outer joins like &lt;code&gt;LEFT JOIN&lt;/code&gt;. A &lt;code&gt;LEFT JOIN&lt;/code&gt; must emit even an outer row that never finds a match, filling the inner side with NULLs, and that NULL-filled row should be produced only when the inner has been fully scanned without a single match. So the node remembers whether a match has happened. A plain &lt;code&gt;INNER JOIN&lt;/code&gt;, which simply discards an unmatched outer row, has no need for this flag. When the next outer row is needed, the node tells the inner child to start over, and that rewind resets the inner subtree so it flows again from its first row.&lt;/p&gt;

&lt;p&gt;Judging by the number N×M alone, NestLoop looks like a textbook case of inefficiency. Yet there is a decisive reason PostgreSQL picks it often: when the inner can use an index, the inner loop's "scan the whole inner" turns into "fetch just one row by index." The key is that when the outer loop takes an outer row, it passes that row's join key value down to the inner. The inner plugs that value into its index condition and, instead of scanning the whole table, goes straight to the matching rows through the index.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Suppose &lt;code&gt;customers.id&lt;/code&gt; has a primary key index. When the planner picks NestLoop here, the outer loop pulls qualifying &lt;code&gt;orders&lt;/code&gt; rows one at a time and passes each row's &lt;code&gt;customer_id&lt;/code&gt; down to the inner. The inner does not scan &lt;code&gt;customers&lt;/code&gt; from the start; it resolves &lt;code&gt;WHERE c.id = &amp;lt;that customer_id&amp;gt;&lt;/code&gt; through the index and fetches just that one customer. So even with ten thousand outer rows, the inner's real work is "ten thousand index lookups," not "ten thousand full table scans." In EXPLAIN, when an &lt;code&gt;Index Scan&lt;/code&gt; sits under the inner side of a &lt;code&gt;Nested Loop&lt;/code&gt; and its index condition references an outer column, this is the shape you are seeing.&lt;/p&gt;

&lt;p&gt;A second strength of NestLoop follows from this: it does not care about the shape of the join condition. All NestLoop does is pair an outer row with an inner row and evaluate the join condition, so the condition can be an equality (&lt;code&gt;=&lt;/code&gt;), an inequality (&lt;code&gt;&amp;lt;&lt;/code&gt;), a range, or a function call. The HashJoin and MergeJoin we will see next work only when the join condition is an equality; NestLoop has no such restriction. A join that connects two tables with an inequality, or matches them through an arbitrary function like a distance calculation, has no option other than NestLoop.&lt;/p&gt;

&lt;p&gt;The trouble is when the inner cannot use an index and there are many matches. Scanning the entire inner for every outer row makes N×M a reality, and with even tens of thousands of outer rows the comparison count climbs into the billions. When the planner misjudges this boundary and picks NestLoop, you get the classic incident of a single query spinning the CPU without end.&lt;/p&gt;

&lt;h2&gt;
  
  
  HashJoin: build a hash table from the inner, look up from the outer
&lt;/h2&gt;

&lt;p&gt;HashJoin's idea is not to rescan the inner over and over, but to read it just once and build a fast index in memory. That index is the hash table.&lt;/p&gt;

&lt;p&gt;HashJoin runs in two phases. First, in the build phase, it pulls the entire inner and builds a hash table keyed on the join key. It runs the key through a hash function to decide which bucket it lands in, and collects inner rows with the same hash value into the same bucket. This phase slurps the inner all at once: unlike other nodes that emit one row at a time, it consumes the entire inner to complete the table before moving on. Then, in the probe phase, it takes outer rows one at a time, runs each row's join key through the same hash function to find its bucket, and compares only against the inner rows in that bucket.&lt;/p&gt;

&lt;p&gt;This asymmetry is the heart of HashJoin. The inner is read exactly once to become an index, and after that only the outer flows. It is the opposite of NestLoop, which reads the inner as many times as there are outer rows. Processing one outer row compares not against the whole inner but against the few inner rows in the same bucket, so in a well-distributed hash table the comparisons per outer row drop to a handful.&lt;/p&gt;

&lt;p&gt;In return, HashJoin carries two restrictions NestLoop does not. First, it works only for equi-joins. Hashing guarantees "equal values land in the same bucket," but it tells you nothing about "greater than" or "less than." So an equality condition like &lt;code&gt;a.x = b.y&lt;/code&gt; can be resolved, but an inequality like &lt;code&gt;a.x &amp;lt; b.y&lt;/code&gt; cannot be narrowed by hashing. Second, the hash table occupies memory. As long as the inner is small enough to fit in memory there is no problem, but once the inner exceeds the memory limit, the story changes.&lt;/p&gt;

&lt;p&gt;That memory limit is not &lt;code&gt;work_mem&lt;/code&gt; itself but &lt;code&gt;work_mem&lt;/code&gt; multiplied by &lt;code&gt;hash_mem_multiplier&lt;/code&gt; (default 2). In other words, with the default setting the hash table can use up to twice &lt;code&gt;work_mem&lt;/code&gt;. If, during the build phase, the hash table looks like it will not fit within this limit, PostgreSQL splits the inner and outer into several batches by the hash value of the join key in advance. It loads one batch into memory to finish build and probe, then writes the remaining batches to temporary files on disk and brings them back one at a time. The &lt;code&gt;Buckets: ... Batches: ... Memory Usage: ...&lt;/code&gt; printed on the &lt;code&gt;Hash&lt;/code&gt; node in EXPLAIN ANALYZE shows this state. &lt;code&gt;Batches&lt;/code&gt; of 1 means the hash table fit entirely in memory; 2 or more means it was split out to disk, adding that much temporary-file I/O. A large &lt;code&gt;Batches&lt;/code&gt; is a signal that &lt;code&gt;work_mem&lt;/code&gt; is short for this join.&lt;/p&gt;

&lt;p&gt;PostgreSQL adds one more optimization here, for the case where the data is not evenly spread but skewed toward particular values. Picture joining an orders table to customers where one large customer accounts for a sizable share of all orders. In that case a single customer's id appears unusually often in the outer's join key. PostgreSQL can know from statistics which values appear this often (the most common values), and it pulls the inner rows for those popular values into a dedicated space that is never spilled to a disk batch. This dedicated space is called the skew bucket: skew means the distribution leans to one side, and this is a separate bucket for those leaning values. Those values will be probed frequently from the outer, and if their inner rows were sitting in a disk batch, each probe would have to read them back from disk. Holding them in memory in the skew bucket eliminates that repeated disk read.&lt;/p&gt;

&lt;h2&gt;
  
  
  MergeJoin: advance two sorted inputs in lockstep
&lt;/h2&gt;

&lt;p&gt;The third method starts from the assumption that both inputs are &lt;em&gt;already sorted&lt;/em&gt; by the join key. If both are sorted lists, you can match them by walking down both at once, one step at a time, without indexing one side wholesale or rescanning it. It is like laying two sorted name lists side by side and walking down to find the same name.&lt;/p&gt;

&lt;p&gt;It begins with a read position on each side. It compares the join keys at the two positions: if they are equal, that is a match, so it emits the combined row; if they differ, it advances only the position on the &lt;em&gt;smaller&lt;/em&gt; side by one. A smaller value means that row's match has already passed or will never come, so it is safe to drop it and move on. Since the two positions only ever move in one direction until each reaches its end, the join finishes after reading each input exactly once. Unlike NestLoop, which reads the inner repeatedly, and HashJoin, which loads the whole inner into memory, MergeJoin streams both sides exactly once.&lt;/p&gt;

&lt;p&gt;For MergeJoin to work, both sides must be sorted, and that sorting is not free. If both inputs happen to come up already sorted because they were read through a join-key index, the merge can begin right away at no sorting cost; otherwise a Sort node is placed under one or both sides to sort the input first, and only then does the merge begin. In EXPLAIN, whether a &lt;code&gt;Sort&lt;/code&gt; hangs under the &lt;code&gt;Merge Join&lt;/code&gt; or a sorted &lt;code&gt;Index Scan&lt;/code&gt; feeds straight in tells you whether that sorting cost was paid.&lt;/p&gt;

&lt;p&gt;A tricky case slips in here: when the join key has duplicate values. If the outer has the key &lt;code&gt;5&lt;/code&gt; twice and the inner has &lt;code&gt;5&lt;/code&gt; three times, the first outer &lt;code&gt;5&lt;/code&gt; must match all three inner &lt;code&gt;5&lt;/code&gt; rows, and the second outer &lt;code&gt;5&lt;/code&gt; must &lt;em&gt;also&lt;/em&gt; match those same three inner &lt;code&gt;5&lt;/code&gt; rows. But by the time the first outer &lt;code&gt;5&lt;/code&gt; is processed, the inner's read position has already moved past the three &lt;code&gt;5&lt;/code&gt; rows to the next value. To match the second outer &lt;code&gt;5&lt;/code&gt;, the inner position must be moved back to where &lt;code&gt;5&lt;/code&gt; began.&lt;/p&gt;

&lt;p&gt;This rewinding is called mark and restore. A mark is placed at the position where a run of equal values begins on the inner, and when the same value appears again on the outer, the inner position is restored to that mark. So although we said MergeJoin reads both sides only once, more precisely a stretch of the inner can be rewound and reread because of duplicate values on the outer. If the join key is nearly unique on both sides, this rewinding barely happens and the join finishes most cleanly; if there are many duplicates, that much rewinding cost is added.&lt;/p&gt;

&lt;p&gt;One more thing: mark and restore work only if the inner child can accept the request "remember this position and go back to it." Not every execution node can return to an arbitrary past position. A node that streams sort output or another join's result one row at a time has no way back to a row it has already passed. In that case PostgreSQL inserts a Materialize node above the inner. Materialize is a node that stacks up the rows the inner sends, in memory (spilling to disk if it overflows). Because mark and restore happen on this stacked-up copy rather than on the inner original, the copy can be walked back and forth freely. The &lt;code&gt;Materialize&lt;/code&gt; you see under a &lt;code&gt;Merge Join&lt;/code&gt; in EXPLAIN is that mechanism.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>executor</category>
    </item>
    <item>
      <title>1.5.2 Scan Nodes: SeqScan, IndexScan, BitmapScan</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:13:42 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/152-scan-nodes-seqscan-indexscan-bitmapscan-2gpo</link>
      <guid>https://dev.to/joonghyukshin/152-scan-nodes-seqscan-indexscan-bitmapscan-2gpo</guid>
      <description>&lt;p&gt;A scan node is a leaf of the executor tree. Having no children, it reads rows straight from disk. That single phrase, "reads straight from disk," actually hides three different methods. To pull rows that pass the same condition from the same table, PG has three separate scan nodes: SeqScan, IndexScan, and BitmapScan. The job is identical, so why three nodes? One reason: depending on what fraction of the rows pass the condition, a different way of reading the heap pages turns out to be cheaper.&lt;/p&gt;

&lt;p&gt;All three run on the same skeleton. The procedure a scan node follows to produce one row is to fetch a raw tuple through the access method, check the qual (the filter expression, like a WHERE condition), and if it passes, project out the needed columns and return it. The difference between the three nodes lies in the first step: where, and in what order, the next raw tuple comes from. SeqScan sweeps the whole table, IndexScan probes only the spots the index points to, and BitmapScan gathers those spots all at once and then reads them in an organized order.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sequential scan: read every table page from first to last
&lt;/h2&gt;

&lt;p&gt;The simplest method is the sequential scan. Without going through an index, it reads the table's heap pages in their physical storage order, from the first page to the last. From each page it pulls rows one by one, checks the qual, and sends up only those that pass. When a query like &lt;code&gt;SELECT * FROM orders WHERE amount &amp;gt; 100&lt;/code&gt; matches a large portion of the table, most pages have to be read anyway, so this method is fastest.&lt;/p&gt;

&lt;p&gt;A sequential scan is fast for two reasons. First, there is no side work of searching an index. Second, reading pages in storage order makes it sequential I/O for the disk. Disks and the OS apply read-ahead, prefetching the next pages while reading consecutive blocks, which is far more efficient than hunting down scattered locations one at a time. This is also the basis on which the planner assigns different costs to sequential access and random access, and that cost model is covered in 1.4.&lt;/p&gt;

&lt;p&gt;Inside PG, the per-row supply for a sequential scan is handled by &lt;code&gt;SeqNext&lt;/code&gt;. It just calls the access method that sweeps the heap in order and hands back the next tuple, and in that same step it also judges whether the tuple it read is visible (visibility) under the current transaction's snapshot. The specific rules this visibility judgment follows are covered in Chapter 3 on transactions and MVCC. Here it is enough to know that this judgment happens together with every row the scan sends up.&lt;/p&gt;

&lt;h2&gt;
  
  
  Index scan: take an address and probe the heap one row at a time
&lt;/h2&gt;

&lt;p&gt;When only a tiny fraction of the rows pass the condition, the story changes. Sweeping a ten-million-row table to find a single row, as in &lt;code&gt;SELECT * FROM orders WHERE id = 42&lt;/code&gt;, is wasteful. This is where the index scan comes in.&lt;/p&gt;

&lt;p&gt;To understand the index scan, you first need the idea of a TID. A TID (tuple identifier) is the physical address of a row inside the heap. It is a (page number, offset within the page) pair pointing to which page and which slot the row sits in, and the index's leaf holds exactly this TID alongside the key value. When you follow the index down to find a key, you get a TID saying "the row with this key lives at this heap address."&lt;/p&gt;

&lt;p&gt;So one row of an index scan consists of two actions. First it pulls the next matching TID from the index, then it goes to the heap address that TID points to and reads the actual row. The row it reads is checked for snapshot visibility right there. Because one index key can point to several versions of a row (under MVCC, modifying a row can leave the old and new versions coexisting), the index handing back a TID does not guarantee that row is currently visible. So the visibility check has to happen at the heap, not at the index. The index scan repeats this "one TID from the index, one row from the heap, visibility check" bundle for as many matching keys as there are.&lt;/p&gt;

&lt;p&gt;A weakness hides in this. The TIDs the index hands back follow the index key order, not the physical storage order in the heap. So the path of probing the heap jumps around between pages, becoming random access. With one or two matches it touches only a page or two and there is no problem, but as matches grow into the thousands, it jumps to scattered pages thousands of times, and may even revisit the same page repeatedly. If a page holds several matching rows that come out far apart in index order, that page is read again each time. Because random access costs more per page than sequential access, the advantage of the index scan fades quickly as matches grow. This is where "reading through an index is always faster" breaks down. If the planner's estimate of few matching rows misses and a large portion of the table actually matches, the index scan ends up slower than reading everything sequentially from the start, because it jumps across scattered pages without end.&lt;/p&gt;

&lt;p&gt;In exchange, the index scan has an advantage the sequential scan lacks: it returns rows in index key order. If a query has &lt;code&gt;ORDER BY id&lt;/code&gt; and happens to scan via the &lt;code&gt;id&lt;/code&gt; index, the rows the index scan sends up are already sorted, so there is no need for a separate Sort node on top. This is why a query like &lt;code&gt;SELECT * FROM orders ORDER BY id LIMIT 10&lt;/code&gt; is fast. Instead of gathering everything to sort it, you just pull the first ten rows in index order and you are done.&lt;/p&gt;

&lt;p&gt;A variant of the index scan is the index-only scan. If every column the query needs is contained in the index, the answer can be built by reading only the index tuples, without going to the heap. Visibility still has to be confirmed, but instead of reading the heap directly, it consults the Visibility Map (an auxiliary structure that marks, per page, whether all tuples on that page are visible to everyone). If the page is marked all-visible, it skips the heap visit; otherwise it goes to the heap for just that row.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bitmap scan: collect all addresses, then read pages once in physical order
&lt;/h2&gt;

&lt;p&gt;There is a middle range where the matching rows are too many for an index scan and too few for a sequential scan. Consider finding a single customer's few hundred orders, as in &lt;code&gt;SELECT * FROM orders WHERE customer_id = 7&lt;/code&gt;. A few hundred is too much for a sequential scan that reads all ten million rows, yet probing scattered pages a few hundred times with an index scan, revisiting the same page over and over, is also inefficient. The bitmap scan is for this range.&lt;/p&gt;

&lt;p&gt;The idea behind the bitmap scan is to turn the index scan's random access into something close to sequential. The key is to not read rows one by one as you follow the index, but to first gather all the matching TIDs, then sort those addresses into the heap's page order and read them. That way you read front to back in one direction instead of jumping between pages, and even if a page holds several matching rows, that page is read exactly once.&lt;/p&gt;

&lt;p&gt;Two nodes split this work. The lower BitmapIndexScan sweeps the index and gathers all matching TIDs into a single bitmap. The bitmap is a data structure that marks "there is a match at this slot of this page" with a set bit. Unlike the other scan nodes that send their results up one row at a time, this node sweeps the index to the end in one shot and hands the completed bitmap up whole. The upper BitmapHeapScan takes that bitmap and visits the heap pages its set bits point to in physical order, emitting rows one by one.&lt;/p&gt;

&lt;p&gt;This structure carries one constraint: the size of the bitmap. Recording the exact slot (offset within the page) of every matching row makes the representation precise, but when matches reach into the millions, that bitmap eats too much memory. So when the memory for the bitmap (&lt;code&gt;work_mem&lt;/code&gt;) runs short, PG drops the precision by one level. What was marked per row becomes marked per page, recording only "there is a match somewhere on this page." The first, precise method is called exact, the second, coarse one is called lossy.&lt;/p&gt;

&lt;p&gt;A lossy page comes with a price. Since you only know "there is a match on this page," after reading the page you have to check again whether the rows on it actually pass the condition. This re-check is called recheck. Rows from an exact page need no recheck because the bitmap pinned the exact slot, but rows from a lossy page have the original condition applied once more to filter them. The &lt;code&gt;Heap Blocks: exact=N lossy=M&lt;/code&gt; in EXPLAIN output shows the counts of these two kinds of pages. A large lossy number signals that the bitmap did not fit in memory and gave up precision, and raising &lt;code&gt;work_mem&lt;/code&gt; brings that many pages back to exact.&lt;/p&gt;

&lt;p&gt;Another advantage of the bitmap approach is that it can combine several indexes. If &lt;code&gt;WHERE customer_id = 7 AND status = 'paid'&lt;/code&gt; has a &lt;code&gt;customer_id&lt;/code&gt; index and a &lt;code&gt;status&lt;/code&gt; index each, two BitmapIndexScans can each build a bitmap, and those two can be overlapped with a bitwise AND to keep only the TIDs that satisfy both conditions. Unlike an index scan that narrows by one index and filters the rest at the heap, this merges both conditions at the bitmap stage before going to the heap. For an OR condition, the two bitmaps are unioned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1p1u6a5boejnf8s0fk3m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F1p1u6a5boejnf8s0fk3m.png" alt="Heap page visit patterns of the three scans" width="800" height="683"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The figure above compares how the three scans visit the heap when matching rows are scattered across pages 2, 7, and 9 of the same table. The sequential scan reads every page from 1 to 10 in order, regardless of matches. The index scan touches only the matching pages, but following index key order it jumps to 7, 2, 9, and revisits 7 and 2 when matches come out apart (the numbers in the circles are the visit order). The bitmap scan sorts the same matching pages into the physical order 2, 7, 9 and reads each exactly once. The jagged path of the index scan straightened into a one-directional line in the bitmap scan is the heart of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;When an index scan is slow, the culprit is usually a missed estimate and random access.&lt;/strong&gt; If in EXPLAIN ANALYZE the index scan's estimated rows and actual rows diverge sharply with the actual far higher, the planner chose the index scan expecting few matches while many actually matched, and it is jumping across pages without end. Stale statistics behind a missed estimate are a common cause, so run ANALYZE again first to check whether the estimate matches reality. When the estimate and the actual diverge widely and the matching rows are far more than expected, you can set &lt;code&gt;enable_indexscan = off&lt;/code&gt; to force the planner to leave the index scan out of its choices, so you can weigh the cost of a BitmapScan or a SeqScan directly. If the estimate keeps missing even after ANALYZE, that column's data distribution is not being captured well enough by the default statistics target (&lt;code&gt;default_statistics_target&lt;/code&gt;, default 100), and the root fix is to raise that column's statistics target with &lt;code&gt;ALTER TABLE ... ALTER COLUMN ... SET STATISTICS&lt;/code&gt;.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>executor</category>
    </item>
    <item>
      <title>1.5.1 The Volcano Iterator Model</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:12:36 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/151-the-volcano-iterator-model-3ebf</link>
      <guid>https://dev.to/joonghyukshin/151-the-volcano-iterator-model-3ebf</guid>
      <description>&lt;p&gt;In 1.5 we saw the big picture: the executor pulls results up one row at a time. That pulling structure ties together nodes that do completely different jobs, SeqScan, HashJoin, Sort, into a single execution loop. Nodes that do such different things all run in one loop because every node follows the same interface, and that interface also explains what the actual rows and loops in EXPLAIN ANALYZE are.&lt;/p&gt;

&lt;h2&gt;
  
  
  Every node follows the same interface
&lt;/h2&gt;

&lt;p&gt;The executor tree has many kinds of nodes. SeqScan sweeps a table, IndexScan follows an index, HashJoin and NestLoop bring two inputs together, Sort orders rows, HashAgg groups and counts. They all do different things. Yet the executor binds them with a single promise. &lt;strong&gt;A node is called like a function, and each call hands back exactly one result row.&lt;/strong&gt; Call the same node again and it gives you the row right after the one it last returned; when there are no more rows, it returns an empty result to signal the end. The unit that travels between nodes one at a time is the tuple.&lt;/p&gt;

&lt;p&gt;Inside PG this "next row, please" request is a single function called &lt;code&gt;ExecProcNode&lt;/code&gt;. Most nodes (&lt;code&gt;PlanState&lt;/code&gt;) such as scans, joins, and sorts carry this function, so the caller can invoke &lt;code&gt;ExecProcNode&lt;/code&gt; and get the next tuple without knowing whether the other side is a SeqScan or a HashJoin. The calling code does not have to differ per node type. Some nodes (Sort among them) do need to take in all of their input before they can emit the first row. Those nodes still send their output one row at a time; the only difference is that the first row does not come out until all the input has been gathered.&lt;/p&gt;

&lt;p&gt;The name for fitting every operator to the same interface this way is the Volcano model. Goetz Graefe laid it out in 1994: every operator is shaped into three operations, open, next, and close. In PG, open is &lt;code&gt;ExecInitNode&lt;/code&gt;, which prepares a node; next is &lt;code&gt;ExecProcNode&lt;/code&gt;, which emits the next tuple; close is &lt;code&gt;ExecEndNode&lt;/code&gt;, which tears it down. Match those three and any operator slots into the tree.&lt;/p&gt;

&lt;p&gt;Why the uniform interface matters becomes clear if you imagine the opposite. If each node had its own output convention, you would need different glue code to put Sort on top of HashJoin versus Sort on top of SeqScan. With N node types the combinations explode. Unified under one next call, Sort only needs to know "pull one row at a time from whatever child it has," and a new operator only needs one next function implemented to sit on top of any existing node. The executor can assemble nodes in the exact shape of the plan tree because of this promise.&lt;/p&gt;

&lt;h2&gt;
  
  
  One next call goes down to the leaf and back up
&lt;/h2&gt;

&lt;p&gt;A node's next cannot make a tuple on its own, because it needs input. So it calls its child's next, takes one row in, and works it into one result row of its own. That call goes to the child's child, then further down, until it reaches the scan node at the leaf of the tree. The scan node has no child, so it reads a row directly from disk.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fenhrdtcdrynrba7bba5v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fenhrdtcdrynrba7bba5v.png" alt="pull flow through the executor tree" width="800" height="621"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Take &lt;code&gt;SELECT * FROM orders WHERE amount &amp;gt; 100 LIMIT 5&lt;/code&gt;. The execution tree for this query is two nodes: Limit on top, SeqScan below. The WHERE condition does not become a separate node; it is evaluated as a filter inside the SeqScan node. In EXPLAIN you see &lt;code&gt;Filter: (amount &amp;gt; 100)&lt;/code&gt; attached to the SeqScan, with no separate filter node. When Limit's next is called, Limit calls SeqScan's next. SeqScan reads one row from a page, checks right there whether &lt;code&gt;amount&lt;/code&gt; exceeds 100, passes it up if it does, and discards it and reads the next row if it does not. When a passing row reaches Limit, Limit sends it to the caller (the client, or the target table of an INSERT). And once Limit has received five rows, it stops calling SeqScan's next.&lt;/p&gt;

&lt;p&gt;The key is that the flow runs in two directions. &lt;strong&gt;The request (next) goes top to bottom, and the tuple goes bottom to top.&lt;/strong&gt; Rows are not pushed up from below on their own; only when something above pulls a row does exactly that much get drawn up from below. To generalize the way SeqScan just checked &lt;code&gt;amount&lt;/code&gt;, the internal procedure a scan node follows to make one row is this: it fetches one raw tuple through the access method, checks the qual (a filter expression like a WHERE condition, the same idea as the SQL standard's predicate), and if the tuple passes, runs a projection to keep only the needed columns and returns it. A raw tuple that fails the qual is discarded and the next one is fetched. So a single next on a scan node may internally sweep several raw tuples until it finds "the next row that passes the condition."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why pull one row at a time
&lt;/h2&gt;

&lt;p&gt;Compared with building the whole result and stacking it up, the benefits of pull come out.&lt;/p&gt;

&lt;p&gt;First, &lt;strong&gt;memory stays constant regardless of result size, and no intermediate result is written to disk.&lt;/strong&gt; Flowing one row at a time, a node only needs to hold the single tuple it is processing. Whether a query returns ten million rows or ten, the number of tuples any node holds at once is one, and a node reuses its single slot (&lt;code&gt;TupleTableSlot&lt;/code&gt;) instead of allocating a fresh tuple. A row that SeqScan read and passed through the filter also rises all the way to the caller in one go, through the nodes above, so there is no need to build an intermediate table between stages. This processing, where one tuple passes through several nodes in a row, is called pipelining. (A node like Sort or HashJoin that must gather all of its input does use memory inside itself, but that is that node's own business and is different from the whole tree stacking up the result.)&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;stop at the top and the bottom stops too.&lt;/strong&gt; For a query with &lt;code&gt;LIMIT 5&lt;/code&gt;, the topmost Limit node stops calling its child's next after it has received five rows. Then the nodes below it also do only as much work as it took to push up those five rows, and stop. Even if the table has ten million rows, SeqScan reads only as many pages as it takes to pass five rows up. With the build-it-all approach, LIMIT or not, you would compute ten million rows first and then cut to five.&lt;/p&gt;

&lt;p&gt;That said, passing tuples one at a time (tuple-at-a-time) is not every database's answer. A function call happens every time a node is called, and when a single query sweeps hundreds of millions of rows, that call cost alone adds up enough to matter. So engines built mainly for analytical queries took a different road. Analytical databases like ClickHouse and DuckDB use vectorized execution, where each call passes not one row but a batch of thousands. DuckDB processes about 2,048 rows at a time, ClickHouse up to 65,536 rows in one chunk, which divides the call cost across the row count and lets the CPU's SIMD instructions (one instruction computing several values at once) process the batch together. On analytical workloads this difference reaches several times. PostgreSQL stays with tuple-at-a-time because it is a general-purpose engine handling both OLTP queries that touch few rows and large analytical queries in one engine. Flowing one row at a time fits patterns that stop early, like the LIMIT we just saw, and fits transaction processing that runs row by row.&lt;/p&gt;

&lt;h2&gt;
  
  
  A cursor: the client pulls one row at a time
&lt;/h2&gt;

&lt;p&gt;This one-row-at-a-time structure is not confined to the executor's insides. Receiving a large result all at once swells the client's memory by the size of the result, because SELECTing millions of rows dumps all of those millions onto the client. The channel that lets you receive that result in smaller pieces is the cursor. When the client opens a cursor with &lt;code&gt;DECLARE CURSOR&lt;/code&gt; and takes a few rows at a time with &lt;code&gt;FETCH&lt;/code&gt;, the executor calls next from the top of the tree by that many each time a FETCH arrives. It does not build the whole result up front; it pulls only as much as the FETCH asks for, then and there.&lt;/p&gt;

&lt;p&gt;So even a result of millions of rows, taken through a cursor 1,000 at a time, keeps the amount the client holds at once at 1,000. And while you receive through the cursor in pieces, the executor keeps that execution tree and transaction snapshot held as is, so FETCHing across several calls still reads consistently against the data as of when the cursor was opened.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scan a node more than once and loops climbs
&lt;/h2&gt;

&lt;p&gt;So far the picture has been each node swept once, start to finish. In a nested loop join it is different. Every time a row is pulled from the outer input, the entire inner input must be swept again from the start. If the outer has 1,000 rows, the inner is scanned from the beginning 1,000 times. Resetting a node back to its starting state like this is called a rescan, and PG handles it with &lt;code&gt;ExecReScan&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This rescan count surfaces as loops in EXPLAIN ANALYZE output. The executor gathers statistics while a node runs: each time one scan (one loop) finishes, it accumulates the tuples that loop produced and bumps the loop count by one. So the two numbers are defined like this.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;loops&lt;/strong&gt;: how many times the node was scanned from the start. For a nested loop inner, this equals the outer row count. A node scanned only once has loops of 1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;actual rows&lt;/strong&gt;: the average tuples produced per loop. The total tuples across all loops, divided by loops.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is the point people often get wrong. actual rows is &lt;strong&gt;the per-loop average, not the total&lt;/strong&gt; of tuples the node produced. For example, if a nested loop inner IndexScan shows &lt;code&gt;rows=1.00 loops=1000&lt;/code&gt;, it means the node was scanned 1,000 times and produced 1 row on average each time, so the total tuples it actually produced is 1 times 1,000, about 1,000 rows. When you see a node with a large loops, do not read actual rows alone and conclude "this node only touches one row." You have to multiply rows by loops to get the scale of what the node actually did.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nested Loop  (actual rows=1000.00 loops=1)
  -&amp;gt;  Seq Scan on orders  (actual rows=1000.00 loops=1)
  -&amp;gt;  Index Scan on customers  (actual rows=1.00 loops=1000)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the output above, the inner Index Scan's &lt;code&gt;loops=1000&lt;/code&gt; means the outer orders had 1,000 rows, so the customers index was re-traversed 1,000 times, and &lt;code&gt;rows=1.00&lt;/code&gt; means each of those 1,000 found 1 row on average. The product, about 1,000 index lookups, is what this node actually did.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, &lt;code&gt;LIMIT&lt;/code&gt; can stop early only when there is no Sort.&lt;/strong&gt; A &lt;code&gt;LIMIT&lt;/code&gt; after &lt;code&gt;ORDER BY&lt;/code&gt; cannot emit its first row until the Sort node has received all input and finished sorting, so even if the top stops early, everything below the Sort has already been read. Which plans can stop early and which cannot is decided by the node types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, holding a cursor open for a long time carries a cost.&lt;/strong&gt; If you leave a cursor open for a long time because receiving one row at a time feels light, the snapshot survives that much longer, and this leads to the side effects of a long-running transaction (vacuum delay among them, covered in Chapter 3). Close a cursor as soon as you are done reading, and if you are not going to receive the result to the end, do not stretch the transaction out.&lt;br&gt;
&lt;br&gt;
&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>executor</category>
    </item>
    <item>
      <title>1.5 Executor: How Results Come Back</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:11:30 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/15-executor-how-results-come-back-5978</link>
      <guid>https://dev.to/joonghyukshin/15-executor-how-results-come-back-5978</guid>
      <description>&lt;p&gt;By the time 1.4 ends, the planner has produced one PlannedStmt. Inside it is an execution tree built from Plan nodes, frozen into a form you can follow step by step, something like "go into the primary key index on users, fetch the one matching row, then output that whole row." But that is still only a blueprint. Reading actual pages off disk, picking out the rows that match the condition, handing results back to the caller: none of that has happened yet. The stage that takes that blueprint and produces actual rows is the executor.&lt;/p&gt;

&lt;p&gt;The difference between the planner and the executor is the difference between deciding and doing. The planner was the stage that weighed "which index, in what order, with what join method" by cost and &lt;em&gt;chose&lt;/em&gt;. The executor takes the chosen approach and &lt;em&gt;carries it out as is&lt;/em&gt;. There is nothing left to choose. It just runs the nodes baked into the plan tree and pulls rows out of them.&lt;/p&gt;

&lt;p&gt;To run it, the executor takes the Plan tree it received and turns it into a PlanState tree. The Plan tree is the static blueprint the planner made, and it does not change during execution. But to actually run, each node needs state that changes as execution proceeds: which row it is reading now, whether the hash table is fully built, what tuple it has buffered from a child. So when execution begins, a PlanState tree with the exact same shape as the Plan tree is created. The blueprint Plan tree is left untouched, and the running state lives in that PlanState tree instead.&lt;/p&gt;

&lt;p&gt;How the executor produces result rows is the heart of the stage. The executor does not build the entire result set at once and stack it up. Instead, it asks the topmost node of the tree for "the next row," and that request travels down the tree to the leaves. When a leaf scan node reads one row from a page and passes it up to its parent, that row climbs up one level at a time through joins and filters until it reaches the top. The top sends that single row to the caller (the client, or the target table of an INSERT). When the next row is needed, the top is asked for "the next row" again. Rows are pulled from above whenever they are needed, and that pull propagates downward. This structure is called the pull model, or the Volcano iterator model.&lt;/p&gt;

&lt;p&gt;Why pull one row at a time? Building everything at once inflates memory by the size of the result, and even when the client only needs the first 10 rows, as in &lt;code&gt;LIMIT 10&lt;/code&gt;, the whole thing gets computed anyway. Pulling one row at a time lets the top stop after it has received ten rows, and then the nodes beneath it do exactly that much work and no more. This is possible because rows do not flow top to bottom: &lt;em&gt;requests flow top to bottom, and rows flow bottom to top.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the 1.4 planner chapter, the planner decided which scan, join, or aggregation method to use and whether to go parallel, by cost. This chapter looks at how the chosen nodes actually run on top of the pull skeleton.&lt;/p&gt;

&lt;p&gt;This chapter splits into five sections.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.5.1 The Volcano iterator model&lt;/strong&gt;: a closer look at the "pull one row at a time" skeleton we just saw. Every node returns one next tuple through the same interface, and we explain how that single call ends up driving the whole tree.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.5.2 Scan nodes: SeqScan, IndexScan, BitmapScan&lt;/strong&gt;: the nodes at the leaves that actually read rows off disk. The conditions under which each of the three scan methods is used.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.5.3 Join nodes: NestLoop, HashJoin, MergeJoin&lt;/strong&gt;: how the three nodes that bring rows from two children together behave on top of the pull skeleton.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.5.4 Aggregation: HashAgg, GroupAgg, sort-based&lt;/strong&gt;: the nodes that execute &lt;code&gt;GROUP BY&lt;/code&gt; and aggregate functions. How aggregation, which needs to see everything before it can answer, is handled inside a skeleton that pulls one row at a time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.5.5 Parallel query: how worker processes cooperate&lt;/strong&gt;: until now, execution has been one backend process driving the tree. Here we look at parallel execution, where several worker processes split a large scan among themselves and run it concurrently. The shared memory and background worker infrastructure that supports this cooperation is covered in detail in Chapter 6.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The executor is the stage that runs the plan as is, not the stage that makes new decisions. The node tree shown in EXPLAIN output is exactly this execution tree, and the actual rows and loops that &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt; attaches to each node measure how many rows the node actually produced and how many times it ran. Exactly what those two numbers count is explained in 1.5.1 along with the pull model. Once you get through this chapter, those numbers in EXPLAIN read not as abstract statistics but as traces of real execution.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>executor</category>
    </item>
    <item>
      <title>1.4.10 Planner Hook: When It Fires, How to Use It</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:10:21 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/1410-planner-hook-when-it-fires-how-to-use-it-1nec</link>
      <guid>https://dev.to/joonghyukshin/1410-planner-hook-when-it-fires-how-to-use-it-1nec</guid>
      <description>&lt;p&gt;Everything from 1.4.1 through 1.4.9 happened inside a single function, &lt;code&gt;standard_planner()&lt;/code&gt;. Building paths, costing them, searching for a join order, estimating cardinality from statistics: all of it runs inside that one function. Yet PostgreSQL does not call &lt;code&gt;standard_planner()&lt;/code&gt; directly. It puts another function, &lt;code&gt;planner()&lt;/code&gt;, one step ahead of it, and has &lt;code&gt;planner()&lt;/code&gt; call &lt;code&gt;standard_planner()&lt;/code&gt;. And &lt;code&gt;planner()&lt;/code&gt; can be made to call some other function instead of &lt;code&gt;standard_planner()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That replacement is what the planner hook enables. When pg_stat_statements measures per-query planning time, or pg_hint_plan rewrites a plan according to hints, it all goes through this hook. Let's look at how PostgreSQL provides a way to observe or change planning behavior without touching a single line of the core, and how external code plugs into it.&lt;/p&gt;

&lt;h2&gt;
  
  
  All planner() does is check the hook
&lt;/h2&gt;

&lt;p&gt;The body of &lt;code&gt;planner()&lt;/code&gt; is essentially this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;planner_hook&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;planner_hook&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cursorOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;boundParams&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;standard_planner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cursorOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;boundParams&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;planner_hook&lt;/code&gt; is a global function pointer. Its default value is &lt;code&gt;NULL&lt;/code&gt;, in which case &lt;code&gt;standard_planner()&lt;/code&gt; is called right away. A plain PostgreSQL build always takes this path: &lt;code&gt;planner_hook&lt;/code&gt; is empty, so the incoming query goes straight to &lt;code&gt;standard_planner()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The key here is the type of &lt;code&gt;planner_hook&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;typedef&lt;/span&gt; &lt;span class="n"&gt;PlannedStmt&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;planner_hook_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Query&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                           &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                           &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;cursorOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                           &lt;span class="n"&gt;ParamListInfo&lt;/span&gt; &lt;span class="n"&gt;boundParams&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This signature is identical, down to the character, to that of &lt;code&gt;planner()&lt;/code&gt; and &lt;code&gt;standard_planner()&lt;/code&gt;. It takes the same &lt;code&gt;Query&lt;/code&gt; and returns the same &lt;code&gt;PlannedStmt&lt;/code&gt; (the execution plan). So external code only has to write a planner function matching this type and store its address in &lt;code&gt;planner_hook&lt;/code&gt;. Let's call this function, written by external code to register in &lt;code&gt;planner_hook&lt;/code&gt;, a custom planner function. The moment its address is stored, every planning request enters this custom planner function first, instead of &lt;code&gt;standard_planner()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Leaving a single function pointer empty is a simple device, but it is the core of PostgreSQL's extension model. Rather than editing the core source to insert new behavior, you connect an external function to a pointer the core left empty in advance. This is why you can change behavior while leaving the compiled PostgreSQL binary untouched.&lt;/p&gt;

&lt;h2&gt;
  
  
  A custom planner function delegates to standard_planner
&lt;/h2&gt;

&lt;p&gt;So when and how does external code fill in &lt;code&gt;planner_hook&lt;/code&gt;? When an extension is loaded, PostgreSQL calls its library's &lt;code&gt;_PG_init()&lt;/code&gt; function once. Hook installation almost always happens here. The &lt;code&gt;_PG_init()&lt;/code&gt; of pg_stat_statements ends in two lines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;prev_planner_hook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;planner_hook&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;planner_hook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pgss_planner&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The second line stores the address of the custom planner function &lt;code&gt;pgss_planner&lt;/code&gt; into the global pointer. At a glance this looks like it fully replaces &lt;code&gt;standard_planner()&lt;/code&gt;. But what pg_stat_statements wants to do is not to plan in place of PostgreSQL; it wants to measure the time PostgreSQL spends planning. The plan itself must still be built by PostgreSQL.&lt;/p&gt;

&lt;p&gt;So &lt;code&gt;pgss_planner&lt;/code&gt; is called first, but it hands the actual planning back to &lt;code&gt;standard_planner()&lt;/code&gt;. The order is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pre-work: start a timer and record buffer usage.&lt;/li&gt;
&lt;li&gt;Delegate: call &lt;code&gt;standard_planner()&lt;/code&gt; to get the actual plan.&lt;/li&gt;
&lt;li&gt;Post-work: stop the timer, compute the elapsed time, and store that figure in its own statistics table.&lt;/li&gt;
&lt;li&gt;Return: hand the plan received in step 2 back to the caller unchanged.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It does not touch the plan itself; it only measures time around it. Having the custom planner function run first and then hand the real work to &lt;code&gt;standard_planner()&lt;/code&gt; is the most common shape. If instead a function does not call &lt;code&gt;standard_planner()&lt;/code&gt; and returns a plan it built itself, that amounts to replacing planning altogether. An extension that changes plans, like pg_hint_plan, is closer to this. That said, even hooks that change plans usually adjust the plan &lt;code&gt;standard_planner()&lt;/code&gt; produced rather than building one from scratch; doing the latter is rare.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I once built dynamic row-level security (RLS) on top of planner_hook, going beyond what PostgreSQL's built-in RLS can do. Built-in RLS only accepts a fixed condition expression in a policy, so it cannot carry a condition that changes per query, like one a policy function produces at execution time. So I intercepted the &lt;code&gt;Query&lt;/code&gt; with planner_hook right before planning and injected the dynamic condition there. For a SELECT, I adjusted the &lt;code&gt;Query&lt;/code&gt; before delegating to gate the read path; for an INSERT/UPDATE, I reworked the plan returned after delegating to check the value on the write path. A single hook thus passed through both before and after delegation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A custom planner function wrapping &lt;code&gt;standard_planner()&lt;/code&gt; with pre-work and post-work, as above, is the common case, but not every function has all three parts. Some only touch the &lt;code&gt;Query&lt;/code&gt; before delegating and return the resulting plan as is (pre-work only); some only adjust the plan after delegating (post-work only). What it does depends on the extension's purpose; only the skeleton, with delegation sitting in the middle, stays the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why save prev_planner_hook
&lt;/h2&gt;

&lt;p&gt;The first of those two lines has not been explained yet: &lt;code&gt;prev_planner_hook = planner_hook&lt;/code&gt;. Before storing its own function, it saves the value already there. If only one extension is in use, this line is unnecessary, since the saved value would be &lt;code&gt;NULL&lt;/code&gt; anyway. The line matters when several extensions are loaded.&lt;/p&gt;

&lt;p&gt;Suppose &lt;code&gt;shared_preload_libraries&lt;/code&gt; lists two extensions. Extension A, listed first, loads and stores its function in &lt;code&gt;planner_hook&lt;/code&gt;. The &lt;code&gt;prev&lt;/code&gt; A saved is &lt;code&gt;NULL&lt;/code&gt;. Then extension B loads, and &lt;code&gt;planner_hook&lt;/code&gt; currently holds A's function. When B runs &lt;code&gt;prev_planner_hook = planner_hook&lt;/code&gt;, that captures A's function address, and then &lt;code&gt;planner_hook&lt;/code&gt; gets B's function. If B had skipped this save and just stored its own function, the function A installed would be referenced by nothing and vanish. A's feature would die silently.&lt;/p&gt;

&lt;p&gt;So each custom planner function, when delegating, does not call &lt;code&gt;standard_planner()&lt;/code&gt; directly; it checks the saved &lt;code&gt;prev&lt;/code&gt; first.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prev_planner_hook&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prev_planner_hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cursorOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;boundParams&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;standard_planner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cursorOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;boundParams&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;prev&lt;/code&gt; exists, it calls that; only when there is none does it call &lt;code&gt;standard_planner()&lt;/code&gt;. This delegation happens at step 2 from the previous section, after the pre-work is done. Not after the extension has finished all its work, but with the delegation sitting between pre-work and post-work.&lt;/p&gt;

&lt;p&gt;The result is that the extensions' functions form a chain.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;planner()  →  extension B's function  →  extension A's function  →  standard_planner()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The call order is the reverse of the install order. Extension B, loaded last, is called first; B calls A through &lt;code&gt;prev&lt;/code&gt;; A, whose &lt;code&gt;prev&lt;/code&gt; is &lt;code&gt;NULL&lt;/code&gt;, calls &lt;code&gt;standard_planner()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Here, B calling A happens in the middle of B's function code (the delegation step). Likewise, A calls &lt;code&gt;standard_planner()&lt;/code&gt; from the middle of its own code. So when &lt;code&gt;standard_planner()&lt;/code&gt; finishes building the plan and returns its value, control comes back to A's function and the post-work code remaining in A runs; when A returns, control goes back to B's function and B's post-work runs. Calls go in the order B → A → &lt;code&gt;standard_planner()&lt;/code&gt;, and returns come out in the reverse order, &lt;code&gt;standard_planner()&lt;/code&gt; → A → B. When a function calls another function from the middle of its code, the rest of the outer function continues after the inner one finishes; this is just how ordinary function calls work. That structure is why the two extensions can each slot in their own pre-work and post-work without colliding.&lt;/p&gt;

&lt;h2&gt;
  
  
  When it fires
&lt;/h2&gt;

&lt;p&gt;Looking at where &lt;code&gt;planner()&lt;/code&gt; is called tells you when the hook fires. &lt;code&gt;planner()&lt;/code&gt; is called by a thin wrapper, &lt;code&gt;pg_plan_query()&lt;/code&gt;, and that wrapper runs whenever an optimizable query needs a plan. So the hook fires every time a query like SELECT/INSERT/UPDATE/DELETE enters planning, one query at a time.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;EXPLAIN&lt;/code&gt; is included here too. &lt;code&gt;EXPLAIN&lt;/code&gt; simply does not execute the query; it still builds the plan the same way, so it passes through the planner hook. This is also why pg_stat_statements captures the planning time of &lt;code&gt;EXPLAIN&lt;/code&gt;-ed queries.&lt;/p&gt;

&lt;p&gt;Prepared statements are an exception, though. While a plan that was built once stays in the plan cache and gets reused, no planning happens, so the hook does not fire either. The hook fires again only when the cache has no plan and one must be built anew, for instance when a generic plan (a plan built once and reused regardless of the parameter values) is first created or a custom plan (a plan rebuilt for each value) is built each time. When a tool that measures planning time via the hook shows "some executions have zero planning time" for a prepared statement, that is not a bug but the cache doing its job.&lt;/p&gt;

&lt;h2&gt;
  
  
  You can choose how deep to intervene
&lt;/h2&gt;

&lt;p&gt;The planner hook intervenes at the very outside of the whole planning process. So it does not see what happens inside; it only receives the incoming &lt;code&gt;Query&lt;/code&gt; and the outgoing &lt;code&gt;PlannedStmt&lt;/code&gt;. But sometimes you want to step into just one intermediate stage of planning. For example, if you only want to add one more path to the scan path candidates of a particular table, taking over all of planning is excessive.&lt;/p&gt;

&lt;p&gt;PostgreSQL therefore provides separate hooks at deeper stages. A hook that fires right after a base table's scan paths are all collected, a hook at the point where join paths are gathered, a hook that replaces the join-order search itself with your own algorithm (the stage where the DP and GEQO from 1.4.5 run), a hook that fires right after the planner reads table information from the catalog so you can add or remove index information, and so on.&lt;/p&gt;

&lt;p&gt;The criterion for choosing is the depth of intervention. Work that only needs the input and output of the whole planning process (timing, plan post-processing) uses planner_hook; work that has to touch an intermediate result of a specific stage uses that stage's hook. The deeper the hook, the more of the planner's internal data structures you need to know, but you do not have to take responsibility for the whole of planning.&lt;/p&gt;

&lt;h2&gt;
  
  
  standard_planner mutates its input
&lt;/h2&gt;

&lt;p&gt;There is a trap that those who write a hook directly can easily fall into. The source comment above &lt;code&gt;planner()&lt;/code&gt; warns about it directly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cm"&gt;/* standard_planner() scribbles on its Query input, so you'd
 * better copy that data structure if you want to plan more than once. */&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;standard_planner()&lt;/code&gt; does not just read the &lt;code&gt;Query&lt;/code&gt; tree it receives; it rewrites it in place. Once a plan has been built, the input &lt;code&gt;Query&lt;/code&gt; is already mutated. If a hook tries to call &lt;code&gt;standard_planner()&lt;/code&gt; twice on the same &lt;code&gt;Query&lt;/code&gt; (say, to plan two ways and compare them), the second call receives already-mutated input. So a hook that needs to plan twice must copy the &lt;code&gt;Query&lt;/code&gt; before handing it to &lt;code&gt;standard_planner()&lt;/code&gt;. If you hit a symptom while writing a hook where "the plan is fine the first time but goes wrong from the second," it is usually a case of reusing the same input without knowing about this mutation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, the more extensions use the planner hook, the more overhead is added per query.&lt;/strong&gt; Turning pg_stat_statements on with &lt;code&gt;track_planning = on&lt;/code&gt; measures planning time for each query, and that measurement is not free. The custom planner function being called first to turn the timer and buffer counters on and off is added to every planning operation. Usually this is negligible, but when you process tens of thousands of short queries per second, this overhead can show. Stacking several extensions that use the planner hook means their functions are called in a chain on every query, so keep this in mind on systems that plan frequently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, the load order of extensions can affect behavior.&lt;/strong&gt; Hooks are chained in reverse install order, so the order you list them in &lt;code&gt;shared_preload_libraries&lt;/code&gt; determines the order they fire. Among extensions that only measure plans, the order is meaningless; but if you use an extension that modifies plans (like pg_hint_plan) together with one that measures, the order decides who touches the plan first and who measures the result. One thing to check when extensions do not mesh as expected is the listing order in &lt;code&gt;shared_preload_libraries&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, the planner hook is an entry point you can use for debugging and diagnosis.&lt;/strong&gt; Without touching the core, you can look right there at what &lt;code&gt;Query&lt;/code&gt; a given query comes in as and what &lt;code&gt;PlannedStmt&lt;/code&gt; it goes out as. With SQL you can only see the execution plan via &lt;code&gt;EXPLAIN&lt;/code&gt;, but to inspect the structure of the &lt;code&gt;Query&lt;/code&gt; tree itself, before and after it goes to the planner, the hook is just about the only avenue. A small extension that installs a planner hook and logs the incoming &lt;code&gt;Query&lt;/code&gt; and outgoing &lt;code&gt;PlannedStmt&lt;/code&gt; lets you observe the input and output of planning without recompiling the core.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;When several extensions interleave, or rewrite stages stack up, the &lt;code&gt;Query&lt;/code&gt; tree often ends up changed in ways I did not intend. For those cases I built a monitoring tool that dumps the &lt;code&gt;Query&lt;/code&gt; tree before and after the planner and used it for debugging. Wiring that tool to a single GUC meant I could toggle monitoring by changing the GUC value through SQL alone, with no rebuild and no touching the server. Combining a hook with a GUC lets you observe internal behavior on a live server using nothing but SQL.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>planner</category>
    </item>
    <item>
      <title>1.4.9 Planner Preprocessing: Subquery Pull-up, Predicate Pushdown, Equivalence Classes</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:09:16 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/149-planner-preprocessing-subquery-pull-up-predicate-pushdown-equivalence-classes-20l7</link>
      <guid>https://dev.to/joonghyukshin/149-planner-preprocessing-subquery-pull-up-predicate-pushdown-equivalence-classes-20l7</guid>
      <description>&lt;p&gt;So far we have seen the planner build candidate paths and rank them by cost, choose join methods and join orders, and estimate those costs from statistics. But before any of that cost comparison begins, the planner does something else first. It takes the incoming Query tree and rewrites it once into a shape that is easier to optimize. That rewriting is preprocessing.&lt;/p&gt;

&lt;p&gt;In processing order, preprocessing is the planner's very first step. Yet why it pays off only becomes clear once you understand cost and joins. Why it is cheap to shrink the rows entering a join is explained by join cost, and why it matters to manufacture a missing join condition to avoid a Cartesian product is explained by join order. That is why a step that runs first in processing is the last one we look at in this book.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preprocessing rewrites the Query tree before cost comparison starts
&lt;/h2&gt;

&lt;p&gt;The planner's entry point runs preprocessing once for each Query it receives. The function that gathers this work is &lt;code&gt;subquery_planner&lt;/code&gt;. The name makes it sound like it only deals with subqueries, but it actually does everything that should happen exactly once per Query object. Only after that preparation finishes does the cost-based candidate comparison begin.&lt;/p&gt;

&lt;p&gt;One thing is worth pinning down here. The rewriter from chapter 1.3 also rewrites the Query tree, but its purpose is different from this preprocessing. The rewriter expands views into their contents, injects RLS policies as conditions, and fills in default values for columns omitted from an INSERT. In other words, it settles what the query actually means. Preprocessing does not touch that settled meaning at all. It leaves the rows that will come out exactly as they are, and rearranges only the shape of the tree so the same result can be produced more cheaply. If the rewriter finishes deciding "what to do," preprocessing is the first trimming of "how to do it."&lt;/p&gt;

&lt;p&gt;Why rewrite at all? The Query tree the rewriter hands over still carries the shape the SQL author wrote. People write for readability, not for the planner's convenience. For clarity they wrap parts of the query in subqueries and pile several conditions into one WHERE clause. Leave that shape alone, and the cost and join comparison never sees the better candidates. A condition that only needs to filter one table tags along all the way to the join and inflates the join's input, and with no join condition spelled out in the SQL, the only join order available between two tables is a Cartesian product. Preprocessing tidies up this shape so the cost comparison gets to see all its candidates. The five sections that follow look at that tidying one piece at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pull subqueries up into the parent query
&lt;/h2&gt;

&lt;p&gt;The first thing it does is pull subqueries in the FROM clause up into the parent query. Merging a subquery into the parent as a single level is called flattening, or subquery pull-up, in the sense of lifting the subquery upward. It runs in the opposite direction from the predicate pushdown we will see later, which pushes conditions down. Take this query as an example.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The author wrapped the part that selects only active customers in a subquery to keep it readable. If the planner leaves this subquery as is, selecting active customers and joining with orders split into two stages. The inner subquery first produces an intermediate result of "active customers," and the outer query joins that result with orders. The problem is that because these two stages are processed separately, the planner cannot put customers and orders together into its join-order search. customers is trapped inside the subquery while orders sits outside, so the two belong to different processing units.&lt;/p&gt;

&lt;p&gt;Flattening removes this boundary. It pulls the subquery up into the parent query, conceptually turning the query above into this shape.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SQL above is not text that actually gets re-emitted; it just renders the result of the tree transformation in SQL form. Now customers and orders are two tables in the same query, and the planner can put both into its join-order search to compare which one to read first and how to join them. The &lt;code&gt;WHERE status = 'active'&lt;/code&gt; condition also moves outward and becomes a target for the predicate pushdown we will see shortly.&lt;/p&gt;

&lt;p&gt;Not every subquery can be merged. If the subquery contains aggregation (GROUP BY or aggregate functions), window functions, DISTINCT, LIMIT, ordering, or HAVING, it is not merged. These features need the subquery computed as one closed result for its meaning to be preserved. For instance, if the subquery computes per-customer order totals with GROUP BY, mixing and merging it with the outer table breaks the aggregation grouping. So the planner only pulls up simple subqueries that produce the same result either way, and plans the rest as closed units on their own.&lt;/p&gt;

&lt;p&gt;IN and EXISTS subqueries that appear in a WHERE clause or a JOIN condition are handled similarly. Conditions like &lt;code&gt;WHERE customer_id IN (SELECT ...)&lt;/code&gt; or &lt;code&gt;WHERE EXISTS (SELECT ...)&lt;/code&gt; are converted into joins when possible. Converting them lets the planner pick an order and method like any other join and optimize it, instead of checking the subquery separately for each outer row. This internal form of an IN or EXISTS turned into a join is called a semi-join, and it is not SQL syntax. The author does not write &lt;code&gt;SEMI JOIN&lt;/code&gt;; the planner turns the IN or EXISTS into a kind of join internally.&lt;/p&gt;

&lt;h2&gt;
  
  
  Compute the expressions that can be reduced to constants ahead of time
&lt;/h2&gt;

&lt;p&gt;Next the planner scans the expressions in the query and computes ahead of time the parts that can be computed early. This is called constant folding.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;     &lt;span class="c1"&gt;-- as written&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;         &lt;span class="c1"&gt;-- after constant folding&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;2 + 2&lt;/code&gt; is a constant with no reason to be re-added for each row, so it is reduced to &lt;code&gt;4&lt;/code&gt; ahead of time. There are cases where the result is determined even with non-constant parts mixed in. For example, if some view or condition expands into a form like &lt;code&gt;WHERE is_vip OR true&lt;/code&gt; (is_vip being a boolean column for whether a customer is a VIP), it is always true whether is_vip is true or false, so it reduces to &lt;code&gt;WHERE true&lt;/code&gt; and the condition itself disappears. Conversely, if &lt;code&gt;AND false&lt;/code&gt; creeps in, that condition is always false and gets tidied away entirely.&lt;/p&gt;

&lt;p&gt;That said, not every function is computed ahead of time. A function that can return a different value each time it is called must not be precomputed. &lt;code&gt;nextval()&lt;/code&gt;, which advances a sequence every time it is called, is the classic example. PostgreSQL precomputes a function only when it is marked immutable (the property of always returning the same output for the same input); otherwise it reduces only the arguments as far as possible and leaves the function call itself for execution time.&lt;/p&gt;

&lt;p&gt;This step has two effects. One is that the executor stops repeating the same computation for every row. Instead of adding &lt;code&gt;2 + 2&lt;/code&gt; a million times across a million rows, it reduces it to &lt;code&gt;4&lt;/code&gt; once when the plan is built. The other is that with expressions simplified, the transformations that follow recognize conditions more easily. For instance, the equivalence class we will see at the end of this section looks for "column = constant" conditions like &lt;code&gt;region_id = 5&lt;/code&gt; to spread to other tables, and if the expression sits unreduced as &lt;code&gt;region_id = 2 + 3&lt;/code&gt;, it fails to recognize that as a constant comparison and misses the chance to spread it. Constant folding has to reduce &lt;code&gt;2 + 3&lt;/code&gt; to &lt;code&gt;5&lt;/code&gt; ahead of time for the next step to work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Turn an outer join into a simpler inner join
&lt;/h2&gt;

&lt;p&gt;An outer join costs more than an inner join. First, the difference between the two. An inner join produces only rows that have a match in both tables. An outer join, specifically a LEFT JOIN, keeps every row of the left table and, where there is no match on the right, manufactures a row with the right columns filled with NULL. This NULL-filling adds work and narrows the planner's freedom to reorder joins. Yet there are cases where a LEFT JOIN the author wrote produces exactly the same result as an inner join. The planner finds those cases and turns the outer join into a simpler inner join. This does not mean an inner join ranks above an outer join; it means simplifying toward the side without the extra work of NULL-filling.&lt;/p&gt;

&lt;p&gt;The key to the decision is the WHERE condition applied to the finished join. In SQL, WHERE applies to the entire result after FROM and JOIN have been processed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LEFT JOIN was used to keep every customer. But the join result has &lt;code&gt;WHERE o.amount &amp;gt; 100&lt;/code&gt; on it. A row where orders is NULL-filled because there is no matching order has &lt;code&gt;o.amount&lt;/code&gt; as NULL, and &lt;code&gt;NULL &amp;gt; 100&lt;/code&gt; is not true, so that row is dropped by WHERE. In other words, the NULL-filled rows cannot survive into the final result anyway. So there is no reason to do the NULL-filling in the first place. The planner turns this LEFT JOIN into an inner join. Conceptually the query above becomes this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For this reasoning to hold, the WHERE condition has to be the kind that "can never be true once NULL comes in." &lt;code&gt;o.amount &amp;gt; 100&lt;/code&gt; is such a condition. If amount is NULL the comparison cannot be true, so the NULL-filled rows are guaranteed to be filtered out. (PostgreSQL calls an operator that cannot be true on NULL input a strict operator. Most comparison operators, &lt;code&gt;=&lt;/code&gt;, &lt;code&gt;&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;&lt;/code&gt;, and so on, qualify.) If the condition were &lt;code&gt;o.customer_id IS NULL&lt;/code&gt; instead, it is the opposite. &lt;code&gt;customer_id&lt;/code&gt; is the join key, so it is NULL only in rows that were NULL-filled for lack of a match. Letting only those rows through means "select only customers with no orders," which the planner recognizes as an anti-join. Like the semi-join, the anti-join is not SQL syntax but a kind of join internal to the planner.&lt;/p&gt;

&lt;p&gt;The reason simplifying outer joins matters is that the fewer outer joins there are, the more freely the planner can shuffle the join order.&lt;/p&gt;

&lt;h2&gt;
  
  
  Push each condition down to the table it references
&lt;/h2&gt;

&lt;p&gt;When you write several conditions joined by AND in a WHERE clause, they sit bundled together at the top of the query in the SQL text. But each of those conditions actually touches a different table. Some look at only one table, some at two together. The planner pushes each condition down toward the table it actually references, so it applies right where that table is read. This transformation is called predicate pushdown, and internally PostgreSQL calls it qual distribution. qual is short for qualification, PostgreSQL's shorthand for the filter conditions that go into a WHERE or ON.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;c.region_id = 5&lt;/code&gt; references only customers. Treat this condition literally as "a filter on the joined result," and it filters by region only after orders and customers are fully joined. The rows to be discarded were dragged all the way to the join. Instead the planner pushes this condition down to where customers is read. Conceptually it becomes this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;region_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "after" above does not mean the tree actually turns into a subquery; it renders in SQL form the fact that &lt;code&gt;region_id = 5&lt;/code&gt; is applied first where customers is read. customers narrows to only the customers in region 5 first, and only those few customers join with orders.&lt;/p&gt;

&lt;p&gt;The difference is clear in numbers. If there are 100,000 customers nationwide and 5,000 in region 5, then without pushing the condition down, all 100,000 customers are joined with orders and then filtered. Push it down, and only 5,000 enter the join. A join costs less the smaller its input, so the same result comes at far less cost.&lt;/p&gt;

&lt;p&gt;As a result of this distribution, each table's workspace (&lt;code&gt;RelOptInfo&lt;/code&gt;) accumulates a list of conditions that apply only to that table. Each condition goes into the list wrapped in an object called a &lt;code&gt;RestrictInfo&lt;/code&gt;, whose additional contents the earlier section on the path tree covered. A join condition that references two tables together is handled not at one table but at the join, and among those, an equality join condition like &lt;code&gt;o.customer_id = c.id&lt;/code&gt; gets one more level of special handling. That handling is the very last transformation we will see next.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pull out missing conditions by grouping equal values together
&lt;/h2&gt;

&lt;p&gt;The last transformation fills in missing conditions by grouping equality conditions together. When several &lt;code&gt;=&lt;/code&gt; conditions hang off a WHERE clause, the planner gathers them into a group of values known to be equal to each other. This group is called an equivalence class.&lt;/p&gt;

&lt;p&gt;The principle is simple. If &lt;code&gt;a = b&lt;/code&gt; and &lt;code&gt;b = c&lt;/code&gt;, then &lt;code&gt;a = c&lt;/code&gt; is also true. The planner can pull out this &lt;code&gt;a = c&lt;/code&gt;, which is nowhere in the SQL, and use it as a new join condition.&lt;/p&gt;

&lt;p&gt;Why this is useful shows up in join order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;regions&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- conditions written by the author: orders-customers, customers-regions&lt;/span&gt;
&lt;span class="c1"&gt;-- condition pulled out by the planner: o.region_id = r.id  (orders-regions)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The author wrote only the orders-customers condition and the customers-regions condition. There is no condition directly linking orders and regions. If the planner used only the written conditions, then trying to join orders and regions first, with no condition linking them, the only thing it could make is a Cartesian product. A Cartesian product is a join that produces every combination of rows from two tables, so 100,000 orders and 100 regions yield ten million rows. Almost always a disaster, so the planner discards that join order. But if the equivalence class pulls out &lt;code&gt;o.region_id = r.id&lt;/code&gt;, a direct join between orders and regions becomes possible. A join order that would have been discarded comes back as a candidate.&lt;/p&gt;

&lt;p&gt;Something stronger happens when a constant joins the group.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- group: {o.region_id, c.region_id, 5}&lt;/span&gt;
&lt;span class="c1"&gt;-- pulled out: o.region_id = 5,  c.region_id = 5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the group becomes &lt;code&gt;{o.region_id, c.region_id, 5}&lt;/code&gt;, the planner applies &lt;code&gt;o.region_id = 5&lt;/code&gt; and &lt;code&gt;c.region_id = 5&lt;/code&gt; right where each table is read. &lt;code&gt;o.region_id = c.region_id&lt;/code&gt; was originally a condition you could only check by joining the two tables, but after the transitivity, each table is filtered by &lt;code&gt;= 5&lt;/code&gt; ahead of time. The predicate pushdown from the previous section ends up applying to both tables on its own. There is no longer any need to check &lt;code&gt;o.region_id = c.region_id&lt;/code&gt; separately at the join.&lt;/p&gt;

&lt;p&gt;This transitivity does not hold everywhere, though. Below an outer join, the NULL-filling makes a derivation like &lt;code&gt;= 5&lt;/code&gt; unsafe. So the planner applies transitivity only within a group of tables inner-joined to each other, and handles the region an outer join creates separately.&lt;/p&gt;

&lt;p&gt;Beyond these five transformations, there are smaller tasks bundled into the same preprocessing step. Turning a MERGE command into a join form, inlining function-form FROM entries into the body where possible, and flattening a simple UNION ALL into one. They all share the same goal of producing a shape the planner can work with.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, "why is this condition I wrote here applied over there" is because of preprocessing.&lt;/strong&gt; In EXPLAIN you sometimes see a condition you wrote in WHERE show up in an odd place. It is attached to a Filter or Index Cond on a table scan node rather than at a join, or a condition you never wrote appears. That is the result of predicate pushdown pulling the condition down to a table scan, or an equivalence class pulling out a condition that was not in the SQL. When you read the per-node Filter and Index Cond in EXPLAIN, knowing that the position is not where you wrote it but where the planner pushed or derived it makes the plan much easier to read.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, wrapping something in a subquery does not change performance.&lt;/strong&gt; Whether you wrap for readability or write it flat, a simple subquery gets merged into the same query by flattening, so the plan is the same. However, a subquery with aggregation, DISTINCT, LIMIT, or ordering is not merged, creating a boundary across which the planner cannot weigh the inside and outside together. If a subquery boundary in a performance-critical query seems to have trapped the plan unintentionally, suspect something inside that blocks flattening. A DISTINCT habitually attached with no effect on the result, or an unnecessary LIMIT, is a common culprit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, an equality join condition fills in by transitivity even if you write only one side.&lt;/strong&gt; Writing just &lt;code&gt;o.region_id = c.region_id AND c.region_id = r.id&lt;/code&gt; lets the planner pull out &lt;code&gt;o.region_id = r.id&lt;/code&gt;, so you do not need to spell out a join condition for every pair of tables. But this is limited to equality (&lt;code&gt;=&lt;/code&gt;) conditions. Range conditions (&lt;code&gt;o.amount &amp;lt; c.credit&lt;/code&gt;) and inequalities are not propagated. If a join is not linked by equality and a Cartesian product creeps into the plan and slows the query down, that is a sign of a structure transitivity cannot fill. The fix is to link the missing join path directly with an equality condition.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>planner</category>
    </item>
    <item>
      <title>1.4.8 Statistics: pg_statistic and Selectivity</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:08:11 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/148-statistics-pgstatistic-and-selectivity-nm2</link>
      <guid>https://dev.to/joonghyukshin/148-statistics-pgstatistic-and-selectivity-nm2</guid>
      <description>&lt;p&gt;From 1.4.2 through 1.4.5, one number showed up every time the planner computed a cost. Weighing a sequential scan against an index scan, deciding which join method is cheaper, gauging intermediate result sizes while reordering joins: underneath all of it sits an estimate of "how many rows pass this condition." That number is called cardinality.&lt;/p&gt;

&lt;p&gt;So how does the planner know it? It cannot actually count how many of 100,000 rows &lt;code&gt;WHERE country = 'KR'&lt;/code&gt; keeps, not at planning time. Scanning the whole table just to build a plan would already be running the query once. So the planner consults statistics it prepared in advance. Where those statistics live, in what shape, and how cardinality comes out of them is the starting point of the whole cost calculation. If the start is off, every cost and join order downstream is off with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Selectivity Is the Surviving Fraction
&lt;/h2&gt;

&lt;p&gt;The unit the planner works in is not an absolute row count but a fraction. Selectivity, the fraction of all rows that a WHERE clause keeps (between 0 and 1), is that unit. A selectivity of 0.5 means half the rows pass the condition; 0.001 means one row in a thousand survives.&lt;/p&gt;

&lt;p&gt;Cardinality follows directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;estimated rows = selectivity × input rows
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;rows=...&lt;/code&gt; that &lt;code&gt;EXPLAIN&lt;/code&gt; prints is exactly the result of this multiplication. On a 100,000-row table, a condition with selectivity 0.5 shows up as &lt;code&gt;rows=50000&lt;/code&gt;. Every cost calculation the planner does ultimately takes this estimated row count as input. One wrong selectivity throws off the whole cost comparison.&lt;/p&gt;

&lt;p&gt;That leaves one question. What does the planner look at to estimate selectivity?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Planner's Source: pg_statistic
&lt;/h2&gt;

&lt;p&gt;The answer is &lt;code&gt;pg_statistic&lt;/code&gt;, the system catalog that holds per-column statistics. Statistics for each column of a table are stored here. The catalog itself is locked away from ordinary users, so you look at it through &lt;code&gt;pg_stats&lt;/code&gt;, a view that exposes the statistics in human-readable form.&lt;/p&gt;

&lt;p&gt;The statistics stored for one column fall into two kinds. One is a set of fixed fields present regardless of column type; the other is a set of slots filled according to the shape of the data. The fixed fields hold the fraction of NULLs, the number of distinct values, and the average byte width. The slots hold statistics of differing kinds, such as a list of common values or a histogram of the value distribution. The statistics for one column look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pg_statistic (one column)
├─ fixed fields (every column)
│    null_frac     fraction of NULL rows
│    n_distinct    number of distinct values
│    avg_width     average byte width of a value
└─ slots (filled by data shape)
     MCV           common values and their frequencies
     histogram     boundaries that split the distribution
     correlation   value order vs physical storage order
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why a slot structure? Because different data types call for different statistics. A scalar value like an integer or a date benefits from a distribution histogram, but an array column needs something else, like its "most common elements." So PostgreSQL stores, in each slot, a kind code marking what sort of statistic lives there, and code reading the statistics searches by kind code rather than by slot number.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F24fg3o8if3pius5acuow.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F24fg3o8if3pius5acuow.png" alt="How each kind of statistic maps to a query shape" width="800" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Four Statistics Behind Selectivity Estimation
&lt;/h2&gt;

&lt;p&gt;Four kinds of statistics do the actual work in selectivity estimation. Real data makes it click. I set up a 100,000-row table &lt;code&gt;stat_demo&lt;/code&gt;. The &lt;code&gt;country&lt;/code&gt; column has a skewed distribution (KR is half, US a fifth, JP a tenth, and the remaining 20% is spread over a few hundred rarely-used two-letter country codes); &lt;code&gt;amount&lt;/code&gt; is a value spread evenly from 1 to 100,000; and &lt;code&gt;note&lt;/code&gt; is half NULL. After running the statistics-collection command &lt;code&gt;ANALYZE&lt;/code&gt; once, querying the &lt;code&gt;country&lt;/code&gt; column's statistics from &lt;code&gt;pg_stats&lt;/code&gt; gives this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;null_frac&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_distinct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;most_common_vals&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;most_common_freqs&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stats&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;tablename&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'stat_demo'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;attname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'country'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;null_frac         | 0
n_distinct        | 598
most_common_vals  | {KR, US, JP, FO, ...}
most_common_freqs | {0.5032, 0.1982, 0.1013, 0.00067, ...}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The four statistics divide up the work like this.&lt;/p&gt;

&lt;p&gt;First, &lt;code&gt;null_frac&lt;/code&gt; is the fraction of rows that are NULL. &lt;code&gt;country&lt;/code&gt; has no NULLs, so it is 0; but the &lt;code&gt;note&lt;/code&gt; column, half of which is NULL, would read about 0.5. The selectivity of &lt;code&gt;WHERE note IS NULL&lt;/code&gt; is exactly that value, which on 100,000 rows estimates about 50,000 rows.&lt;/p&gt;

&lt;p&gt;Second, &lt;code&gt;n_distinct&lt;/code&gt; is how many distinct values there are. If an order-status column held only the three values 'paid', 'shipping', 'delivered', this would be 3. For &lt;code&gt;country&lt;/code&gt; it came out as 598. This value serves as the denominator when estimating a value that is not common in an equality condition, because if values are spread evenly, one value takes up roughly &lt;code&gt;1/n_distinct&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This value is read differently depending on its sign. When positive, it is the count of distinct values, as just described. When negative, it means a multiple of the row count. Querying the &lt;code&gt;amount&lt;/code&gt; column gives -0.53346, which means the number of distinct values is 0.53346 times the row count, that is, about 53,000 on 100,000 rows. The reason for also allowing a negative form is that a table's row count can be updated more often than the statistics are. Pinning the count as an absolute number would make it drift as rows grow, but storing it as a ratio lets the estimate move with the row count.&lt;/p&gt;

&lt;p&gt;Third is &lt;code&gt;most_common_vals&lt;/code&gt; (the most common values, MCV for short). For &lt;code&gt;country&lt;/code&gt; the MCV came out as &lt;code&gt;{KR, US, JP, FO, ...}&lt;/code&gt;. The leading &lt;code&gt;KR&lt;/code&gt;, &lt;code&gt;US&lt;/code&gt;, &lt;code&gt;JP&lt;/code&gt; are the genuinely common values, taking up half, a fifth, and a tenth of the data. The &lt;code&gt;FO&lt;/code&gt; at the end is different. It is one of the rarely-used codes that fill the remaining 20%; those rare codes all have a similar frequency (about 30 rows each), but in drawing the random sample &lt;code&gt;FO&lt;/code&gt; happened to be picked most often among them and slipped onto the end of the MCV list. The fraction each value takes up sits right beside it in &lt;code&gt;most_common_freqs&lt;/code&gt;, in the same order. KR is 0.5032, US is 0.1982. An equality condition on a common value uses this frequency directly.&lt;/p&gt;

&lt;p&gt;Fourth, the histogram is a list of boundary values that split the value distribution into buckets. It does not appear on a categorical value like &lt;code&gt;country&lt;/code&gt;; it appears on a column you can line up by magnitude, like &lt;code&gt;amount&lt;/code&gt;. Querying &lt;code&gt;amount&lt;/code&gt; gives these boundaries in &lt;code&gt;histogram_bounds&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{7, 1049, 2049, 3114, 4167, 5139, ... , 99008, 99995}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With 101 boundaries there are 100 buckets between them. ANALYZE picks the boundaries so each bucket holds 1% of the rows (1,000 rows here). Because &lt;code&gt;amount&lt;/code&gt; is spread evenly, the boundaries land at roughly 1,000 apart. A range condition like &lt;code&gt;amount &amp;gt; 90000&lt;/code&gt; follows this list of boundaries.&lt;/p&gt;

&lt;p&gt;There is a reason MCV and the histogram are kept apart. A few values that are skewed and frequent behave differently from the rest, which are scattered at similar frequencies. The few frequent values are best remembered one frequency at a time, while the rest are best lumped into buckets. So a column can carry both an MCV and a histogram, and in that case the histogram describes only the distribution left over after the MCV values are removed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Equality: Common Values vs Rare Values
&lt;/h2&gt;

&lt;p&gt;The function that computes the selectivity of an equality condition &lt;code&gt;col = constant&lt;/code&gt; (&lt;code&gt;var_eq_const&lt;/code&gt;) treats common values and rare values differently. Same equality, but the estimation path forks.&lt;/p&gt;

&lt;p&gt;If the constant is in the MCV list, the story is simple. ANALYZE already measured that value's frequency, so it is used as is. For &lt;code&gt;WHERE country = 'KR'&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;country='KR'   -&amp;gt; rows=50323   (MCV frequency 0.5032 × 100000)
country='US'   -&amp;gt; rows=19820   (MCV frequency 0.1982 × 100000)
country='JP'   -&amp;gt; rows=10133   (MCV frequency 0.1013 × 100000)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The estimated row count is exactly proportional to the frequency. The more common the value, the more directly the statistics remember its frequency, so the estimate is most trustworthy.&lt;/p&gt;

&lt;p&gt;If the constant is not in the MCV, the estimate takes one more step. How common it is, the planner does not know; but it does know the value sits among "everything that is not a common value." So it subtracts the sum of the common values' frequencies and the NULL fraction from the whole to get "the fraction the rest occupies," and divides that by the number of distinct values left after the common ones. It assumes the remaining values are all spread evenly at similar frequencies. For instance, if the common values take up 80% of the table and the remaining 20% is split among 100 distinct rare values, one rare value's selectivity is taken as 0.2 ÷ 100 = 0.002.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;country='AAA' -&amp;gt; rows=33   (a value not in the MCV)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tracing where this 33 comes from shows the logic of the estimate directly. The frequencies in the MCV add up to about 0.807. Most of it is taken by the common values KR, US, JP (0.5032 + 0.1982 + 0.1013 = 0.8027), with the rest of the rare MCV values adding a little more. There are no NULLs, so the remaining fraction is 1 - 0.807 ≈ 0.193. The number of distinct values left after the common ones is 598 total minus the 9 in the MCV, which is 589. So the selectivity is 0.193 ÷ 589 ≈ 0.00033, and multiplied by 100,000 it comes to about 33. That matches the 33 EXPLAIN printed.&lt;/p&gt;

&lt;p&gt;Here an inherent limit of statistics-based estimation surfaces. &lt;code&gt;AAA&lt;/code&gt; is in fact a value with not a single row in this table. Yet the estimate is 33. The planner does not know the value really is absent. It only has the information "not in the MCV," so it treats a value that does not exist and a value that exists rarely exactly the same, both at 33. Statistics are a summary drawn from a sample, so for a value the summary never captured, the planner can only fill in with the even-spread assumption.&lt;/p&gt;

&lt;p&gt;When there are no statistics at all (a fresh table that has never been analyzed, say), the estimate gets cruder still. The default selectivity for equality is hardwired at 0.005, that is, one in 200. This is the same as assuming there are about 200 distinct values. It is the last-resort guess for when there is nothing to go on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ranges Use the Histogram, ANDs Multiply
&lt;/h2&gt;

&lt;p&gt;A range condition like &lt;code&gt;col &amp;gt; value&lt;/code&gt; or &lt;code&gt;col &amp;lt; value&lt;/code&gt; uses the histogram seen earlier. It finds which bucket the constant falls into, and converts how far into that bucket it sits into a fraction. Since &lt;code&gt;amount&lt;/code&gt;'s boundaries were about 1,000 apart, for &lt;code&gt;amount &amp;gt; 90000&lt;/code&gt; the value 90000 sits just below the eleventh boundary from the end, leaving about 10 buckets above it. That is roughly the top 10%.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;amount &amp;gt; 90000              -&amp;gt; rows=10007   (top ~10%)
amount BETWEEN 25000 AND 75000 -&amp;gt; rows=50071 (middle ~50%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The condition asking for the top 10% estimates about 10,000 rows, and the one asking for the middle half about 50,000. With an even distribution, the fraction and the row count line up naturally.&lt;/p&gt;

&lt;p&gt;When several conditions are joined with AND, the planner multiplies each one's selectivity. Putting &lt;code&gt;country = 'KR'&lt;/code&gt; (0.5032) and &lt;code&gt;amount &amp;gt; 90000&lt;/code&gt; (about 0.10) together gives 0.5032 × 0.10 ≈ 0.050, about 5,000 rows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;country='KR' AND amount&amp;gt;90000 -&amp;gt; rows=5036   (actual 4938)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The estimate and reality nearly agree. But this multiplication hides an assumption: that the two conditions are independent. The example above was accurate because &lt;code&gt;country&lt;/code&gt; and &lt;code&gt;amount&lt;/code&gt; were built to be genuinely unrelated in the synthetic data. In real data, two columns are often entangled, and that is where this multiplication breaks down.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Columns Are Correlated: Extended Statistics
&lt;/h2&gt;

&lt;p&gt;Think of city and zip code. Once the zip code is fixed, the city is effectively determined. Put a condition like &lt;code&gt;city = 'Seoul' AND zipcode = '06236'&lt;/code&gt;, and the planner simply multiplies the two selectivities. But almost every row with zip code 06236 is already in Seoul, so &lt;code&gt;city = 'Seoul'&lt;/code&gt; filters out essentially nothing beyond what the zip code condition already removed. The multiplication, treating the two as unrelated, shaves the fraction twice anyway. The result is a row count far below reality, an underestimate.&lt;/p&gt;

&lt;p&gt;Every statistic seen so far is about a single column, so it cannot catch this entanglement. So PostgreSQL separately offers extended statistics, which collect the correlation between several columns as a group. With &lt;code&gt;CREATE STATISTICS&lt;/code&gt; you name a group of columns, and ANALYZE measures that group's functional dependencies or its group-wise distinct count. The planner then knows &lt;code&gt;city&lt;/code&gt; and &lt;code&gt;zipcode&lt;/code&gt; are effectively one lump and applies the fraction only once.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;STATISTICS&lt;/span&gt; &lt;span class="n"&gt;addr_stat&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dependencies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;zipcode&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;addresses&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ANALYZE&lt;/span&gt; &lt;span class="n"&gt;addresses&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Extended statistics objects have one interesting property. Unlike an index, they are designed as independent objects, not tied to a particular table.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This independence also shows up in ownership. An index, no matter who creates it, takes its owner from the table's owner; but the owner of a statistics object made with &lt;code&gt;CREATE STATISTICS&lt;/code&gt; is whoever ran the CREATE statement. I saw this difference as an inconsistency, a bug, and proposed a patch to the PostgreSQL community to make a statistics object's owner follow the table's owner like an index does. My reasoning was this: unlike a view, which can be defined over several tables, statistics are built for a single table only, so like an index or a trigger they are strictly bound to that table, and ownership should follow the table. The proposal was not accepted. A statistics object having a different ownership model from an index is by design, and a PG core contributor pointed out the reason. There is a plan to eventually allow statistics defined across several tables, so they were deliberately made into independent objects not bound to any one table.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  ANALYZE: Statistics Come from a Sample
&lt;/h2&gt;

&lt;p&gt;Every statistic seen so far is filled in by &lt;code&gt;ANALYZE&lt;/code&gt;. But ANALYZE does not read the whole table. It draws a random sample (some rows pulled at random from the table) and estimates the statistics from there.&lt;/p&gt;

&lt;p&gt;The sample size is, by default, about 30,000 rows. Precisely, it is 300 times the setting that controls statistics precision (&lt;code&gt;default_statistics_target&lt;/code&gt;, default 100). What is interesting is that this size is almost independent of the total table size. Whether the table has 10,000 rows or a billion, the sample is about 30,000 rows either way.&lt;/p&gt;

&lt;p&gt;This can seem to clash with intuition. Shouldn't a bigger table call for looking at more? An opinion poll lines the intuition up. A national poll asks 1,000 to 2,000 people and produces a result with similar error whether the electorate is 10 million or 40 million. It does not scale the sample up in proportion to the population. The reliability of a sample is set by the size of the sample itself, not by what percentage of the whole it is. ANALYZE is the same. PostgreSQL's sample-size formula follows a 1998 paper (Chaudhuri, Motwani, Narasayya) on how large a sample histogram construction needs, and its conclusion too is that the sample size needed for a given precision barely grows no matter how large the table gets. So rather than growing the sample with the row count, a fixed absolute size gives enough precision.&lt;/p&gt;

&lt;p&gt;That the statistics come from a sample also means they are always an approximation. And they are a snapshot from the moment ANALYZE ran. If the table changes a lot afterward, the statistics go stale. How stale statistics wreck an estimate is clearest seen directly. Starting from 100,000 rows (KR is half) and running ANALYZE, I then inserted 100,000 more KR rows and, without running ANALYZE again, asked the same condition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;After INSERT (before ANALYZE): country='KR' estimate = 99,573   (actual 150,030)
After ANALYZE                : country='KR' estimate = 150,060   (actual 150,030)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Right after the INSERT, the planner does roughly sense that the table grew and raises the total row count to near 200,000. But KR's frequency is still stale at 0.50. The real KR share is now 75% (150,030 / 200,000), yet multiplying by 0.50 underestimates it at about 100,000 rows. Only after ANALYZE remeasures the frequency does an estimate near 150,000 come out. The crux is that selectivity is a fraction: when the fraction is stale, the estimate misses even when the total row count is right.&lt;/p&gt;

&lt;p&gt;Finally, back to why &lt;code&gt;pg_statistic&lt;/code&gt; is locked away from ordinary users. The MCV slot holds actual data values (the most common country codes, the most common salary figures, and so on). The statistics themselves expose part of the data, so an unprivileged user reading the raw catalog leaks information. So the raw &lt;code&gt;pg_statistic&lt;/code&gt; is blocked, and the human-facing &lt;code&gt;pg_stats&lt;/code&gt; view is filtered to show only statistics for columns the reader is allowed to see.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, after a big data change, running ANALYZE yourself is the safe move.&lt;/strong&gt; Right after a bulk load, a bulk delete, or a large update, the statistics are stale, and as the example above showed, selectivity being a fraction means the estimate misses even when the total row count is right. Autovacuum will eventually run ANALYZE, but there is a lag until it crosses its threshold. If you have to run a heavy query right after loading, slipping a manual &lt;code&gt;ANALYZE&lt;/code&gt; in between is the better choice. When a query suddenly turns slow, checking whether the relevant column's statistics in &lt;code&gt;pg_stats&lt;/code&gt; match reality is the starting point of diagnosis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, a query whose estimate is badly off shows its cause when you line up EXPLAIN ANALYZE's estimated and actual row counts.&lt;/strong&gt; If it says &lt;code&gt;rows=33&lt;/code&gt; but 50,000 rows actually came out, that node's selectivity estimate is wrong. And the error does not stop there. When one table's estimate is off, the join sitting on top of it is off by more, and it amplifies as join stages stack up. So finding the node where estimate and actual diverge by more than tenfold usually points at that column's statistics or a correlation between conditions, and the starting point of a slow query is often a single column with skewed statistics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, when estimates are inaccurate on a very large table, you can raise statistics precision per column.&lt;/strong&gt; With the sample fixed at about 30,000 rows, a large table with very many distinct values or an awkward distribution may have a sample that cannot capture the whole distribution. Raising that column's statistics target makes the sample size and the MCV and histogram sizes grow together, sharpening the estimate. For example, to raise &lt;code&gt;country&lt;/code&gt;'s target from the default 100 to 1000:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;stat_demo&lt;/span&gt; &lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;country&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;STATISTICS&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ANALYZE&lt;/span&gt; &lt;span class="n"&gt;stat_demo&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You have to run ANALYZE again after changing the setting for the new target to take effect in the statistics. But ANALYZE cost and planning time grow with it, so it is right to apply this selectively, only to columns where the estimate is actually a problem.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>planner</category>
    </item>
    <item>
      <title>1.4.7 Parallel Cost: When to Go Parallel and How Many Workers</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:07:06 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/147-parallel-cost-when-to-go-parallel-and-how-many-workers-46o6</link>
      <guid>https://dev.to/joonghyukshin/147-parallel-cost-when-to-go-parallel-and-how-many-workers-46o6</guid>
      <description>&lt;p&gt;The planner weighs how to read a table, how to join, and how to aggregate by cost, and picks the cheapest candidate. Parallel query is one more decision layered on top of those candidates: whether one process handles a given scan or join alone, or several processes split the work, and if they split it, among how many. This decision is also made by the planner, not the executor. The planner costs a parallel candidate separately, puts it on the same scale as the serial one, and keeps whichever is cheaper.&lt;/p&gt;

&lt;p&gt;So understanding parallelism comes down to three numbers. &lt;em&gt;When&lt;/em&gt; does the planner consider parallel at all. If it does, &lt;em&gt;how many&lt;/em&gt; workers does it assign. And &lt;em&gt;how much&lt;/em&gt; does the cost drop once the work is split. These three numbers map directly onto the questions you hit in practice: "this table is huge, why isn't it going parallel," and "I added cores, why didn't it get that much faster." How the worker processes actually divide the pages and cooperate is the executor's job, covered in 1.5. Here we look at the numbers the planner uses to decide whether to go parallel in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Small tables are never even candidates
&lt;/h2&gt;

&lt;p&gt;Parallelism is not free. Launching worker processes, setting up shared memory, and having the leader gather their results all carry a fixed cost. When a table is small, that fixed cost outweighs the gain from splitting the work, so reading it alone is actually faster. That is why the planner only considers parallelism once a table crosses a certain size.&lt;/p&gt;

&lt;p&gt;That threshold is &lt;code&gt;min_parallel_table_scan_size&lt;/code&gt;. The default is 8MB, which works out to 1024 pages (PG 18). The planner estimates how many pages a scan will read, and if that estimate falls below this value, it does not build a parallel path at all. This is usually the first reason you stare at an EXPLAIN plan and never see a &lt;code&gt;Parallel Seq Scan&lt;/code&gt;. Index scans have their own threshold, &lt;code&gt;min_parallel_index_scan_size&lt;/code&gt; (default 512KB, 64 pages). When a scan reads both the heap and an index, each threshold is checked on its own.&lt;/p&gt;

&lt;p&gt;Write queries drop out of parallel consideration regardless of size. The SELECT inside an &lt;code&gt;INSERT ... SELECT&lt;/code&gt; or an &lt;code&gt;UPDATE&lt;/code&gt; never goes parallel, no matter how heavy that SELECT is. The parallel portion has to be strictly read-only, and the reason for that restriction is tangled up with how workers share state, which 1.5 covers. Here we just note the fact: a SELECT that comes with a write is not a candidate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Worker count is set by table size
&lt;/h2&gt;

&lt;p&gt;Once the planner decides to go parallel, how many workers does it launch? PostgreSQL sets this from the logarithm of the table size, with base 3 specifically. It starts at one worker and adds one more each time the page count reaches three times the threshold. Since the threshold itself starts at &lt;code&gt;min_parallel_table_scan_size&lt;/code&gt; (8MB), the steps come out like this (PG 18 defaults).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;8MB or more: 1 worker&lt;/li&gt;
&lt;li&gt;24MB or more: 2 workers&lt;/li&gt;
&lt;li&gt;72MB or more: 3 workers&lt;/li&gt;
&lt;li&gt;216MB or more: 4 workers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each time the table triples in size, the worker count goes up by one. It does not double as the size doubles; it grows much more slowly. A 1GB table does not get dozens of workers. The marginal gain from adding workers shrinks on larger tables, so the formula is conservative from the start.&lt;/p&gt;

&lt;p&gt;There is one more cap on top of this: &lt;code&gt;max_parallel_workers_per_gather&lt;/code&gt;, which defaults to 2. Even if the log formula yields 4, this cap of 2 trims the worker count to 2. So with no special configuration, scanning a large table normally gets at most two workers, and together with the leader, which pitches in directly, three processes split the same scan. To use more cores, you have to raise this value. Setting it to 0 turns parallelism off entirely.&lt;/p&gt;

&lt;p&gt;Workers are not a per-query resource. The total number of parallel workers the whole server can run at once is bounded by another cap, &lt;code&gt;max_parallel_workers&lt;/code&gt; (default 8), so when several queries go parallel at the same time, they share this pool. If the pool runs short, the planner may have planned for 2 workers but only 1 actually launches, or none do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallelism divides only the CPU cost
&lt;/h2&gt;

&lt;p&gt;Once you have decided to add workers and settled on a count, how is the parallel path's cost computed? The heart of it fits in one sentence. &lt;strong&gt;Parallelism divides only the CPU cost by the worker count; it does not divide the disk cost.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A scan cost breaks into two chunks: the cost of reading pages off disk, and the CPU cost of evaluating the WHERE conditions and processing each row that comes back. When going parallel, PostgreSQL divides only the CPU cost by the worker count. The disk cost stays at its full value.&lt;/p&gt;

&lt;p&gt;There are two reasons the disk cost is not divided. First, no matter how many workers you attach, the total number of pages that must be read does not shrink. If a table is 10,000 pages, all 10,000 still have to come off disk no matter who reads them. Parallelism just lets several workers split those pages among themselves; it does not reduce the page count.&lt;/p&gt;

&lt;p&gt;So would splitting the reads make the disk deliver pages faster? Counter to intuition, it does not. The disk is a single device the workers share, and there is a ceiling on how fast that device hands out pages. On top of that, when one process reads pages in order, the OS prefetches the next ones (read-ahead), so even a single reader already gets close to that device ceiling. Adding readers therefore does not make the disk deliver data N times faster. If anything, workers jumping between different regions can look like scattered access to the disk and cost a little.&lt;/p&gt;

&lt;p&gt;PostgreSQL's cost model takes the conservative view here: it treats the disk cost as having nothing to amortize across workers and leaves it untouched. On fast SSDs parallel reads can give some benefit, but the model assumes that benefit is zero so as not to overvalue parallelism. So what parallelism actually cuts is the CPU side alone.&lt;/p&gt;

&lt;p&gt;That said, the value the CPU cost is divided by is not exactly the worker count. The leader does not just launch workers and sit idle; it runs the same scan itself and produces rows, so the number of actual hands at work is the worker count plus the leader's share. The planner estimates that share as "the leader spends 30% of its time servicing each worker." So the divisor that actually scales the CPU cost comes out like this.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 worker: 1 + (1 − 0.3) = 1.7&lt;/li&gt;
&lt;li&gt;2 workers: 2 + (1 − 0.6) = 2.4&lt;/li&gt;
&lt;li&gt;3 workers: 3 + (1 − 0.9) = 3.1&lt;/li&gt;
&lt;li&gt;4 workers: 4 + 0 = 4.0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As workers pile up, the leader spends more time collecting their results and doing the post-processing on top, and less time on the scan itself. So the leader's share shrinks as workers grow, and by 4 workers it is treated as zero. This leader participation is toggled by &lt;code&gt;parallel_leader_participation&lt;/code&gt;, which defaults to on.&lt;/p&gt;

&lt;p&gt;So for parallelism to cut cost meaningfully, the CPU chunk of the scan has to be large. A scan that evaluates a heavy WHERE condition on every row, or runs a large aggregate, has a big CPU chunk, so splitting it pays off. A scan that just reads a lot of pages with light per-row work has most of its cost on the disk side, so attaching workers cuts little.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fixed cost Gather adds
&lt;/h2&gt;

&lt;p&gt;A parallel path does not only have costs that shrink; it also has costs that get added. The &lt;code&gt;Gather&lt;/code&gt; node that collects the workers' results piles on two costs. One is the setup cost of launching workers and preparing shared memory; the other is the communication cost of handing each row a worker produces over to the leader.&lt;/p&gt;

&lt;p&gt;The setup cost is &lt;code&gt;parallel_setup_cost&lt;/code&gt;, which defaults to 1000. Recall that the cost of reading one page in earlier chapters was around 1, which makes 1000 a hefty fixed cost, on par with reading 1000 pages. The communication cost is &lt;code&gt;parallel_tuple_cost&lt;/code&gt;, defaulting to 0.1 per row.&lt;/p&gt;

&lt;p&gt;This setup cost is the decisive reason parallelism loses out on small queries. The fixed 1000 is laid down the same whether the table is small or large, but the CPU cost that gets cut is tiny when the table is small. On a small table the added 1000 outweighs the amount saved, so the parallel path ends up costing more than the serial one. The planner compares their total costs and keeps the cheaper one, so on a small query, even if a parallel candidate is built, it falls behind in the cost comparison. If the size threshold is the first gate, this cost comparison is the second gate above it.&lt;/p&gt;

&lt;p&gt;When result order has to be preserved, &lt;code&gt;GatherMerge&lt;/code&gt; is used instead of &lt;code&gt;Gather&lt;/code&gt;. This happens when an &lt;code&gt;ORDER BY&lt;/code&gt; means each worker's already-sorted output has to be merged without breaking the order. That merge carries the burden of continually comparing the front rows from each worker, so the planner charges &lt;code&gt;GatherMerge&lt;/code&gt; 5% more for communication than &lt;code&gt;Gather&lt;/code&gt;, plus the comparison cost of the merge sort itself. The price of keeping order is reflected directly in the cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparing serial and parallel cost in numbers
&lt;/h2&gt;

&lt;p&gt;Let us put the rules so far onto actual numbers. A sequential scan cost is the sum of a disk term and a CPU term. The disk term multiplies the cost of reading one sequential page (&lt;code&gt;seq_page_cost&lt;/code&gt;, default 1.0) by the page count, and the CPU term multiplies the cost of processing one tuple by the tuple count.&lt;/p&gt;

&lt;p&gt;Take an &lt;code&gt;accounts&lt;/code&gt; table with 1 million rows in 10,000 pages (about 80MB), scanned with &lt;code&gt;WHERE balance &amp;gt; 9000&lt;/code&gt;. Since it is a simple comparison qual, the CPU cost per tuple is &lt;code&gt;cpu_tuple_cost&lt;/code&gt; (0.01) plus the comparison cost &lt;code&gt;cpu_operator_cost&lt;/code&gt; (0.0025), for 0.0125. The cost of the serial sequential scan comes out as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;disk_run_cost = 1.0 × 10000       = 10000
cpu_run_cost  = 0.0125 × 1000000  = 12500
total         ≈ 22500
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now attach 2 workers. 80MB clears the size threshold (8MB), so parallelism becomes a candidate; the log formula yields 3 workers but the default cap of 2 trims it. The divisor for 2 workers is 2.4. The disk term stays put and only the CPU term is divided by 2.4. On top of that, the &lt;code&gt;Gather&lt;/code&gt; that collects the results adds setup of 1000 and a per-row communication cost. If 1000 rows pass the WHERE, the communication cost is &lt;code&gt;parallel_tuple_cost&lt;/code&gt; 0.1 × 1000 = 100.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;disk_run_cost = 1.0 × 10000       = 10000   (unchanged)
cpu_run_cost  = 12500 ÷ 2.4       ≈ 5208    (divided by workers)
Gather setup  = 1000
Gather comm   = 0.1 × 1000         = 100
total         ≈ 16308
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;22500 dropped to 16308, so the planner picks parallel. The roughly 6000 saved comes almost entirely from the CPU term being cut from 12500 to 5208. The disk term of 10000 holds equally on both sides, and the setup of 1000 is added only on the parallel side.&lt;/p&gt;

&lt;p&gt;The numbers flip the other way when a table barely clears the size threshold. With small page and tuple counts, the CPU term is small too, so splitting it among workers only cuts a few hundred, while the setup of 1000 is still laid down in full. The parallel total then ends up larger, so the path clears the size gate but falls behind in the cost comparison. This is the setup cost acting as that second gate, expressed in numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, when a large table won't go parallel, check the size threshold and the worker caps first.&lt;/strong&gt; The check order is threefold. First, &lt;code&gt;min_parallel_table_scan_size&lt;/code&gt; (default 8MB). If the table is smaller than this, the planner never even raises parallelism as a candidate. Next, &lt;code&gt;max_parallel_workers_per_gather&lt;/code&gt; (default 2). If this is 0, parallelism is fully off, and to use more cores on an analytical query you raise it per session. Last, the overall pool, &lt;code&gt;max_parallel_workers&lt;/code&gt; (default 8). In a high-concurrency environment where this pool runs short, the query runs with fewer workers than planned. When &lt;code&gt;Workers Planned&lt;/code&gt; (the number the planner intended to launch) and &lt;code&gt;Workers Launched&lt;/code&gt; (the number that actually started) differ in &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;, this pool shortage is exactly what happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, doubling the workers does not double the speed.&lt;/strong&gt; Parallelism does not speed up in direct proportion to core count. As workers grow, the leader's share of collecting results and post-processing grows, so the effect of each additional worker tapers off, and on top of that there is the fixed setup cost. The planner's cost model itself reflects this diminishing return by shaving the leader's contribution as workers grow. So do not expect raising &lt;code&gt;max_parallel_workers_per_gather&lt;/code&gt; to speed things up by that same ratio; it is better to vary the worker count and actually measure, finding the point where the effect flattens out.&lt;br&gt;
&lt;br&gt;
&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>planner</category>
    </item>
    <item>
      <title>1.4.6 Aggregate Cost: Choosing HashAgg vs GroupAgg</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:06:01 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/146-aggregate-cost-choosing-hashagg-vs-groupagg-457i</link>
      <guid>https://dev.to/joonghyukshin/146-aggregate-cost-choosing-hashagg-vs-groupagg-457i</guid>
      <description>&lt;p&gt;A query like &lt;code&gt;SELECT region, COUNT(*) FROM sales GROUP BY region&lt;/code&gt; folds many rows together, collapsing each group into a single value. This folding of many rows into one is aggregation, and &lt;code&gt;COUNT&lt;/code&gt;, &lt;code&gt;SUM&lt;/code&gt;, &lt;code&gt;AVG&lt;/code&gt; are the familiar examples. PostgreSQL handles aggregation in the execution plan with one of two nodes: &lt;code&gt;HashAggregate&lt;/code&gt; or &lt;code&gt;GroupAggregate&lt;/code&gt;. Both do the same job, grouped aggregation, but they go about it differently, and the planner picks one or the other to nail into the plan tree for a given query. When you see &lt;code&gt;HashAggregate&lt;/code&gt; in one &lt;code&gt;EXPLAIN&lt;/code&gt; and &lt;code&gt;GroupAggregate&lt;/code&gt; in another, that is the result of this choice. What decides it? How the two nodes actually run (how they accumulate state, fill the hash table, and split to disk when memory runs out) is covered in Chapter 1.5. This section looks at the step before that: how the planner compares their costs and picks one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two strategies have the same total; they differ at startup
&lt;/h2&gt;

&lt;p&gt;The function that computes aggregate cost is &lt;code&gt;cost_agg&lt;/code&gt;. Look inside it and one surprising fact shows up: the sort-based strategy (GroupAggregate) and the hash-based one (HashAggregate) have the &lt;strong&gt;same total cost&lt;/strong&gt;. A source comment states it plainly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;in this cost model, AGG_SORTED and AGG_HASHED have
exactly the same total CPU cost, but AGG_SORTED has lower startup cost.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That makes sense. Both strategies pass every input row through the aggregate function exactly once and emit one result row per group. The total amount of work is the same, so the total cost is the same. So what is the planner comparing? The answer is the rest of that comment: &lt;strong&gt;startup cost&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;As we saw earlier, in &lt;code&gt;EXPLAIN&lt;/code&gt;'s &lt;code&gt;cost=A..B&lt;/code&gt;, the first number A is startup (the cost paid before the first result row appears) and the second number B is total (the cost through the last row). What earlier sections called "cost" and used for comparison was mostly the total, the second number. The two aggregate strategies diverge at the first number, startup.&lt;/p&gt;

&lt;p&gt;GroupAggregate works on sorted input, spots where one group ends, and emits that group's result right away. The first row comes out the instant the first group closes, so the cost to the first result is small. In the cost model, GroupAggregate's startup inherits the input's startup cost almost unchanged. HashAggregate, by contrast, cannot declare any group complete until it has pulled in the &lt;em&gt;entire&lt;/em&gt; input and finished building the hash table, so it cannot emit a single row before then. The cost model therefore loads HashAggregate's startup with the cost of reading the input to the end, that is, the input's full total cost.&lt;/p&gt;

&lt;p&gt;This difference shows up in real &lt;code&gt;EXPLAIN&lt;/code&gt; numbers. Suppose we group a 100,000-row table by one column, and that column happens to have an index, so sorted input arrives for free. Running the same aggregate query both ways gives these costs (about 1,000 groups).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GroupAggregate  (cost=0.29..4138.17 rows=1001)
  -&amp;gt;  Index Only Scan ...  (cost=0.29..3628.16 rows=100000)
HashAggregate   (cost=1943.00..1953.01 rows=1001)
  -&amp;gt;  Seq Scan ...         (cost=0.00..1443.00 rows=100000)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look at the first number, startup. GroupAggregate is &lt;code&gt;0.29&lt;/code&gt;, near zero; HashAggregate is &lt;code&gt;1943&lt;/code&gt;. The index hands over the sort for free, so GroupAggregate can emit a result the moment the first group closes, and its cost to the first result is near zero. HashAggregate has to pull in all 100,000 rows and finish the hash table before the first row appears, so the cost of reading the input to the end, &lt;code&gt;1943&lt;/code&gt;, becomes its startup outright.&lt;/p&gt;

&lt;p&gt;So where do we see that the totals are equal? Take each node's total and subtract the total of the input scan below it, leaving just what the aggregate node added. For GroupAggregate, &lt;code&gt;4138.17 - 3628.16 = 510.01&lt;/code&gt;; for HashAggregate, &lt;code&gt;1953.01 - 1443.00 = 510.01&lt;/code&gt;. The cost the aggregation itself adds is identical down to the penny. The total numbers (4138 and 1953) drift apart not because aggregation costs more in one case, but because GroupAggregate chose a more expensive index scan (3628) over the sequential scan (1443) to get sorted input. Strip away the input and the aggregate work is the same; what differs is &lt;em&gt;when&lt;/em&gt; that work is paid, which is startup.&lt;/p&gt;

&lt;p&gt;This difference actually decides the choice when there is a &lt;code&gt;LIMIT&lt;/code&gt; on top, or a parent node that needs only some of the results. For a query like &lt;code&gt;GROUP BY ... LIMIT 10&lt;/code&gt; that needs only the first few rows, a low-startup GroupAggregate can build just ten groups and stop without processing the whole input. HashAggregate cannot use this advantage: whatever the &lt;code&gt;LIMIT&lt;/code&gt;, its build phase must read the input to the end. Even when the totals are equal, the lower-startup side becomes the cheaper candidate in situations like this.&lt;/p&gt;

&lt;h2&gt;
  
  
  GroupAggregate's startup is low only when the sort is free
&lt;/h2&gt;

&lt;p&gt;But GroupAggregate's low startup hides a precondition: "sorted input." To find group boundaries by comparing adjacent rows, rows of the same group must arrive back to back, which means the input has to be sorted by the grouping key.&lt;/p&gt;

&lt;p&gt;When the input already arrives sorted, this precondition is met for free. If the grouping key has an index and the read goes through it, for instance, the sorted order follows at no extra cost. In that case GroupAggregate's startup really is low.&lt;/p&gt;

&lt;p&gt;The problem is when there is no sorted input. Then the planner places a Sort node under GroupAggregate to sort the input first. Sorting means reading the whole input and lining it up, so its entire cost goes into startup: the first group cannot be emitted until the sort finishes, and the sort cannot finish until the whole input is read. The moment a fresh sort is required, GroupAggregate's startup advantage vanishes. It becomes just like HashAggregate in having to read the input to the end before the first row, with the extra work of sorting on top.&lt;/p&gt;

&lt;p&gt;Take the same query as the sorted case above, but now without an index, so it must sort first. The costs change like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GroupAggregate  (cost=9747.82..10507.83 rows=1001)
  -&amp;gt;  Sort       (cost=9747.82..9997.82 rows=100000)
        -&amp;gt;  Seq Scan ...  (cost=0.00..1443.00 rows=100000)
HashAggregate   (cost=1943.00..1953.01 rows=1001)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GroupAggregate's startup jumped from &lt;code&gt;0.29&lt;/code&gt; to &lt;code&gt;9747&lt;/code&gt;. That is the Sort's cost &lt;code&gt;9747.82&lt;/code&gt; carried up wholesale into startup. The sort raises not just the startup but the total (&lt;code&gt;10507&lt;/code&gt;) too, putting it more than five times above HashAggregate's total (&lt;code&gt;1953&lt;/code&gt;). Unlike the sorted case where the two strategies' aggregate costs matched, the gap opens here because GroupAggregate had to buy a sort it did not have before and take on its full cost. For this query the planner picks HashAggregate without hesitation.&lt;/p&gt;

&lt;p&gt;So the first condition that decides the choice is "is the input already sorted by the grouping key?" If it is, GroupAggregate gets its low startup without paying for a sort, which is favorable; if a fresh sort is needed, HashAggregate becomes attractive by exactly that cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  HashAggregate's risk is the number of groups
&lt;/h2&gt;

&lt;p&gt;HashAggregate skips the sort but takes on a different cost. It has to hold per-group accumulators in a hash table, and as the number of groups grows, that table presses on memory.&lt;/p&gt;

&lt;p&gt;The memory a hash table may use is &lt;code&gt;work_mem&lt;/code&gt; multiplied by &lt;code&gt;hash_mem_multiplier&lt;/code&gt;. &lt;code&gt;work_mem&lt;/code&gt; is how much memory one operation like a sort or a hash can use before spilling to disk, and &lt;code&gt;hash_mem_multiplier&lt;/code&gt; is a factor that raises that limit by a multiple for hash-based operations only (default 2.0). The cost model estimates the number of groups and gauges whether the hash table fits within this limit. If it fits, it finishes in memory with no spill; if it overflows, a spill cost for processing in disk-split chunks is added.&lt;/p&gt;

&lt;p&gt;The key input here is the &lt;strong&gt;group-count estimate&lt;/strong&gt;. The planner estimates how many groups will come out from the grouping key's statistics (how that estimate is made is covered in Section 1.4.8). For a &lt;code&gt;GROUP BY region&lt;/code&gt; with only a few regions, the group count is small, the hash table is light, and HashAggregate comes out cheap. For a &lt;code&gt;SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id&lt;/code&gt; with millions of customers, the group count rivals the input size, the hash table overflows the memory limit, and a disk spill cost attaches.&lt;/p&gt;

&lt;p&gt;When a spill happens, that much cost is added to HashAggregate's total, because the overflowed rows have to be written to disk and read back later, creating disk I/O. That I/O cost is proportional to the amount moved to and from disk, and the cost model charges twice what a sort would for the same amount of disk. Even moving the same amount, HashAggregate's disk access pattern is considered less efficient than a sort's on real hardware. The background is a difference in how the two operations use disk. HashAggregate's spill scatters the overflowed rows across several partition files by the hash of the grouping key, then reads them back partition by partition; a sort, by contrast, reads its sorted runs back in order and merges them. Going back and forth across scattered files tends to be slower on disk than reading in order, so the same page count is charged more heavily for HashAggregate. That said, why it is &lt;em&gt;exactly&lt;/em&gt; twice is not a value derived from theory but an empirical fudge factor, and the source comment offers no precise rationale, calling it only a "generic penalty." In the end, for a query where many groups make a spill likely, this doubled disk cost lifts HashAggregate's total and tips the planner's scale toward the sort-based side.&lt;/p&gt;

&lt;p&gt;So the second condition is "how many groups are there?" Few groups favor HashAggregate by saving the sort; a group count rivaling the input attaches a spill cost and makes the sort-based side safer. Since the group-count estimate drives this judgment, a wrong estimate throws off the choice along with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The planner builds both and picks by cost
&lt;/h2&gt;

&lt;p&gt;The planner does not fix one of the two in advance and then cost it. At the point where grouping is handled, it builds a sort-based path and a hash-based path &lt;em&gt;separately&lt;/em&gt; (&lt;code&gt;add_paths_to_grouping_rel&lt;/code&gt;). The sort-based one is a path with GroupAggregate placed on sorted input (or on top of a Sort), and the hash-based one is a HashAggregate path. Once both are registered as candidates, the standard cost comparison from Section 1.4.1 picks the cheaper one. The startup difference, sort cost, and spill cost seen so far all enter this comparison as numbers and decide the outcome.&lt;/p&gt;

&lt;p&gt;But before costs are even compared, candidate eligibility can already differ. If an aggregate function is not in a form a hash can handle, HashAggregate never makes it onto the candidate list at all. An aggregate written with &lt;code&gt;WITHIN GROUP (ORDER BY ...)&lt;/code&gt;, like &lt;code&gt;percentile_cont&lt;/code&gt; for a median (an ordered-set aggregate), needs the values within a group sorted to produce an answer, but a hash table only gathers a group's values together without sorting them. Such aggregates leave the sort-based strategy as the only candidate.&lt;/p&gt;

&lt;p&gt;Turning &lt;code&gt;enable_hashagg&lt;/code&gt; off removes HashAggregate from the candidates. The interesting part is that, as we will see in Chapter 1.5, even after HashAggregate gained the ability to spill to disk, the planner treats this switch not as a soft condition ("use it only when it fits in memory") but as a hard off-switch. In memory or spilling to disk, off means off.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, when the aggregate strategy looks wrong, suspect statistics before indexes.&lt;/strong&gt; The planner picks between HashAggregate and GroupAggregate on a single group-count estimate. If that estimate is far off, say it expects few groups when there are really millions, it may pick HashAggregate only to watch the hash table overflow memory and spill to disk. If an aggregate query suddenly slows down, compare the estimated row count (&lt;code&gt;rows=&lt;/code&gt;) in &lt;code&gt;EXPLAIN&lt;/code&gt; against the actual count from &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;, and if they diverge sharply, run &lt;code&gt;ANALYZE&lt;/code&gt; to refresh the statistics so the planner can see the group count correctly and reconsider the strategy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, one index on a frequently grouped column widens the choice of aggregate strategy.&lt;/strong&gt; GroupAggregate's low startup holds only when the input arrives sorted by the grouping key. With an index on that column, the planner can get sorted input without a fresh sort and hold GroupAggregate as a cheap candidate. That sort is also reused by an &lt;code&gt;ORDER BY&lt;/code&gt; above, so a query like &lt;code&gt;GROUP BY region ORDER BY region&lt;/code&gt; finishes grouping and ordering in one go off a single index sort. Knowing which columns you frequently group and order by lets you bake into the design that an index on that column acts directly on aggregate cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, &lt;code&gt;work_mem&lt;/code&gt; affects both which strategy the aggregate picks and how fast it finishes.&lt;/strong&gt; The cost model charges HashAggregate's spill cost based on whether the hash table fits in &lt;code&gt;work_mem&lt;/code&gt; (times &lt;code&gt;hash_mem_multiplier&lt;/code&gt;). When this value is small, a spill cost is charged heavily even at the same group count, and the planner shies away from HashAggregate; when it is large enough, the planner assumes it finishes in memory without a spill and picks HashAggregate. If a heavy aggregate query is slow from spilling to disk, you can raise &lt;code&gt;work_mem&lt;/code&gt; for that session. But &lt;code&gt;work_mem&lt;/code&gt; is memory taken per connection, and per operation within a plan, so raising it globally by a lot is risky when many connections are active. Raising it per session for heavy aggregate queries only is the safe approach.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>planner</category>
    </item>
  </channel>
</rss>
