<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: JoongHyuk Shin</title>
    <description>The latest articles on DEV Community by JoongHyuk Shin (@joonghyukshin).</description>
    <link>https://dev.to/joonghyukshin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3911251%2F63b302ae-0e4c-4fa7-916b-72b2a119b393.png</url>
      <title>DEV Community: JoongHyuk Shin</title>
      <link>https://dev.to/joonghyukshin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joonghyukshin"/>
    <language>en</language>
    <item>
      <title>1.4.3 Cost Model: The Index Scan Formula</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Thu, 04 Jun 2026 09:23:22 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/143-cost-model-the-index-scan-formula-2005</link>
      <guid>https://dev.to/joonghyukshin/143-cost-model-the-index-scan-formula-2005</guid>
      <description>&lt;p&gt;In 1.4.2, the cost of a sequential scan had two parts: the disk cost of reading every page in the table, and the CPU cost of processing every row read. An index scan reads only the pages it matches, so when the result row count is low it generally reads less and costs less than a sequential scan. The key word is "generally." Even with few result rows, an index scan can lose to a sequential scan. Whether it wins depends not only on how many pages it reads, but also on how those pages are laid out on disk. This section is about how an index scan's cost is computed, and when it loses to a sequential scan.&lt;/p&gt;

&lt;p&gt;The difference between the two starts with what each one reads over the same table. A sequential scan reads every page of the table in order; an index scan uses the index to read only the pages that match.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7foaold8x96c3s46j7k6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7foaold8x96c3s46j7k6.png" alt="Sequential scan vs index scan: reading all pages of a 100-page table versus picking out a single page" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Going by the picture alone, an index scan looks unconditionally cheaper, since it reads less. But there are two traps. First, "only the matched pages" is not always a single page. The picture shows &lt;code&gt;id = 5&lt;/code&gt;, which finds one row, so one page; but when many rows match, that many pages get read. Second, the cost of reading one of those pages is higher than in a sequential scan. Because the pages sit at scattered locations the scan jumps to directly, a random access, it costs up to 4 times what a sequential access (reading neighboring pages back to back) does.&lt;/p&gt;

&lt;p&gt;So it comes down to weighing "the savings from reading fewer pages" against "the penalty of each page costing more," and which way that scale tips is what this section is about.&lt;/p&gt;

&lt;h2&gt;
  
  
  An index scan's cost is the sum of three costs
&lt;/h2&gt;

&lt;p&gt;Fetching one row through an index scan takes two disk accesses. First it descends the index structure to find where in the table the matching row lives, then it reads the table page at that location to pull out the actual row. Each of these two accesses has a cost, and adding the CPU cost of processing the fetched row makes three costs in total.&lt;/p&gt;

&lt;p&gt;The second access (reading the table page) is needed because the query asks for columns the index does not hold. If a query needs only columns that live in the index (for example, &lt;code&gt;SELECT id ... WHERE id &amp;gt; 5&lt;/code&gt; with an index on id), PostgreSQL can skip the table and read only the index, an index-only scan. The conditions under which that is possible (the visibility map) are covered in 2.3.5. This section deals with the ordinary index scan that must also read the table.&lt;/p&gt;

&lt;p&gt;The function that computes index scan cost is &lt;code&gt;cost_index&lt;/code&gt;. For the simplest case, with no join involved and no parallel scan (a mode where several worker processes split a table's pages and read them concurrently), the structure unfolds like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;total_cost = index_access_cost      (① the cost of reading the index)
           + heap_fetch_cost         (② the disk cost of pulling rows from the table)
           + cpu_cost                (③ the CPU cost of processing fetched rows)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each of the three, line by line:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;① Index access cost&lt;/strong&gt;: the I/O of reading the index pages while descending the structure, plus the CPU cost of examining the index entries there. This value is computed differently for each index type (btree, hash, GiST, and so on), so PostgreSQL leaves it to a per-access-method (AM) callback. We look at this cost separately at the end of the section.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;② Heap fetch cost&lt;/strong&gt;: the disk I/O of reading the table page at the location the index reported ("this row is on table page 47"). Here "heap" is PostgreSQL's term for the body where a table's rows actually live. It has nothing to do with the heap data structure from heapsort or priority queues; it means something closer to "the unsorted area where the table is stored" (as opposed to the index, a separate structure). This cost is the star of the section. Of the three, it is the term that varies most with circumstances, so whether an index scan beats a sequential scan is mostly decided here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;③ CPU cost&lt;/strong&gt;: the per-row processing cost for each row fetched from the heap. On top of the base &lt;code&gt;cpu_tuple_cost&lt;/code&gt;, it adds the cost of any condition the index could not filter and the table must recheck. For example, with &lt;code&gt;WHERE id &amp;gt; 5 AND name LIKE '%kim%'&lt;/code&gt; and an index only on id, the index filters &lt;code&gt;id &amp;gt; 5&lt;/code&gt; but cannot judge &lt;code&gt;name LIKE '%kim%'&lt;/code&gt; on its own. So &lt;code&gt;name&lt;/code&gt; has to be rechecked on each row fetched from the heap, and that recheck cost lands here. This value is multiplied by the number of rows fetched. It is the same structure as the sequential scan's CPU cost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Seeing which stage of the actual flow each of the three costs attaches to:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmbbgn1splx02i6bjv9yh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmbbgn1splx02i6bjv9yh.png" alt="The three stages of an index scan: descend the index (B-tree) to a single leaf (①), read the one heap page it points to and fetch the row (②), then process the fetched row (③)" width="800" height="546"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;① and ③ are similar in character to the sequential scan's cost terms. The real difference is in ②, so we start there.&lt;/p&gt;

&lt;h2&gt;
  
  
  Heap fetch cost is decided by correlation
&lt;/h2&gt;

&lt;p&gt;Once the index reports a matching row's location, PostgreSQL must read the table page at that location. The key question is "how scattered are those pages on disk?" Reading neighbors in order costs &lt;code&gt;seq_page_cost&lt;/code&gt; (1.0) per page; jumping around scattered locations costs &lt;code&gt;random_page_cost&lt;/code&gt; (4.0). For the same number of pages, the cost can differ by 4 times depending on how they are laid out.&lt;/p&gt;

&lt;p&gt;This "how well the order the index points to matches the order the rows are actually stored on disk" is called correlation. First, note that the table (heap) is not itself sorted. Rows are generally piled onto disk in the order they were &lt;em&gt;inserted&lt;/em&gt;. So correlation is not about "is the table sorted" but about "how well, by accident or by design, the insertion order lines up with the index column's order." Two extremes make it clear.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Perfect correlation&lt;/strong&gt;: insertion order matches the index column's order. For example, a table where an auto-incrementing id was inserted in order has small-id rows on early pages and large-id rows on later pages. Following the id index, the table pages come out in order: page 1, page 2, page 3. Only the first page is a random access; the rest are all sequential.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero correlation&lt;/strong&gt;: insertion order is unrelated to the index column's order. A table where id values were inserted in random order, or where updates and deletes have scattered the rows, looks like this. Following the id index, the table pages come out scattered, like page 47, page 3, page 91, so every page read is a random access.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;cost_index&lt;/code&gt; computes the cost at each of these two extremes, then settles on a value between them according to the actual correlation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;min_IO_cost (perfect correlation)  = random_page_cost + (pages_fetched - 1) × seq_page_cost
max_IO_cost (zero correlation)     = pages_fetched × random_page_cost

csquared    = correlation²
heap_fetch_cost = max_IO_cost + csquared × (min_IO_cost - max_IO_cost)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reading the formula: when correlation is 0, csquared is 0 too, so &lt;code&gt;max_IO_cost&lt;/code&gt; (all random) is used as is; when correlation is 1, csquared is 1, so it becomes &lt;code&gt;min_IO_cost&lt;/code&gt; (nearly sequential). The value in between is set by the square of the correlation. The higher the correlation, the closer it gets to &lt;code&gt;min_IO_cost&lt;/code&gt;. The square is used to match the observation that even a slight drop in correlation produces quite a few page jumps in practice. A comment in the PostgreSQL source even leaves a question mark over whether this squared scheme is appropriate.&lt;/p&gt;

&lt;p&gt;One thing about the zero-correlation &lt;code&gt;pages_fetched&lt;/code&gt; (the number of table pages to read): it is not simply "the number of result rows." When rows are randomly scattered across many pages, even fetching a few rows touches many pages. PostgreSQL estimates this page count with a formula proposed by Mackert and Lohman in 1989. In short, it takes the total page count of the table, the number of rows to fetch, and the number of pages that can fit in the memory cache (&lt;code&gt;effective_cache_size&lt;/code&gt;), and produces "the actual number of disk page fetches, accounting for cache effects." The more rows fetched, the closer it converges to the table's page count.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same selectivity, different cost
&lt;/h2&gt;

&lt;p&gt;To see how this computation decides actual plan selection, here is a concrete example. Suppose an &lt;code&gt;accounts&lt;/code&gt; table has 10,000 rows stored across 100 pages on disk, with a unique index on the id column.&lt;/p&gt;

&lt;p&gt;First, a &lt;strong&gt;single-row lookup&lt;/strong&gt; &lt;code&gt;WHERE id = 5&lt;/code&gt;. The selectivity (the fraction of all rows that the WHERE clause keeps) is 1/10000, and the result is 1 row. Finding one row through the index means the table page holding it is just 1 page. With only one page, correlation is irrelevant: it is one random access either way.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;② heap_fetch_cost ≈ 4.0          (1 page × random_page_cost)
① index_access_cost ≈ 4          (a few index leaf pages + tree descent)
③ cpu_cost ≈ 0.01                (process 1 row)
total ≈ 8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same query run as a sequential scan is &lt;code&gt;total ≈ 225&lt;/code&gt; (read all 100 pages and check the id of all 10,000 rows). The index scan is about 8. For a single-row lookup with an extremely low result count, the index scan is overwhelmingly cheaper. This is half the answer to the question posed earlier: when the result count is low, the index scan's heap fetch touches only a few pages, so it is cheaper than the sequential scan.&lt;/p&gt;

&lt;p&gt;But there is another half. Consider a &lt;strong&gt;range query&lt;/strong&gt; &lt;code&gt;WHERE id BETWEEN 1 AND 2000&lt;/code&gt;. Selectivity is 0.2, and the result is 2,000 rows. This time correlation decides the cost.&lt;/p&gt;

&lt;p&gt;When the table was loaded in id order so that &lt;strong&gt;correlation is near 1&lt;/strong&gt;, the 2,000 rows the index points to are clustered on the first 20 pages of the table. &lt;code&gt;min_IO_cost&lt;/code&gt; applies: only the first page is random, the other 19 are sequential.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;② heap_fetch_cost ≈ 4.0 + 19 × 1.0 = 23      (correlation ≈ 1)
③ cpu_cost ≈ 0.01 × 2000 = 20
total ≈ 23 + 20 + (index cost) ≈ 50 or so
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the same query, when the table was loaded in random order so that &lt;strong&gt;correlation is near 0&lt;/strong&gt;, those 2,000 rows are scattered across all 100 pages. Fetching 2,000 rows at random effectively touches all 100 pages of the table, each a random access.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;② heap_fetch_cost ≈ 100 × 4.0 = 400          (correlation ≈ 0)
③ cpu_cost ≈ 0.01 × 2000 = 20
total ≈ 400 + 20 + (index cost) ≈ 420 or more
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run as a sequential scan, this range query costs about 250 (&lt;code&gt;BETWEEN&lt;/code&gt; has two conditions, so each row gets one more comparison, costing slightly more CPU than the single-row lookup's 225). Compared against this 250, the same range query, depending on correlation, either wins big over the sequential scan (about 50) or loses big to it (about 420). Selectivity is 0.2 in both cases, yet the result is the opposite. This is the other half of the question. Even with good selectivity, bad correlation can make an index scan lose to a sequential scan, because reading the whole table by random access is more expensive than sweeping it once sequentially.&lt;/p&gt;

&lt;p&gt;The planner shows the result of this decision through &lt;code&gt;EXPLAIN&lt;/code&gt;. In the second, bad-correlation case, the planner compares the index scan candidate (about 420) against the sequential scan candidate (about 250) and picks the cheaper sequential scan, printing this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Seq Scan on accounts  (cost=0.00..250.00 rows=2000 width=...)
  Filter: ((id &amp;gt;= 1) AND (id &amp;lt;= 2000))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Had correlation been good, the same query would resolve to an index scan and an &lt;code&gt;Index Scan&lt;/code&gt; node would appear. The point is that the planner does not use the index just because it exists; it is the result of a cost comparison that factors in correlation.&lt;/p&gt;

&lt;p&gt;Look at a real workload and this pattern shows up often. A log-style table that accumulates in time order has naturally high correlation on its timestamp column, so time-range queries resolve nicely to index scans. A table where random updates and inserts are mixed in has low correlation, and you see the planner drop to a sequential scan even for a range query of the same selectivity. The root of the common puzzle "I added an index but it isn't being used" is usually right here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading the index is not free either
&lt;/h2&gt;

&lt;p&gt;So far we have focused on ② (heap fetch). Lastly, ①, the cost of reading the index itself. As mentioned, this computation is handled by a per-type callback, and most of them, btree included, go through the shared function &lt;code&gt;genericcostestimate&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Index access cost also splits into page I/O and CPU.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Index page I/O&lt;/strong&gt;: what the index mostly reads are leaf pages. A leaf page is the bottom layer of the index tree, where the actual key values and the location info ("the row for this key is on table page N") are stored. The more keys match, the more leaf pages are read, and that page count is multiplied by &lt;code&gt;random_page_cost&lt;/code&gt;. The root and middle-layer pages, being at the top of the tree, are few and used so often that they are almost always sitting in the memory cache, so the cost calculation ignores them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Index CPU&lt;/strong&gt;: for each index entry visited, it adds &lt;code&gt;cpu_index_tuple_cost&lt;/code&gt; (0.005) and the qual operator cost (&lt;code&gt;cpu_operator_cost&lt;/code&gt;, 0.0025) and multiplies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where does the correlation value that drove ②'s heap fetch cost come from? That value too is filled in by the per-AM callback. &lt;code&gt;genericcostestimate&lt;/code&gt; here is a &lt;em&gt;shared computation routine&lt;/em&gt; not tied to any particular index type. Each AM's callback (btree, hash, GiST, and so on) calls this shared routine to compute the common part of the index cost, then adds its own type-specific values on top if needed. This shared routine leaves correlation at 0 to begin with. Being a generic computation that does not look at column statistics, it has no way to know the real correlation, so it sets the worst case (all random) as the default. Btree overwrites that with the real value: it reads the actual correlation that ANALYZE recorded in &lt;code&gt;pg_statistic&lt;/code&gt; (the catalog holding per-column statistics) and passes it along (as is for a single-column index, multiplied by 0.75 to stay conservative for a multi-column one). So ②'s correlation adjustment really works only for indexes like btree that fill in the real value. An index that leaves correlation at 0 always has its heap fetch priced at &lt;code&gt;max_IO_cost&lt;/code&gt; (all random).&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, if an index isn't being used, suspect correlation, not just selectivity.&lt;/strong&gt; A common misconception is "few matching rows always means the index is used." As shown above, for a range query that returns not-so-few rows (say 20% of the table), low correlation inflates the index scan's heap fetch into the cost of reading the whole table at random, and it loses to a sequential scan. The &lt;code&gt;correlation&lt;/code&gt; column of the &lt;code&gt;pg_stats&lt;/code&gt; view lets you check a column's correlation directly. Close to 1 favors an index scan; close to 0 hurts on range queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, distinguish single-row lookups from range queries first, because their cost structures differ.&lt;/strong&gt; A single row (unique lookup) touches just 1 heap page, so correlation is meaningless and an index scan almost always wins. A range query touches many pages, so correlation and result row count work together. Sorting out which of the two your target query is tells you right away whether correlation is something to worry about.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, you can revive an index scan for a frequently range-queried column by adjusting load order.&lt;/strong&gt; Because correlation is determined by physical load order. The same data and the same index can have different correlation depending on the order they are written to disk. A table that accumulates in time order has naturally high correlation on its timestamp. If you want to raise the correlation of a frequently range-queried column, the &lt;code&gt;CLUSTER&lt;/code&gt; command physically reorders the table into that index's order. Note that &lt;code&gt;CLUSTER&lt;/code&gt; only sorts once, and later updates and inserts scatter it again, so it has to be re-run periodically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fourth, tuning &lt;code&gt;random_page_cost&lt;/code&gt; on SSDs acts directly on index scans.&lt;/strong&gt; Since the heap fetch's &lt;code&gt;max_IO_cost&lt;/code&gt; is multiplied by &lt;code&gt;random_page_cost&lt;/code&gt;, lowering this value from 4.0 to 1.1 sharply drops the cost of uncorrelated index scans. Random access on an SSD is not as expensive as on an HDD, so leaving the default at 4.0 makes bad-correlation index scans look unfairly expensive and the planner often falls back to sequential scans. This is why lowering the value to revive index scan opportunities is a common tuning move in SSD-based production.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>performance</category>
    </item>
    <item>
      <title>1.4.2 Cost Model: Sequential Scan Formula</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Wed, 03 Jun 2026 06:45:49 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/142-cost-model-sequential-scan-formula-lg5</link>
      <guid>https://dev.to/joonghyukshin/142-cost-model-sequential-scan-formula-lg5</guid>
      <description>&lt;p&gt;In 1.4.1, we saw the planner build multiple candidate paths over the same table and pick the one with the lowest cost to crystallize into a plan. Each candidate is represented by the &lt;code&gt;Path&lt;/code&gt; data structure, and each &lt;code&gt;Path&lt;/code&gt; carries a single number called &lt;code&gt;total_cost&lt;/code&gt;. The planner compares candidates against this number. Where does this number come from? As the 1.4 chapter intro touched on, PostgreSQL has a &lt;em&gt;cost model&lt;/em&gt;. It converts physical cost estimates (the cost of reading one disk page, the cost of processing one tuple, and so on) into arbitrary unit numbers, then sums these units per candidate path to produce &lt;code&gt;total_cost&lt;/code&gt;. The rest of chapter 1.4 unpacks how this cost model computes costs for each kind of candidate. We start with the simplest one: a &lt;em&gt;sequential scan&lt;/em&gt;, which reads a table from its first page to its last in order, examining every row.&lt;/p&gt;

&lt;p&gt;Two units carry the cost throughout this section, and they are worth pinning down first: the page and the tuple.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;page&lt;/strong&gt; is the fixed-size block PostgreSQL uses to store and read a table on disk. The default size is 8KB. The key point is that disk I/O happens in units of pages, not individual rows. Even if you need just one row, PostgreSQL reads the entire page that holds it into memory. That is why, in the cost model, disk cost is counted as "how many pages do we read", not "how many rows".&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;tuple&lt;/strong&gt; is one row as stored inside a page. "row" and "tuple" point at nearly the same thing from different angles: a row is the &lt;em&gt;logical&lt;/em&gt; row, the thing a query returns or a WHERE clause filters, while a tuple is the &lt;em&gt;physical&lt;/em&gt; form of that row as it sits inside a disk page, PostgreSQL's term for it. The cost model bothers to distinguish them because disk cost is counted in pages while CPU cost is counted in the number of tuples processed. The internal layout of pages and tuples (how tuples are arranged within a page, and so on) is covered in detail in chapter 2; here we only carry forward that the unit of disk reads is the page and the unit of CPU processing is the tuple.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost is a ratio, not an absolute value
&lt;/h2&gt;

&lt;p&gt;PostgreSQL's &lt;code&gt;cost&lt;/code&gt; has no units. It's not seconds, not milliseconds. If someone seeing &lt;code&gt;EXPLAIN&lt;/code&gt;'s &lt;code&gt;cost=0.00..205.00&lt;/code&gt; for the first time asks "does this mean 0.2 seconds?", that intuition misses. The number itself is in arbitrary units, and meaning comes only from &lt;em&gt;ratios between candidates&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The reference point is &lt;code&gt;seq_page_cost = 1.0&lt;/code&gt;. The cost of reading one sequential page is fixed at 1, and every other cost constant is defined as a ratio to this. Current PG 18 defaults are:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GUC&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;seq_page_cost&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1.0&lt;/td&gt;
&lt;td&gt;Cost of fetching one sequential page. &lt;strong&gt;Reference unit&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;random_page_cost&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4.0&lt;/td&gt;
&lt;td&gt;Cost of fetching one non-sequential (random) page&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cpu_tuple_cost&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.01&lt;/td&gt;
&lt;td&gt;CPU cost of the executor processing one tuple&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cpu_index_tuple_cost&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.005&lt;/td&gt;
&lt;td&gt;CPU cost of processing one index tuple&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cpu_operator_cost&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.0025&lt;/td&gt;
&lt;td&gt;CPU cost of one operator or function call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;parallel_tuple_cost&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;0.1&lt;/td&gt;
&lt;td&gt;Cost of a parallel worker passing one tuple to the leader&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;parallel_setup_cost&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1000.0&lt;/td&gt;
&lt;td&gt;Cost of setting up shared memory at the start of a parallel scan&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two things stand out from this table. First, reading one page costs 100 times more than processing one tuple on CPU, at default values. Second, a random page costs 4 times a sequential page.&lt;/p&gt;

&lt;p&gt;What's the difference between "sequential" and "random"? Sequential page reading goes through page 1, page 2, page 3 in order, picking up adjacent pages one after another. Random page reading jumps around: page 1, then page 47, then page 12, where positions are scattered. On an HDD, the moving head takes time to seek to a different location, so random reads are much slower. On an SSD, with no moving head, the gap between the two narrows.&lt;/p&gt;

&lt;p&gt;Another fact is baked into the ratio. The OS kernel reads ahead: when it fetches a page from disk, it guesses that the next pages will be needed soon and pulls them into memory in advance. This is called read-ahead, or prefetch. Sequential reading benefits from this guess because the next page really is the next page. Random reading can't benefit because there's no way to know which page comes next. The PG default ratio &lt;code&gt;seq : random = 1 : 4&lt;/code&gt; reflects an environment of HDD plus kernel read-ahead.&lt;/p&gt;

&lt;p&gt;Why ratios instead of absolute values? Pin down absolute values, and environmental dependency gets baked into the model. Measure millisecond values on HDD, and SSD readings become inaccurate. A small table that fits entirely in memory is yet another case. Arbitrary-unit ratios keep the model environment-neutral, and operators adjust the ratios to match their environment. In SSD environments, dropping &lt;code&gt;random_page_cost = 1.1&lt;/code&gt; to nudge the planner toward index scans is a common tuning, and that's possible precisely because the model uses ratios.&lt;/p&gt;

&lt;p&gt;In other engine codebases, I've seen attempts to express cost in actual millisecond units. The result is accurate but every constant has to be re-measured whenever the environment changes. PostgreSQL's arbitrary-unit ratio model gives up that precision in exchange for &lt;em&gt;tunability&lt;/em&gt; and &lt;em&gt;environment independence&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The formula is a sum of two axes
&lt;/h2&gt;

&lt;p&gt;The function that computes sequential scan cost is &lt;code&gt;cost_seqscan&lt;/code&gt;. We start from the simplest case: no join is involved (so no variable from another table is bound to the path; this case is called a &lt;em&gt;non-parameterized path&lt;/em&gt;, and parameterized paths get full treatment in 1.4.5 on joins), and no &lt;em&gt;parallel scan&lt;/em&gt; either (a mode where multiple worker processes split up reading the table's pages concurrently). The formula for this simple case reads:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;disk_run_cost = seq_page_cost × pages
cpu_run_cost  = (cpu_tuple_cost + qual.per_tuple) × tuples
              + pathtarget.per_tuple × rows

startup_cost  = qual.startup + pathtarget.startup
total_cost    = startup_cost + cpu_run_cost + disk_run_cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Walking through the variables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pages&lt;/code&gt;, &lt;code&gt;tuples&lt;/code&gt;: the estimated number of pages this table occupies on disk and the estimated number of rows contained in them. Where these come from is the next section.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;qual&lt;/code&gt;: the cost of evaluating the WHERE clause. PG separates qual cost into two parts. &lt;code&gt;startup&lt;/code&gt; is the one-time setup cost incurred once. &lt;code&gt;per_tuple&lt;/code&gt; is the cost of re-evaluating qual for each tuple. A simple comparison qual like &lt;code&gt;id = 5&lt;/code&gt; usually has only a small &lt;code&gt;per_tuple&lt;/code&gt; value, with &lt;code&gt;startup&lt;/code&gt; essentially zero.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pathtarget&lt;/code&gt;: the cost of evaluating the SELECT clause expressions. For &lt;code&gt;SELECT a, b * 2 FROM ...&lt;/code&gt;, this means the cost of evaluating &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b * 2&lt;/code&gt;. Like qual, it splits into &lt;code&gt;startup&lt;/code&gt; and &lt;code&gt;per_tuple&lt;/code&gt;. With only simple column references it's near zero. Complex function calls drive &lt;code&gt;per_tuple&lt;/code&gt; up.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rows&lt;/code&gt;: the estimated row count after applying WHERE. This value gets multiplied with &lt;code&gt;pathtarget&lt;/code&gt;. Tuples filtered out by WHERE don't need SELECT expression evaluation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drawing the two cost axes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;total_cost
=
┌──────────── disk_run_cost ────────────┬─────────────── cpu_run_cost ───────────────┐
│  seq_page_cost  ×  pages              │ (cpu_tuple_cost + qual.per_tuple) × tuples │
│                                       │  + pathtarget.per_tuple × rows             │
└───────────────────────────────────────┴────────────────────────────────────────────┘
                                                        + startup_cost
                                                          (usually near zero)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why split into two axes? Because the two costs consume &lt;em&gt;physically different resources&lt;/em&gt;. Reading one page is disk I/O. Processing one tuple is CPU. The same row estimate carries different cost weights depending on whether the bottleneck is disk (large &lt;code&gt;pages&lt;/code&gt;) or CPU (large &lt;code&gt;tuples&lt;/code&gt;, for example a narrow table packed densely with tuples per page). Putting these in separate terms lets a single formula stay accurate across both environments.&lt;/p&gt;

&lt;p&gt;Zooming in on &lt;code&gt;cpu_run_cost&lt;/code&gt;, the two terms charge cost differently. The main term (&lt;code&gt;cpu_tuple_cost + qual.per_tuple&lt;/code&gt;) is paid once per tuple across &lt;em&gt;all tuples&lt;/em&gt;, regardless of survival. Every tuple has to be inspected to evaluate qual, which is how survival gets decided in the first place. The second term (&lt;code&gt;pathtarget.per_tuple × rows&lt;/code&gt;) is paid only on &lt;em&gt;surviving rows&lt;/em&gt;: SELECT expressions need to run only for tuples that passed WHERE. Even within CPU cost, there's a split between "scales with all tuples" and "scales with output row count".&lt;/p&gt;

&lt;p&gt;This makes it obvious why &lt;code&gt;tuples&lt;/code&gt; and &lt;code&gt;pages&lt;/code&gt; are the two most important inputs to this formula. PG estimates table size &lt;em&gt;before&lt;/em&gt; computing cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Good selectivity doesn't reduce disk cost
&lt;/h2&gt;

&lt;p&gt;Looking at the formula again, an interesting asymmetry shows up. &lt;code&gt;disk_run_cost&lt;/code&gt; multiplies &lt;code&gt;pages&lt;/code&gt;. The main term of &lt;code&gt;cpu_run_cost&lt;/code&gt; multiplies &lt;code&gt;tuples&lt;/code&gt;. The &lt;em&gt;output row count&lt;/em&gt; appears only in the second term of &lt;code&gt;cpu_run_cost&lt;/code&gt; (&lt;code&gt;pathtarget.per_tuple × rows&lt;/code&gt;). It doesn't appear in &lt;code&gt;disk_run_cost&lt;/code&gt; or the main term.&lt;/p&gt;

&lt;p&gt;What's the "output row count" here? It's the estimated number of rows that survive WHERE filtering. The fraction of rows that survive is called &lt;em&gt;selectivity&lt;/em&gt;. Where selectivity itself comes from is covered in detail in 1.4.6. Selectivity of 1/10000 means one row out of ten thousand survives.&lt;/p&gt;

&lt;p&gt;This asymmetry becomes clear with a concrete example. Suppose an &lt;code&gt;accounts&lt;/code&gt; table with 10,000 rows in 100 pages, queried with &lt;code&gt;WHERE id = 5&lt;/code&gt;. If &lt;code&gt;id&lt;/code&gt; is unique, selectivity is 1/10000 and the result row count is 1. Even so, the sequential scan cost gets computed as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;disk_run_cost = 1.0 × 100 = 100
cpu_run_cost  ≈ (0.01 + 0.0025) × 10000  +  0 × 1  ≈ 125
total_cost    ≈ 225
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Breaking down the two terms of &lt;code&gt;cpu_run_cost&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;(0.01 + 0.0025) × 10000&lt;/code&gt;: the main term. Scales with all tuples. The &lt;code&gt;0.0025&lt;/code&gt; is the cost of evaluating the &lt;code&gt;=&lt;/code&gt; operator from &lt;code&gt;id = 5&lt;/code&gt; (&lt;code&gt;cpu_operator_cost&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0 × 1&lt;/code&gt;: the pathtarget term (&lt;code&gt;pathtarget.per_tuple × rows&lt;/code&gt;). With only simple SELECT and one surviving row, this contribution is negligible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even though only 1 row survives, disk still pays the full 100. CPU also pays 125 because the main term scales with 10,000 tuples. EXPLAIN output shows the same pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Seq Scan on accounts  (cost=0.00..225.00 rows=1 width=...)
  Filter: (id = 5)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;rows=1&lt;/code&gt; but &lt;code&gt;cost=...225.00&lt;/code&gt;: this is the asymmetry made visible.&lt;/p&gt;

&lt;p&gt;What this means in practice: a sequential scan is priced as "read every page even if only one row survives". WHERE doesn't reduce disk cost. The main CPU term scales with &lt;em&gt;all tuples&lt;/em&gt; regardless of survival. The only part that shrinks with result row count is the second term (&lt;code&gt;pathtarget.per_tuple × rows&lt;/code&gt;), and with simple column references that part is near zero, so its impact on total cost is negligible. Better selectivity doesn't make a meaningful dent in sequential scan cost.&lt;/p&gt;

&lt;p&gt;How does this asymmetry shape the planner's decisions? Compare with an index scan candidate over the same table. An index scan follows the index first to locate only the pages with matching rows, so its disk cost scales with &lt;em&gt;the number of pages it visits&lt;/em&gt;. When matching rows are sparsely scattered, only a few pages are touched. So in general, when selectivity is good (few output rows), index scan disk cost gets much smaller than sequential scan disk cost. Without this asymmetry, the planner would always favor sequential scan as cheaper, missing index scan opportunities on large tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where do &lt;code&gt;pages&lt;/code&gt; and &lt;code&gt;tuples&lt;/code&gt; come from?
&lt;/h2&gt;

&lt;p&gt;Where these two formula inputs come from determines the model's reliability. PG fills them in over two stages before computing cost.&lt;/p&gt;

&lt;p&gt;The first stage is &lt;em&gt;physical estimation&lt;/em&gt;. When the planner registers a table as a work target, in &lt;code&gt;get_relation_info&lt;/code&gt;, it calls &lt;code&gt;estimate_rel_size&lt;/code&gt;, which fills &lt;code&gt;baserel-&amp;gt;pages&lt;/code&gt; and &lt;code&gt;baserel-&amp;gt;tuples&lt;/code&gt;. Two data sources go into filling these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pg_class.relpages&lt;/code&gt;, &lt;code&gt;pg_class.reltuples&lt;/code&gt;. &lt;code&gt;pg_class&lt;/code&gt; is the system catalog holding metadata for every table. The last &lt;code&gt;ANALYZE&lt;/code&gt; (the stats-gathering command) or &lt;code&gt;VACUUM&lt;/code&gt; (which reclaims dead tuples and refreshes stats alongside) leaves these values behind.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;RelationGetNumberOfBlocks()&lt;/code&gt;: the current page count on disk, checked directly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PG doesn't use just one or the other; it combines them. &lt;code&gt;pages&lt;/code&gt; is simple. &lt;code&gt;RelationGetNumberOfBlocks()&lt;/code&gt; returns &lt;em&gt;the current page count on disk&lt;/em&gt;, used as-is. &lt;code&gt;tuples&lt;/code&gt; is trickier. If rows have been added since the last stats, using &lt;code&gt;pg_class.reltuples&lt;/code&gt; (the tuple count at stats time) directly would underestimate the truth. So PG takes the &lt;em&gt;per-page tuple density&lt;/em&gt; from the stats snapshot (&lt;code&gt;pg_class.reltuples / pg_class.relpages&lt;/code&gt;), multiplies it by &lt;em&gt;current pages&lt;/em&gt;, and gets &lt;code&gt;tuples&lt;/code&gt;. The assumption is that density stays roughly similar across pages, so a proportional growth in page count implies proportional growth in tuple count. For a brand-new table that has never been ANALYZE'd, PG estimates tuple width from attribute datatypes and derives density from that.&lt;/p&gt;

&lt;p&gt;The second stage is &lt;em&gt;logical estimation&lt;/em&gt;. Once physical estimation has filled &lt;code&gt;pages&lt;/code&gt; and &lt;code&gt;tuples&lt;/code&gt;, the planner calls &lt;code&gt;set_baserel_size_estimates&lt;/code&gt; from &lt;code&gt;set_plain_rel_size&lt;/code&gt;. That function uses &lt;code&gt;clauselist_selectivity&lt;/code&gt; to compute the WHERE clause selectivity, then writes &lt;code&gt;baserel-&amp;gt;rows = tuples × selectivity&lt;/code&gt;. How selectivity itself is computed, based on column statistics in the &lt;code&gt;pg_statistic&lt;/code&gt; catalog (histograms, most-common-values, n_distinct, and the like), is left for 1.4.6 to unpack.&lt;/p&gt;

&lt;p&gt;The point: &lt;em&gt;&lt;code&gt;pages&lt;/code&gt; and &lt;code&gt;tuples&lt;/code&gt; are estimates grounded in pg_class stats&lt;/em&gt;. If ANALYZE gets skipped and stats go stale, density gets wrong, and the whole cost comparison goes off. The classic pattern where the same SQL produces a good plan one day and a bad plan the next has its root here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parallel and disabled_nodes: two adjustments on top of the same formula
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;cost_seqscan&lt;/code&gt; applies two more adjustments on top of the formula above.&lt;/p&gt;

&lt;p&gt;The first is the parallel scan adjustment. When multiple worker processes split up reading a table's pages (&lt;code&gt;path-&amp;gt;parallel_workers &amp;gt; 0&lt;/code&gt;), &lt;code&gt;cpu_run_cost&lt;/code&gt; gets divided by the number of workers. CPU work is distributed across workers, so the burden each worker shoulders shrinks accordingly. &lt;code&gt;disk_run_cost&lt;/code&gt; stays the same. The kernel's read-ahead already speeds up the disk side, so adding workers won't reduce I/O further; that assumption is baked in. The result row count is also re-pinned to the count one worker processes. With this adjustment, the parallel sequential scan candidate ends up with a different cost than the non-parallel one over the same table, so the two enter the comparison pool separately.&lt;/p&gt;

&lt;p&gt;The second is the disabled node adjustment. As 1.4.1 covered, when an operator tries to turn off sequential scans with &lt;code&gt;enable_seqscan = off&lt;/code&gt;, PG handles it by pinning &lt;code&gt;disabled_nodes = 1&lt;/code&gt; on the &lt;code&gt;Path&lt;/code&gt; node. The dominance comparison rule looks at &lt;em&gt;which side has fewer disabled_nodes&lt;/em&gt; before comparing cost. Older PG versions handled this differently. They added a large constant to the cost so that disabled nodes' costs became astronomical and lost out in comparison. But that inflated cost distorted comparisons among candidates elsewhere in the plan tree, so PG switched to using a separate count field. After computing the cost formula, &lt;code&gt;cost_seqscan&lt;/code&gt; writes &lt;code&gt;disabled_nodes = 0&lt;/code&gt; if &lt;code&gt;enable_seqscan&lt;/code&gt; is on, or &lt;code&gt;disabled_nodes = 1&lt;/code&gt; if it's off.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, don't attach units to the EXPLAIN cost number.&lt;/strong&gt; Reading &lt;code&gt;cost=0.00..205.00&lt;/code&gt; as "this query takes 205ms" is wrong. The unit is arbitrary, and the number means something only &lt;em&gt;in comparison with other candidates within the same EXPLAIN output&lt;/em&gt;. If candidate A has cost 200 and candidate B has cost 800, that tells you the planner picked A. It says nothing about absolute runtime. For actual execution time, look at &lt;code&gt;actual time=&lt;/code&gt; in &lt;code&gt;EXPLAIN ANALYZE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, skipping ANALYZE breaks the cost model.&lt;/strong&gt; &lt;code&gt;pages&lt;/code&gt; and &lt;code&gt;tuples&lt;/code&gt; are derived from pg_class stats, and those two variables directly determine &lt;code&gt;disk_run_cost&lt;/code&gt; and &lt;code&gt;cpu_run_cost&lt;/code&gt;. Stale stats lead to wrong density, which leads to unstable plans for the same SQL. That's why a fresh table needs ANALYZE after data load, and why ANALYZE should be re-run after bulk INSERT or DELETE. Autovacuum (the background PG facility that runs stats updates and dead tuple reclamation automatically) handles this most of the time but fires only after hitting a threshold, so before running large queries shortly after a load, running ANALYZE explicitly once is the safe move.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, on SSD, leaving &lt;code&gt;random_page_cost&lt;/code&gt; at 4.0 makes index scans look unfairly expensive.&lt;/strong&gt; The default reflects HDD-era assumptions where random reads cost 4x sequential reads. On SSDs, random cost is nearly identical to or only slightly more than sequential. Operators commonly tune &lt;code&gt;random_page_cost = 1.1&lt;/code&gt; to nudge the planner toward index scans. PG docs also recommend narrowing the gap between the two values for cache-resident workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fourth, the cost weight differs between narrow large tables and wide small tables.&lt;/strong&gt; &lt;code&gt;disk_run_cost&lt;/code&gt; scales with &lt;code&gt;pages&lt;/code&gt;, while &lt;code&gt;cpu_run_cost&lt;/code&gt; scales with &lt;code&gt;tuples&lt;/code&gt;. A narrow table packed with many tuples per page has a high &lt;code&gt;tuples / pages&lt;/code&gt; ratio, so the CPU share dominates. A table with wide rows fitting few tuples per page has the disk share dominate. The same row count can lead the planner to different candidate choices. As a monitoring signal, looking at &lt;code&gt;pg_class.relpages&lt;/code&gt; and &lt;code&gt;reltuples&lt;/code&gt; together, and being aware of their ratio, gives intuition for cost variability.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>performance</category>
    </item>
    <item>
      <title>1.4.1 Path Tree vs Plan Tree: What's the Difference</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Tue, 02 Jun 2026 06:34:18 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/141-path-tree-vs-plan-tree-whats-the-difference-2c59</link>
      <guid>https://dev.to/joonghyukshin/141-path-tree-vs-plan-tree-whats-the-difference-2c59</guid>
      <description>&lt;p&gt;The 1.4 chapter overview already noted that the planner's internals are split into two layers, &lt;strong&gt;Path&lt;/strong&gt; and &lt;strong&gt;Plan&lt;/strong&gt;. Path is the floor where candidates compete on cost. Plan is the form the cheapest candidate takes once it has been frozen into something the executor can actually run.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Path tree holds, side by side, the different routes that lead to the same result
&lt;/h2&gt;

&lt;p&gt;The moment the planner receives a query, it sets up a separate workspace for each &lt;em&gt;rel&lt;/em&gt; the query touches. A &lt;em&gt;rel&lt;/em&gt; is the name PG uses for a set of rows. A table the query references directly, or a sub-query, or any single item appearing in &lt;code&gt;FROM&lt;/code&gt;, is called a &lt;strong&gt;base rel&lt;/strong&gt;. A rel that is constructed from other rels, like the result of joining two rels, is called a &lt;strong&gt;join rel&lt;/strong&gt;. The formal name for the workspace the planner builds for one rel is &lt;code&gt;RelOptInfo&lt;/code&gt;. In a single-table query like &lt;code&gt;SELECT * FROM users WHERE id = 1&lt;/code&gt;, one RelOptInfo is built for the base rel &lt;code&gt;users&lt;/code&gt;. In a two-table join, three RelOptInfos are built: one for each of the two base rels, and one for the join rel that combines them.&lt;/p&gt;

&lt;p&gt;Inside the workspace is a list that collects the path candidates that can be built for that rel. The list is called the &lt;code&gt;pathlist&lt;/code&gt;. For example, if &lt;code&gt;users&lt;/code&gt; has a single primary key index and no other secondary indexes, the RelOptInfo for &lt;code&gt;users&lt;/code&gt; looks roughly like this (the cost and rows numbers are arbitrary examples for illustration).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RelOptInfo (workspace for users)
├── pathlist (candidate list)
│   ├── SeqScan path        (cost: 100, rows: 1000, sort: none)
│   └── IndexScan path      (cost: 5,   rows: 1,    sort: id ascending)
├── baserestrictinfo (WHERE conditions attached to this table)
│   └── RestrictInfo: id = 1
└── (statistics, join info, etc.)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are two routes leading to the same result, and the pathlist holds both at once. Neither is correct in advance. The planner keeps both candidates around until one of them clearly &lt;em&gt;beats&lt;/em&gt; the other.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dominance check
&lt;/h3&gt;

&lt;p&gt;The pathlist does not take in every possible path. PG runs a cleanup before a new candidate is registered. This cleanup is called the &lt;strong&gt;dominance check&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The rule is simple. The check compares five items.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt;: cheaper or equal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sort order&lt;/strong&gt;: more useful or equal (for example, if the upper stage wants the result sorted by &lt;code&gt;id&lt;/code&gt;, a path that already comes out sorted by &lt;code&gt;id&lt;/code&gt; is more useful)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row count&lt;/strong&gt;: fewer or equal&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outer dependency&lt;/strong&gt;: depends on fewer external rels, or the same&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel safety&lt;/strong&gt;: at least as safe to run in a worker process&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the new candidate is &lt;em&gt;no worse than&lt;/em&gt; the existing one in all five items, the new candidate is said to dominate the old one. The old candidate is then yanked out of the pathlist. Conversely, if the existing candidate is no worse than the new one in all five items, the new candidate is not registered at all.&lt;/p&gt;

&lt;p&gt;If either side wins on at least one item, both survive. For example, if the SeqScan path is cheaper but the IndexScan path has a more useful sort order, one wins on cost and the other wins on sort. Neither completely dominates the other, so both stay in the pathlist. Which one gets picked in the end depends on what the upper stage wants, like whether an &lt;code&gt;ORDER BY&lt;/code&gt; is in play or a MergeJoin is sitting above.&lt;/p&gt;

&lt;p&gt;The function that wraps this check is &lt;code&gt;add_path&lt;/code&gt;. Paths that lose dominance are freed from memory right there.&lt;/p&gt;

&lt;h3&gt;
  
  
  Disabled candidates lose before cost even gets compared
&lt;/h3&gt;

&lt;p&gt;There is one more rule inside the comparison. &lt;strong&gt;Whether a candidate is disabled is checked before cost is.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What does "disabled candidate" mean here? PG has GUCs like &lt;code&gt;enable_seqscan&lt;/code&gt;, &lt;code&gt;enable_indexscan&lt;/code&gt;, &lt;code&gt;enable_nestloop&lt;/code&gt;. When an operator turns one of these to &lt;code&gt;off&lt;/code&gt;, PG does &lt;em&gt;not&lt;/em&gt; stop making that kind of path. It makes the path normally, but stamps it once with a note that says "use this only as a last resort." That stamp lives inside the path node as a counter called &lt;code&gt;disabled_nodes&lt;/code&gt;. When a seq scan path is built with &lt;code&gt;enable_seqscan = off&lt;/code&gt;, that path's &lt;code&gt;disabled_nodes&lt;/code&gt; is 1, while the normal candidates sit at 0.&lt;/p&gt;

&lt;p&gt;The dominance comparison looks at &lt;code&gt;disabled_nodes&lt;/code&gt; first. No matter what else is going on, whichever side has fewer &lt;code&gt;disabled_nodes&lt;/code&gt; wins outright. Only when they tie does the comparison move on to cost, sort order, and the rest.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;enable_seqscan = off&lt;/code&gt;, the seq scan path gets &lt;code&gt;disabled_nodes = 1&lt;/code&gt;, and as long as even one normal candidate is alive at &lt;code&gt;disabled_nodes = 0&lt;/code&gt;, that normal candidate wins unconditionally. The seq scan almost always loses. The word &lt;em&gt;almost&lt;/em&gt; matters. If there is no other candidate, or every candidate has been bumped to &lt;code&gt;disabled_nodes = 1&lt;/code&gt; for the same reason, the seq scan rises back up. The &lt;code&gt;enable_*&lt;/code&gt; GUCs are not hard "you can never use this" switches; they are soft "if there is another way, prefer it" hints.&lt;/p&gt;

&lt;p&gt;The pathlist itself is kept sorted first by &lt;code&gt;disabled_nodes&lt;/code&gt; and then by &lt;code&gt;total_cost&lt;/code&gt;. The least disabled candidate, and among those the cheapest, is always at the front.&lt;/p&gt;

&lt;h3&gt;
  
  
  Joins
&lt;/h3&gt;

&lt;p&gt;Joins repeat the same mechanism one level up. Once the pathlists for two base rels are filled, a RelOptInfo for the join workspace is built that combines them, and one join path each for NestLoop, MergeJoin, and HashJoin is registered into that join workspace's pathlist.&lt;/p&gt;

&lt;p&gt;A join path is a single node, but it points to two children. One child path is held as &lt;code&gt;outer&lt;/code&gt;, the other as &lt;code&gt;inner&lt;/code&gt;. When joining table &lt;code&gt;a&lt;/code&gt; and table &lt;code&gt;b&lt;/code&gt;, a NestLoop join path looks roughly like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NestLoop join path (a ⨝ b)
├── outer child → IndexScan path on a
└── inner child → SeqScan path on b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a query that joins three tables &lt;code&gt;a&lt;/code&gt;, &lt;code&gt;b&lt;/code&gt;, and &lt;code&gt;c&lt;/code&gt;, join paths stack up like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HashJoin path ((a ⨝ b) ⨝ c)
├── outer child → NestLoop path (a ⨝ b)
│                ├── outer → IndexScan path on a
│                └── inner → SeqScan path on b
└── inner child → SeqScan path on c
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once paths start pointing at other paths as children, a single path sitting in the final join workspace's pathlist is itself an entire tree. When the book talks about a &lt;em&gt;Path tree&lt;/em&gt;, this shape is what it means.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Plan tree holds only the one survivor
&lt;/h2&gt;

&lt;p&gt;After the planner has filled every workspace's pathlist, what is left at the end is the top-level join workspace that produces the entire query's result. The cheapest single tree sits at the front of that workspace's pathlist. The planner picks it as &lt;code&gt;best_path&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is where the next step begins. &lt;code&gt;best_path&lt;/code&gt; is a data structure that only makes sense inside the planner. It is not a form the executor can take in and run directly. So the planner rewrites &lt;code&gt;best_path&lt;/code&gt; into a form the executor can use. The result is the &lt;strong&gt;Plan tree&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The Plan tree is a single tree. There are no multiple candidates the way there were in the Path tree. It is built in roughly the same shape as the path tree it came from. What was called &lt;code&gt;outer&lt;/code&gt; and &lt;code&gt;inner&lt;/code&gt; in the path tree is called &lt;code&gt;lefttree&lt;/code&gt; and &lt;code&gt;righttree&lt;/code&gt; in the plan tree. The names are different, but they point to the same thing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HashJoin plan        ← corresponds to HashJoin path in the path tree
├── lefttree → NestLoop plan      ← corresponds to NestLoop path
│              ├── lefttree → IndexScan plan  (a)
│              └── righttree → SeqScan plan  (b)
└── righttree → SeqScan plan      (c)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A Plan node directly holds three things.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Where rows come from&lt;/strong&gt;: pointers (&lt;code&gt;lefttree&lt;/code&gt;, &lt;code&gt;righttree&lt;/code&gt;) to the child plan nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What to filter by&lt;/strong&gt;: the filter conditions applied at execution time (&lt;code&gt;qual&lt;/code&gt; list).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What to output&lt;/strong&gt;: the expression list of output columns (&lt;code&gt;targetlist&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The executor walks this tree one node at a time, taking the rows its children hand up and applying its own work (filter, join, output) on top. In the diagram above, when SeqScan plans pull rows up off disk, the NestLoop above takes them in and performs the &lt;code&gt;a&lt;/code&gt;-&lt;code&gt;b&lt;/code&gt; join, and the HashJoin above takes that result and joins it with &lt;code&gt;c&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Packing both into one structure means each stage drags dead fields
&lt;/h2&gt;

&lt;p&gt;A natural question shows up here. If the path tree and the plan tree are nearly the same shape, why build both? Why not bolt cost info onto a single structure, do the comparisons, and hand that structure straight to the executor?&lt;/p&gt;

&lt;p&gt;The answer is that the two stages need different information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Planning needs comparison-side info.&lt;/strong&gt; To run the dominance check, you have to know whether two candidates have the same sort order, whether they take the same external parameters, which workspace they belong to. A Path node therefore carries fields like &lt;code&gt;pathkeys&lt;/code&gt;, which expresses sort order, &lt;code&gt;param_info&lt;/code&gt;, which holds parameterization, and a back-link to its own workspace (RelOptInfo). These fields are useless the moment planning ends. The executor never compares candidates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Execution needs runtime-side info.&lt;/strong&gt; A Plan node carries the pointers it needs to find its children, the filter expressions it applies to the rows passing through, and the expressions for its output columns directly inside the node. The English field names are the ones you saw in the diagram (&lt;code&gt;lefttree&lt;/code&gt;, &lt;code&gt;righttree&lt;/code&gt;, &lt;code&gt;qual&lt;/code&gt;, &lt;code&gt;targetlist&lt;/code&gt;). These fields are useless during planning, because candidate comparison does not look at child pointers or filter conditions.&lt;/p&gt;

&lt;p&gt;The PG optimizer README sums up this asymmetry in a single sentence.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"There is pretty nearly a one-to-one correspondence between the Path and Plan trees, but Path nodes omit info that won't be needed during planning, and include info needed for planning that won't be needed by the executor."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Packing both kinds of field into one structure makes both stages heavier. Planning's comparison loop drags dead execution fields through every node, leaking cache efficiency and memory. The executor's plan nodes carry dead comparison fields around forever.&lt;/p&gt;

&lt;p&gt;There is one more reason. With two separate structures, a path that loses dominance can be freed right there. If a path were the same thing as a plan, you could not know in advance which candidate would eventually go to the executor, so you would have to keep every candidate alive to the end. With them separated, the moment planning ends you only need to convert the one &lt;code&gt;best_path&lt;/code&gt; into a plan, and you can throw the rest away. In big queries where join candidates blow up, this difference shows up directly as memory cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  create_plan freezes one best_path into a plan tree
&lt;/h2&gt;

&lt;p&gt;The conversion from path tree to plan tree happens in a function called &lt;code&gt;create_plan&lt;/code&gt;. It takes one &lt;code&gt;best_path&lt;/code&gt; and walks the path tree, rewriting each node as the corresponding plan node.&lt;/p&gt;

&lt;p&gt;The correspondence is one-to-one.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path node&lt;/th&gt;
&lt;th&gt;Plan node&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SeqScan path&lt;/td&gt;
&lt;td&gt;SeqScan plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IndexScan path&lt;/td&gt;
&lt;td&gt;IndexScan plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NestPath&lt;/td&gt;
&lt;td&gt;NestLoop plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MergePath&lt;/td&gt;
&lt;td&gt;MergeJoin plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HashPath&lt;/td&gt;
&lt;td&gt;HashJoin plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SortPath&lt;/td&gt;
&lt;td&gt;Sort plan&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each path carries its own kind in a &lt;code&gt;pathtype&lt;/code&gt; field, and the conversion engine switches on that value to call the right per-type conversion function.&lt;/p&gt;

&lt;p&gt;The conversion is not a plain copy. Information moves in three directions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, cost info is carried over as is.&lt;/strong&gt; The &lt;code&gt;startup_cost&lt;/code&gt;, &lt;code&gt;total_cost&lt;/code&gt;, &lt;code&gt;rows&lt;/code&gt;, and &lt;code&gt;disabled_nodes&lt;/code&gt; that the path was holding are copied straight onto the plan. The cost numbers you see in EXPLAIN are these values brought over from the path side. The plan stage does not recompute them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, qual is distributed out of RestrictInfo into individual plan nodes.&lt;/strong&gt; Why this distribution is needed takes a little untangling. WHERE conditions and join conditions are not pinned directly to path nodes. Instead, each workspace (RelOptInfo) holds a list of WHERE conditions attached to its own table, each join path holds a list of its own join conditions, and each condition lives inside that list as a &lt;code&gt;RestrictInfo&lt;/code&gt; object.&lt;/p&gt;

&lt;p&gt;A RestrictInfo is not a plain wrapper. Along with the condition expression itself, it carries side information that, once computed, can be reused in many places. Which rels the condition references, the estimated selectivity, the evaluation cost, whether the condition is usable as a mergejoin or hashjoin clause, and so on. In other words, &lt;em&gt;the condition expression plus a cache of the interpretation done over it&lt;/em&gt; is what makes up a RestrictInfo. When several candidate paths in the same workspace need to apply the same condition, they reuse the cached selectivity and eval cost from the RestrictInfo instead of recomputing every time. The wrapper exists not just for &lt;em&gt;sharing the condition&lt;/em&gt;, but for &lt;em&gt;reusing the interpretation&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;A Plan node, however, has to hold its own qual directly. So &lt;code&gt;create_plan&lt;/code&gt; walks the path tree in post-order and decides which RestrictInfo should descend into which plan node's qual. Given a condition like &lt;code&gt;WHERE a.x = b.x AND a.y &amp;gt; 0&lt;/code&gt;, on the NestLoop join path between &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt;, &lt;code&gt;a.x = b.x&lt;/code&gt; goes down as the join qual, and &lt;code&gt;a.y &amp;gt; 0&lt;/code&gt; goes down to the SeqScan plan on &lt;code&gt;a&lt;/code&gt; as its filter. Once distribution is finished, each plan node holds only the qual that applies at its own location.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, only for MergeJoin children does the plan stage actually add nodes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In the general case, sort is already pinned in during the path stage. When something like &lt;code&gt;ORDER BY&lt;/code&gt; or &lt;code&gt;GROUP BY&lt;/code&gt; demands a sort, the planner inserts a &lt;code&gt;SortPath&lt;/code&gt; into the path tree as a separate path node, and conversion just transforms that SortPath into a Sort plan one-to-one.&lt;/p&gt;

&lt;p&gt;MergeJoin is the exception. Instead of placing a separate sort path on top of a child path, the MergeJoinPath carries the sort keys for each of its two children in its own fields (just the fact that "a sort is needed above this child"). At conversion time, the plan stage reads those fields and slips a Sort plan directly on top of the corresponding child plan.&lt;/p&gt;

&lt;p&gt;So plan trees are mirror images of path trees almost everywhere, with MergeJoin's child-side sort as the one exception where the plan stage actually inserts a node.&lt;/p&gt;

&lt;p&gt;When conversion is over, the plan tree has one root node with children hanging below it. The plan tree is then wrapped in PlannedStmt and handed to the executor. The detailed structure of PlannedStmt was covered in the 1.4 chapter overview.&lt;/p&gt;

&lt;p&gt;What enters the conversion stage is a single tree, and the plan tree is nearly that single tree's mirror image. Paths that lost dominance had already been freed at registration time by &lt;code&gt;add_path&lt;/code&gt;, and the other paths that survived but were not chosen as &lt;code&gt;best_path&lt;/code&gt; are cleaned up along with the planner's context the moment planning ends.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, "why did the optimizer pick this plan" cannot be answered by EXPLAIN alone.&lt;/strong&gt; The node tree EXPLAIN shows, with its SeqScan, IndexScan, NestLoop, is the plan tree, not the path tree. Only the &lt;em&gt;single survivor&lt;/em&gt; is visible. Which candidates competed in the path stage for the same query, at what costs, and which paths fell out of dominance and disappeared leave no trace on the plan tree. The actual time from EXPLAIN ANALYZE, the output from EXPLAIN (VERBOSE), the per-node JSON from EXPLAIN (COSTS OFF, FORMAT JSON) all read fields off the plan nodes. They do not show the path-stage comparison results. To answer "why was this plan chosen" you cannot just stare at one EXPLAIN. You have to vary the input statistics and the GUCs, watch how the plan shifts, and infer the path-stage comparison backwards from those shifts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, if the plan for the same SQL has changed, something in the path-stage comparison flipped.&lt;/strong&gt; When the same query that was running fine starts coming back with a different plan and the performance shifts, the first place to suspect is the path-stage comparison. Statistics (&lt;code&gt;pg_statistic&lt;/code&gt;) may have been refreshed and selectivity estimates moved; indexes may have been added or dropped so that new candidate paths appeared or old ones disappeared; cost parameters like &lt;code&gt;random_page_cost&lt;/code&gt; may have changed, shifting the cost score of the same path. None of these change the plan tree directly. They flip dominance results inside the path stage, and once &lt;code&gt;best_path&lt;/code&gt; ends up pointing to a different path, the plan that comes out the other side comes out differently shaped. In production debugging, replacing the question "why did the plan change" with "which path's cost estimate flipped" makes the trail much shorter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, if &lt;code&gt;enable_seqscan = off&lt;/code&gt; does not change the plan, doubt the candidate list before doubting the GUC.&lt;/strong&gt; &lt;code&gt;enable_*&lt;/code&gt; GUCs are soft hints, not hard blocks. So when no other normal candidate survives, the disabled path stays alive and ends up in the plan anyway. On small tables where no index can beat a seq scan on cost, on tables with no indexes, on situations where broken statistics have driven every index candidate to the same disabled level, the GUC looks impotent for one real reason: &lt;em&gt;no other candidate is left standing&lt;/em&gt;. The diagnostic order is (a) confirm that an index exists with &lt;code&gt;\d+ &amp;lt;table&amp;gt;&lt;/code&gt; or &lt;code&gt;pg_indexes&lt;/code&gt;, (b) check &lt;code&gt;last_analyze&lt;/code&gt; in &lt;code&gt;pg_stat_user_tables&lt;/code&gt; to make sure the statistics are alive, (c) compare the cost numbers in EXPLAIN between the chosen plan and any candidate alternatives to see what is alive and why it is still more expensive. Pushing the GUC harder is not the answer; doubting the candidate list is.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>planner</category>
    </item>
    <item>
      <title>1.4 Planner: Which Path to Take</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Fri, 29 May 2026 10:05:31 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/14-planner-which-path-to-take-1k2f</link>
      <guid>https://dev.to/joonghyukshin/14-planner-which-path-to-take-1k2f</guid>
      <description>&lt;p&gt;By the time 1.3 ends, the rewriter has let the Query tree through. Views have been expanded, RLS policies have been wedged into WHERE clauses, and missing INSERT columns have been filled with defaults. The starting point of this chapter is one fact: &lt;strong&gt;that Query tree now has "what to do" fully decided, but "how to do it" not even by a single character.&lt;/strong&gt; The stage that fills in that gap is the planner.&lt;/p&gt;

&lt;p&gt;The planner's input and output are clear. It takes one Query tree and produces one PlannedStmt. What it takes in is a tree of meaning; what it puts out is a PlannedStmt wrapping a tree of execution. The execution tree is built out of Plan nodes, and PlannedStmt bundles that tree together with the side information execution needs (for each referenced table, what lock mode to take; for each parameter slot the executor will fill at runtime, what type it is; and so on). At the moment of receiving, the tree only carries the meaning "pick the row from the users table where id is 1." At the moment of emitting, it carries a tree the executor can follow step by step, something like "go into the primary key index on users, fetch the one matching row, then output that whole row."&lt;/p&gt;

&lt;p&gt;What sets the planner decisively apart from the analyzer and the rewriter is the nature of the decision. When the analyzer resolves "which table at which OID is users," there is one correct answer. When the rewriter expands a view reference into its underlying SELECT, there is one shape the expansion takes. But the question the planner gets, "in what order, using which index, with what join method should this Query be executed," has many equally correct candidates. The way to resolve &lt;code&gt;WHERE id = 1&lt;/code&gt; is a sequential scan only if there is no index. If an index is present, an index scan joins as a candidate, and even then, depending on what the statistics say, a sequential scan can still be cheaper. One candidate has to be picked, and the basis for picking is not correctness but &lt;strong&gt;estimated cost&lt;/strong&gt;. That is why the planner carries a cost model. It scores the candidates with estimates derived from statistics and picks the cheapest path.&lt;/p&gt;

&lt;p&gt;To express this cost-based selection, the inside of the planner is split into two layers. &lt;strong&gt;Path&lt;/strong&gt; and &lt;strong&gt;Plan&lt;/strong&gt;. Path is a candidate plane that expresses "this is a possible route." For each chunk of the query, one candidate is generated per possible route, and each carries its own cost into the competition. Plan is the winning route, frozen into a concrete form the executor can actually run. Without the Path stage, there is no place to line up candidates and sort them by cost. Without the Plan stage, the executor has nothing to follow. That is why the two layers exist separately.&lt;/p&gt;

&lt;p&gt;1.4 splits into seven sections.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.4.1 Path tree vs Plan tree: what's the difference&lt;/strong&gt;: how multiple Path candidates get built for the same Query, how one of them gets frozen into a Plan, and why the two data structures exist separately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.4.2 Cost model: sequential scan formula&lt;/strong&gt;: how PG estimates the cost of the simplest candidate, SeqScan, and what parameters like &lt;code&gt;seq_page_cost&lt;/code&gt; and &lt;code&gt;cpu_tuple_cost&lt;/code&gt; are quantifying.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.4.3 Cost model: index scan formula&lt;/strong&gt;: the cost formula for index scans, a step more complex than SeqScan, and how selectivity enters it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.4.4 Join strategies: nested loop, hash, merge&lt;/strong&gt;: the three basic ways to bring two tables together, and the conditions under which each one wins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.4.5 Join order: dynamic programming vs GEQO&lt;/strong&gt;: as the number of tables to join grows, the possible orders explode. How PG handles that with dynamic programming and the genetic algorithm (GEQO).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.4.6 Statistics: pg_statistic and selectivity&lt;/strong&gt;: where the accuracy of cost estimation comes from. The statistics ANALYZE builds, and how those statistics feed selectivity estimation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.4.7 Planner hook: when it fires, how to use it&lt;/strong&gt;: the extension point PG has opened at the planner's entry. How external extensions intercept the entire plan or alter parts of the cost estimation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The planner is not a one-track machine that finds the answer; it is a tool that bets between candidates with cost estimates. The same SQL freezing into a different plan after a single index is added or a single ANALYZE runs, a join order collapsing on one statistical miss, the node tree shown in EXPLAIN output looking the way it does: all of it is the result of this bet.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>planner</category>
    </item>
    <item>
      <title>1.3.4 RETURNING Mapping and View-Target DML Transformation</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Mon, 25 May 2026 11:02:53 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/134-returning-mapping-and-view-target-dml-transformation-4ajm</link>
      <guid>https://dev.to/joonghyukshin/134-returning-mapping-and-view-target-dml-transformation-4ajm</guid>
      <description>&lt;p&gt;The previous section (1.3.3) looked at the work the rewriter does to &lt;em&gt;fill empty slots with values and expressions&lt;/em&gt; inside an INSERT/UPDATE/DELETE the user issued. Injecting defaults for omitted INSERT columns, resolving the &lt;code&gt;DEFAULT&lt;/code&gt; keyword, and merging partial UPDATE assignments all belong to that work.&lt;/p&gt;

&lt;p&gt;The other half of DML rewriting has a different flavor. It is the work of moving &lt;em&gt;what existing references point to&lt;/em&gt;. It shows up in two places. One is when the &lt;code&gt;RETURNING&lt;/code&gt; clause has to be resolved. The other is when a DML aimed at a view has to be rewritten as a DML against a base table. In both cases the expressions the user wrote stay the same; only what those expressions point to (the &lt;code&gt;varno&lt;/code&gt;/&lt;code&gt;varattno&lt;/code&gt; of &lt;code&gt;Var&lt;/code&gt; nodes) gets swapped.&lt;/p&gt;

&lt;h2&gt;
  
  
  RULE's RETURNING is reshaped to the user's request
&lt;/h2&gt;

&lt;p&gt;All three commands (INSERT/UPDATE/DELETE) can carry a &lt;code&gt;RETURNING&lt;/code&gt; clause. It is the clause that hands the affected rows back as a result. RETURNING usually flows into the planning stage exactly as the user wrote it, but when DML is issued against a table that has an INSTEAD/ALSO rule attached, the rewriter adds one more mapping step.&lt;/p&gt;

&lt;p&gt;Take a concrete scenario. Suppose an &lt;code&gt;account_summary&lt;/code&gt; view and a rule on it are defined like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;account_summary&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;RULE&lt;/span&gt; &lt;span class="n"&gt;log_summary_insert&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;account_summary&lt;/span&gt;
    &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;INSTEAD&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="n"&gt;RETURNING&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This rule is an INSTEAD rule: when the user issues an INSERT to &lt;code&gt;account_summary&lt;/code&gt;, it replaces that with an INSERT to &lt;code&gt;accounts&lt;/code&gt;. The action itself carries a RETURNING clause whose list is the three columns &lt;code&gt;id, owner, balance&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now suppose the user issues this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;account_summary&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'alice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="n"&gt;RETURNING&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user's requested RETURNING list is two columns, &lt;code&gt;id, balance&lt;/code&gt;. The shape differs from the rule action's RETURNING list of &lt;code&gt;id, owner, balance&lt;/code&gt;. What the rewriter does here is clear. It reshapes the rule action's RETURNING list to the shape the user asked for, &lt;code&gt;id, balance&lt;/code&gt;. The query that ends up being executed looks like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'alice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="n"&gt;RETURNING&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user's shape is what ends up in the result. The RETURNING &lt;code&gt;id, owner, balance&lt;/code&gt; written into the rule action itself acts only as a signal that "this rule knows how to resolve RETURNING over these columns"; the final shape follows the user's request.&lt;/p&gt;

&lt;p&gt;There are three more branches.&lt;/p&gt;

&lt;p&gt;First, the user did not write a RETURNING clause. In that case the rule action's RETURNING is simply discarded. The user said they don't want a result, so the RETURNING expressions are never evaluated.&lt;/p&gt;

&lt;p&gt;Second, RETURNING lists appear on more than one rule action. Suppose, for example, &lt;code&gt;account_summary&lt;/code&gt; has both of these rules attached.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;RULE&lt;/span&gt; &lt;span class="n"&gt;log_summary_insert&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;account_summary&lt;/span&gt;
    &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;INSTEAD&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="n"&gt;RETURNING&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;RULE&lt;/span&gt; &lt;span class="n"&gt;audit_summary_insert&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;account_summary&lt;/span&gt;
    &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="n"&gt;ALSO&lt;/span&gt;   &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;accounts_audit&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="n"&gt;RETURNING&lt;/span&gt; &lt;span class="n"&gt;audit_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;DO ALSO&lt;/code&gt; does not intercept the user's command; it runs as an additional action alongside. Now a single INSERT event has two actions carrying RETURNING: one for &lt;code&gt;accounts&lt;/code&gt;, the other for &lt;code&gt;accounts_audit&lt;/code&gt;. When the user asks for &lt;code&gt;RETURNING id, balance&lt;/code&gt;, there is no way to decide which action the result should come from. Because of this ambiguity, the rewriter reports an error at this point.&lt;/p&gt;

&lt;p&gt;Third, an INSTEAD rule has no RETURNING at all but the user asked for one. This is also an error. The message is "cannot perform INSERT/UPDATE/DELETE RETURNING on relation X".&lt;/p&gt;

&lt;p&gt;If we boil these three branches down to one line: the RULE system prioritizes the user's requested result shape, but the path is sound only when exactly one rule action can resolve it.&lt;/p&gt;

&lt;h2&gt;
  
  
  DML on a view is rewritten as DML on a base relation
&lt;/h2&gt;

&lt;p&gt;When an INSERT/UPDATE/DELETE targets a view and that view is auto-updatable (a view that satisfies the simple-view expansion conditions from 1.3.1), the rewriter rewrites the view-target DML as a DML against the base relation. INSERT, UPDATE, DELETE, and MERGE all go through the same function &lt;code&gt;rewriteTargetView&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Take an example first. Suppose a view and a user query like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;active_accounts&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;deleted_at&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;active_accounts&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'alice'&lt;/span&gt;
    &lt;span class="n"&gt;RETURNING&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user issued a query shaped like an UPDATE against the &lt;code&gt;active_accounts&lt;/code&gt; view. The relation targeted by the UPDATE, i.e., the relation whose rows will be modified, is called the &lt;strong&gt;result relation&lt;/strong&gt; in PG terminology. Here the slot for the result relation holds &lt;code&gt;active_accounts&lt;/code&gt;. A view doesn't carry data of its own, so it cannot be updated as-is; the update has to flow through to the base table behind the view, which is &lt;code&gt;accounts&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In one line, the transformation replaces every view reference inside the query with a reference to the base relation. Broken down into steps, three things happen.&lt;/p&gt;

&lt;p&gt;First, the view RTE that sat in the result relation slot is swapped for the base relation RTE. In the example above, the slot initially held an RTE pointing to &lt;code&gt;active_accounts&lt;/code&gt;; that gets replaced with the RTE of the base table &lt;code&gt;accounts&lt;/code&gt; that the view definition points to.&lt;/p&gt;

&lt;p&gt;Second, every &lt;code&gt;Var&lt;/code&gt; node inside the query (Var in targetList, Var in RETURNING, Var in ON CONFLICT) has its attno rewritten from the view's column index to the base table's column index. One point worth pinning down here. When the user issues a DML targeting a view, the column names they write in the RETURNING clause are the view's column names. That's because when the analyzer resolves the column references in RETURNING, it looks at the catalog of the result relation, which is the view. After that, the rewriter rewrites those &lt;code&gt;Var&lt;/code&gt; nodes' attno to point at slot numbers in the base table. SQL text is not what changes; one integer field on a tree node is swapped for a different number.&lt;/p&gt;

&lt;p&gt;In our example, the view is shaped as &lt;code&gt;SELECT * FROM accounts&lt;/code&gt;, so the view's &lt;code&gt;id&lt;/code&gt; maps straight to the base's &lt;code&gt;id&lt;/code&gt;. Rewriting attno is the whole story. If the view had instead been written as &lt;code&gt;SELECT id AS account_id, owner, balance FROM accounts&lt;/code&gt;, with an alias renaming a column, the user would have written &lt;code&gt;RETURNING account_id&lt;/code&gt;, and that &lt;code&gt;Var&lt;/code&gt;'s attno would be rewritten to point at the base's &lt;code&gt;id&lt;/code&gt; slot. The &lt;em&gt;label&lt;/em&gt; on the column the user gets back stays exactly as they wrote it, i.e., the view column name (&lt;code&gt;account_id&lt;/code&gt;), and the value inside that column is read from the base's &lt;code&gt;id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Third, if the view definition had a WHERE clause, that condition is added to the user's query. The view definition in our example is &lt;code&gt;... WHERE deleted_at IS NULL&lt;/code&gt;, and that condition defines which rows are "visible" through the view. The UPDATE the user issued must target rows that fall within the view's visibility range, so the view's WHERE is ANDed onto the user's WHERE. UPDATE and DELETE pick up the merge this way; INSERT creates new rows and has no WHERE clause of its own for the view's WHERE to attach to, so it skips this step.&lt;/p&gt;

&lt;p&gt;Conceptually, the rewriter takes the query above and produces something like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'alice'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;deleted_at&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
    &lt;span class="n"&gt;RETURNING&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The view definition's WHERE (&lt;code&gt;deleted_at IS NULL&lt;/code&gt;) got ANDed onto the user's WHERE (&lt;code&gt;owner = 'alice'&lt;/code&gt;), and the &lt;code&gt;Var&lt;/code&gt; nodes in RETURNING have been remapped from view column indexes to base column indexes in &lt;code&gt;accounts&lt;/code&gt;. The SQL above isn't real SQL text; it's an analogy that pictures the transformed result in SQL form. Inside the tree, the &lt;code&gt;Var&lt;/code&gt; nodes in RETURNING simply point at base column slots; the result is not re-emitted as SQL text. The user thinks they updated a view, but the base table is what's actually updated, and the view no longer appears in the post-transformation query.&lt;/p&gt;

&lt;h3&gt;
  
  
  A security_barrier view pins the view's WHERE in front of securityQuals
&lt;/h3&gt;

&lt;p&gt;If a view was created with the &lt;code&gt;security_barrier&lt;/code&gt; option, one thing changes. Instead of being ANDed onto the user's WHERE in the usual way, the view definition's WHERE is pinned at the front of the base relation RTE's &lt;code&gt;securityQuals&lt;/code&gt;. This is the same &lt;code&gt;securityQuals&lt;/code&gt; list from 1.3.2, the container that held RLS quals. Going into the same container brings the same protection effect. The planner cannot reorder a leaky function the user planted in their WHERE clause to be evaluated before the view's filter. That stops a leaky function from bypassing the view's filter to see base rows.&lt;/p&gt;

&lt;h3&gt;
  
  
  WITH CHECK OPTION stops a new row from breaking the view's visibility
&lt;/h3&gt;

&lt;p&gt;If the view carries &lt;code&gt;WITH CHECK OPTION&lt;/code&gt;, one more thing is added. It is the promise that any row coming in (or being changed) through INSERT/UPDATE must satisfy the view's WHERE again. Adding &lt;code&gt;WITH CHECK OPTION&lt;/code&gt; to the end of the view definition looks like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;active_accounts&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;deleted_at&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
    &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="k"&gt;OPTION&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we issue an UPDATE against a view defined this way, the row produced by the UPDATE has to still satisfy &lt;code&gt;deleted_at IS NULL&lt;/code&gt;. That stops the user from setting &lt;code&gt;deleted_at&lt;/code&gt; to some timestamp via that UPDATE and shoving the row they were updating outside the view's visibility range. The rewriter turns this promise into a check entry and pins it at the front of the Query's &lt;code&gt;withCheckOptions&lt;/code&gt; list.&lt;/p&gt;

&lt;p&gt;The same container from 1.3.2 shows up again. The &lt;code&gt;withCheckOptions&lt;/code&gt; list that held RLS WITH CHECK conditions now also takes the view's CHECK OPTION in the same place. The only difference is the identifier that distinguishes the kind of check; the enforcement mechanism is the same. Even the enforcement semantics (an error is raised on violation, not a silent drop) are identical.&lt;/p&gt;

&lt;p&gt;Consider the case where RLS and view CHECK OPTION are both active on the same table/view. Say the base table &lt;code&gt;accounts&lt;/code&gt; has an RLS WITH CHECK policy saying "an INSERTed row's &lt;code&gt;owner&lt;/code&gt; must always be current_user", and the &lt;code&gt;active_accounts&lt;/code&gt; view above carries a WITH CHECK OPTION saying "the row must satisfy &lt;code&gt;deleted_at IS NULL&lt;/code&gt;". When the user issues an INSERT through &lt;code&gt;active_accounts&lt;/code&gt;, the rewriter pins both conditions onto the &lt;code&gt;withCheckOptions&lt;/code&gt; list together. The executor checks the new row against both conditions in order; a violation of either raises an error. Because they sit in the same list and are enforced the same way, the enforcement mechanism is single.&lt;/p&gt;

&lt;h3&gt;
  
  
  View transformation unwinds recursively
&lt;/h3&gt;

&lt;p&gt;When the view-target DML work finishes, the resulting query re-enters RewriteQuery recursively. The base table might still have another view attached, and if RLS is attached to the base table, it gets applied in step 2 (&lt;code&gt;fireRIRrules&lt;/code&gt;) that runs after this transformation. That's why multi-layer constructions like view-on-view, view-on-RLS, and RLS-on-view each unwind at their own layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, code that takes the RETURNING result from a view-target DML by positional index can break when the view definition changes.&lt;/strong&gt; When you issue a DML against a view with &lt;code&gt;RETURNING *&lt;/code&gt;, the result columns are labeled with the view's column names, but the data is actually read from the base table. Take a view defined like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;account_summary&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Application code issues an INSERT against this view, takes the &lt;code&gt;RETURNING *&lt;/code&gt;, and assumes the first slot is &lt;code&gt;id&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cur&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT INTO account_summary (owner, balance) VALUES (%s, %s) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RETURNING *&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;fetchone&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;new_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# assumed to be id
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Later, during operation, suppose the team decides to add &lt;code&gt;created_at&lt;/code&gt; to the view and redefines the column order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;account_summary&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the same code keeps running, &lt;code&gt;row[0]&lt;/code&gt; is now &lt;code&gt;created_at&lt;/code&gt;, and &lt;code&gt;new_id&lt;/code&gt; ends up with a timestamp. The code still compiles, no SQL error fires, so the bug is found late. Reading by column name (&lt;code&gt;row["id"]&lt;/code&gt;) lets the value find its own slot even if the view's column order shifts. On code paths that issue DML through views, standardizing on reading RETURNING results by column name is the safer choice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, an UPDATE against a view with WITH CHECK OPTION cannot shove the very row it is updating outside the view.&lt;/strong&gt; One common trap shows up on the soft-delete pattern. Soft-delete means not actually DELETEing a row but recording a deletion timestamp in a column like &lt;code&gt;deleted_at&lt;/code&gt;, treating the row as "removed". A view that exposes only "rows not yet deleted" is then layered on top. &lt;code&gt;active_accounts&lt;/code&gt; is exactly that shape: in the base &lt;code&gt;accounts&lt;/code&gt; the row stays put, and only rows with &lt;code&gt;deleted_at IS NULL&lt;/code&gt; are exposed through the view. When application code tries to soft-delete a row through that same view, as in &lt;code&gt;UPDATE active_accounts SET deleted_at = now() WHERE id = ?&lt;/code&gt;, the row resulting from that UPDATE no longer satisfies &lt;code&gt;deleted_at IS NULL&lt;/code&gt;. The row falls outside the view's visibility range (the area of "visible rows" the view's WHERE defines). WITH CHECK OPTION blocks this with an error. The application meant to soft-delete, but the view's visibility promise stands in direct conflict with that intent, so the operation is rejected.&lt;/p&gt;

&lt;p&gt;There are two ways to resolve the conflict. One is to issue the soft-delete UPDATE directly against the base table instead of routing it through the view. The view keeps the read-side visibility job, and the soft-delete write happens on the base. The other is to intentionally drop the view's CHECK OPTION and instead enforce, at the application layer, the promise that "INSERTed/UPDATEd rows must satisfy &lt;code&gt;deleted_at IS NULL&lt;/code&gt;". The application takes on the promise the engine used to enforce.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>views</category>
    </item>
    <item>
      <title>1.3.3 INSERT/UPDATE/DELETE Targetlist Normalization (Default Expansion)</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Fri, 22 May 2026 09:52:25 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/133-insertupdatedelete-targetlist-normalization-default-expansion-31nl</link>
      <guid>https://dev.to/joonghyukshin/133-insertupdatedelete-targetlist-normalization-default-expansion-31nl</guid>
      <description>&lt;p&gt;The two rewriter tasks we have seen so far, view expansion and RLS policy injection, were unpacked under the assumption that the query is a SELECT. INSERT/UPDATE/DELETE go through one more preparatory stage on top of those. Even when the user only writes &lt;code&gt;INSERT INTO accounts (owner) VALUES ('alice')&lt;/code&gt;, what PostgreSQL ends up looking at is a query where the &lt;code&gt;id&lt;/code&gt;, &lt;code&gt;balance&lt;/code&gt;, and &lt;code&gt;created_at&lt;/code&gt; slots already contain expressions. That preparation does not exist for SELECT. The subject of this section is how that preparation works, and why it has to happen before user-defined rules get applied.&lt;/p&gt;

&lt;h2&gt;
  
  
  DML preparation has to happen before rule application
&lt;/h2&gt;

&lt;p&gt;The rewriter is divided into two phases. The first phase applies user-defined rules; the second applies view expansion and RLS to each resulting Query (&lt;code&gt;fireRIRrules&lt;/code&gt;, seen in 1.3.1 and 1.3.2). SELECT queries have essentially nothing to do in the first phase. Only INSERT/UPDATE/DELETE receive additional processing inside the first phase.&lt;/p&gt;

&lt;p&gt;The first step of that additional processing is targetlist normalization. For INSERT, the rewriter fills in default expressions for columns the user did not name (when those columns have defaults), and replaces any &lt;code&gt;DEFAULT&lt;/code&gt; keyword the user wrote with the actual expression. UPDATE goes through the same normalization function, but with less to do because there are no missing slots to fill. DELETE skips the stage entirely.&lt;/p&gt;

&lt;p&gt;Why does this normalization have to come before rule application? The answer comes from looking at how the RULE system (1.3.1) refers to columns. The &lt;code&gt;accounts&lt;/code&gt; table used throughout this section is defined as follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;          &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;owner&lt;/span&gt;       &lt;span class="nb"&gt;text&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;balance&lt;/span&gt;     &lt;span class="nb"&gt;numeric&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt;  &lt;span class="n"&gt;timestamptz&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;deleted_at&lt;/span&gt;  &lt;span class="n"&gt;timestamptz&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last column, &lt;code&gt;deleted_at&lt;/code&gt;, has no &lt;code&gt;DEFAULT&lt;/code&gt; clause and is not an identity column. In other words, it is a column with no default. This shape is common in the soft-delete pattern, and the way this column is handled later becomes one interesting case in this section.&lt;/p&gt;

&lt;p&gt;Suppose an audit rule is attached to this table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;RULE&lt;/span&gt; &lt;span class="n"&gt;log_account_insert&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;
    &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="n"&gt;ALSO&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;audit_log&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;owner_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logged_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                          &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The rule says "whenever someone inserts a row into accounts, also insert a row into audit_log." Inside the rule body, &lt;code&gt;NEW.id&lt;/code&gt; and &lt;code&gt;NEW.owner&lt;/code&gt; are virtual expressions that mean "the &lt;code&gt;id&lt;/code&gt; column, the &lt;code&gt;owner&lt;/code&gt; column of the row the user is about to insert." When the rule author writes these references, they make one assumption: no matter what INSERT the user throws, that row will have an &lt;code&gt;id&lt;/code&gt; slot and an &lt;code&gt;owner&lt;/code&gt; slot, with values in them.&lt;/p&gt;

&lt;p&gt;Now suppose the normalization step did not exist, and the rule received the user's original query as is. The user runs &lt;code&gt;INSERT INTO accounts (owner) VALUES ('alice')&lt;/code&gt;. The original query's targetlist contains a single entry, &lt;code&gt;owner = 'alice'&lt;/code&gt;. The &lt;code&gt;id&lt;/code&gt; is missing. When the rule fires, it looks into the trigger query's targetlist to resolve &lt;code&gt;NEW.id&lt;/code&gt;. There is no matching slot. At this point PostgreSQL's rule mapping (the precise mechanism is covered later in this section) replaces the unmatched slot with a NULL constant. So the rule itself does not error out. But what ends up in audit_log is &lt;code&gt;(NULL, 'alice', ...)&lt;/code&gt;. Meanwhile the user's base INSERT goes through the planner separately, picks up an actual integer (say 17) from the sequence into the &lt;code&gt;id&lt;/code&gt; slot, and lands &lt;code&gt;(17, 'alice', ...)&lt;/code&gt; in accounts. The audit table is supposed to track "what row got inserted," and yet it never receives the new row's id. That is not what the rule author had in mind.&lt;/p&gt;

&lt;p&gt;The opposite case: normalization has finished, then the rule fires. The rewriter has already filled the three missing columns (&lt;code&gt;id&lt;/code&gt;, &lt;code&gt;balance&lt;/code&gt;, &lt;code&gt;created_at&lt;/code&gt;) with expressions, so the trigger query the rule sees has all four columns sitting in their proper slots. &lt;code&gt;NEW.id&lt;/code&gt; points to the sequence expression now occupying that slot, and &lt;code&gt;NEW.owner&lt;/code&gt; points to &lt;code&gt;'alice'&lt;/code&gt;. The rule mapping has nothing to fall back to NULL for. The &lt;code&gt;account_id&lt;/code&gt; slot in the audit_log INSERT receives the same sequence expression, so at execution time audit_log gets a value tied to the same new row id that accounts received. The tracking the rule author originally assumed finally works.&lt;/p&gt;

&lt;p&gt;The same INSERT may have some columns explicitly given and others omitted. The rule has no way to tell. The rewriter has to run before the rule and fill in the missing columns in attno order, so that the user's query the rule sees is in the standard shape. The position of this preparation is determined by what the next stage expects as input.&lt;/p&gt;

&lt;p&gt;This is qualitatively different from what the rewriter does for SELECT. View expansion and RLS are transformations that &lt;em&gt;add conditions that were not there&lt;/em&gt; in the user's query; DML preparation is a transformation that &lt;em&gt;fills in slots that were left empty&lt;/em&gt; in the user's query. Both touch the tree, but with different motivations.&lt;/p&gt;

&lt;h2&gt;
  
  
  INSERT auto-injects default expressions into missing columns
&lt;/h2&gt;

&lt;p&gt;Suppose the user fires this query against the &lt;code&gt;accounts&lt;/code&gt; table just defined.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'alice'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The targetList of the Query tree produced by the analyzer contains only the single entry &lt;code&gt;owner = 'alice'&lt;/code&gt;. The id, balance, created_at, and deleted_at are missing. The rewriter fills the empty slots as follows.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;id&lt;/code&gt; is defined as &lt;code&gt;GENERATED ALWAYS AS IDENTITY&lt;/code&gt;, a SQL-standard identity column. PostgreSQL creates a dedicated sequence (a catalog object called &lt;code&gt;accounts_id_seq&lt;/code&gt;) for such columns and, on every INSERT, draws the next integer from that sequence and fills the column with it. What the rewriter places into this slot is the expression node that retrieves the next value from the sequence (&lt;code&gt;NextValueExpr&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;balance&lt;/code&gt; has &lt;code&gt;DEFAULT 0&lt;/code&gt; written in the column definition. That definition is stored as an expression in the &lt;code&gt;pg_attrdef&lt;/code&gt; catalog, and the rewriter pulls the expression out of there and places it into the slot.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;created_at&lt;/code&gt; follows the same route, and the &lt;code&gt;now()&lt;/code&gt; function call expression goes into its slot.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;deleted_at&lt;/code&gt; has no default. The rewriter does not place any expression into this slot.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After three columns get filled, the targetList ends up shaped like &lt;code&gt;(NextValueExpr node, 'alice', 0, now(), empty)&lt;/code&gt; across five positions. The user's query had only one column (&lt;code&gt;owner&lt;/code&gt;), but the query PostgreSQL ends up looking at has four columns filled with expressions and one (&lt;code&gt;deleted_at&lt;/code&gt;) left as an empty slot. The decision of what to fill in each column (or whether to leave it empty) belongs to a single responsibility function (&lt;code&gt;build_column_default&lt;/code&gt;); when an expression is produced, it gets wrapped in a new TargetEntry and inserted into the targetList.&lt;/p&gt;

&lt;p&gt;The decision follows three priority steps. First, if the column is an identity column, the node that retrieves the next value from a sequence. Second, if the column definition contains &lt;code&gt;DEFAULT &amp;lt;expression&amp;gt;&lt;/code&gt;, that expression. Third, if neither applies but the column's data type itself has a default, that type default. In the example above, &lt;code&gt;id&lt;/code&gt; took the first path, while &lt;code&gt;balance&lt;/code&gt; and &lt;code&gt;created_at&lt;/code&gt; took the second. &lt;code&gt;deleted_at&lt;/code&gt; falls through all three.&lt;/p&gt;

&lt;h3&gt;
  
  
  Columns without a default: how the empty slot resolves to NULL
&lt;/h3&gt;

&lt;p&gt;When all three priority steps fail, &lt;code&gt;build_column_default&lt;/code&gt; returns NULL, and the rewriter does not create a TargetEntry for that column in the INSERT targetList. The attno slot itself exists, since the table has five columns by definition, but one of those slots (here, &lt;code&gt;deleted_at&lt;/code&gt;) is simply left empty rather than carrying a TargetEntry. This might look like it breaks the promise stated earlier ("the query the rule sees is in the standard shape with every column sitting in its slot"). If the slot is empty, surely the rule pointing to &lt;code&gt;NEW.deleted_at&lt;/code&gt; should get the wrong value or trigger an error.&lt;/p&gt;

&lt;p&gt;That is not how it works. PostgreSQL has been designed so that every stage downstream of the empty slot interprets it as NULL in a consistent way. A concrete scenario makes this clear. Suppose another rule is also attached to &lt;code&gt;accounts&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;RULE&lt;/span&gt; &lt;span class="n"&gt;forward_deleted_at&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;
    &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="n"&gt;ALSO&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;sync_queue&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;account_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deleted_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                         &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;deleted_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This rule references &lt;code&gt;NEW.deleted_at&lt;/code&gt;. But the normalized targetList of the user's &lt;code&gt;INSERT INTO accounts (owner) VALUES ('alice')&lt;/code&gt; does not contain a TargetEntry for &lt;code&gt;deleted_at&lt;/code&gt; (no default, so no expression was placed there). When the rule mapping step looks into the trigger query's targetList for the &lt;code&gt;deleted_at&lt;/code&gt; slot, it does not find one.&lt;/p&gt;

&lt;p&gt;At this point PostgreSQL's rule mapping substitutes a NULL constant in place of the missing TargetEntry. So the &lt;code&gt;deleted_at&lt;/code&gt; slot of the sync_queue INSERT receives NULL. The rule's promise is upheld in a normal way. Pointing to &lt;code&gt;NEW.deleted_at&lt;/code&gt; simply produces NULL; there is no rule breakage, no neighboring column getting accidentally dragged into the slot.&lt;/p&gt;

&lt;p&gt;The base table side works the same way. When the planner produces the plan for the INSERT and sees no TargetEntry for &lt;code&gt;deleted_at&lt;/code&gt; in the normalized targetList, it fills that slot with a NULL constant to complete the row. The row that lands in accounts ends up as (1, 'alice', 0, now(), NULL).&lt;/p&gt;

&lt;p&gt;At both of those sites, the sync_queue INSERT produced by rule mapping and the base accounts row produced by the planner, the NULL is filled in the same way. There is no inconsistency where one site produces NULL and the other produces some other value. That is why the rewriter does not bother building an explicit NULL expression for the targetList. Inserting a NULL TargetEntry and leaving the slot empty have the same end result, so PostgreSQL chose to leave the slot empty and keep the tree lighter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The &lt;code&gt;DEFAULT&lt;/code&gt; keyword arrives as a SetToDefault placeholder and gets resolved
&lt;/h2&gt;

&lt;p&gt;The previous section covered automatic filling when the user &lt;em&gt;omits&lt;/em&gt; a column. The user can also write the &lt;code&gt;DEFAULT&lt;/code&gt; keyword explicitly to say "fill this slot with the default expression."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'bob'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When this SQL passes through the analyzer, each &lt;code&gt;DEFAULT&lt;/code&gt; written by the user becomes a &lt;code&gt;SetToDefault&lt;/code&gt; placeholder node in the Query tree. SetToDefault is just a temporary dummy placed where an expression node belongs; it is not an executable expression. The node definition comment in PostgreSQL says this directly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cm"&gt;/*
 * Placeholder node for a DEFAULT marker in an INSERT or UPDATE command.
 *
 * This is not an executable expression: it must be replaced by the actual
 * column default expression during rewriting.
 */&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The expression that replaces a SetToDefault is built by the same responsibility function shown earlier. The same three-step priority applies, so each &lt;code&gt;DEFAULT&lt;/code&gt; slot receives the corresponding identity / &lt;code&gt;pg_attrdef&lt;/code&gt; / type-default expression. The core mechanism is the same up to this point. What differs is &lt;em&gt;where&lt;/em&gt; the placeholder lives in the tree, and that depends on the form of the INSERT.&lt;/p&gt;

&lt;h3&gt;
  
  
  Single-row INSERT is substituted directly in the targetList
&lt;/h3&gt;

&lt;p&gt;For the single-row INSERT case above, since there is only one row, the four values (&lt;code&gt;DEFAULT&lt;/code&gt;, &lt;code&gt;'bob'&lt;/code&gt;, &lt;code&gt;DEFAULT&lt;/code&gt;, &lt;code&gt;DEFAULT&lt;/code&gt;) go straight into the Query's targetList. The three SetToDefault nodes sit at positions 1, 3, and 4 of the targetList. The rewriter walks the targetList column by column, and whenever it encounters a SetToDefault, it substitutes the expression produced by the responsibility function in place. In terms of the Query tree structure from 1.3.1, one row in the targetList represents the expression for one column, and that expression flips from SetToDefault to the actual default expression.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-row INSERT runs through a VALUES RTE in between
&lt;/h3&gt;

&lt;p&gt;For multi-row INSERT, which inserts several rows in one statement, the Query tree shape is slightly different.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'bob'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="k"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'carol'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="k"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                            &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'dave'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="k"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'2026-01-01'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With three rows, the targetList cannot hold three rows at once (targetList is shaped as one expression per column). Instead, PostgreSQL bundles the rows into a separate node and attaches it to the Query tree's from-list. That node is called a &lt;strong&gt;VALUES RTE&lt;/strong&gt;. RTE is short for &lt;code&gt;RangeTblEntry&lt;/code&gt; (introduced in 1.2.3), a single data-source unit referenced from the from-list (a table, a sub-query, a function, and here, the VALUES bundle). A VALUES RTE holds a list of rows internally, and each row in turn holds a list of values. The seven SetToDefault placeholders from the INSERT above are scattered across positions 1, 3, and 4 of the three rows inside the VALUES RTE.&lt;/p&gt;

&lt;p&gt;Why does the tree shape differ for what looks like the same INSERT? A single-row INSERT is the case where producing one row &lt;em&gt;is&lt;/em&gt; the entire result of the Query, while a multi-row INSERT is closer to "produce a set of rows and pour them into the table." PostgreSQL handles the latter the same way it handles a regular sub-query. A VALUES RTE sits in the same kind of slot a SELECT's result set would, an "inline data source." That is how &lt;code&gt;INSERT INTO accounts SELECT ...&lt;/code&gt; and &lt;code&gt;INSERT INTO accounts VALUES (...), (...)&lt;/code&gt; end up sharing the same rewriter and planner code paths.&lt;/p&gt;

&lt;p&gt;The multi-row path walks each value of each row inside the VALUES RTE, and whenever it sees a SetToDefault, it substitutes the expression produced by the same responsibility function. The expression produced is identical to the single-row case; only the location it lands in differs.&lt;/p&gt;

&lt;p&gt;There is one small optimization. A VALUES RTE may carry 100 or 1,000 rows. If none of them contain a SetToDefault, the rewriter skips the expensive list rebuild entirely. Rather than reconstructing all those rows, it passes the original list through to the next stage. The reason PostgreSQL has this fast path is the practical observation that multi-row INSERT is typically used for bulk loads, where the DEFAULT keyword almost never appears.&lt;/p&gt;

&lt;h2&gt;
  
  
  UPDATE only tidies what the user wrote; DELETE leaves the targetlist alone
&lt;/h2&gt;

&lt;p&gt;UPDATE goes through the same normalization function as INSERT, but it does much less work than INSERT does. There is no need to look for missing columns to fill. The SET clause of an UPDATE only puts the columns the user explicitly listed into the targetList, and the columns the user did not list must keep the existing row's values. So there is nothing to fill in for empty slots. The two things the rewriter does for UPDATE are: first, resolve any &lt;code&gt;DEFAULT&lt;/code&gt; keyword the user wrote into the actual expression (the same SetToDefault substitution mechanism as in INSERT); second, combine separate-line updates of &lt;em&gt;parts&lt;/em&gt; of the same column into a single expression. The second item needs a brief explanation.&lt;/p&gt;

&lt;p&gt;A column in PostgreSQL can be more than a simple scalar (&lt;code&gt;int&lt;/code&gt;, &lt;code&gt;text&lt;/code&gt;, etc.); it can be a composite type or an array. You can write a SET clause that updates only &lt;em&gt;part&lt;/em&gt; of such a column. For example, if &lt;code&gt;address&lt;/code&gt; is a composite-type column shaped like &lt;code&gt;(street, city, zip)&lt;/code&gt;, you can update two different fields of the same column in one UPDATE.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;
   &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;street&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'101 Pine St'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;city&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Seattle'&lt;/span&gt;
   &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or you can update different subscripts of an array column &lt;code&gt;tags&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;
   &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'premium'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'verified'&lt;/span&gt;
   &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two SET items look like the same column name (&lt;code&gt;address&lt;/code&gt;, &lt;code&gt;tags&lt;/code&gt;) appearing twice in the SQL text, but they really refer to &lt;em&gt;different parts&lt;/em&gt; of the column. PostgreSQL combines the two lines into a &lt;em&gt;single&lt;/em&gt; expression that operates on the column as a whole. &lt;code&gt;address.street = '101 Pine St'&lt;/code&gt; followed by &lt;code&gt;address.city = 'Seattle'&lt;/code&gt; becomes "take the original address, change street first, then take the result and change city" as one expression. As a result, even though the column name appears twice in the SET clause, at execution time the column is updated once.&lt;/p&gt;

&lt;p&gt;When the &lt;em&gt;entire&lt;/em&gt; column is assigned twice (&lt;code&gt;UPDATE t SET address = X, address = Y&lt;/code&gt;, replacing the whole column twice), that is no longer a partial update but a genuine conflict, and the rewriter throws a syntax error. The combining only happens when the two SET items point to &lt;em&gt;different parts&lt;/em&gt; of the same column.&lt;/p&gt;

&lt;p&gt;DELETE is simpler still. In the DML branch, DELETE has a one-line case body that reads "Nothing to do here." DELETE produces no column values. It only decides which rows to remove, so there is nothing to do to the targetList itself. None of the things we have seen so far (INSERT's default filling, the DEFAULT keyword substitution for INSERT/UPDATE, UPDATE's partial-update combining) apply to DELETE. The amount of work the rewriter does is set by how much responsibility the user's query has for producing row data.&lt;/p&gt;

&lt;p&gt;That covers the DML's targetlist normalization. We have walked through the &lt;em&gt;filling-in-empty-slots&lt;/em&gt; half of DML rewriting. The other half, &lt;em&gt;redirecting references to point at different places&lt;/em&gt;, is covered in 1.3.4 through RETURNING mapping and DML-on-view transformation.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, when an INSERT feels slow, default expressions are the first thing to suspect.&lt;/strong&gt; Even if the user does not name a column, the default expression for it still runs on every INSERT. If an expression in &lt;code&gt;pg_attrdef&lt;/code&gt; calls a function (&lt;code&gt;now()&lt;/code&gt;, &lt;code&gt;uuid_generate_v4()&lt;/code&gt;) or a sub-query, that function or sub-query runs once per missing column on every INSERT. Sequence consumption, statistics counters, side-effect function calls, all of these happen exactly as if the user had provided the value explicitly. When INSERT performance looks off, the fastest first check is to inspect what each default expression evaluates to via &lt;code&gt;\d+ &amp;lt;table&amp;gt;&lt;/code&gt;. Avoiding heavy sub-queries in default expressions follows the same reasoning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, RULE definitions can safely reference &lt;code&gt;NEW.&amp;lt;column&amp;gt;&lt;/code&gt; even for columns without a default.&lt;/strong&gt; With a rule like the &lt;code&gt;forward_deleted_at&lt;/code&gt; example from the body text, there is no need to separately check whether the user named &lt;code&gt;deleted_at&lt;/code&gt;. If they did, the value flows through as written; if they did not, NULL flows through. Missing data is uniformly carried as NULL, so the rule author does not have to branch on every possible combination of "which columns did the user actually write." This safety guarantee shows up often in INSERT triggers and audit rules.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>rewriter</category>
    </item>
    <item>
      <title>1.3.2 How RLS Rewrites Queries (Mechanism)</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 17 May 2026 10:53:10 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/132-how-rls-rewrites-queries-mechanism-1eaa</link>
      <guid>https://dev.to/joonghyukshin/132-how-rls-rewrites-queries-mechanism-1eaa</guid>
      <description>&lt;p&gt;Once you put row-level security (RLS) on a table, queries that read from that table don't run the way the user wrote them. A single line &lt;code&gt;SELECT * FROM accounts&lt;/code&gt; comes out of the rewriter looking more like &lt;code&gt;SELECT * FROM accounts WHERE owner = current_user&lt;/code&gt;. The user never wrote a WHERE clause, but it's there. What RLS as a feature does is covered in 7.5. Here we look at what that "silently changes" actually is, inside Postgres.&lt;/p&gt;

&lt;h2&gt;
  
  
  The qual the rewriter adds
&lt;/h2&gt;

&lt;p&gt;The core of RLS is simpler than it sounds. It takes the expression written in the policy and grafts it onto the target table as a qual. That's the whole mechanism. What PG calls a qual is a shorthand for a boolean expression that filters rows, the kind of thing that lives in a WHERE clause (short for "qualification", and the same concept as what the SQL standard usually calls a predicate). If a query is the whole tree carrying SELECT/FROM/WHERE and so on, a qual is one expression node inside that tree. RLS doesn't run a separate enforcement engine. It plants the policy's expression into the query tree and deforms the tree itself.&lt;/p&gt;

&lt;p&gt;This work happens at the same site as view expansion (covered in 1.3.1). The RIR pass, driven by &lt;code&gt;fireRIRrules&lt;/code&gt;, finishes everything else first (CTE handling, view expansion, sublink recursion) and then, as the &lt;strong&gt;last&lt;/strong&gt; step, applies RLS policies. It walks the query's range table one more time and, for each RTE that points at a regular relation, attaches the policy's conditions.&lt;/p&gt;

&lt;p&gt;The place those conditions go is the &lt;code&gt;securityQuals&lt;/code&gt; field on the RTE. The comment in the PG source describes the field this way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cm"&gt;/*
 * securityQuals is a list of security barrier quals (boolean expressions),
 * to be tested in the listed order before returning a row from the
 * relation.  It is always NIL in parser output.  Entries are added by the
 * rewriter to implement security-barrier views and/or row-level security.
 */&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things stand out. First, &lt;code&gt;securityQuals&lt;/code&gt; is filled by the rewriter, not the parser. Conditions appear on the RTE that weren't in the user's query at all. Second, RLS and security-barrier views share the same container. A security-barrier view (defined with &lt;code&gt;CREATE VIEW ... WITH (security_barrier)&lt;/code&gt;) is one whose filter conditions the planner is forbidden from reordering against the user's WHERE clauses. Its purpose is essentially the same as RLS, so PG implements both on top of the same machinery. Even though 1.3.1 showed view expansion turning a view reference into a sub-query, the conditions of a security-barrier view and the conditions of an RLS policy end up in the very same &lt;code&gt;securityQuals&lt;/code&gt; list. (The leaky view attack scenarios that security-barrier views guard against, and the operational pitfalls, are covered in 7.5.3.)&lt;/p&gt;

&lt;p&gt;RLS fills the same container the same way. A "policy" in RLS is a catalog object created by &lt;code&gt;CREATE POLICY&lt;/code&gt;. A single policy bundles five things together: a name, which commands it applies to (SELECT/INSERT/UPDATE/DELETE or ALL), which roles it applies to, the USING expression that filters existing rows, and the WITH CHECK expression that validates new rows. It's stored as one row in the &lt;code&gt;pg_policy&lt;/code&gt; catalog, and it rides along when the table's relcache is loaded. The rewriter pulls the USING and WITH CHECK expressions out of that catalog row and grafts them onto the query tree.&lt;/p&gt;

&lt;p&gt;Suppose the &lt;code&gt;accounts&lt;/code&gt; table has this policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;POLICY&lt;/span&gt; &lt;span class="n"&gt;account_owner&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt;
    &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user runs &lt;code&gt;SELECT * FROM accounts WHERE balance &amp;gt; 0&lt;/code&gt;. After the analyzer, the Query tree's RTE for &lt;code&gt;accounts&lt;/code&gt; has an empty &lt;code&gt;securityQuals&lt;/code&gt;. When the rewriter's RLS pass visits that RTE, it pulls the policy's USING expression &lt;code&gt;owner = current_user&lt;/code&gt;, copies it, fixes up the Var references so they point at the right relation, and drops it into &lt;code&gt;securityQuals&lt;/code&gt;. What the planner ends up seeing, conceptually, is something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;balance&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;owner = current_user&lt;/code&gt; is a condition the user never wrote. The rewriter pulled it out of the policy and stuck it in.&lt;/p&gt;

&lt;p&gt;One thing to flag, though. That SQL above is a conceptual view. PG isn't actually rewriting the user's SQL text into that form. The text the user submitted stays as it was. What gets transformed is the Query tree it was parsed into. The policy condition lands on the &lt;code&gt;accounts&lt;/code&gt; RTE's &lt;code&gt;securityQuals&lt;/code&gt; field inside that tree. The SQL above is just a human-readable rendering of the tree-level change. What &lt;code&gt;pg_stat_statements&lt;/code&gt; records and what shows up in the logs is the original SQL the user typed. The RLS transformation happens at the tree level, not at the text level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the rewriter, not somewhere else
&lt;/h2&gt;

&lt;p&gt;Doing RLS as a query transformation is itself a design choice. Row-level access control can be built in other ways. One alternative is to add a separate filter stage in the execution engine that evaluates the policy condition as each row comes off the relation. PG didn't go that route. It does the work in the rewriter, planting the condition into the query tree.&lt;/p&gt;

&lt;p&gt;The reason lies one step downstream, in the planner. Once the rewriter has folded the policy condition in as a normal qual, the planner doesn't see it as anything special. It looks like just another condition in the query. So if there's an index on the column, the planner can pick an index scan. If the condition can be pushed below a join, the planner can push it. It estimates selectivity from statistics like any other qual. The RLS condition benefits from the planner's full optimization machinery.&lt;/p&gt;

&lt;p&gt;A separate runtime filter wouldn't get any of this. The filter tends to bolt onto the tail of an already-built plan, which means the table gets fully scanned before the policy condition is applied. There's no path to index the policy condition. Putting RLS in the rewriter is the choice that keeps access control inside the part of the query the planner can optimize.&lt;/p&gt;

&lt;p&gt;There's also a reason &lt;code&gt;fireRIRrules&lt;/code&gt; saves RLS for last. A policy's condition can contain a sub-query (for example, &lt;code&gt;USING (owner IN (SELECT ...))&lt;/code&gt;), and adding such a condition means the infinite-recursion check needs to run again on the new condition. If RLS were folded into the view-expansion loop above, the tree walk for view expansion would end up visiting the sub-queries inside RLS conditions twice. Pulling RLS out into a separate pass after view expansion avoids that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The qual that filters and the qual that checks
&lt;/h2&gt;

&lt;p&gt;A policy can carry two kinds of condition, and the two land in different parts of the query tree and get enforced in different ways. This distinction is the most important point in the whole RLS mechanism.&lt;/p&gt;

&lt;p&gt;The USING condition decides &lt;strong&gt;which of the existing rows in the table are visible&lt;/strong&gt;. The &lt;code&gt;owner = current_user&lt;/code&gt; from before is a USING. It goes into &lt;code&gt;securityQuals&lt;/code&gt; and behaves like a WHERE clause. Rows that don't match are simply absent from the result. They disappear silently.&lt;/p&gt;

&lt;p&gt;The WITH CHECK condition decides &lt;strong&gt;which new rows are allowed in&lt;/strong&gt;. Rows added by INSERT or UPDATE have to satisfy it. This condition doesn't go into &lt;code&gt;securityQuals&lt;/code&gt;. It goes into a separate list on the Query node called &lt;code&gt;withCheckOptions&lt;/code&gt;, and it gets enforced differently. When a row violates the condition, the row isn't quietly dropped. PG &lt;strong&gt;raises an error&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why is one silent and the other loud? The scenarios make it obvious. Suppose the user runs &lt;code&gt;SELECT * FROM accounts&lt;/code&gt; and there happen to be 100 accounts they don't own. It's natural for the USING condition to remove those from the result. The user asked for "rows I can see," and the unseen rows being missing is the expected behavior. It's not an error.&lt;/p&gt;

&lt;p&gt;Now suppose the user runs &lt;code&gt;INSERT INTO accounts VALUES (...)&lt;/code&gt; trying to insert an account owned by someone else. If that row were silently dropped, the user would believe the INSERT succeeded while the data isn't actually there. The command reported success, but there's no result behind it. That's the worst kind of bug to leave for the application. So PG raises an error on WITH CHECK violations. The comment in rowsecurity.c spells out this intent: the check option exists specifically to make sure a policy violation is signaled as an error, because otherwise rows you were trying to add could silently disappear.&lt;/p&gt;

&lt;p&gt;So USING is a visibility filter and WITH CHECK is a write gate. One is silent, the other is loud. The same RLS policy gets unpacked into different containers and different enforcement on the read path versus the write path.&lt;/p&gt;

&lt;p&gt;There's one convenience. If a policy doesn't specify an explicit WITH CHECK, PG reuses the USING condition as the WITH CHECK. The reasoning is that "only rows you can read can be written" is a reasonable default in most cases. Specifying a separate WITH CHECK lets you write policies where the read condition and the write condition differ.&lt;/p&gt;

&lt;h2&gt;
  
  
  How policies get combined
&lt;/h2&gt;

&lt;p&gt;A single table can carry multiple policies. By what rule does the rewriter combine them into one condition?&lt;/p&gt;

&lt;p&gt;Policies come in two flavors: permissive and restrictive. A permissive policy says "allow if this condition is met," and multiple permissive policies are combined with OR. Any one of them passing is enough to make the row visible. A restrictive policy says "must satisfy this condition," and they're combined with AND. Violating any single restrictive policy hides the row. The rewriter filters the policies by command type and role, then ORs the permissive ones into a single chunk and ANDs the restrictive ones onto it.&lt;/p&gt;

&lt;p&gt;Imagine the &lt;code&gt;accounts&lt;/code&gt; table has three policies. One is a permissive policy that lets users see accounts they own, with the condition &lt;code&gt;owner = current_user&lt;/code&gt;. Another is a permissive policy that also lets users see accounts flagged as public, with &lt;code&gt;is_public = true&lt;/code&gt;. The third is a restrictive policy that blocks deleted accounts no matter what, with &lt;code&gt;deleted_at IS NULL&lt;/code&gt;. When the user runs a SELECT, the rewriter combines the three into the following condition on &lt;code&gt;accounts&lt;/code&gt;'s &lt;code&gt;securityQuals&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;owner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;current_user&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;is_public&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;deleted_at&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two permissive conditions OR together to form an allow-channel ("either I own it, or it's public"), and the single restrictive condition ANDs on top to enforce "and also, not deleted." Adding more permissive policies widens the allow-channel, so more rows become visible. Adding more restrictive policies adds more gates that have to pass, so fewer rows become visible. The general shape is &lt;code&gt;(permissive1 OR permissive2 OR ...) AND restrictive1 AND restrictive2 AND ...&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;There's one rule worth highlighting. &lt;strong&gt;If there are no permissive policies at all, no row is visible.&lt;/strong&gt; That holds even if there are plenty of restrictive policies on the table. The reason is that permissive policies are the grounds for visibility, and restrictive policies are constraints that trim what's already allowed. With no grounds for visibility, there's nothing to trim. The rewriter handles this by putting a single &lt;code&gt;false&lt;/code&gt; constant into &lt;code&gt;securityQuals&lt;/code&gt;. The effective condition becomes &lt;code&gt;WHERE false&lt;/code&gt;, and no row passes.&lt;/p&gt;

&lt;p&gt;That's RLS's default-deny behavior. If you run &lt;code&gt;ALTER TABLE ... ENABLE ROW LEVEL SECURITY&lt;/code&gt; on a table without creating a single policy, no one (except the table owner or a user with BYPASSRLS) sees a single row. It might feel counterintuitive that not writing a WHERE clause results in zero rows, but RLS doesn't sit on the same layer as the user's WHERE. It's an access-control gate sitting above that layer. The gate's default being deny is the same fail-closed principle behind firewalls and filesystem ACLs. Zero policies means zero rules granting passage, so zero rows pass through, which is internally consistent. Turning RLS on is in itself a declaration that "anything not explicitly allowed is forbidden," and that's the safe default that prevents one forgotten policy from leaking data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security barrier: quals have an order
&lt;/h2&gt;

&lt;p&gt;That &lt;code&gt;securityQuals&lt;/code&gt; is not just a list but an &lt;strong&gt;ordered&lt;/strong&gt; list is the last piece of the RLS mechanism. The field comment said "tested in the listed order," and that order matters.&lt;/p&gt;

&lt;p&gt;Why does it matter? After the rewriter plants the policy conditions, the planner picks each element of &lt;code&gt;securityQuals&lt;/code&gt; out (in &lt;code&gt;process_security_barrier_quals&lt;/code&gt;) and moves it into the rel's qual list. As it does so, it assigns each element a distinct security_level. Earlier elements get a lower level, later elements a higher level. Conditions from the user's own WHERE clause get an even higher level than any of those.&lt;/p&gt;

&lt;p&gt;The level enforces evaluation order. Lower-level conditions have to be evaluated before higher-level ones. The concrete thing it prevents is a function the user wrote in WHERE getting evaluated before the RLS condition.&lt;/p&gt;

&lt;p&gt;Picture this query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;accounts&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;leaky_function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;account_number&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Suppose &lt;code&gt;leaky_function&lt;/code&gt; is a function that leaks its argument value somewhere. The leak can take several forms. A PL/pgSQL function might write the argument to the server log with &lt;code&gt;RAISE NOTICE '%', $1&lt;/code&gt;. Or you can write a function that deliberately raises an error for certain argument values, like &lt;code&gt;1 / (CASE WHEN account_number = '...' THEN 0 ELSE 1 END)&lt;/code&gt; triggering a division-by-zero. When PG raises a runtime error like that, the error context includes the call site and the argument expression, so the argument value gets carried out into the user-visible error message or the server log. Any path where information about the argument escapes the function through something other than its return value makes the function leaky. (The opposite is leakproof, a function guaranteed to have no such paths.) PG groups all such functions under the term &lt;strong&gt;leaky functions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The RLS policy is supposed to only show rows where &lt;code&gt;owner = current_user&lt;/code&gt;. But if the planner, for optimization reasons, evaluates &lt;code&gt;leaky_function&lt;/code&gt; before the RLS condition, the function receives the &lt;code&gt;account_number&lt;/code&gt; of rows the user has no right to see as its argument. Even if the policy then removes those rows from the result, the function already saw the values and leaked them, into the server log or the error message. RLS has been pierced.&lt;/p&gt;

&lt;p&gt;The security_level prevents this. The RLS condition is at a lower level, and the user's WHERE conditions are at a higher level. The planner therefore cannot push a user condition containing a non-leakproof function below the RLS condition. The RLS condition filters first, and only the rows that pass make it to the user's function. This is the substance of the name "security barrier": qual order keeps optimization from crossing the security boundary.&lt;/p&gt;

&lt;p&gt;There's one more thing about the order. The rewriter attaches the RLS conditions to the &lt;strong&gt;front&lt;/strong&gt; of &lt;code&gt;rte-&amp;gt;securityQuals&lt;/code&gt;. If a table has both an RLS policy and conditions coming in from a security-barrier view, the RLS condition on the table itself ends up at the lower level, the position that gets evaluated first. The decision is that policies bound directly to a table take precedence over conditions coming in through a view.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, the performance of queries on an RLS-protected table comes down to whether the policy condition can use an index.&lt;/strong&gt; The RLS condition isn't a separate filter, it's a regular qual merged into the query, so the planner may or may not be able to use an index for it. A simple &lt;code&gt;USING (owner = current_user)&lt;/code&gt; will use an index scan if there's one on &lt;code&gt;owner&lt;/code&gt;. A &lt;code&gt;USING (owner IN (SELECT ... FROM ...))&lt;/code&gt; with a sub-query, on the other hand, attaches that sub-query to every query against the table. If queries on an RLS table feel slow, the first step is to use &lt;code&gt;EXPLAIN&lt;/code&gt; to see how the policy condition is being executed. Policy conditions are part of index design too.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, if an INSERT under RLS completes without error, the row actually went in.&lt;/strong&gt; WITH CHECK violations are loud, not silent. Rows blocked by the policy don't vanish behind a success response, so you can trust that on the read side. What's a separate concern is whether the application code is catching that error and handling it properly. Make sure WITH CHECK violation errors are routed through the same handling path as ordinary constraint violations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, turning RLS on without creating a policy locks the table down entirely.&lt;/strong&gt; &lt;code&gt;ENABLE ROW LEVEL SECURITY&lt;/code&gt; is itself a default-deny declaration. If a migration enables RLS in one statement and creates policies in another, and the second statement is missing, any code deployed in between gets empty results. To make matters worse, table owners bypass RLS, so the problem won't surface in a development environment where you connect as the owner; it only shows up in production. The lesson is to put the RLS-enable statement and the policy creation in a single transaction, and to test at least once with a role that isn't the owner.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>security</category>
    </item>
    <item>
      <title>1.3.1 Rule System and View Expansion</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Thu, 14 May 2026 10:49:29 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/131-rule-system-and-view-expansion-2cg1</link>
      <guid>https://dev.to/joonghyukshin/131-rule-system-and-view-expansion-2cg1</guid>
      <description>&lt;p&gt;This section is about how PostgreSQL makes a view reference disappear. When a user writes &lt;code&gt;SELECT * FROM my_view&lt;/code&gt;, the planner no longer sees &lt;code&gt;my_view&lt;/code&gt; in that query. As soon as PostgreSQL receives the query, it moves the SELECT definition that the view name points to into that spot. Where and how that substitution happens, and how the fact that the mechanism rides on PostgreSQL's old RULE system surfaces in application code, is the subject of this section.&lt;/p&gt;

&lt;h2&gt;
  
  
  View and materialized view: what is the difference
&lt;/h2&gt;

&lt;p&gt;Views themselves are worth one paragraph of setup. PostgreSQL has two kinds of views: the plain view and the materialized view.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A plain view holds no data.&lt;/strong&gt; A view definition is a SELECT line (or a larger query), and every time you SELECT from the view, you get back the result of running that definition SELECT freshly. The view itself has no storage. Only the view definition is registered in the catalog, and each time a query comes in, that definition gets expanded to read data from the base tables. &lt;strong&gt;The subject of this section is the expansion of plain views.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A materialized view stores its data.&lt;/strong&gt; It runs the view definition's SELECT once, keeps the result on disk, and later SELECTs read that stored result directly. When the base table data changes, the materialized view's result is not refreshed automatically; the user has to explicitly run &lt;code&gt;REFRESH MATERIALIZED VIEW&lt;/code&gt; to recompute it. So the rewriter leaves materialized views alone, without expanding them.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Plain view&lt;/th&gt;
&lt;th&gt;Materialized view&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data storage&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Yes (on disk)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Behavior on SELECT&lt;/td&gt;
&lt;td&gt;Runs definition SELECT each time&lt;/td&gt;
&lt;td&gt;Reads stored result&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;When updated&lt;/td&gt;
&lt;td&gt;Reflects base table changes immediately&lt;/td&gt;
&lt;td&gt;Updated only on &lt;code&gt;REFRESH&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rewriter handling&lt;/td&gt;
&lt;td&gt;Expanded into definition SELECT&lt;/td&gt;
&lt;td&gt;Left as is&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With this picture in mind, it becomes clear that this section deals with how PostgreSQL moves the definition SELECT into place for plain views.&lt;/p&gt;

&lt;h2&gt;
  
  
  PostgreSQL's RULE system
&lt;/h2&gt;

&lt;p&gt;PostgreSQL has a mechanism called the RULE system. It is the system responsible for &lt;strong&gt;rewriting the query itself, right after the query arrives at the backend and before it is actually executed&lt;/strong&gt;. As noted in the 1.3 chapter introduction, the rewriter is the stage that changes the form of a query while preserving its meaning, and the RULE system is the tool that decides which rules to apply for that form change. (Transformations that change the meaning itself, such as subquery unnesting or predicate push-down, are the planner's responsibility in PostgreSQL and are covered in chapter 1.4.)&lt;/p&gt;

&lt;p&gt;A rule is a bundle of four elements.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Element&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Target relation&lt;/td&gt;
&lt;td&gt;Which table/view it applies to&lt;/td&gt;
&lt;td&gt;the &lt;code&gt;users&lt;/code&gt; table&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event type&lt;/td&gt;
&lt;td&gt;Which command triggers it&lt;/td&gt;
&lt;td&gt;one of INSERT, UPDATE, DELETE, SELECT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Condition&lt;/td&gt;
&lt;td&gt;Under what condition it fires (optional)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;WHERE NEW.role = 'admin'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Action&lt;/td&gt;
&lt;td&gt;What it does when fired&lt;/td&gt;
&lt;td&gt;replace the original command with another query, or run an additional query&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This bundle is stored as one row in the &lt;code&gt;pg_rewrite&lt;/code&gt; catalog. A user can create a rule directly with the &lt;code&gt;CREATE RULE&lt;/code&gt; command, and PostgreSQL also registers one rule automatically when a view is created.&lt;/p&gt;

&lt;p&gt;Comparing it to the trigger, which does a similar job, makes the position of the RULE clear. &lt;strong&gt;A trigger acts at the executor stage, hooking into each individual row.&lt;/strong&gt; This row was inserted, so add a row to the audit table, that kind of thing. &lt;strong&gt;A rule rewrites the query as a whole at the rewriter stage.&lt;/strong&gt; Query A came in, so rewrite it as query B and run that, that kind of thing. If a trigger is a "per-row side effect," a rule is a "per-query transformation."&lt;/p&gt;

&lt;p&gt;Here is one example of a rule a user might define directly. Suppose you want to record into an audit table every time an INSERT happens on the &lt;code&gt;users&lt;/code&gt; table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;RULE&lt;/span&gt; &lt;span class="n"&gt;log_user_insert&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;
    &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="n"&gt;ALSO&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;user_audit_log&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;NEW&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'INSERT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;DO ALSO&lt;/code&gt; means "run the original INSERT as is, and additionally run this audit INSERT too." With this rule in place, when you fire &lt;code&gt;INSERT INTO users (id, name) VALUES (1, 'alice')&lt;/code&gt;, the result of passing through the rewriter is two Query trees. One is the original INSERT into &lt;code&gt;users&lt;/code&gt;, the other is the INSERT into &lt;code&gt;user_audit_log&lt;/code&gt;. Both queries go through the planner and executor and get executed. This is what it looks like for one SQL text to fan out into multiple statements as it passes through the rewriter. (&lt;code&gt;DO INSTEAD&lt;/code&gt; replaces the original query; &lt;code&gt;DO ALSO&lt;/code&gt;, or the default, adds to the original query.)&lt;/p&gt;

&lt;p&gt;A user-defined rule is one use case of the RULE system. But the rule you encounter most often in ordinary applications is not one a user made directly. It is the &lt;strong&gt;SELECT rule that PostgreSQL registers automatically when you create a view&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  A view is a special case of the RULE system
&lt;/h2&gt;

&lt;p&gt;Creating a view in PostgreSQL means registering two things at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One is the registration of an empty table.&lt;/strong&gt; When you run &lt;code&gt;CREATE VIEW my_view AS SELECT ...&lt;/code&gt;, PostgreSQL adds a row to &lt;code&gt;pg_class&lt;/code&gt;. It gets an OID like an ordinary table, and column definitions go into &lt;code&gt;pg_attribute&lt;/code&gt; too. The only difference is that the &lt;code&gt;relkind&lt;/code&gt; column is marked 'v' (view). Looking at just this row, a view is an empty table. No file is even created to store data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The other is the registration of a SELECT substitution promise.&lt;/strong&gt; PostgreSQL adds a row to the &lt;code&gt;pg_rewrite&lt;/code&gt; catalog, and that row is the promise: "when this view is queried with SELECT, instead of returning an empty result, run the SELECT written in the view definition and return that result." The formal name of this promise is the &lt;strong&gt;ON SELECT DO INSTEAD SELECT&lt;/strong&gt; rule. Unpacked word by word:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ON SELECT&lt;/strong&gt;: which event it fires on. When a query that reads this relation with SELECT comes in.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DO INSTEAD&lt;/strong&gt;: how the original action is handled. Instead of the original action (querying this empty view as is).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SELECT ...&lt;/strong&gt;: what gets run. The SELECT written in the view definition.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Putting the three parts together gives the promise: "when this view is queried with SELECT, do not return 0 rows from the empty table; run the view definition's SELECT and return that result." PostgreSQL does not keep a view as a separate data structure; it expresses the concept of a view with one empty table and this one-line promise. Mapping it to the general rule definition from the previous section: the view's automatic rule has the view itself as target, SELECT as event, no condition, and the execution of the definition SELECT as action.&lt;/p&gt;

&lt;p&gt;This automatic SELECT rule for views has two constraints. There is exactly one ON SELECT rule per view, and that rule fires unconditionally. A single line of PostgreSQL code captures this fact directly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"RIR" stands for "Retrieve-Instead-Retrieve", that is an ON SELECT DO INSTEAD SELECT rule (which has to be unconditional and where only one rule can exist on each relation).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The abbreviation RIR shows up all over the PostgreSQL code, and its identity is in this one line. "retrieve" is the SELECT keyword from the POSTQUEL era, and its trace remains in the abbreviation. Two historical names you will often encounter in PostgreSQL material are worth a brief note.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;POSTGRES&lt;/strong&gt; (all caps): the name of the database project started at UC Berkeley in 1986. Led by Michael Stonebraker as the follow-on research project, it meant the next generation after Ingres.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;POSTQUEL&lt;/strong&gt;: the name of the query language that the POSTGRES project used. It was designed as the successor to QUEL (Ingres's query language). It is not SQL.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When SQL was adopted in 1995, the project renamed itself to PostgreSQL and POSTQUEL disappeared, but traces of that era remain in abbreviations like RIR and in some function names.&lt;/p&gt;

&lt;h2&gt;
  
  
  The rewriter works in two steps
&lt;/h2&gt;

&lt;p&gt;The rewriter has a single entry point function, but inside it two steps run in sequence.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Step 1&lt;/strong&gt;: applying the INSERT/UPDATE/DELETE rules that the user defined with &lt;code&gt;CREATE RULE&lt;/code&gt;. The audit rule we saw earlier fires at this step. One input query can fan out into zero or several queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step 2&lt;/strong&gt;: applying RIR rules. That is, firing the view's automatic ON SELECT rule to expand a view reference into its definition SELECT. It is applied to each of the queries produced by Step 1.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What the separation of the two steps means is clear. &lt;strong&gt;View expansion always happens last.&lt;/strong&gt; Whether a user-defined rule fans one query into several or changes it into something else, if a view reference remains in the result, it all gets resolved in Step 2. By the time it reaches the planner, no view reference remains anywhere.&lt;/p&gt;

&lt;p&gt;Step 2 does one thing. It walks the query's range table, and when it meets a view reference, it expands the view's definition SELECT into that spot.&lt;/p&gt;

&lt;p&gt;The expansion mechanism resolves cleanly thanks to the unification of the RTE data structure we saw in 1.2.3. In the Query that the analyzer produced, the view reference spot holds an RTE of kind RTE_RELATION (that is, an RTE pointing at an ordinary table or view). After passing through rewriter Step 2, that spot becomes kind RTE_SUBQUERY (an RTE that holds a sub-Query inside it), and the view definition's SELECT sits inside it. The item at slot N of the range table just changes kind and stays in the same spot.&lt;/p&gt;

&lt;p&gt;Suppose a view &lt;code&gt;v_active_users&lt;/code&gt; is defined as &lt;code&gt;SELECT id, name FROM users WHERE deleted_at IS NULL&lt;/code&gt;. Here is what the one line &lt;code&gt;SELECT * FROM v_active_users&lt;/code&gt; looks like before and after the rewriter.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;before rewriter (analyzer output)        after rewriter (planner input)

Query                                    Query
├─ commandType = CMD_SELECT              ├─ commandType = CMD_SELECT
├─ rtable                                ├─ rtable
│   └─ RangeTblEntry                     │   └─ RangeTblEntry
│       (rtekind = RTE_RELATION,         │       (rtekind = RTE_SUBQUERY,
│        relid = OID(v_active_users))    │        subquery = Query{
│                                        │             rtable = [users RTE],
│                                        │             targetList = [id, name],
│                                        │             qual = (deleted_at IS NULL)
│                                        │        })
├─ jointree → RangeTblRef(1)             ├─ jointree → RangeTblRef(1)
└─ targetList: SELECT *                  └─ targetList: SELECT *
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Slot 1 of the range table stays in place. Only the kind changed from RTE_RELATION to RTE_SUBQUERY, and the view definition sits inside it whole. Thanks to this unification, the planner, the stage after the rewriter, does not know the concept of a view at all. The Query the planner receives is just an ordinary SELECT with a sub-query expanded into a slot. This is why, when you look at EXPLAIN output, the view name disappears and base table names show up directly.&lt;/p&gt;

&lt;p&gt;A view definition can contain another view inside it. In that case the expanded sub-Query is recursively passed to the same function to resolve the inner view too. The problem is when a view references itself (either directly or through another view). Left as is, that becomes an infinite recursion of expanding and expanding again. To prevent this, PostgreSQL carries a list of the OIDs of views currently being processed, and when it meets a view already in that list, it throws the error &lt;code&gt;infinite recursion detected in rules for relation "..."&lt;/code&gt; and aborts query execution. It chose to cut things off with a clear error rather than spin forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 exceptions: materialized views and EXCLUDED
&lt;/h2&gt;

&lt;p&gt;The Step 2 view expansion logic has two explicit exceptions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, materialized views are not expanded.&lt;/strong&gt; As we saw earlier, a materialized view reads its stored result directly, so there is no need to expand its definition on every query. Step 2 passes an RTE with relkind 'm' straight through.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, the virtual name EXCLUDED in &lt;code&gt;INSERT ... ON CONFLICT&lt;/code&gt; is not expanded.&lt;/strong&gt; To understand EXCLUDED, we need to briefly touch PostgreSQL's UPSERT syntax.&lt;/p&gt;

&lt;p&gt;PostgreSQL has the &lt;code&gt;INSERT ... ON CONFLICT (...) DO UPDATE SET ...&lt;/code&gt; syntax. Commonly called UPSERT, it means "attempt the INSERT, but if it hits a unique constraint or similar and a conflict occurs, run an UPDATE in its place instead." For example, in a table that records a user's login count, you can handle "if it is a new user, add it; if it already exists, increment the count" in one query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;login_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'alice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;CONFLICT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DO&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;EXCLUDED&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                   &lt;span class="c1"&gt;-- the incoming value 'alice'&lt;/span&gt;
    &lt;span class="n"&gt;login_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;login_count&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;-- the existing row's value + 1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a row with &lt;code&gt;id = 1&lt;/code&gt; already exists, the INSERT conflicts, and the UPDATE in the ON CONFLICT clause runs against that row. Here the UPDATE must be able to reference both rows: &lt;strong&gt;the row already in the DB&lt;/strong&gt; (referenced as &lt;code&gt;users.column&lt;/code&gt;) and &lt;strong&gt;the row that tried to come in via INSERT but was blocked&lt;/strong&gt; (referenced as &lt;code&gt;EXCLUDED.column&lt;/code&gt;). The name EXCLUDED carries the meaning "the values excluded because of the conflict." In the example above, &lt;code&gt;EXCLUDED.name&lt;/code&gt; is 'alice', and &lt;code&gt;users.login_count&lt;/code&gt; points at the existing count stored in the DB.&lt;/p&gt;

&lt;p&gt;EXCLUDED is neither a real table nor a view. It is a virtual row name that exists only inside the UPSERT syntax. Internally PostgreSQL does represent EXCLUDED as an RTE_RELATION RTE, but the view expansion logic must not wrongly fire and try to expand this RTE (there is no definition to expand). So when the rewriter meets this RTE, it skips it with a separate branch.&lt;/p&gt;

&lt;h2&gt;
  
  
  The RULE system has many pitfalls and is not recommended
&lt;/h2&gt;

&lt;p&gt;The fact that view expansion rides on the RULE system is a trace of PostgreSQL's history. The RULE system existed from the POSTGRES era, and views were implemented as one branch of that system. The idea was that if a user directly defined INSERT/UPDATE/DELETE rules on a view with &lt;code&gt;CREATE RULE&lt;/code&gt;, that view could be used as an update target.&lt;/p&gt;

&lt;p&gt;The RULE system itself is powerful but full of pitfalls. The same rule getting evaluated multiple times, cascading happening differently than intended, permission checks behaving subtly: problems come up often. In PostgreSQL 9.3, the &lt;strong&gt;automatically updatable view&lt;/strong&gt; feature was introduced to sidestep those pitfalls. When a user fires INSERT/UPDATE/DELETE against a simple view (one that, say, just picks some columns from a single base table), PostgreSQL analyzes the view definition and automatically converts it into INSERT/UPDATE/DELETE against the base table, with no rule definition needed. For complex views that cannot meet the automatically-updatable conditions, solving it with an &lt;strong&gt;INSTEAD OF trigger&lt;/strong&gt; is the standard approach today. PostgreSQL's official documentation recommends triggers over rules for INSERT/UPDATE/DELETE.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;In many cases, tasks that could be performed by rules on
INSERT/UPDATE/DELETE are better done with triggers. Triggers are
notationally a bit more complicated, but their semantics are much
simpler to understand. Rules tend to have surprising results when
the original query contains volatile functions: volatile functions
may get executed more times than expected in the process of carrying
out the rules.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ON SELECT rule generated automatically when a view is created is something users rarely deal with directly. PostgreSQL installs it, and the rewriter expands it. The case where a user faces the RULE system is usually when they want to make a complex view updatable that automatically updatable views cannot handle, and even then an INSTEAD OF trigger is safer and less work.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In other RDBMS engines, view expansion is usually handled as a separate stage. PostgreSQL's decision was to fold view expansion into one branch of general rule application. The result of that unification is that data structures and code paths gather in one place, but the cost is that the RULE system, an old abstraction, is exposed to users as is.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, when a view name disappears from EXPLAIN output, the plan is not wrong.&lt;/strong&gt; Since the view reference gets expanded into its definition SELECT at the rewriter stage, the query the planner receives has no concept of a view at all. It is normal for base table names to show up in EXPLAIN output instead of the view name you wrote. The more complex the view definition, the more you need to look at the base tables in the EXPLAIN output and map them back to the view definition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, when making a view updatable, consider an automatically updatable view or an INSTEAD OF trigger before a RULE.&lt;/strong&gt; For a simple view that meets the automatically-updatable conditions, INSERT/UPDATE/DELETE work with no separate definition. If it cannot meet the conditions, an INSTEAD OF trigger is safer than a user-defined RULE. Defining a RULE directly is the last resort.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, when you change a view definition with ALTER, every query that calls that view gets expanded with the new definition at its next analysis.&lt;/strong&gt; A view definition is stored as a SELECT rule in the catalog, and view expansion has the rewriter re-read that catalog entry at execution time and expand it. When the catalog changes, an invalidation message propagates to the plan cache, and on the next execution the result is planned with a fresh expansion using the new definition (see 1.2.3 plan cache). From the application side, this means you can use a view as a code abstraction. Just changing the view definition consistently updates the behavior of every query that uses that view from the next call onward.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>views</category>
    </item>
    <item>
      <title>1.3 Rewriter: How a Query Is Rewritten</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Thu, 14 May 2026 10:49:28 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/13-rewriter-how-a-query-is-rewritten-14im</link>
      <guid>https://dev.to/joonghyukshin/13-rewriter-how-a-query-is-rewritten-14im</guid>
      <description>&lt;p&gt;By the time the analysis stage from 1.2 finishes, the SQL has become a Query tree. Catalog coordinates are baked in, locks (as we saw in 1.2.2) are already held from that point, and the type of every expression is determined. The starting point of this chapter is one fact: &lt;strong&gt;that Query tree does not go straight to the planner.&lt;/strong&gt; There is one more stage in between, and that stage is the rewriter.&lt;/p&gt;

&lt;p&gt;The rewriter's input and output are the same. It takes a Query tree and produces a Query tree. The number of trees that come out can be zero, one, or several. That means a single user SQL can fan out into multiple statements as it passes through the rewriter. The key point is that the input format and the output format are identical. The planner only receives Query trees that the rewriter passed through, and it does not distinguish whether a tree is exactly what the user wrote or the result of a transformation.&lt;/p&gt;

&lt;p&gt;Why does this stage exist on its own? Analysis is the work of connecting the meaning of SQL to the catalog, and planning is the work of deciding how to execute that meaning. The transformation that sits between them, the one that &lt;strong&gt;does not change what the query means but does change the form of the query itself&lt;/strong&gt;, is the rewriter's responsibility. Expanding a view reference into its underlying SELECT, injecting the WHERE condition of a row-level security (RLS) policy into the query, filling in defaults for columns omitted from an INSERT: all of this happens here.&lt;/p&gt;

&lt;p&gt;1.3 splits into three sections.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.3.1 Rule system and view expansion&lt;/strong&gt;: the mechanism by which a view reference gets expanded in place into a sub-query. This expansion rides on the RULE system that PostgreSQL has carried since its early days, and we cover how that fact surfaces in application code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.3.2 How RLS rewrites queries&lt;/strong&gt;: when a row-level security policy is defined in the catalog, how the rewriter weaves that policy into the query as a WHERE condition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1.3.3 INSERT/UPDATE/DELETE rewriting&lt;/strong&gt;: filling in default values, handling RETURNING, and the flow that turns an automatically updatable view into INSERT/UPDATE/DELETE against the view's base table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the end of this chapter, you should have a clear picture of where and how "the SQL I wrote" and "the SQL the planner saw" diverge. The three places they diverge are view expansion, RLS policy injection, and INSERT/UPDATE/DELETE rewriting. When EXPLAIN output or permission behavior does not match your intuition, recalling these three places is the start of an accurate trace.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>rewriter</category>
    </item>
    <item>
      <title>1.2.3 Query Tree Node Types: Query, RangeTblEntry, TargetEntry</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Sun, 10 May 2026 08:05:36 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/123-query-tree-node-types-query-rangetblentry-targetentry-3c85</link>
      <guid>https://dev.to/joonghyukshin/123-query-tree-node-types-query-rangetblentry-targetentry-3c85</guid>
      <description>&lt;p&gt;After the Analyze stage from 1.2.2, the raw parse tree has become a Query tree. This section is about the shape of that result. How it differs from a raw parse tree, what abstraction sits at its core, and what assumptions the next stages (rewriter and planner) get to start from when they receive a Query tree.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tokens descend into catalog coordinates
&lt;/h2&gt;

&lt;p&gt;A raw parse tree and a Query tree are both trees, but they sit on different planes. A raw parse tree mirrors the syntactic structure of SQL text, so its nodes look a lot like tokens. Where a table name appeared, you find a node carrying that name as a token. Where a column appeared, you find another name token. The types of expressions are not yet determined.&lt;/p&gt;

&lt;p&gt;A Query tree is what those tokens look like after they have been resolved against the catalog. The same positions hold different representations now. Where a table name was, there is now an RTE carrying that table's OID and lock mode. Where a column name was, there is now an index pair: "the M-th attribute of the N-th item in rtable". Where an operator token sat, there is now an OID pointing to the function that implements that operator. The same shape of tree carries the same kinds of information at the same positions, but the information has dropped one level down from surface SQL text into catalog coordinates.&lt;/p&gt;

&lt;p&gt;For &lt;code&gt;SELECT id, name FROM users WHERE age &amp;gt; 18&lt;/code&gt;, the contrast looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;raw parse tree (sketch)          Query tree (sketch)

SelectStmt                       Query
├─ targetList                    ├─ commandType = CMD_SELECT
│   ├─ ColumnRef "id"            ├─ rtable
│   └─ ColumnRef "name"          │   └─ RangeTblEntry
├─ fromClause                    │       (relid = 16384, RTE_RELATION)
│   └─ RangeVar "users"          ├─ jointree (FROM + WHERE)
└─ whereClause                   ├─ targetList
    └─ A_Expr                    │   ├─ TargetEntry(resno=1, expr=Var(rt=1, att=1))
        ('&amp;gt;', ColumnRef "age",   │   └─ TargetEntry(resno=2, expr=Var(rt=1, att=2))
              A_Const 18)        └─ qual: OpExpr(opno=...,
                                            args=[Var(rt=1, att=3), Const(int4=18)])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The descent into catalog coordinates is right there in the picture. Table names become OIDs, column names become pairs of (rtable index, attribute number), operator tokens become typed expression nodes. The rewriter and planner work on those coordinates and have nothing left to resolve.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PG's decision to fold every kind of FROM item into a single enum looks unremarkable at first, but its effect compounds. If each kind of input were a different node type, the rewriter and planner would scatter kind-specific handling across every branch they touch. With one node type and an enum tag, transformations that cut across many places (think "scan every RTE and check locks", "inject an RLS policy on every RTE") fit into one place consistently.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Range table: every FROM input in one place
&lt;/h2&gt;

&lt;p&gt;The Query carries a single List that holds every item appearing in the FROM clause. PG calls this list the range table, and an item in it a RangeTblEntry. A plain table, a subquery (&lt;code&gt;FROM (SELECT ...)&lt;/code&gt;), a result of an explicit JOIN, a set-returning function call (&lt;code&gt;FROM generate_series(...)&lt;/code&gt;), an inline row set built with &lt;code&gt;VALUES (...)&lt;/code&gt;, a CTE (&lt;code&gt;WITH ... AS (...)&lt;/code&gt;). Whatever form a FROM input takes, it gets folded into a single entry in this list.&lt;/p&gt;

&lt;p&gt;The kind tag lives in one field, &lt;code&gt;rtekind&lt;/code&gt;, on the RTE. There is no separate node type per kind of input; the same &lt;code&gt;RangeTblEntry&lt;/code&gt; object carries different &lt;code&gt;rtekind&lt;/code&gt; values to distinguish them. The object type is one, and the value of &lt;code&gt;rtekind&lt;/code&gt; decides which auxiliary fields are meaningful. For a plain table the OID and lock mode fields matter; for a subquery the sub-Query field matters; and so on.&lt;/p&gt;

&lt;p&gt;You can see the payoff of this consolidation in the shape of the rewriter and planner code. Both stages, having received a &lt;code&gt;Query&lt;/code&gt;, work to a single model: "take the N items in rtable, glue them together via jointree, and evaluate targetList against the result." A subquery, a JOIN, a function call all flow through the same model. Adding a new kind of input is a matter of adding one line to RTEKind and filling in a few branches.&lt;/p&gt;

&lt;p&gt;The range table plays another role. Column references look at it. In a Query tree, a column is not a name; it is two integers, "attribute M of rtable item N". A column reference's meaning is bound to the order of items in the range table. That is why a rewriter or planner that adds, removes, or reorders rtable items must update the indices in existing column references at the same time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Target list: SELECT, INSERT, UPDATE, RETURNING all in one shape
&lt;/h2&gt;

&lt;p&gt;The other key List on a Query is the target list. The output columns of a SELECT, the values to assign in INSERT/UPDATE, and the columns returned by RETURNING all live in lists of the same shape with the same kind of node. One target list per Query for the SELECT/INSERT/UPDATE body, plus a second one for RETURNING when present.&lt;/p&gt;

&lt;p&gt;The shape is one, but its meaning shifts slightly with context. Each entry has a field called resno (short for result number, indicating where the entry lands in the result), and the interpretation of this number depends on whether you are looking at SELECT, INSERT, or UPDATE. Take this table to compare the two cases.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt;    &lt;span class="nb"&gt;serial&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;-- attno 1&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;  &lt;span class="nb"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 &lt;span class="c1"&gt;-- attno 2&lt;/span&gt;
    &lt;span class="n"&gt;age&lt;/span&gt;   &lt;span class="nb"&gt;int&lt;/span&gt;                   &lt;span class="c1"&gt;-- attno 3&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For SELECT, resno is the ordinal position of the output column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This SELECT's target list has two entries. The first, holding the &lt;code&gt;name&lt;/code&gt; expression, has resno 1; the second, holding &lt;code&gt;age&lt;/code&gt;, has resno 2. They land in the first and second positions of the result set, respectively. Even though &lt;code&gt;name&lt;/code&gt; has attno 2 and &lt;code&gt;age&lt;/code&gt; has attno 3 in the table, the result positions in SELECT are independent of those attnos: they go 1, 2, in the order they appear in the SELECT list.&lt;/p&gt;

&lt;p&gt;For INSERT, resno is the attribute number of the destination column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'alice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This INSERT's target list also has two entries, but the first's resno is not 1, the second's not 2. They are 2 and 3, the attnos of &lt;code&gt;name&lt;/code&gt; and &lt;code&gt;age&lt;/code&gt; in the table definition. The number that lands in resno is not the order in which the columns were written, but the position of the column in the table itself. UPDATE's SET clause follows the same rule: attno in resno.&lt;/p&gt;

&lt;p&gt;The fact that resno carries different meanings in different contexts feels odd at first, but folding two meanings into one List buys clear leverage. Expression evaluation code, target list traversal, RETURNING handling all assume one node type. The branching shrinks. The context-dependent meaning is settled when the List is built, and downstream stages just read what is already there.&lt;/p&gt;

&lt;p&gt;The target list also carries items invisible to the user. When an expression appears in ORDER BY or GROUP BY but not in the SELECT list, that expression is added to the target list with a "junk" mark. Sorting and grouping need a place to evaluate the value, so the value rides along in the same List. Junk columns get evaluated but stripped from the final output. The visible result is clean from the application side, while internally a few extra columns flow alongside to make sorting work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the Query goes
&lt;/h2&gt;

&lt;p&gt;Right after a Query tree is built, the rewriter takes it. RULE expansion, view expansion, and row-level security (RLS) policy injection all happen on the Query tree at this stage. The output is still a Query tree (or a list of zero or more, when a rule expands one statement into several). Details are covered in chapter 1.3.&lt;/p&gt;

&lt;p&gt;The rewriter hands the Query off to the planner. The planner's job is to convert a Query tree into a PlannedStmt: the strategy for how this SQL will be executed. Which indexes to use, what JOIN order to take, whether to go parallel. Chapter 1.4 covers this in depth.&lt;/p&gt;

&lt;p&gt;What is interesting is that the next stage, the executor, does not look at the Query tree. The Query definition in &lt;code&gt;parsenodes.h&lt;/code&gt; says it in one line: "the Query structure is not used by the executor." The executor's input is a PlannedStmt, and the Query tree, once turned into a plan, does not flow further down. The Query tree is the common currency of parser/analyzer/rewriter/planner, and that is where its lifetime ends.&lt;/p&gt;

&lt;p&gt;This separation of stages is reflected directly in how prepared statements work. A prepared statement is a mechanism to avoid going through the parser every time the same SQL runs repeatedly, by holding the analyzed result or plan in memory. There are two layers of holding. One layer holds the Query tree (already resolved against the catalog) in memory; the layer above holds the PlannedStmt that has gone through the planner. What is commonly called the plan cache refers to these two layers together. A held Query tree lets the next execution start from the planner; a held PlannedStmt is handed straight to the executor. If the catalog changes in between, an invalidation message arrives, the held results are discarded, and analysis or planning runs again. Because the per-stage outputs of parser, analyzer, rewriter, planner, and executor are cleanly separated, deciding which layer to invalidate and where to restart is straightforward.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, EXPLAIN works in the same shape for SELECT, INSERT, UPDATE, DELETE, and MERGE.&lt;/strong&gt; Five kinds of statement live in the same Query data structure, and the planner produces a single PlannedStmt regardless of kind. From the application side, you don't have to wonder whether you need a different tool to inspect an INSERT plan than a SELECT plan. From the tooling side, monitoring tools can deal with plans through one interface without branching on statement kind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, INSERT/UPDATE remains safe even when only some columns are written.&lt;/strong&gt; The resno in the target list carries the attno of the destination column rather than the order in which the user wrote it. The columns the user wrote land in their attno positions; missing columns are filled in with defaults later by the rewriter, which works on the same target list (covered in 1.3.3). This is the structural reason ORM-generated INSERT/UPDATE statements work correctly even when columns are out of order or only partially specified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, the same SQL can produce different Query trees depending on when it was analyzed.&lt;/strong&gt; Because catalog coordinates get baked into the Query tree, the same SQL text seen at two different times against two different catalog states produces two different trees. A prepared statement that produces different plans across two connections, a plan cache that re-analyzes after a catalog invalidation, the same code pointing at different tables under different &lt;code&gt;search_path&lt;/code&gt; settings: all of these emerge on top of this fact. From the application side, this is a single-line summary of where the intuition "same text, same behavior" breaks.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>queryplanner</category>
    </item>
    <item>
      <title>1.2.2 Semantic Analysis: Name Resolution, Type Checking, Catalog Lookup</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Thu, 07 May 2026 10:57:23 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/122-semantic-analysis-name-resolution-type-checking-catalog-lookup-2699</link>
      <guid>https://dev.to/joonghyukshin/122-semantic-analysis-name-resolution-type-checking-catalog-lookup-2699</guid>
      <description>&lt;p&gt;A raw parse tree captures the structure of SQL text, but not what that structure refers to. Whether the token &lt;code&gt;users&lt;/code&gt; is a table in some schema, whether the two sides of &lt;code&gt;=&lt;/code&gt; have compatible types, whether &lt;code&gt;count&lt;/code&gt; is a function or a column. Filling in those answers and turning a raw parse tree into a meaningful query tree is the job of Semantic analysis. PG code commonly shortens it to "Analyze". The shape of the resulting node structure is covered in 1.2.3.&lt;/p&gt;

&lt;h2&gt;
  
  
  Parser and Analyzer
&lt;/h2&gt;

&lt;p&gt;A note on terminology first. In PG code, "parser" is used in two scopes. Narrowly, it refers to the lexer and grammar that turn SQL text into a raw parse tree. Broadly, it includes the next stage, Analyze, as well. Both stages live under &lt;code&gt;src/backend/parser/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This chapter splits them. The raw parser takes SQL text and produces a raw parse tree (1.2.1). The Analyzer takes a raw parse tree and produces a Query tree by consulting the catalog (this section). The two stages differ in their output shape (raw parse tree → Query tree) and in whether they touch the catalog.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the raw parse tree leaves out
&lt;/h2&gt;

&lt;p&gt;The raw parser checks only syntax. The fact that &lt;code&gt;SELECT a FROM t WHERE a = 1&lt;/code&gt; is grammatically well-formed is the limit of what a raw parse tree asserts. Whether &lt;code&gt;t&lt;/code&gt; actually exists, whether &lt;code&gt;a&lt;/code&gt; is a column of that table, whether &lt;code&gt;1&lt;/code&gt; can be coerced to the type of &lt;code&gt;a&lt;/code&gt;: the raw parse tree knows none of this.&lt;/p&gt;

&lt;p&gt;Why split into two stages? A few reasons.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Separation of concerns&lt;/strong&gt;. Mixing syntax checks with semantic checks puts too much responsibility into one stage. Splitting keeps each stage's code simple.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raw parse must work in an aborted transaction&lt;/strong&gt;. When a transaction is in an aborted state, the catalog cannot be touched (covered at the end of 1.2.1). But commands like &lt;code&gt;COMMIT&lt;/code&gt;/&lt;code&gt;ROLLBACK&lt;/code&gt; still need to be recognized to escape that state. The split lets raw parse identify the command type without catalog access.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reuse of raw parse output&lt;/strong&gt;. A single raw parse tree can be Analyzed multiple times in different contexts (different &lt;code&gt;search_path&lt;/code&gt;, different parameter types). Prepared statements rely on this reuse pattern.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result is a clean two-stage split. Raw parse is independent of the catalog and only checks syntax. Analyze then aggressively consults the catalog and fills in the blanks. Analyze does four things. It looks up every name in the raw parse tree against the catalog and replaces them with OIDs (object identifiers, the 32-bit integer IDs assigned to every object in the catalog). It determines the type of every expression. It inserts cast nodes wherever types do not match. It packages the result into a new tree called &lt;code&gt;Query&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Name resolution: what does each name in SQL refer to?
&lt;/h2&gt;

&lt;p&gt;Name resolution is the task of figuring out which catalog object each table, column, or function name in the SQL is pointing at.&lt;/p&gt;

&lt;p&gt;Start with table names. As Analyze reads down the FROM clause, it encounters items like &lt;code&gt;FROM users&lt;/code&gt;, looks up the &lt;code&gt;users&lt;/code&gt; table in the catalog, and obtains its OID. It then registers it in an internal list as "the Nth table appearing in this SQL's FROM". PG calls this list the range table. (Why "range"? Covered in 1.2.3.)&lt;/p&gt;

&lt;p&gt;A range table holds more than just plain tables. The FROM clause can contain subqueries (&lt;code&gt;FROM (SELECT ...)&lt;/code&gt;), JOIN results (&lt;code&gt;FROM t1 JOIN t2 ON ...&lt;/code&gt;), and function calls (&lt;code&gt;FROM generate_series(1, 10)&lt;/code&gt;). All of these are equivalent in being "an entry that produces rows", so each gets one slot in the range table. The differences between kinds are covered in 1.2.3.&lt;/p&gt;

&lt;p&gt;When the schema is not specified (&lt;code&gt;FROM users&lt;/code&gt; versus &lt;code&gt;FROM public.users&lt;/code&gt;), PG walks the connection's &lt;code&gt;search_path&lt;/code&gt; in order and picks the first schema that contains a matching table. This decision is locked in at Analyze time. So the same SQL text can refer to different tables across connections with different &lt;code&gt;search_path&lt;/code&gt; settings.&lt;/p&gt;

&lt;p&gt;Column resolution comes next. Column names appearing in WHERE, SELECT, ORDER BY, and similar positions are matched against the range table that was just built, deciding "which column of which table this name refers to". The familiar &lt;code&gt;column reference "a" is ambiguous&lt;/code&gt; error from &lt;code&gt;SELECT a FROM t1 JOIN t2 ON ...&lt;/code&gt; (when &lt;code&gt;a&lt;/code&gt; exists in both &lt;code&gt;t1&lt;/code&gt; and &lt;code&gt;t2&lt;/code&gt;) is raised at this stage.&lt;/p&gt;

&lt;p&gt;Column references can have between one and four dot-separated parts. &lt;code&gt;a&lt;/code&gt; (column only), &lt;code&gt;t.a&lt;/code&gt; (table.column), &lt;code&gt;s.t.a&lt;/code&gt; (schema.table.column), &lt;code&gt;db.s.t.a&lt;/code&gt; (database.schema.table.column). Depending on how many dots are present, different parts are interpreted differently, and the priority rules for that interpretation are baked into this stage.&lt;/p&gt;

&lt;p&gt;Function calls are slightly more involved. PG allows function overloading, so multiple versions of the same function name can be registered with different argument types. For example, &lt;code&gt;length()&lt;/code&gt; has separate versions accepting &lt;code&gt;text&lt;/code&gt; and &lt;code&gt;bytea&lt;/code&gt;, and the right one is chosen based on the argument types at the call site. Analyze gathers all candidates with the same name and picks the best match by argument-type matching. If the match is ambiguous, an ambiguous error is raised. If there is no match, &lt;code&gt;function does not exist&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Type checking and cast insertion
&lt;/h2&gt;

&lt;p&gt;In an expression tree, leaf nodes are one of three things. Constants, column references, or parameters. The constant &lt;code&gt;1&lt;/code&gt; has type integer, the column &lt;code&gt;name&lt;/code&gt; has type text, the parameter &lt;code&gt;$1&lt;/code&gt; has whatever type the caller declared. Analyze walks the expression tree from leaf to root, determining the type of every node along the way. The result type of &lt;code&gt;a + b&lt;/code&gt; is determined only after the types of &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt; are settled, and the type of &lt;code&gt;(a + b) * 2&lt;/code&gt; is determined from that result.&lt;/p&gt;

&lt;p&gt;When the two sides of an operation differ in type, an automatic cast may be inserted. In &lt;code&gt;WHERE a = '5'&lt;/code&gt; where &lt;code&gt;a&lt;/code&gt; is integer and &lt;code&gt;'5'&lt;/code&gt; is text, PG inserts a cast node converting &lt;code&gt;'5'&lt;/code&gt; to integer so the comparison is well-typed. The authoritative answer for which type-to-type conversions are possible lives in the &lt;code&gt;pg_cast&lt;/code&gt; catalog. It records the source type, target type, conversion method, and which context the cast is automatically applied in. A few representative rows look like this.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;source → target&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;int4&lt;/code&gt; → &lt;code&gt;int8&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Function call (&lt;code&gt;int8(int4)&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;implicit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;int8&lt;/code&gt; → &lt;code&gt;int4&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Function call (&lt;code&gt;int4(int8)&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;assignment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;int4&lt;/code&gt; → &lt;code&gt;oid&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Binary passthrough&lt;/td&gt;
&lt;td&gt;implicit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;json&lt;/code&gt; → &lt;code&gt;jsonb&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Text I/O detour&lt;/td&gt;
&lt;td&gt;assignment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two of the columns capture the heart of how PG does casts. The method is how the source value is turned into the target type, and the context is where that cast is automatically applied.&lt;/p&gt;

&lt;p&gt;The method falls into one of three kinds.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;No conversion needed because the memory representation is the same&lt;/strong&gt;. The &lt;code&gt;int4 → oid&lt;/code&gt; row is this case. Both are 32-bit integers, so the binary representation passes through. The conversion cost is essentially zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A cast function is called&lt;/strong&gt;. The &lt;code&gt;int4 → int8&lt;/code&gt; row is this case, with a registered conversion function &lt;code&gt;int8(int4)&lt;/code&gt; that gets called. The cost is one function call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output as text and read it back&lt;/strong&gt;. Every type in PG has an output function that serializes its value to a string and an input function that parses a string back into a value of that type. To bridge two types that have no direct cast function, PG runs the source type's output function to produce a string, then feeds that string through the target type's input function. The &lt;code&gt;json → jsonb&lt;/code&gt; row works this way: a &lt;code&gt;json&lt;/code&gt; value is first emitted as a textual form like &lt;code&gt;{"a":1}&lt;/code&gt;, then that text is parsed by the &lt;code&gt;jsonb&lt;/code&gt; input function into the binary JSON representation. This route is inefficient and not used when a direct cast function can be defined, but it serves as a fallback for user-defined types or for bridging two types whose representations differ significantly.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If no method is registered, PG raises &lt;code&gt;cannot cast type X to type Y&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The same cast does not apply automatically in every position. PG splits cast contexts into three groups.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;implicit&lt;/strong&gt;. PG inserts the cast on its own, without a request from the user. Applied in positions like comparisons and arithmetic. The &lt;code&gt;int4 → int8&lt;/code&gt; row is implicit, so &lt;code&gt;WHERE bigint_col = int_col&lt;/code&gt; automatically promotes &lt;code&gt;int4&lt;/code&gt; to &lt;code&gt;int8&lt;/code&gt; for the comparison even though the two sides differ in type.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;assignment&lt;/strong&gt;. Applies when a value is being put into a column. Operates in INSERT, the SET of UPDATE, COPY, and similar positions. Some conversions that are not registered as implicit are registered as assignment. The &lt;code&gt;int8 → int4&lt;/code&gt; row is one such example. Narrowing a wide integer to a narrower one risks precision loss, so applying it automatically in a comparison would be dangerous, but in a position where a value is being placed into a column of a particular type, accepting that loss is closer to the user's intent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;explicit&lt;/strong&gt;. The user directly requests the cast with &lt;code&gt;CAST(col AS integer)&lt;/code&gt; or &lt;code&gt;col::integer&lt;/code&gt;. PG treats this as an unambiguous statement of intent, so even conversions not registered as implicit or assignment are allowed here.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The three contexts form a relationship in which the set of cast pairs PG applies automatically grows wider. A cast registered as implicit is also automatic in assignment positions, and of course in explicit positions too. Conversely, a cast registered only at the explicit level does not happen automatically in implicit or assignment positions; the user has to spell it out. In set notation, implicit ⊂ assignment ⊂ explicit in terms of which pairs PG converts on your behalf.&lt;/p&gt;

&lt;p&gt;This split keeps PG away from a common pitfall in other RDBMSs where implicit conversions happen too eagerly and produce surprising results. PG deliberately keeps implicit casts narrow. Before 8.3 there was an implicit cast between text and integer, but it caused too many unintended comparisons and was removed in later versions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Catalog lookups and locks
&lt;/h2&gt;

&lt;p&gt;Every decision above involves catalog lookups. An OID lookup per FROM table, an attribute lookup per column, a &lt;code&gt;pg_type&lt;/code&gt; lookup per type in the expression, a &lt;code&gt;pg_proc&lt;/code&gt; lookup per function call. Analyzing one SQL statement can trigger dozens of catalog lookups.&lt;/p&gt;

&lt;p&gt;To keep the cost from blowing up, PG caches lookup results in memory (the syscache, a system catalog cache). A second lookup with the same key in the same connection hits memory. When another connection mutates the catalog with ALTER TABLE or similar, an invalidation message is delivered to the syscache and just the affected entry is dropped.&lt;/p&gt;

&lt;p&gt;The bigger user-visible effect is not the lookup itself but its side effect. When Analyze finds a FROM table in the catalog, it also takes a lock on that table. A plain FROM in a SELECT acquires &lt;code&gt;AccessShareLock&lt;/code&gt; (the read lock). Tables locked by &lt;code&gt;FOR UPDATE/SHARE&lt;/code&gt; get &lt;code&gt;RowShareLock&lt;/code&gt;. The target of an INSERT/UPDATE/DELETE gets &lt;code&gt;RowExclusiveLock&lt;/code&gt;. By the time Analyze finishes, every table touched by the statement is locked. The exact semantics of each lock mode and the conflict matrix between them are covered in 3.5.1 (Heavyweight locks).&lt;/p&gt;

&lt;p&gt;What is interesting is that locks are taken at the Analyze stage. Other designs are possible. Some other RDBMSs delay lock acquisition until plan or execute time. PG chose to take the lock during the catalog visit instead. The effect is clear: once parse finishes, the schema is guaranteed not to change for the rest of plan and execute, on a per-statement basis. The race in which the catalog mutates under ALTER TABLE while planning is in progress, invalidating the plan, is shut out before it can happen.&lt;/p&gt;

&lt;h2&gt;
  
  
  After Analyze finishes
&lt;/h2&gt;

&lt;p&gt;Once Analyze is done, PG does two more things.&lt;/p&gt;

&lt;p&gt;The first is computing a query identifier. When monitoring tools track slow-running queries, calls like &lt;code&gt;WHERE id = 1&lt;/code&gt; and &lt;code&gt;WHERE id = 2&lt;/code&gt; should be grouped together as instances of the same pattern (&lt;code&gt;WHERE id = ?&lt;/code&gt;). Otherwise, a statistic like "this query pattern averages X ms" loses its meaning.&lt;/p&gt;

&lt;p&gt;The catch is in what key the statistics table groups by. The key here is the data-structure key, the same kind of key as in a hash table where matching keys land in the same slot and accumulate. Using the SQL text directly as a key is no good. &lt;code&gt;WHERE id = 1&lt;/code&gt; and &lt;code&gt;WHERE id = 2&lt;/code&gt; are different texts and would land in separate slots, scattering what should be one row of statistics across many. What we want is one row per pattern, not one row per call.&lt;/p&gt;

&lt;p&gt;So at the end of Analyze, PG normalizes the Query tree by stripping out the variable parts (constants and the like, with the same effect as replacing them with &lt;code&gt;?&lt;/code&gt;) and then hashes the remainder into a 64-bit identifier. Same pattern produces the same identifier; same identifier accumulates into the same statistics slot. This process is called query jumble, and the resulting identifier is the key used by monitoring tools to group queries of the same kind.&lt;/p&gt;

&lt;p&gt;The second thing is a hook where extensions plug in. PG places a post-analyze hook at the very end of Analyze, so any extension registered there has its code invoked right after each query is analyzed. Statistics-gathering extensions like &lt;code&gt;pg_stat_statements&lt;/code&gt; register on this hook to collect each query's identifier and execution time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, the same SQL text can refer to different tables depending on the connection's &lt;code&gt;search_path&lt;/code&gt;.&lt;/strong&gt; This is a direct consequence of table name resolution being locked in at Analyze time and following &lt;code&gt;search_path&lt;/code&gt;. In container or deployment environments where the default &lt;code&gt;search_path&lt;/code&gt; happens to differ, tests pass while production reports "table not found". This is one of the first things to check when debugging such an error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, throwing a SELECT inside a transaction immediately locks every table the statement touches, and those locks are held until the transaction ends.&lt;/strong&gt; Lock acquisition being baked into the Analyze stage shows up directly in how applications behave. Patterns like &lt;code&gt;BEGIN; SELECT ...; (some other work)&lt;/code&gt;, where a transaction is opened early, hold &lt;code&gt;AccessShareLock&lt;/code&gt; on the SELECT's tables for the entire transaction's duration, which can block ALTER TABLE in other sessions. This is one of the reasons long-running transactions are operationally risky (other reasons are covered in chapter 3 on MVCC). Conversely, from the application's point of view, the moment a SELECT is issued, the schema becomes protected against change for the remainder of that statement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, the unit of "query" that monitoring tools group on is not the SQL text but the normalized pattern of the analyzed result.&lt;/strong&gt; When &lt;code&gt;pg_stat_statements&lt;/code&gt; collects slow queries, the key it uses to lump same-shaped queries onto one row is the query identifier computed at the end of analysis. Statements that differ only in constant values but share the same structure accumulate into the same row. So calling &lt;code&gt;WHERE id = 1&lt;/code&gt; a thousand times and calling &lt;code&gt;WHERE id = 1&lt;/code&gt;, &lt;code&gt;WHERE id = 2&lt;/code&gt;, ..., &lt;code&gt;WHERE id = 1000&lt;/code&gt; once each look the same to the monitoring tool: a single row.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>sql</category>
    </item>
    <item>
      <title>1.2.1 From SQL Text to Raw Parse Tree</title>
      <dc:creator>JoongHyuk Shin</dc:creator>
      <pubDate>Wed, 06 May 2026 02:06:12 +0000</pubDate>
      <link>https://dev.to/joonghyukshin/121-from-sql-text-to-raw-parse-tree-342c</link>
      <guid>https://dev.to/joonghyukshin/121-from-sql-text-to-raw-parse-tree-342c</guid>
      <description>&lt;p&gt;A line like &lt;code&gt;SELECT name FROM users WHERE id = 1&lt;/code&gt; arrives at the backend. As we saw in 1.1.1, the backend is a child process forked by the postmaster when the client connects, dedicated to that one client. What this process first holds in its hands is just a byte array. It does not yet know where the keywords end, where the identifiers begin, or where the integer constant lives. Turning that byte array into a tree structure is the front half of the second of the five stages, namely raw parsing. This section covers only that front half. The back half (parse analysis), which consults the catalog to attach meaning, belongs to 1.2.2. (The catalog, in case you need a refresher, is the set of internal tables PostgreSQL maintains to describe itself: which tables have which columns, which functions take which argument types, and so on, all stored as rows in these tables.)&lt;/p&gt;

&lt;p&gt;The output of raw parsing is a single &lt;code&gt;RawStmt&lt;/code&gt; node per SQL string (or a List, if multiple statements are joined with &lt;code&gt;;&lt;/code&gt;). This RawStmt wraps a raw node like &lt;code&gt;SelectStmt&lt;/code&gt; for SELECT, &lt;code&gt;InsertStmt&lt;/code&gt; for INSERT, and so on. The name "raw" means it has not seen the catalog. Whether &lt;code&gt;users&lt;/code&gt; is actually an existing table, which column &lt;code&gt;id&lt;/code&gt; refers to, none of that is known yet. All that has been captured is the grammatical structure: a SELECT keyword followed by a column list, a FROM followed by an identifier, a WHERE followed by a comparison expression.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two tools dividing the work: flex and Bison
&lt;/h2&gt;

&lt;p&gt;PostgreSQL's raw parser is a collaboration of two tools. One side is the &lt;strong&gt;lexer&lt;/strong&gt; (lexical analyzer), the other is the &lt;strong&gt;grammar&lt;/strong&gt; (syntactic analyzer). The lexer slices the byte array into tokens. The grammar takes that token sequence and groups it into a tree.&lt;/p&gt;

&lt;p&gt;PostgreSQL does not write these two by hand. It uses standard generator tools brought in from outside. The lexer is &lt;strong&gt;flex&lt;/strong&gt;, the grammar is &lt;strong&gt;Bison&lt;/strong&gt;. Both are code generators. The developer writes rules, and at build time the tool reads those rules and emits actual working lexer/parser C code.&lt;/p&gt;

&lt;p&gt;For flex, a rule means "if this regex pattern comes in, emit this token." For example, "if a letter is followed by alphanumerics, emit an IDENT (identifier) token", "if only digits appear, emit an ICONST (integer constant) token." PostgreSQL keeps these rules in &lt;code&gt;scan.l&lt;/code&gt; (about 1,500 lines on the current PostgreSQL 18). At build time flex reads this file and generates the C function that does the tokenizing.&lt;/p&gt;

&lt;p&gt;For Bison, a rule means "if this token sequence appears, build this tree node." For example, "if SELECT is followed by a column list, then FROM and a table identifier, build a SelectStmt node." PostgreSQL keeps these grammar rules in &lt;code&gt;gram.y&lt;/code&gt; (about 20,000 lines on the current PostgreSQL 18). At build time Bison reads this file and generates the C parser function that groups tokens into a tree.&lt;/p&gt;

&lt;p&gt;The driver that ties these two together is the &lt;code&gt;raw_parser()&lt;/code&gt; function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;List&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="nf"&gt;raw_parser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RawParseMode&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;core_yyscan_t&lt;/span&gt; &lt;span class="n"&gt;yyscanner&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;base_yy_extra_type&lt;/span&gt; &lt;span class="n"&gt;yyextra&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt;           &lt;span class="n"&gt;yyresult&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;scanner_init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...);&lt;/span&gt;     &lt;span class="cm"&gt;/* flex init */&lt;/span&gt;
    &lt;span class="n"&gt;parser_init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;yyextra&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;      &lt;span class="cm"&gt;/* Bison state init */&lt;/span&gt;
    &lt;span class="n"&gt;yyresult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_yyparse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yyscanner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   &lt;span class="cm"&gt;/* one cycle */&lt;/span&gt;
    &lt;span class="n"&gt;scanner_finish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yyscanner&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yyresult&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;NIL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;             &lt;span class="cm"&gt;/* syntax error */&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;yyextra&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parsetree&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="cm"&gt;/* List of RawStmt */&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;base_yyparse()&lt;/code&gt; is the actual parser function Bison generated. Inside it, whenever a token is needed, it calls the lexer and groups tokens into a tree according to grammar rules. The lexer does not run on its own. It is pulled along by the grammar.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the lexer (scan.l) sees
&lt;/h2&gt;

&lt;p&gt;What the lexer does is simple. It scans the byte array from the start, finds the longest pattern that matches one of the regex rules, and hands the corresponding token type and value to the grammar. What does "longest matching pattern" mean? When several rules can match starting at the same position, the lexer picks the one with the longer match. For example, if the input contains &lt;code&gt;&amp;gt;=&lt;/code&gt;, the lexer does not slice off &lt;code&gt;&amp;gt;&lt;/code&gt; alone. It groups &lt;code&gt;&amp;gt;=&lt;/code&gt; into a single token (such as &lt;code&gt;GREATER_EQUALS&lt;/code&gt;). &lt;code&gt;&amp;gt;&lt;/code&gt; alone would also match a comparison-operator rule, but a longer-matching rule for &lt;code&gt;&amp;gt;=&lt;/code&gt; exists, so that one wins. In lex terminology this is called longest match. Almost every lexer behaves this way. So when &lt;code&gt;SELECT name FROM users WHERE id = 1&lt;/code&gt; comes in, the lexer emits tokens in this order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;     &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt;
&lt;span class="n"&gt;name&lt;/span&gt;       &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;identifier&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IDENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;       &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt;
&lt;span class="n"&gt;users&lt;/span&gt;      &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;identifier&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IDENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"users"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;      &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;keyword&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt;
&lt;span class="n"&gt;id&lt;/span&gt;         &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;identifier&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IDENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;=&lt;/span&gt;          &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="s1"&gt;'='&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;          &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nb"&gt;integer&lt;/span&gt; &lt;span class="n"&gt;constant&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ICONST&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;How does the lexer tell identifiers from keywords? Every identifier candidate (a letter followed by alphanumerics) first matches the same regex rule. After that, the matched string is checked against the keyword table by binary search. If it is in the table, it becomes a keyword token. If not, IDENT.&lt;/p&gt;

&lt;p&gt;The size and categorization of that keyword table matter. On the current PostgreSQL 18 there are 494 keywords, split into four categories.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;UNRESERVED&lt;/td&gt;
&lt;td&gt;330&lt;/td&gt;
&lt;td&gt;usable as identifier (e.g. &lt;code&gt;abort&lt;/code&gt;, &lt;code&gt;aggregate&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;COL_NAME&lt;/td&gt;
&lt;td&gt;63&lt;/td&gt;
&lt;td&gt;usable as a column name but not as a function/type name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TYPE_FUNC_NAME&lt;/td&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;usable as a type or function name&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RESERVED&lt;/td&gt;
&lt;td&gt;78&lt;/td&gt;
&lt;td&gt;never usable as identifier (e.g. &lt;code&gt;select&lt;/code&gt;, &lt;code&gt;from&lt;/code&gt;, &lt;code&gt;where&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This split is why a query like &lt;code&gt;SELECT abort FROM ...&lt;/code&gt; works in PostgreSQL: &lt;code&gt;abort&lt;/code&gt; is UNRESERVED, so it can also be used as an identifier. By contrast, &lt;code&gt;SELECT select FROM ...&lt;/code&gt; is a syntax error. RESERVED never gives way. Compatibility differences with other RDBMSes often come from this split.&lt;/p&gt;

&lt;p&gt;There is an interesting comment at the top of scan.l: "rules in this file must be kept in sync with &lt;code&gt;src/fe_utils/psqlscan.l&lt;/code&gt; and &lt;code&gt;src/interfaces/ecpg/preproc/pgc.l&lt;/code&gt;!" In other words, the same SQL lexer rules live in three places. The psql client has its own lexer, ecpg (the embedded SQL preprocessor) has its own. The reason is that each tool has slightly different lexical needs, but the practical consequence is that changing the lexer rules in PostgreSQL means synchronizing three files.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tree the grammar (gram.y) builds
&lt;/h2&gt;

&lt;p&gt;As the lexer emits tokens, the grammar takes them in order, matches them against grammar rules, and assembles a tree. It builds bottom-up. Small subtrees are grouped first, those subtrees are then collected into larger nodes, and finally a single node corresponding to the whole statement is completed. This act of "taking a sequence of tokens or subtrees and reducing them into a single parent node" is called &lt;strong&gt;reduce&lt;/strong&gt; in parser terminology. For example, if the token sequence &lt;code&gt;IDENT '=' ICONST&lt;/code&gt; matches the "binary comparison expression" rule, those three are reduced into a single subtree. That subtree is then reduced as part of a WHERE clause rule. That WHERE clause is reduced as part of a SelectStmt rule. In the end, one SelectStmt node is built and gets wrapped in a RawStmt.&lt;/p&gt;

&lt;p&gt;Here is a picture of the raw parse tree that &lt;code&gt;SELECT name FROM users WHERE id = 1&lt;/code&gt; produces. The tree has the root at the top, leaves toward the bottom.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                        RawStmt
                           │
                       SelectStmt
              ┌────────────┼─────────────┐
              │            │             │
          targetList   fromClause    whereClause
              │            │             │
            IDENT       RangeVar      A_Expr (=)
            "name"      "users"      ┌──────┴──────┐
                                     │             │
                                   IDENT         ICONST
                                   "id"             1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The RawStmt at the top is the output of the raw parser. Keywords like SELECT, FROM, WHERE are only markers that tell the grammar rule which subtree to reduce into, so they do not survive as separate tree nodes. Only meaningful values (identifiers, constants) from the input tokens remain as leaves.&lt;/p&gt;

&lt;p&gt;PostgreSQL's top rule is &lt;code&gt;stmtmulti&lt;/code&gt;. It expresses that multiple SQL statements may be joined with &lt;code&gt;;&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stmtmulti:  stmtmulti ';' toplevel_stmt
                { ... lappend($1, makeRawStmt($3, @3)); ... }
          | toplevel_stmt
                { ... list_make1(makeRawStmt($1, @1)); ... }
;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;toplevel_stmt&lt;/code&gt; is one SQL statement, and they all get wrapped by &lt;code&gt;makeRawStmt()&lt;/code&gt; into a RawStmt. That is why the raw parser's output is always a List of &lt;code&gt;RawStmt&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Going down into &lt;code&gt;toplevel_stmt&lt;/code&gt; and then &lt;code&gt;stmt&lt;/code&gt;, you find that &lt;code&gt;stmt&lt;/code&gt; is a giant OR rule.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stmt:
        AlterEventTrigStmt
      | AlterCollationStmt
      | AlterDatabaseStmt
      | ...
      | SelectStmt
      | InsertStmt
      | ...
;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More than 120 statement kinds are listed here on the current PostgreSQL 18, one per line. Each then expands into its own sub-rules. &lt;code&gt;SelectStmt&lt;/code&gt; alone has dozens of sub-rules, and every SELECT element such as FROM clause, WHERE clause, GROUP BY, set operation unfolds inside it like a tree. The roughly 20,000 lines of gram.y are the result of this unfolding.&lt;/p&gt;

&lt;p&gt;A new tree node is built inside the reduce action. At the moment &lt;code&gt;SELECT * FROM users&lt;/code&gt; reduces into a SelectStmt node, &lt;code&gt;makeNode(SelectStmt)&lt;/code&gt; allocates the node and fills in target list, FROM clause, WHERE, and so on. At this moment we do not know which OID of which table &lt;code&gt;users&lt;/code&gt; refers to. The identifier string &lt;code&gt;"users"&lt;/code&gt; simply lives inside a &lt;code&gt;RangeVar&lt;/code&gt; node. Catalog lookup is the next stage's job.&lt;/p&gt;

&lt;p&gt;gram.y also happens to be the file that most directly shows PostgreSQL's history of SQL compatibility changes. When a new PostgreSQL version adds a new SQL feature (such as MERGE or JSON_TABLE), the grammar rules grow accordingly, so dozens to hundreds of lines get added to gram.y each time. Following git blame is enough to see which SQL feature entered which PostgreSQL version. Its roughly 20,000-line size is the trace of three decades of SQL standard accumulation in a single file.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lookahead filter: base_yylex
&lt;/h2&gt;

&lt;p&gt;There is one more detail. The parser Bison generates is &lt;strong&gt;LALR(1)&lt;/strong&gt;. Spelling out the acronym, that is Look-Ahead Left-to-right Rightmost-derivation, with 1-token lookahead. The name sounds intimidating, but the gist is simple. The parser scans input from left to right exactly once, and at each step it peeks at exactly one upcoming token ("1-token lookahead") to decide which grammar rule to apply. Being able to decide with one token of lookahead keeps the parser fast and memory-efficient. Almost every mainstream compiler/DB parser works this way.&lt;/p&gt;

&lt;p&gt;The catch is that SQL grammar has a few cases that do not fit cleanly into LALR(1). Multi-word tokens like &lt;code&gt;NULLS FIRST&lt;/code&gt; and &lt;code&gt;WITH ORDINALITY&lt;/code&gt; need to be received by the grammar as single tokens. If NULLS and FIRST were given to the grammar as separate tokens, the grammar could not decide what comes after NULLS based on a single lookahead.&lt;/p&gt;

&lt;p&gt;PostgreSQL solves this by inserting one filter layer between the lexer and the grammar. That filter is &lt;code&gt;base_yylex()&lt;/code&gt;. The actual lexer flex generates is &lt;code&gt;core_yylex()&lt;/code&gt;, but Bison never calls it directly. It always receives tokens through &lt;code&gt;base_yylex()&lt;/code&gt;. base_yylex looks at one token, and if this is a case that needs the next token pulled in and merged, it gives the grammar a single combined token. The result is that the grammar can be written cleanly as LALR(1), and the complexity of multi-word tokens is isolated inside base_yylex.&lt;/p&gt;

&lt;h2&gt;
  
  
  Never touches the catalog
&lt;/h2&gt;

&lt;p&gt;The most important constraint of the raw parser stage is that &lt;strong&gt;it never accesses the system catalog&lt;/strong&gt;. This is not just a convention but a correctness requirement.&lt;/p&gt;

&lt;p&gt;The header comment of &lt;code&gt;pg_parse_query()&lt;/code&gt; explains why.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Analysis and rewriting cannot be done in an aborted transaction, since they require access to database tables. So, we rely on the raw parser to determine whether we've seen a COMMIT or ABORT command; when we are in abort state, other commands are not processed any further than the raw parse stage.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To follow this comment, you first need to know what "the transaction is in abort state" means. Here is the most intuitive scenario.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;BEGIN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'a'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'b'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;   &lt;span class="c1"&gt;-- unique violation!&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The moment the third line violates the unique constraint and errors out, PostgreSQL marks this transaction as "already broken." That is the abort state. The next line, &lt;code&gt;SELECT * FROM users;&lt;/code&gt;, is grammatically fine and the table exists in the catalog, but PostgreSQL refuses to execute it. Instead, every subsequent SQL gets the same one-line error.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR:  current transaction is aborted, commands ignored until end of transaction block
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the situation in &lt;code&gt;psql&lt;/code&gt; where, after one query inside a transaction fails, every following query is rejected with the same message. To unlock it, the client must explicitly send &lt;code&gt;ROLLBACK&lt;/code&gt; (or &lt;code&gt;COMMIT&lt;/code&gt;, which in abort state has the same effect as rollback). In other words, the only SQL that PostgreSQL effectively accepts in abort state is ROLLBACK.&lt;/p&gt;

&lt;p&gt;Now the relationship between the raw parser and the catalog matters. ROLLBACK still arrives as SQL text, so it has to be parsed somehow. But in abort state, even catalog reads are blocked. What if the raw parser depended on the catalog? Parsing ROLLBACK itself would then error out, and the client would have no way to unwind the transaction. Killing and reopening the connection would be the only escape.&lt;/p&gt;

&lt;p&gt;The design choice that avoids this from the start is the raw parser's catalog ban. The raw parser must work in any transaction state, so it never touches the catalog. All identifier-level meaning analysis is deferred to the next stage (parse analysis). 1.2.2 covers that story.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means in practice
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;First, "parsing is fast" really means raw parsing is fast.&lt;/strong&gt; PostgreSQL's raw parser is pure string work without catalog lookup, so it is very fast. But the bulk of what a prepared statement (1.1.2) caches is not raw parsing. It is &lt;strong&gt;the next stage (parse analysis)&lt;/strong&gt;. Parse analysis is where catalog lookups, function-overload resolution, and type checking happen. When people talk about the per-query parse cost of the simple query protocol being a burden, they almost always mean parse analysis cost, not raw parse cost. To diagnose accurately, separate which stage's cost you are looking at.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second, keyword categories are a hidden trap in PostgreSQL compatibility and migration.&lt;/strong&gt; A word that other RDBMSes happily allow as a column or function name may be RESERVED in PostgreSQL. If your migration tool does not automatically wrap such names in quotes (&lt;code&gt;"name"&lt;/code&gt;), you get a syntax error. Conversely, identifiers PostgreSQL allows freely (such as &lt;code&gt;abort&lt;/code&gt;) may be blocked in another DB. When reviewing a migration, comparing the keyword tables on both sides is the safe move.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third, the raw parser only catches syntax errors.&lt;/strong&gt; Errors like "table does not exist", "function signature does not match", "column is ambiguous" are all thrown by the next stage, parse analysis. So when a query has both a syntax error and a semantic error, the syntax error is reported first. The semantic error only surfaces once syntax is clean. If your error message suddenly changes during debugging, that may be a signal that the syntax issue cleared and you are now seeing the next stage's error.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>internals</category>
      <category>parser</category>
    </item>
  </channel>
</rss>
