1.5.1 The Volcano Iterator Model

#postgres #database #internals #executor

In 1.5 we saw the big picture: the executor pulls results up one row at a time. That pulling structure ties together nodes that do completely different jobs, SeqScan, HashJoin, Sort, into a single execution loop. Nodes that do such different things all run in one loop because every node follows the same interface, and that interface also explains what the actual rows and loops in EXPLAIN ANALYZE are.

Every node follows the same interface

The executor tree has many kinds of nodes. SeqScan sweeps a table, IndexScan follows an index, HashJoin and NestLoop bring two inputs together, Sort orders rows, HashAgg groups and counts. They all do different things. Yet the executor binds them with a single promise. A node is called like a function, and each call hands back exactly one result row. Call the same node again and it gives you the row right after the one it last returned; when there are no more rows, it returns an empty result to signal the end. The unit that travels between nodes one at a time is the tuple.

Inside PG this "next row, please" request is a single function called ExecProcNode. Most nodes (PlanState) such as scans, joins, and sorts carry this function, so the caller can invoke ExecProcNode and get the next tuple without knowing whether the other side is a SeqScan or a HashJoin. The calling code does not have to differ per node type. Some nodes (Sort among them) do need to take in all of their input before they can emit the first row. Those nodes still send their output one row at a time; the only difference is that the first row does not come out until all the input has been gathered.

The name for fitting every operator to the same interface this way is the Volcano model. Goetz Graefe laid it out in 1994: every operator is shaped into three operations, open, next, and close. In PG, open is ExecInitNode, which prepares a node; next is ExecProcNode, which emits the next tuple; close is ExecEndNode, which tears it down. Match those three and any operator slots into the tree.

Why the uniform interface matters becomes clear if you imagine the opposite. If each node had its own output convention, you would need different glue code to put Sort on top of HashJoin versus Sort on top of SeqScan. With N node types the combinations explode. Unified under one next call, Sort only needs to know "pull one row at a time from whatever child it has," and a new operator only needs one next function implemented to sit on top of any existing node. The executor can assemble nodes in the exact shape of the plan tree because of this promise.

One next call goes down to the leaf and back up

A node's next cannot make a tuple on its own, because it needs input. So it calls its child's next, takes one row in, and works it into one result row of its own. That call goes to the child's child, then further down, until it reaches the scan node at the leaf of the tree. The scan node has no child, so it reads a row directly from disk.

Take SELECT * FROM orders WHERE amount > 100 LIMIT 5. The execution tree for this query is two nodes: Limit on top, SeqScan below. The WHERE condition does not become a separate node; it is evaluated as a filter inside the SeqScan node. In EXPLAIN you see Filter: (amount > 100) attached to the SeqScan, with no separate filter node. When Limit's next is called, Limit calls SeqScan's next. SeqScan reads one row from a page, checks right there whether amount exceeds 100, passes it up if it does, and discards it and reads the next row if it does not. When a passing row reaches Limit, Limit sends it to the caller (the client, or the target table of an INSERT). And once Limit has received five rows, it stops calling SeqScan's next.

The key is that the flow runs in two directions. The request (next) goes top to bottom, and the tuple goes bottom to top. Rows are not pushed up from below on their own; only when something above pulls a row does exactly that much get drawn up from below. To generalize the way SeqScan just checked amount, the internal procedure a scan node follows to make one row is this: it fetches one raw tuple through the access method, checks the qual (a filter expression like a WHERE condition, the same idea as the SQL standard's predicate), and if the tuple passes, runs a projection to keep only the needed columns and returns it. A raw tuple that fails the qual is discarded and the next one is fetched. So a single next on a scan node may internally sweep several raw tuples until it finds "the next row that passes the condition."

Why pull one row at a time

Compared with building the whole result and stacking it up, the benefits of pull come out.

First, memory stays constant regardless of result size, and no intermediate result is written to disk. Flowing one row at a time, a node only needs to hold the single tuple it is processing. Whether a query returns ten million rows or ten, the number of tuples any node holds at once is one, and a node reuses its single slot (TupleTableSlot) instead of allocating a fresh tuple. A row that SeqScan read and passed through the filter also rises all the way to the caller in one go, through the nodes above, so there is no need to build an intermediate table between stages. This processing, where one tuple passes through several nodes in a row, is called pipelining. (A node like Sort or HashJoin that must gather all of its input does use memory inside itself, but that is that node's own business and is different from the whole tree stacking up the result.)

Second, stop at the top and the bottom stops too. For a query with LIMIT 5, the topmost Limit node stops calling its child's next after it has received five rows. Then the nodes below it also do only as much work as it took to push up those five rows, and stop. Even if the table has ten million rows, SeqScan reads only as many pages as it takes to pass five rows up. With the build-it-all approach, LIMIT or not, you would compute ten million rows first and then cut to five.

That said, passing tuples one at a time (tuple-at-a-time) is not every database's answer. A function call happens every time a node is called, and when a single query sweeps hundreds of millions of rows, that call cost alone adds up enough to matter. So engines built mainly for analytical queries took a different road. Analytical databases like ClickHouse and DuckDB use vectorized execution, where each call passes not one row but a batch of thousands. DuckDB processes about 2,048 rows at a time, ClickHouse up to 65,536 rows in one chunk, which divides the call cost across the row count and lets the CPU's SIMD instructions (one instruction computing several values at once) process the batch together. On analytical workloads this difference reaches several times. PostgreSQL stays with tuple-at-a-time because it is a general-purpose engine handling both OLTP queries that touch few rows and large analytical queries in one engine. Flowing one row at a time fits patterns that stop early, like the LIMIT we just saw, and fits transaction processing that runs row by row.

A cursor: the client pulls one row at a time

This one-row-at-a-time structure is not confined to the executor's insides. Receiving a large result all at once swells the client's memory by the size of the result, because SELECTing millions of rows dumps all of those millions onto the client. The channel that lets you receive that result in smaller pieces is the cursor. When the client opens a cursor with DECLARE CURSOR and takes a few rows at a time with FETCH, the executor calls next from the top of the tree by that many each time a FETCH arrives. It does not build the whole result up front; it pulls only as much as the FETCH asks for, then and there.

So even a result of millions of rows, taken through a cursor 1,000 at a time, keeps the amount the client holds at once at 1,000. And while you receive through the cursor in pieces, the executor keeps that execution tree and transaction snapshot held as is, so FETCHing across several calls still reads consistently against the data as of when the cursor was opened.

Scan a node more than once and loops climbs

So far the picture has been each node swept once, start to finish. In a nested loop join it is different. Every time a row is pulled from the outer input, the entire inner input must be swept again from the start. If the outer has 1,000 rows, the inner is scanned from the beginning 1,000 times. Resetting a node back to its starting state like this is called a rescan, and PG handles it with ExecReScan.

This rescan count surfaces as loops in EXPLAIN ANALYZE output. The executor gathers statistics while a node runs: each time one scan (one loop) finishes, it accumulates the tuples that loop produced and bumps the loop count by one. So the two numbers are defined like this.

loops: how many times the node was scanned from the start. For a nested loop inner, this equals the outer row count. A node scanned only once has loops of 1.
actual rows: the average tuples produced per loop. The total tuples across all loops, divided by loops.

Here is the point people often get wrong. actual rows is the per-loop average, not the total of tuples the node produced. For example, if a nested loop inner IndexScan shows rows=1.00 loops=1000, it means the node was scanned 1,000 times and produced 1 row on average each time, so the total tuples it actually produced is 1 times 1,000, about 1,000 rows. When you see a node with a large loops, do not read actual rows alone and conclude "this node only touches one row." You have to multiply rows by loops to get the scale of what the node actually did.

Nested Loop  (actual rows=1000.00 loops=1)
  ->  Seq Scan on orders  (actual rows=1000.00 loops=1)
  ->  Index Scan on customers  (actual rows=1.00 loops=1000)

In the output above, the inner Index Scan's loops=1000 means the outer orders had 1,000 rows, so the customers index was re-traversed 1,000 times, and rows=1.00 means each of those 1,000 found 1 row on average. The product, about 1,000 index lookups, is what this node actually did.

What this means in practice

First, LIMIT can stop early only when there is no Sort. A LIMIT after ORDER BY cannot emit its first row until the Sort node has received all input and finished sorting, so even if the top stops early, everything below the Sort has already been read. Which plans can stop early and which cannot is decided by the node types.

Second, holding a cursor open for a long time carries a cost. If you leave a cursor open for a long time because receiving one row at a time feels light, the snapshot survives that much longer, and this leads to the side effects of a long-running transaction (vacuum delay among them, covered in Chapter 3). Close a cursor as soon as you are done reading, and if you are not going to receive the result to the end, do not stretch the transaction out.