1.5.3 Join Nodes: NestLoop, HashJoin, MergeJoin

#postgres #database #internals #executor

A scan node sits at the leaf of the tree and pulls rows from a single table. A join node sits in the middle and brings together the rows that its two children send up. It takes one row from users, one row from orders, checks whether they belong to the same user, and if they match, emits the combined row. PostgreSQL has three nodes for this one job: NestLoop, HashJoin, and MergeJoin. The reason a single task splits into three nodes is much like the reason scans did. There is more than one way to find matching pairs from two inputs, and which way is cheapest depends on the size of the inputs and the shape of the join condition.

Deciding which way is cheapest, by costing the alternatives, was the planner's job in an earlier chapter. This section looks at what those three nodes actually do when they execute. Given the same two tables, the three find matches in completely different ways, and that difference in approach is exactly what tells them apart.

How the three nodes route requests

All three join nodes are internal nodes with two children. One child is called the outer, the other the inner. All three run on the Volcano model's pull framework: when the parent asks for the next row, the join node takes rows from its two children, builds one matched row, and sends it up. The only difference is the order and manner in which it routes pull requests to its two children.

NestLoop pulls the inner from the start all over again for each outer row it receives. HashJoin slurps the inner in one pass to build an index in memory, then takes outer rows one at a time and probes that index. MergeJoin, on the assumption that both sides are sorted in the same order, advances both sides one step at a time in lockstep.

NestLoop: rescan the inner for every outer row

The simplest method is NestLoop. As the name says, the loops are nested. The outer loop takes one row from the outer; the inner loop scans the inner from beginning to end, looking for inner rows that match that outer row. When the inner is exhausted, the outer loop moves to its next row and the inner loop scans the inner from the start again. If the outer has N rows and the inner has M rows, the node performs N×M match comparisons.

At runtime the node keeps only two pieces of state: whether the inner scan for the current outer row is finished so the next outer row is needed, and whether the current outer row has matched the inner at least once. That second flag exists for outer joins like LEFT JOIN. A LEFT JOIN must emit even an outer row that never finds a match, filling the inner side with NULLs, and that NULL-filled row should be produced only when the inner has been fully scanned without a single match. So the node remembers whether a match has happened. A plain INNER JOIN, which simply discards an unmatched outer row, has no need for this flag. When the next outer row is needed, the node tells the inner child to start over, and that rewind resets the inner subtree so it flows again from its first row.

Judging by the number N×M alone, NestLoop looks like a textbook case of inefficiency. Yet there is a decisive reason PostgreSQL picks it often: when the inner can use an index, the inner loop's "scan the whole inner" turns into "fetch just one row by index." The key is that when the outer loop takes an outer row, it passes that row's join key value down to the inner. The inner plugs that value into its index condition and, instead of scanning the whole table, goes straight to the matching rows through the index.

SELECT *
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.created_at > '2026-01-01';

Suppose customers.id has a primary key index. When the planner picks NestLoop here, the outer loop pulls qualifying orders rows one at a time and passes each row's customer_id down to the inner. The inner does not scan customers from the start; it resolves WHERE c.id = <that customer_id> through the index and fetches just that one customer. So even with ten thousand outer rows, the inner's real work is "ten thousand index lookups," not "ten thousand full table scans." In EXPLAIN, when an Index Scan sits under the inner side of a Nested Loop and its index condition references an outer column, this is the shape you are seeing.

A second strength of NestLoop follows from this: it does not care about the shape of the join condition. All NestLoop does is pair an outer row with an inner row and evaluate the join condition, so the condition can be an equality (=), an inequality (<), a range, or a function call. The HashJoin and MergeJoin we will see next work only when the join condition is an equality; NestLoop has no such restriction. A join that connects two tables with an inequality, or matches them through an arbitrary function like a distance calculation, has no option other than NestLoop.

The trouble is when the inner cannot use an index and there are many matches. Scanning the entire inner for every outer row makes N×M a reality, and with even tens of thousands of outer rows the comparison count climbs into the billions. When the planner misjudges this boundary and picks NestLoop, you get the classic incident of a single query spinning the CPU without end.

HashJoin: build a hash table from the inner, look up from the outer

HashJoin's idea is not to rescan the inner over and over, but to read it just once and build a fast index in memory. That index is the hash table.

HashJoin runs in two phases. First, in the build phase, it pulls the entire inner and builds a hash table keyed on the join key. It runs the key through a hash function to decide which bucket it lands in, and collects inner rows with the same hash value into the same bucket. This phase slurps the inner all at once: unlike other nodes that emit one row at a time, it consumes the entire inner to complete the table before moving on. Then, in the probe phase, it takes outer rows one at a time, runs each row's join key through the same hash function to find its bucket, and compares only against the inner rows in that bucket.

This asymmetry is the heart of HashJoin. The inner is read exactly once to become an index, and after that only the outer flows. It is the opposite of NestLoop, which reads the inner as many times as there are outer rows. Processing one outer row compares not against the whole inner but against the few inner rows in the same bucket, so in a well-distributed hash table the comparisons per outer row drop to a handful.

In return, HashJoin carries two restrictions NestLoop does not. First, it works only for equi-joins. Hashing guarantees "equal values land in the same bucket," but it tells you nothing about "greater than" or "less than." So an equality condition like a.x = b.y can be resolved, but an inequality like a.x < b.y cannot be narrowed by hashing. Second, the hash table occupies memory. As long as the inner is small enough to fit in memory there is no problem, but once the inner exceeds the memory limit, the story changes.

That memory limit is not work_mem itself but work_mem multiplied by hash_mem_multiplier (default 2). In other words, with the default setting the hash table can use up to twice work_mem. If, during the build phase, the hash table looks like it will not fit within this limit, PostgreSQL splits the inner and outer into several batches by the hash value of the join key in advance. It loads one batch into memory to finish build and probe, then writes the remaining batches to temporary files on disk and brings them back one at a time. The Buckets: ... Batches: ... Memory Usage: ... printed on the Hash node in EXPLAIN ANALYZE shows this state. Batches of 1 means the hash table fit entirely in memory; 2 or more means it was split out to disk, adding that much temporary-file I/O. A large Batches is a signal that work_mem is short for this join.

PostgreSQL adds one more optimization here, for the case where the data is not evenly spread but skewed toward particular values. Picture joining an orders table to customers where one large customer accounts for a sizable share of all orders. In that case a single customer's id appears unusually often in the outer's join key. PostgreSQL can know from statistics which values appear this often (the most common values), and it pulls the inner rows for those popular values into a dedicated space that is never spilled to a disk batch. This dedicated space is called the skew bucket: skew means the distribution leans to one side, and this is a separate bucket for those leaning values. Those values will be probed frequently from the outer, and if their inner rows were sitting in a disk batch, each probe would have to read them back from disk. Holding them in memory in the skew bucket eliminates that repeated disk read.

MergeJoin: advance two sorted inputs in lockstep

The third method starts from the assumption that both inputs are already sorted by the join key. If both are sorted lists, you can match them by walking down both at once, one step at a time, without indexing one side wholesale or rescanning it. It is like laying two sorted name lists side by side and walking down to find the same name.

It begins with a read position on each side. It compares the join keys at the two positions: if they are equal, that is a match, so it emits the combined row; if they differ, it advances only the position on the smaller side by one. A smaller value means that row's match has already passed or will never come, so it is safe to drop it and move on. Since the two positions only ever move in one direction until each reaches its end, the join finishes after reading each input exactly once. Unlike NestLoop, which reads the inner repeatedly, and HashJoin, which loads the whole inner into memory, MergeJoin streams both sides exactly once.

For MergeJoin to work, both sides must be sorted, and that sorting is not free. If both inputs happen to come up already sorted because they were read through a join-key index, the merge can begin right away at no sorting cost; otherwise a Sort node is placed under one or both sides to sort the input first, and only then does the merge begin. In EXPLAIN, whether a Sort hangs under the Merge Join or a sorted Index Scan feeds straight in tells you whether that sorting cost was paid.

A tricky case slips in here: when the join key has duplicate values. If the outer has the key 5 twice and the inner has 5 three times, the first outer 5 must match all three inner 5 rows, and the second outer 5 must also match those same three inner 5 rows. But by the time the first outer 5 is processed, the inner's read position has already moved past the three 5 rows to the next value. To match the second outer 5, the inner position must be moved back to where 5 began.

This rewinding is called mark and restore. A mark is placed at the position where a run of equal values begins on the inner, and when the same value appears again on the outer, the inner position is restored to that mark. So although we said MergeJoin reads both sides only once, more precisely a stretch of the inner can be rewound and reread because of duplicate values on the outer. If the join key is nearly unique on both sides, this rewinding barely happens and the join finishes most cleanly; if there are many duplicates, that much rewinding cost is added.

One more thing: mark and restore work only if the inner child can accept the request "remember this position and go back to it." Not every execution node can return to an arbitrary past position. A node that streams sort output or another join's result one row at a time has no way back to a row it has already passed. In that case PostgreSQL inserts a Materialize node above the inner. Materialize is a node that stacks up the rows the inner sends, in memory (spilling to disk if it overflows). Because mark and restore happen on this stacked-up copy rather than on the inner original, the copy can be walked back and forth freely. The Materialize you see under a Merge Join in EXPLAIN is that mechanism.