JoongHyuk Shin

Posted on Jun 21

1.4.4 Join Strategies: Nested Loop, Hash, Merge

#postgres #database #internals #planner

The costs in 1.4.2 and 1.4.3 were about how to read a single table. A join combines two tables, and PostgreSQL has three ways to do it: nested loop join, hash join, and merge join. None of the three is always best. Their cost shapes differ, so the winner flips depending on whether the inputs are small or large, already sorted, and whether the join condition is an equality.

This section walks through what each method does, where its cost comes from, and what the planner looks at to pick one.

Nested Loop Join

The simplest method. It walks one side (the outer) one row at a time and, for each row, scans the whole of the other side (the inner) to find matching pairs. All three algorithms split the two inputs into an outer and an inner, but the name fits nested loop most literally.

The cost falls right out of this shape. Say the outer has N rows and the inner has M rows. Without an index, scanning the inner from start to finish for every outer row means examining N times M pairs. In big-O terms that is O(N×M), and it grows quadratically when both sides get large. An outer of 10,000 and an inner of 10,000 means 100 million pairs.

But things change completely when the inner can be reached through an index. This section uses two tables for its examples: customers (about 20,000 rows) and orders (about 200,000 rows), where orders points to its customer through a customer_id column that has an index. Consider a query joining a few small customers against the large orders.

EXPLAIN SELECT c.name, o.id, o.amount
FROM customers c JOIN orders o ON o.customer_id = c.id
WHERE c.id <= 3;

 Nested Loop  (cost=4.66..134.16 rows=30 width=19)
   ->  Index Scan using customers_pkey on customers c  (cost=0.29..8.34 rows=3 width=13)
         Index Cond: (id <= 3)
   ->  Bitmap Heap Scan on orders o  (cost=4.37..41.84 rows=10 width=14)
         Recheck Cond: (c.id = customer_id)
         ->  Bitmap Index Scan on orders_customer_id_idx  (cost=0.00..4.37 rows=10 width=0)
               Index Cond: (customer_id = c.id)

There is a lot of information here on first sight, but three things matter right now. First, the top node is Nested Loop itself: the planner picked this join method. Second, the two indented blocks below it are the outer and inner inputs. customers, on top, is the outer; orders, below, is the inner. Third, cost= and rows= are the cost and row-count estimates seen earlier. The Bitmap Heap Scan and Bitmap Index Scan on the inner side are one of several ways to reach the inner through an index; the kinds of scan nodes and how they work are covered separately in 1.5. For now, all you need to see is that the inner reaches only the matching rows through the index.

The key is that the inner, orders, does not get scanned in full. Each outer row's value (c.id) is passed into the inner's index condition (customer_id = c.id), so only matching rows are pulled directly through the index. This is exactly the parameterized index scan seen earlier (a scan that uses the value handed down by the outer as its index key). With only 3 outer rows, the inner index is probed three times, and the total cost of 134 is almost nothing. In big-O terms, one index lookup is roughly O(log M), so the whole thing drops to O(N × log M). When N (the outer) is small, it is close to linear in the outer row count.

The multiplicative cost turning into "one index lookup per small outer row" is the lifeblood of nested loop. A small outer with an index-reachable inner makes it the fastest of the three; both sides large with no index makes it the slowest.

Nested loop has one more ability the other two lack: it handles arbitrary join conditions. A non-equality join (o.amount < c.credit) or a join wrapped in a function, anything that is not an equality ('='), cannot be handled by the hash join or merge join we are about to see. Nested loop simply evaluates the condition on every pair of rows, so the condition can be anything. That is why, when a non-equality join shows up, the only option left to the planner is nested loop.

Hash Join

Hash join builds a hash table in memory from the smaller side, then streams the larger side one row at a time and looks up only the matching bucket (a slot in the hash table) to find pairs.

EXPLAIN SELECT o.id, o.amount, c.name
FROM orders o JOIN customers c ON o.customer_id = c.id;

 Hash Join  (cost=578.00..4377.11 rows=200000 width=19)
   Hash Cond: (o.customer_id = c.id)
   ->  Seq Scan on orders o  (cost=0.00..3274.00 rows=200000 width=14)
   ->  Hash  (cost=328.00..328.00 rows=20000 width=13)
         ->  Seq Scan on customers c  (cost=0.00..328.00 rows=20000 width=13)

The thing under the Hash node is customers (20,000 rows). The planner picked the smaller of the two as the build side, because a smaller hash table fits in memory better. The large orders (200,000 rows) is scanned with Seq Scan and probed. Since no sorting is needed at all, hash join usually wins on an equality join where both tables are large and neither is sorted. Its complexity is O(M) for the build plus O(N) for the probe, which is O(N+M) together. Addition rather than multiplication is where it parts ways with nested loop.

Hash join's weakness is memory. When the built hash table exceeds work_mem (the working-memory limit one operation in a query may use, 4MB by default), it spills to disk and the advantage erodes. Running the same join with work_mem cut to 64kB shows the traces.

SET work_mem = '64kB';
EXPLAIN (ANALYZE) SELECT o.id, c.name
FROM orders o JOIN customers c ON o.customer_id = c.id;

 Hash Join  (actual rows=200000.00 loops=1)
   ...
   Buffers: shared hit=1402, temp read=679 written=679

temp read=679 written=679 is the trace of the hash table spilling to temporary disk files because it did not all fit in memory. The more batches there are, the more this disk I/O adds to the cost, eroding hash join's advantage.

Hash join has one restriction: it only does equality ('=') joins. "Equal keys land in the same bucket" is hash's premise, so it cannot be used for non-equality conditions. Different keys scatter into different buckets, leaving no way to find "greater-or-equal" matches.

Merge Join

Merge join first sorts both tables by the join key (the columns used in the join condition that links the two tables, here o.customer_id and c.id), then advances the two sorted inputs one step at a time to stitch matches together. It scans each side only once, so it is very efficient as long as things are sorted.

The catch is that "as long as things are sorted." If the inputs are not sorted, both sides have to be sorted, and that sort cost goes into startup cost (the cost paid before the first result comes out). Sorting is not cheap, so a merge join that has to sort from scratch tends to lose to hash join. In big-O terms, when both sides are already sorted it scans each once for O(N+M), but when a fresh sort is needed the sort cost O(N log N + M log M) dominates.

But there are cases where no fresh sort is needed. A B-tree index hands back its results already sorted by key. When both tables have an index on the join key, the planner can skip the sort and feed the index order straight into the merge.

EXPLAIN SELECT o.id, c.name
FROM orders o JOIN customers c ON o.id = c.id;

 Merge Join  (cost=0.85..1475.69 rows=20000 width=13)
   Merge Cond: (o.id = c.id)
   ->  Index Only Scan using orders_pkey on orders o  (cost=0.42..5204.42 rows=200000 width=4)
   ->  Index Scan using customers_pkey on customers c  (cost=0.29..659.29 rows=20000 width=13)

This query joins the two tables on their primary keys (id). Both sides are read through PK indexes, so the inputs are already in id order, and there is no Sort node anywhere in the plan. Startup is 0.85, almost 0. The planner picked merge over hash for this join: with no fresh sort needed, merge wins.

Merge join is restricted to equality ('=') joins, same as hash join, but for a different reason. Merge join advances two pointers over the sorted tables looking for "the same key value." If one side's value is smaller it advances that side, and when the two values are equal it emits a pair. This advancing only fits the equality condition of "values are equal." A non-equality like o.amount < c.credit has one row matching an entire contiguous range on the other side, which a simple merge advancing one step at a time cannot handle. So non-equality joins can use neither hash join nor merge join, leaving only nested loop, which evaluates every pair of rows. That is why nested loop was called the only option for non-equality joins earlier.

How the Planner Chooses Among the Three

Pulling the trade-offs into one table. N is the outer, M is the inner row count.

Method	Complexity	Favorable case	Join condition
nested loop	O(N×M), O(N log M) with indexed inner	small outer + indexed inner	any (equality and non-equality)
hash join	O(N+M)	large equality join (neither side sorted)	equality ('=') only
merge join	O(N+M) sorted, O(N log N + M log M) needs sort	both sides already sorted	equality ('=') only

What this means is that no single method always wins; the advantage shifts with the shape of the inputs. With an indexed inner and a small outer, nested loop's O(N log M) is lowest; with both sides large and unsorted, hash join's O(N+M) leads; with both already sorted, merge join wins at O(N+M) with no sort cost.

So PostgreSQL does not hard-code this ordering into if-else rules. Every time it joins two tables, it builds all three candidates, prices each, and keeps only the cheapest after the dominance check (the comparison where a better candidate pushes out a worse one).

Why this matters shows in how easily a fixed rule like "a small table means nested loop" breaks. Take the earlier query joining customers and orders, and just change the WHERE range to grow the outer.

-- WHERE c.id <= 3    (outer 3 rows)
 Nested Loop  (cost=4.66..134.16 ...)

-- WHERE c.id <= 300  (outer 300 rows)
 Hash Join  (cost=18.29..3817.40 ...)

The query shape is the same, but an outer of 3 rows gets nested loop while 300 rows gets hash join. As the outer grows, the number of inner index lookups (which is the outer row count) grows with it, raising nested loop's cost, until at some point the "build once, probe once" of hash becomes cheaper. The planner computes that crossover point by cost and switches over. It is not a rule set in stone but a calculation redone every time the input changes.

There are settings that turn off a join method: enable_nestloop, enable_mergejoin, enable_hashjoin. The names make them look like switches that forbid a join, but turning one to false does not make the candidate disappear. Instead the candidate gets a disabled mark and falls to the back in the cost comparison. When there is no other option at all, even a turned-off method ends up being used.

The classic "no other option" case is the full outer join. A full outer join cannot be implemented as a nested loop. Nested loop's asymmetric structure of walking the outer and searching the inner gives it no way to track rows on the inner side that never matched. It can emit the outer's unmatched rows the way a left join does, but a full outer join must emit the unmatched rows on both sides. So a full outer join is implemented only as a merge join or hash join, both of which scan both tables to the end; on such a query, the planner has to use one of those even with enable_mergejoin or enable_hashjoin off. In the end these settings are closer to diagnostic tools than to operational tuning switches. The application section returns to this.

Filtering Candidates Early with a Lower Bound

The planner does not compute a candidate path's cost all the way through in one go; it splits the work into two phases. First, in a preliminary phase, it quickly computes a lower bound on the cost for each candidate path. The lower bound is a value the path's actual cost, when the path is actually run, cannot drop below.

The way the lower bound is computed is simple. It adds up only the costs certain to be paid to read and process the inputs: for nested loop, the cost of reading both inputs; for a merge join that needs sorting, the sort cost on top; for hash join, the cost of building the hash table from the inner. Then it deliberately skips the single most expensive part: the cost of checking, for every pair of rows, whether the join condition actually matches, which grows in proportion to the number of row pairs and is heavy to compute. The reason this value cannot exceed the actual cost is that the skipped condition-evaluation cost can never be negative. The actual cost is just this value plus that condition-evaluation cost, so it can never be smaller. That is why this value is a lower bound.

This lower bound is used to compare against candidates already found. Suppose the planner is working through several candidates for joining two tables, and the cheapest found so far is a hash join at a cost of 4377. The next candidate it examines is a merge join, and before going into the precise calculation it computes the lower bound, which comes out to 12000. That 12000 is already larger than 4377. The merge join's actual cost cannot drop below its lower bound (that is the definition), so it is at least 12000, definitely more expensive than the hash join's 4377. So the merge join is discarded without ever doing the precise calculation.

Conversely, suppose the next candidate, a nested loop, comes out with a lower bound of 3000. 3000 is less than the current best of 4377. Here the actual cost still has a chance of being cheaper than 4377, so it cannot be discarded. Only the nested loop is passed to the second phase, where the precise cost including the join-condition evaluation is computed. If that precise cost comes out at 3500, the nested loop becomes the new best; if 4500, the existing hash join stays best.

The reason for splitting into two phases is that join candidates explode. With several tables, the join order (covered in 1.4.5) branches many ways, and at each step of each order, the three algorithms times the outer/inner placement times the several scan methods each input has all multiply together. Computing every one of these candidates at full precision from the start would slow the planner itself down. Knocking most of them out early with a cheap lower bound leaves the expensive precise calculation only for the few genuinely competing for the win.

What This Means in Practice

First, the most common culprit in a slow join query is the planner picking a nested loop when the real outer turns out to be far larger than estimated. The planner uses statistics (covered in 1.4.8) to estimate the outer's row count and decides "it's small, so nested loop is cheap." But when the statistics are stale, or correlated conditions stack up and underestimate, the outer at execution time becomes tens of thousands of rows, and the inner index gets probed that many times, running slow. The diagnosis is EXPLAIN ANALYZE. This command attaches actual execution results rather than estimates to each node; among them, actual rows is the number of rows the node actually emitted, and loops is how many times the node was executed. A nested loop's inner runs once per outer row, so loops is the outer row count. When the estimated rows (rows=) and actual rows diverge badly and the inner's loops is more inflated than expected, this is the pattern.

Second, a setting like enable_nestloop is a diagnostic tool, not an operational setting. On a slow query, turning hash off with SET enable_hashjoin = off reveals, by cost, what the planner picks as the next-best and how much more expensive it is. Take the query above joining orders and customers on customer_id (the one the planner chose hash join for) and turn hash off.

SET enable_hashjoin = off;
EXPLAIN SELECT o.id, o.amount, c.name
FROM orders o JOIN customers c ON o.customer_id = c.id;

 Merge Join  (cost=0.58..12217.42 rows=200000 width=19)
   Merge Cond: (o.customer_id = c.id)
   ->  Index Scan using orders_customer_id_idx on orders o  (cost=0.29..9008.17 rows=200000 width=14)
   ->  Index Scan using customers_pkey on customers c  (cost=0.29..659.29 rows=20000 width=13)

The next-best comes out as a merge join, and the cost jumps from 4377 (hash) to 12217 (merge), nearly threefold. It reads orders in customer_id index order to get the sort, but that index scan itself is expensive enough that the total exceeds hash. That number is why the planner chose hash in the first place. Once the diagnosis is done, revert it. Pinning it in postgresql.conf ruins the plans of every other query in an attempt to fix this one.

Third, once a hash join is confirmed to have spilled to disk batches for lack of work_mem (the temp read/written and Batches seen earlier), raising the limit with SET work_mem in that query's session can cut the batches. But work_mem is memory taken per connection, and per sort and hash operation within a query, so raising it globally and carelessly can exhaust memory when there are many concurrent connections. Adjusting it per problem query is safer.