Philip McClarence

Posted on Apr 28

PostgreSQL Join Optimization: Nested Loop, Hash, and Merge

#database #sql #postgres #performance

PostgreSQL has three join algorithms. The planner picks between them for every join in every query, driven by several things at once: the estimated sizes of the two inputs, whether they arrive already sorted on the join key, the type of join (inner vs left/semi/anti), which operators are mergejoinable or hashjoinable, whether a hash table will fit in work_mem, and the cost parameters that weigh I/O against CPU. Get the decision right and a three-way join across millions of rows runs in tens of milliseconds. Get it wrong — usually by encouraging a Nested Loop on two large unsorted inputs — and the same query takes minutes.

This article is the third in the Complete Guide to PostgreSQL SQL Query Analysis & Optimization series. We assume the reader can read EXPLAIN output and is familiar with the indexing vocabulary. The running dataset is the same Neon Postgres 17.8 database used throughout the series: 500,000-row sim_bp_orders, 1,000,000-row sim_bp_order_items, 200,000-row sim_bp_users.

We'll cover how each of the three join strategies works, when the planner picks each, what indexes each one wants, and how to read multi-way joins.

Nested Loop — small outer, indexed inner

Nested Loop is the simplest strategy: for each row on the outer side, scan the inner side for matches. Without any index on the inner side, this is a full scan per outer row — O(outer × inner) — and catastrophic for two large tables. With an index on the inner side's join key, each "scan" of the inner is a handful of page reads (a btree descent plus a heap fetch for any columns not in the index), so the total cost is outer-rows × random-I/O-per-probe rather than a polynomial blowup. When the outer side is small and the inner has an index, Nested Loop is nearly unbeatable.

Here's a three-way join that the planner executes as a tower of Nested Loops. The query is "twenty recent pending orders with the user's email and the items in each order":

SELECT u.email, o.order_id, oi.quantity, oi.unit_price_cents
FROM sim_bp_users u
JOIN sim_bp_orders o ON o.user_id = u.user_id
JOIN sim_bp_order_items oi ON oi.order_id = o.order_id
WHERE o.status = 'pending' AND u.status = 'active'
ORDER BY o.created_at DESC
LIMIT 20;

Limit  (cost=1.28..18.06 rows=20 width=41) (actual time=5.098..43.038 rows=20 loops=1)
  Buffers: shared hit=96 read=45
  ->  Nested Loop  (cost=1.28..159741.54 rows=190376 width=41)
        (actual time=5.097..43.027 rows=20 loops=1)
        ->  Nested Loop  (cost=0.85..78058.04 rows=95188 width=33)
              (actual time=2.949..10.878 rows=9 loops=1)
              Inner Unique: true
              ->  Index Scan Backward using idx_sim_bp_orders_created_at on sim_bp_orders o
                    (cost=0.42..30949.29 rows=100300 width=16)
                    (actual time=1.566..2.610 rows=9 loops=1)
                    Filter: ((o.status)::text = 'pending'::text)
                    Rows Removed by Filter: 44
              ->  Memoize  (cost=0.43..0.55 rows=1 width=25)
                    (actual time=0.916..0.916 rows=1 loops=9)
                    Cache Key: o.user_id
                    Cache Mode: logical
                    Hits: 0  Misses: 9  Evictions: 0  Overflows: 0  Memory Usage: 2kB
                    ->  Index Scan using sim_bp_users_pkey on sim_bp_users u
                          (cost=0.42..0.54 rows=1 width=25)
                          (actual time=0.846..0.846 rows=1 loops=9)
                          Index Cond: (u.user_id = o.user_id)
                          Filter: ((u.status)::text = 'active'::text)
        ->  Index Scan using idx_sim_bp_order_items_order_id on sim_bp_order_items oi
              (cost=0.42..0.83 rows=3 width=12)
              (actual time=2.418..3.567 rows=2 loops=9)
              Index Cond: (oi.order_id = o.order_id)
 Execution Time: 43.129 ms

43 ms for a three-way join across 200k × 500k × 1M rows is good. The plan is a tower of two Nested Loops — the inner one joins orders and users, the outer one joins that intermediate result with order items. Read it top-down:

The Index Scan Backward on sim_bp_orders.created_at walks the index in reverse — newest first — looking for pending orders. rows=9 loops=1 means the outer driver produced nine orders before the whole pipeline had enough downstream rows to satisfy LIMIT 20. Forty-four rows were read and filtered as non-pending along the way.
For each of those nine orders, a Memoize → Index Scan on sim_bp_users_pkey looks up the user. Memoize is a PostgreSQL 14+ cache that short-circuits the inner scan when the same key appears repeatedly; here the nine orders happen to be from nine different users, so it's effectively nine primary-key lookups with no cache hits.
For each matching (order, user) pair, the outer Index Scan using idx_sim_bp_order_items_order_id returns an average of two to three line items per order (rows=2 loops=9). The LIMIT 20 applies to the final joined row count, so the executor stops as soon as 20 (order, user, item) tuples have been produced — which is roughly the point where 9 orders × ~2 items each = 20 rows.

This is the Nested Loop success case: the outer driver returns a tiny number of rows thanks to the LIMIT + ordered index, and every inner lookup is an indexed point query. Without the LIMIT, the planner would likely pick a very different strategy — possibly a Hash Join cascade — because it would have to produce tens of thousands of rows instead of twenty.

The Nested Loop failure mode

The same strategy is a disaster when the outer side is large. Consider "count the items across all pending orders," which must process 100,000 pending orders:

SELECT count(*)
FROM sim_bp_orders o
JOIN sim_bp_order_items oi ON oi.order_id = o.order_id
WHERE o.status = 'pending';

If we force the planner to use a Nested Loop (by disabling hash and merge joins), the result is telling:

Aggregate (actual time=1621.494..1621.495 rows=1 loops=1)
  Buffers: shared hit=398994 read=2894
  ->  Nested Loop  (actual time=6.422..1606.338 rows=200535 loops=1)
        ->  Index Only Scan on sim_bp_orders o
              (actual time=4.859..123.354 rows=100252 loops=1)
        ->  Index Only Scan on sim_bp_order_items oi
              (actual time=0.013..0.014 rows=2 loops=100252)
              Index Cond: (oi.order_id = o.order_id)
 Execution Time: 1621.525 ms

1.6 seconds for the same result the planner produces in 1.2 seconds via a Parallel Hash Join (next section). More interestingly, the Buffers line shows 398,994 pages hit — that's from 100,252 inner-index probes, each one re-traversing the btree descent of idx_sim_bp_order_items_order_id. Many of those probes hit the same upper index pages over and over (that's why it's mostly hit, not read), but it's still enormous repeated page traffic that dominates CPU even when the data is fully cached. Under concurrency, other queries would find their own working set evicted from shared_buffers to make room.

The MyDBA analyzer rule nested_loop_large is specifically for this failure mode: it fires when a Nested Loop has Plan Rows > 1000 on the outer side and Plan Rows > 100 on the inner side. At those sizes the Nested Loop is almost always the wrong strategy.

Hash Join — larger sides, unsorted input

Hash Join works in two phases:

Build phase. Read the smaller side in full, building an in-memory hash table keyed by the join column(s). This happens inside the Hash node you see in the plan.
Probe phase. Stream the larger side through the hash table, emitting matched rows as they come.

Hash Join doesn't care whether the inputs are sorted, which makes it the fallback when Merge Join isn't available. It wants the build side to fit in work_mem; if it doesn't, the join spills: PostgreSQL partitions both sides by the join key and processes one pair of partitions at a time. Spilling is visible in the plan as Batches > 1 on the Hash or Hash Join node, and the MyDBA analyzer rule hash_batches_spill fires on it.

Here's the same count query the planner actually chose — a Parallel Hash Join:

Finalize Aggregate  (actual time=1196.234..1199.894 rows=1 loops=1)
  Buffers: shared hit=3827 read=6356
  ->  Gather (Workers Planned: 2, Workers Launched: 2)
        ->  Partial Aggregate  (actual time=1179.014..1179.016 rows=1 loops=3)
              ->  Parallel Hash Join
                    (actual time=170.554..1143.676 rows=333333 loops=3)
                    Hash Cond: (oi.order_id = o.order_id)
                    ->  Parallel Seq Scan on sim_bp_order_items oi
                          (actual time=1.589..703.241 rows=333333 loops=3)
                    ->  Parallel Hash
                          Buckets: 524288  Batches: 1  Memory Usage: 23712kB
                          ->  Parallel Seq Scan on sim_bp_orders o
                                (actual time=0.009..38.403 rows=166667 loops=3)
                                Filter: ((o.status)::text = 'pending'::text)
 Execution Time: 1199.945 ms

1.2 seconds, 10,183 buffer pages touched — about 40× fewer than the forced Nested Loop. The planner built the hash table from sim_bp_orders (the smaller filtered side, 100k pending rows) and probed it with sim_bp_order_items. Batches: 1 means the hash table fit in work_mem entirely, so there was no spill.

Note the Parallel Seq Scan on both sides. That is not a planner mistake — when you're going to read every pending row anyway, a sequential scan is cheaper than an indexed scan because it avoids random I/O and plays nicely with read-ahead. Hash Join is perfectly happy to consume an unsorted stream.

The Parallel Hash Join is a newer variant (PostgreSQL 11+) where workers collaborate to build one shared hash table and then probe it in parallel. Under the hood, Parallel Hash coordinates the build; each worker contributes to it and then proceeds to scan its share of the probe side. This is why you see Workers Planned: 2, Workers Launched: 2 at the top and three loops in each node (one leader + two workers).

When Hash Join is suboptimal

Three cases:

Build side too large. If the smaller table is still multiple-of-work_mem, hash-join spilling degrades performance sharply. The fix is either to raise work_mem (per-session, not cluster-wide), or to force a different strategy via index creation. hash_batches_spill flags this in the analyzer output.

Probe side is tiny. If one input is five rows and the other is fifty million, Nested Loop into an indexed inner is cheaper than building any hash table. PostgreSQL's cost model handles this case correctly most of the time.

Both inputs already sorted. If both sides come out of index scans that produce rows in join-key order, Merge Join is strictly cheaper because it skips the hash build. The planner usually figures this out on its own when it sees the access paths.

Merge Join — both sides sorted

Merge Join walks two pre-sorted inputs in parallel, pairing rows with matching keys in a single pass. It's optimal when both inputs are already sorted on the join key — typically because both are served from index scans on the join column, or because the query itself requires an ORDER BY that aligns with the join key.

The planner picks Merge Join less often than you might expect, because:

If one side has a smaller size and the other has an index, Nested Loop is usually cheaper per row.
If neither side is sorted and both are large, Hash Join wins — sorting both sides just to merge them is rarely cost-effective.
Merge Join's sweet spot is two large pre-sorted streams, which is often a signal that a materialised view or a pre-joined table would be cheaper still.

A canonical Merge Join shape:

SELECT o.order_id, oi.quantity
FROM sim_bp_orders o
JOIN sim_bp_order_items oi ON oi.order_id = o.order_id
ORDER BY o.order_id;

If both tables have indexes on order_id (they do — the primary key on orders and idx_sim_bp_order_items_order_id) and the ORDER BY forces ordered output, the planner may produce something like:

Merge Join
  Merge Cond: (o.order_id = oi.order_id)
  ->  Index Scan using sim_bp_orders_pkey on sim_bp_orders o
  ->  Index Scan using idx_sim_bp_order_items_order_id on sim_bp_order_items oi

Single pass through both indexes, no hash build, no random access. When the prerequisites are met — both sides produced in join-key order — Merge Join is the cheapest option by a wide margin.

In practice you'll see Merge Join most often on joins with explicit ordering, or in the middle of larger plans where the planner noticed that an upstream node was already producing sorted output.

How the planner chooses

PostgreSQL's planner is cost-based. For each join, it enumerates the plausible strategies (Nested Loop, Hash Join, Merge Join, and each direction for each — which side is inner, which is outer) and picks the lowest-cost option. The cost model incorporates:

Estimated row counts from both sides (crucially — if these are wrong, everything downstream is wrong).
Whether each side has a useful index on the join column.
Current work_mem — the planner knows whether a hash table will fit or whether it'll have to plan a spill.
Whether inputs are already sorted (from index scans or prior sort nodes).
The cost parameters: random_page_cost, seq_page_cost, cpu_tuple_cost, etc.

The single biggest cause of wrong-strategy joins is bad row estimates. If the planner thinks a side will produce 15 rows and it actually produces 150,000, it might pick a Nested Loop (optimal for 15) when a Hash Join (optimal for 150,000) would be 100× faster. The MyDBA analyzer rule row_estimate_inaccurate fires when the actual-to-estimated ratio exceeds 10× in either direction, and the fix is almost always ANALYZE on the affected table, or extended statistics if the bad estimate comes from a correlation the planner doesn't know about.

The second biggest cause is stale column statistics on correlated predicates. The planner assumes predicates are independent — if WHERE tenant_id = 7 AND region = 'eu' implies a much narrower row set than P(tenant_id=7) × P(region='eu'), the planner will underestimate and pick the wrong join strategy. Extended statistics (CREATE STATISTICS ... ON tenant_id, region FROM ...) are the specific fix.

Join order: how PostgreSQL decides what to join first

In a three-way join A ⨝ B ⨝ C, there are several possible orders: (A ⨝ B) ⨝ C, A ⨝ (B ⨝ C), and if the join conditions allow it, (A ⨝ C) ⨝ B. For a fourth table you get a lot more permutations. PostgreSQL's planner searches through them.

The heuristic is: do the most selective joins first, so the intermediate result is as small as possible. A join that filters rows_A × rows_B down to 100 rows should happen before a join that would blow the intermediate to millions.

For queries with fewer than 12 tables, PostgreSQL uses dynamic programming to enumerate orders exhaustively. For 12+ tables, the planner switches to the Genetic Query Optimizer (GEQO) which uses heuristic search — sometimes producing non-optimal plans on complex joins. If you have a very wide query (12+ tables, complex conditions), tune geqo_threshold and from_collapse_limit or consider rewriting with explicit CTEs to split the problem.

A few practical levers when the planner picks a wrong join order:

Add or fix indexes. A missing index on a join column often drives the planner to avoid that join until later, resulting in large intermediates. Indexing fixes it.
ANALYZE recently. Stale row counts → bad estimates → bad orders. Autovacuum handles this for active tables; it's often out of date after a bulk load.
Extended statistics. For correlated join keys, CREATE STATISTICS on the correlation.
Rewriting to constrain the planner. STRAIGHT_JOIN doesn't exist in PostgreSQL, but you can force the order by using explicit JOIN syntax and setting join_collapse_limit = 1. Use sparingly — the cost model is usually right.

When join strategy doesn't matter — and what does

Sometimes the join strategy is correct and the query is still slow. The real costs are upstream:

A slow sub-query or CTE feeding the join. The join isn't the problem; its input is. Diagnose by looking at the actual timing of each side.
An expensive filter that prevents index use. If one side of the join is doing a sequential scan because of a non-sargable WHERE clause, the join strategy can't save you. See WHERE Clause Optimisation.
Over-selective projections. SELECT * on a 400-column table passed through a join is expensive in row width; projecting only the columns you need tightens the whole pipeline.

When reading a multi-way join plan, resist the urge to focus on the outermost join. Instead, scan the leaves of the plan tree for the biggest actual rows × loops node — that's where the time is actually going.

Quick reference

Outer size	Inner size	Inner indexed?	Inputs sorted?	Strategy
Small (≤1K)	Any	Yes	—	Nested Loop
Medium	Large	Yes	—	Nested Loop or Hash
Large	Large	—	Both	Merge Join
Large	Large	—	No	Hash Join (may spill)
Large	Large	Build side > work_mem	No	Hash Join with spill — raise work_mem or add an index

A plan shape that should always prompt investigation:

Nested Loop with outer rows > 1,000 and no Memoize cache → fires nested_loop_large.
Hash or Hash Join with Batches > 1 → fires hash_batches_spill; either raise work_mem or index to eliminate the join.
Any join where row_estimate_inaccurate fires on either side — fix statistics first, then re-examine the join.

Next steps

Joins are the category most affected by the quality of your WHERE clauses. The next article in the series covers WHERE Clause Optimisation — sargability, composite-index column ordering, and the operators that silently disable indexes. If your joins look right but the inputs to them are slow, that's almost always where the fix lives.

For the subquery/CTE patterns that sometimes appear in place of explicit joins (EXISTS, correlated subqueries, LATERAL), see Subquery & CTE Optimisation.