JoongHyuk Shin

Posted on Jun 21

1.4.5 Join Order: Dynamic Programming vs GEQO

#postgres #database #internals #planner

Section 1.4.4 covered the three ways to combine two tables. But once a query joins three or more tables, another question appears: which two do you combine first, and what do you attach next? This order is not a matter of taste. It changes the size of the intermediate results, and that size drives the total cost.

The catch is that the number of possible orders explodes as tables are added. So the planner picks one of two search strategies depending on how many tables there are. When the count is modest, it uses dynamic programming to consider every order exhaustively and find the optimal one. When there are too many for that, it uses a genetic algorithm to quickly settle on a decent order. This section walks through how each works and when the planner switches from one to the other.

Join order drives the cost

Take three tables: customers (about 20,000 rows), orders (about 200,000 rows), and order_items (about 600,000 rows). Suppose a query pulls a small slice of Korean customers together with their orders and order items.

SELECT c.name, o.amount, i.product
FROM order_items i
JOIN orders o ON i.order_id = o.id
JOIN customers c ON o.customer_id = c.id
WHERE c.country = 'KR' AND c.id <= 50;

The WHERE clause filters only one side, customers, hard. Say 17 customers have an id of 50 or less and a country of KR. Starting from those 17, you fetch only a few hundred of their orders through an index, then follow the order items through an index too. The intermediate result stays small at every step.

Now suppose instead you follow the order written in FROM: combine the two biggest tables, order_items (600K) and orders (200K), first. That join inflates the intermediate result to hundreds of thousands of rows, and only after building that huge intermediate do you finally narrow it down to 17 customers. You do almost all the work, then throw it away at the end.

The planner normally picks the first plan. But if you force the order to the second, you can see directly what it costs. Setting join_collapse_limit (covered below) to 1 leaves the planner no choice but to follow the order written in FROM.

-- planner free to choose the order
 Nested Loop  (cost=1.01..823.63 rows=506 ...)
   ->  Nested Loop  (cost=0.58..720.87 rows=169 ...)
         ->  Index Scan on customers c  (rows=17)   -- starts from 17
         ->  Index Scan on orders o
   ->  Index Scan on order_items i

-- FROM order forced (join_collapse_limit = 1)
 Hash Join  (cost=3964.56..12647.70 rows=506 ...)
   ->  Hash Join  (rows=249200)                     -- big two joined first, 250K intermediate
         ->  Seq Scan on order_items i
         ->  Seq Scan on orders o
   ->  Index Scan on customers c

Same query, same result, and the cost jumps from 823 to 12,647, roughly 15 times. The only difference is which pair was combined first. Combine the small side first and keep intermediates small, and it is cheap; combine the big side first and build a huge intermediate, and the cost climbs. That is what it means for join order to drive the cost.

Why the candidates explode

If order matters, it seems like the planner should just try every order. For a handful of tables, it does exactly that. The problem is that the number of orders grows frighteningly fast with the table count.

For n tables, even counting only the simple shape that chains tables one after another (a left-deep tree, built by attaching one table at a time on the left), there are n! (n factorial) orders. Four tables give 24, but ten tables give more than 3.6 million. And the planner does not look only at single-chain shapes. It also considers bushy trees, where (A join B) and (C join D) are built separately and then joined, so the real count is larger still.

One more thing multiplies on top of this. As section 1.4.4 showed, each join pair has three methods (nested loop, hash, merge), and each input can be scanned several ways. So every single order carries its own combination of implementations. Computing all orders with all implementations precisely from start to finish would make the planner itself take longer than running the query.

So the key question becomes: how do you find a near-best order without examining every exploding candidate? PostgreSQL's answer is dynamic programming.

Dynamic programming: build from the small pieces

Dynamic programming (storing the answers to small subproblems and reusing them for larger ones) is built on never repeating the same computation.

Applied to join order search, it works like this. Suppose you join four tables t1, t2, t3, t4 connected in a neighbor chain. The set {t2, t3}, two tables combined, shows up again and again on the way to larger joins. Add t1 to {t2, t3} and you get {t1, t2, t3}; add t4 instead and you get {t2, t3, t4}, and both paths use the same {t2, t3} as their starting point. Recomputing the cheapest way to build {t2, t3} on every path would mean doing the same work many times over. So the best {t2, t3} is computed once and stored, then pulled out whenever that set is needed instead of being recomputed.

PostgreSQL stores this by level. Level 1 is single tables, level 2 is pairs, level 3 is triples, and so on. It fills level 1 first, combines level 1 to make level 2, combines level 1 and level 2 to make level 3. At each level, for a given set of tables, it keeps only the cheapest of the various ways to build it. Climbing one step at a time, it finishes when the final level holds the single set of all tables joined.

How the join clauses are wired cuts down the sets built at each level. If t1, t2, t3, t4 form a chain with join conditions only on neighboring pairs, that is t1-t2, t2-t3, t3-t4, then the only meaningful level-2 sets are the ones directly linked by a join clause: {t1,t2}, {t2,t3}, {t3,t4}.

It is worth pinning down why {t1,t3} is missing. t1 and t3 are not neighbors, so no join condition directly links them. To combine two tables with no condition, you have to pair every row on one side with every row on the other. A join like this, multiplying all rows with no condition, is called a cartesian product. If t1 and t3 have 10,000 rows each, the result is 100 million rows, a huge intermediate that is obviously going to be thrown away later. The planner does not bother building such a useless set; it builds only sets directly linked by a join clause.

There is one caveat, though. If the join conditions equate the same value, as in t1.x = t2.x and t2.x = t3.x, PostgreSQL derives t1.x = t3.x from them automatically. A group of columns tied together by the same value like this is called an equivalence class. In that case a join condition does arise between t1 and t3 as well, and {t1,t3} is no longer a cartesian product. So saying {t1,t3} is dropped assumes t1 and t3 are linked by different keys with no shared value. The customers, orders, order_items chain from earlier is exactly that kind of chain. customers hangs off orders by customer_id, and order_items hangs off orders by order_id; because these are different keys, no condition is derived between customers and order_items. So the set that joins those two ends directly would be a cartesian product and is never built.

At level 3, one more table is attached to these level-2 sets to make {t1,t2,t3} and {t2,t3,t4}, and at level 4 they merge into {t1,t2,t3,t4}.

Because the best of each set is computed once and stored per level, this finishes with far less work than computing every possible order separately from scratch. And since every valid combination is still examined at each level, the result is the same optimal plan you would get from trying every order one by one. Guaranteeing the optimum at a low cost is the reason for using dynamic programming.

When there are too many tables: GEQO

The number of sets dynamic programming stores grows with the number of table subsets it can build. In the worst case, where every table is wired to every other by a join condition, the number of buildable subsets reaches 2^n (n tables have 2^n subsets). When the join conditions are sparse, as in the chain join above where sets like {t1,t3} drop out, the number actually built is far smaller. Either way, though, once the table count grows large enough the scale becomes unmanageable. 2^n is 1,024 at 10 tables but passes a million at 20. So dynamic programming, too, eventually explodes when there are enough tables and the joins are dense.

PostgreSQL draws a line here. When the number of tables whose order must be decided at once is geqo_threshold (default 12) or more, it sets dynamic programming aside and switches to GEQO. GEQO is genetic query optimization, a join order search that uses a genetic algorithm. The source comments liken the problem to a constrained Traveling Salesman Problem. Just as that problem looks for the shortest route visiting each city once, join order looks for the cheapest sequence in which to weave the tables together. The traveling salesman problem is a classic hard problem where the number of possible routes explodes as cities are added, making an exact answer hard to find quickly. So instead of an exact search over every route, various approximation techniques that quickly close in on a good-enough answer have been studied. The genetic algorithm is one of them, and GEQO borrows that approximation for the join order problem.

A genetic algorithm does not compute the answer from start to finish; it mimics biological evolution to inch toward better answers. It first builds several random join-order candidates into a population (pool). How good each candidate is gets scored by the estimated cost of joining in that order; the lower the cost, the fitter the individual. Then over successive generations, it picks two strong candidates and mixes their orders to produce a child (crossover), occasionally twists part of one at random (mutation), and lets new candidates push out the weaker existing ones. It repeats this for a set number of generations and reports the best order it found.

One thing to be clear about: what GEQO narrows is the search space of join orders, not the implementation method of each join. Whatever order it tries, the individual joins inside it still consider nested loop, hash, and merge and pick the cheapest, exactly as dynamic programming does.

What this trade buys and gives up shows in the planning time. Here is a 12-table join planned both ways in the same environment.

-- dynamic programming (geqo = off)
 Planning Time: ~500 to 1000 ms      total cost 346.50

-- GEQO (geqo = on)
 Planning Time: ~50 ms               total cost 346.50

Planning time dropped to between a tenth and a twentieth. The interesting part is that in this example the cost of the plan GEQO found (346.50) is identical to what dynamic programming found. It reached the same optimum without examining every order. That is not always guaranteed, though. Because GEQO does not scan the whole space, an unlucky run can settle on a higher-cost order. So the accurate phrasing is "GEQO can be worse," not "GEQO is worse." For a huge join of 12 or more tables where the planning time itself becomes a problem, accepting a slight quality risk to finish planning fast is the reasonable call, and that judgment is what this threshold encodes.

Exhaustive search when there are few tables, and a switch to approximation or heuristics past a certain count, is not unique to PostgreSQL. Working on join order search in other engines, I ran into the same fork in the road. What differs is only where the switchover threshold sits, and whether the approximation is a genetic algorithm or some other method. Some engines set a time budget for planning instead of a table count as the threshold. Within that budget they search the order as best they can, and when the time runs out they stop with the best order found so far. The approaches differ, but the premise is the same in every engine: an exhaustive search over orders inevitably collapses at a certain scale.

The knob that sets the search set: collapse limit

So far I have lumped everything under "joining n tables," but this n is not always equal to the number of tables written in the SQL. That is because a setting separately limits the size of the table set the planner shuffles at once.

Before searching the order, the planner first tries to flatten the join structure written in the SQL into a single flat list of tables. For example, write a JOIN b JOIN c and it treats this as "one set of three tables, a, b, c, whose order may be shuffled freely." Only after flattening it this way can dynamic programming or GEQO change the order at will inside it. This flattening is the collapse in the setting names, the act of collapsing a nested JOIN structure into one list.

join_collapse_limit (default 8) sets the maximum size of this flattened set. If the explicit JOINs chain up to 8 tables, they all flatten into one set whose order is searched freely. From the 9th onward, flattening stops. The tables left unflattened cannot be merged into one set, so the order written in the SQL is frozen and the planner cannot change it.

What this means is clearest with a real query. Look at this query joining nine tables with JOIN.

SELECT * FROM t1
JOIN t2 ON t1.id = t2.id
JOIN t3 ON t2.id = t3.id
JOIN t4 ON t3.id = t4.id
JOIN t5 ON t4.id = t5.id
JOIN t6 ON t5.id = t6.id
JOIN t7 ON t6.id = t7.id
JOIN t8 ON t7.id = t8.id
JOIN t9 ON t8.id = t9.id;

Explicit JOINs bind from the left, so this parses as ((((((((t1 JOIN t2) JOIN t3) JOIN t4) ... ) JOIN t8) JOIN t9). The planner flattens this from the inside out, and since join_collapse_limit is 8, it hits the limit the moment it flattens t1 through t8, eight tables. Flattening the ninth, t9, would make the set 9 and exceed the limit, so it stops without flattening t9. The picture the planner then sees is two pieces: {the t1..t8 set} JOIN t9. The eight tables t1 through t8 search their order freely within one set to find the optimum, but t9 is fixed to attach last, after that set is fully built. The planner never tries inserting t9 somewhere among t1 through t8. At the point where the limit was exceeded, the order written in the SQL is frozen.

The size of this flattened set is exactly the table count that went into dynamic programming and GEQO in the previous section, and the number compared against geqo_threshold (12).

from_collapse_limit (default 8) does the same thing for subqueries. When a subquery sits inside a FROM clause, it limits in the same way whether that subquery is pulled up into the outer query and merged into one set.

A subtle interaction arises here. geqo_threshold is 12, but join_collapse_limit is 8. Even if you write a lot of explicit JOINs, a set is cut off at 8, so even writing more than 12 tables may never reach the GEQO threshold of 12. To get to GEQO, you usually have to raise join_collapse_limit as well. Raising join_collapse_limit to 12 in the previous section's 12-table experiment was precisely to flatten the 12 into one set and compare dynamic programming and GEQO under the same conditions.

Set join_collapse_limit to 1 at the extreme, and since only one table fits in a set, no join flattens at all. The planner cannot change the order at all and follows the order written in FROM and JOIN literally. Inflating the 3-table join cost by 15 times earlier was exactly this setting. It turns off order search entirely, taking the planner's most powerful weapon away.

What this means in practice

First, join order is almost always best left to the planner, and lowering join_collapse_limit is a last resort. Sometimes, thinking "I want to force the optimal order I know," people set join_collapse_limit to 1, but this turns off the planner's cost-based order search entirely. As shown above, the same query's cost can jump 15 times. The planner reads the statistics, judges which side is smaller, and combines the small side first; an order frozen by hand stays frozen even when the data changes. Before touching this setting, confirm with EXPLAIN that the planner is actually choosing the wrong order.

Second, if planning time suddenly grows on a large join of 12 or more tables, suspect GEQO and the collapse limit together. Query processing splits into planning and execution, and EXPLAIN (ANALYZE)'s Planning Time shows the planning time separately from execution. If that value is abnormally large, it is a sign that a large join is being handled by dynamic programming rather than GEQO. Either geqo is off, or geqo_threshold has been raised above the default so that even a large set falls short of the threshold and dynamic programming scans the huge space. One easily confused point: raising join_collapse_limit does not usually make planning slower. Once the set grows to geqo_threshold (12) or more, as long as geqo is on it actually switches to GEQO and planning speeds up. In a measurement, the same 14-table join took about 10 seconds with dynamic programming but dropped to about 60 milliseconds once it switched to GEQO. On the other hand, if GEQO is on and the same query produces a different plan on every run, remember that GEQO is not an exhaustive search over all orders but a randomized approximation.

Third, when a join is slow, first tell apart whether the culprit is the order or the method. Faced with a slow join query, it is easy to look first at the join method from section 1.4.4 (nested loop or hash), but the real cause is often the order. In EXPLAIN, if the estimated row count (rows=) at the intermediate nodes keeps inflating as you go up, that is a sign the plan is built in an order that produces large intermediates. Had it combined the small side first to shrink intermediates, the row count would stay small going up. When the order is the problem, it is usually because stale statistics made the planner misjudge which side is smaller, so refreshing the statistics with ANALYZE comes before forcing the order by hand. How statistics produce row estimates is covered in 1.4.8.