JoongHyuk Shin

Posted on Jun 2 • Edited on Jun 21

1.4.1 Path Tree vs Plan Tree: What's the Difference

#postgres #database #internals #planner

The 1.4 chapter overview already noted that the planner's internals are split into two layers, Path and Plan. Path is the floor where candidates compete on cost. Plan is the form the cheapest candidate takes once it has been frozen into something the executor can actually run. This section looks at what each of the two layers holds, why all of it isn't packed into a single structure instead of being split apart, and how one layer is handed off to the next.

A Path tree holds, side by side, the different routes that lead to the same result

The moment the planner receives a query, it sets up a separate workspace for each rel the query touches. A rel is the name PG uses for a set of rows. A table the query references directly, or a sub-query, or any single item appearing in FROM, is called a base rel. A rel that is constructed from other rels, like the result of joining two rels, is called a join rel. The formal name for the workspace the planner builds for one rel is RelOptInfo. In a single-table query like SELECT * FROM users WHERE id = 1, one RelOptInfo is built for the base rel users. In a two-table join, three RelOptInfos are built: one for each of the two base rels, and one for the join rel that combines them.

Inside the workspace is a list that collects the path candidates that can be built for that rel. The list is called the pathlist. For example, if users has a single primary key index and no other secondary indexes, the RelOptInfo for users looks roughly like this (the cost and rows numbers are arbitrary examples for illustration).

RelOptInfo (workspace for users)
├── pathlist (candidate list)
│   ├── SeqScan path        (cost: 100, rows: 1000, sort: none)
│   └── IndexScan path      (cost: 5,   rows: 1,    sort: id ascending)
├── baserestrictinfo (WHERE conditions attached to this table)
│   └── RestrictInfo: id = 1
└── (statistics, join info, etc.)

There are two routes leading to the same result, and the pathlist holds both at once. Neither is correct in advance. The planner keeps both candidates around until one of them clearly beats the other.

Dominance check

The pathlist does not take in every possible path. PG runs a cleanup before a new candidate is registered. This cleanup is called the dominance check.

The rule is simple. The check compares five items.

Cost: cheaper or equal
Sort order: more useful or equal (for example, if the upper stage wants the result sorted by id, a path that already comes out sorted by id is more useful)
Row count: fewer or equal
Outer dependency: depends on fewer external rels, or the same
Parallel safety: at least as safe to run in a worker process

If the new candidate is better than or equal to the existing one in all five items, that is, if it is no worse than the existing one in all five, the new candidate is said to dominate the old one. The old candidate is then yanked out of the pathlist. Conversely, if the existing candidate is no worse than the new one in all five items, the new candidate is not registered at all.

If either side wins on at least one item, both survive. For example, if the SeqScan path is cheaper but the IndexScan path has a more useful sort order, one wins on cost and the other wins on sort. Neither completely dominates the other, so both stay in the pathlist. Which one gets picked in the end depends on what the upper stage wants, like whether an ORDER BY is in play or a MergeJoin is sitting above.

The function that wraps this check is add_path. Paths that lose dominance are freed from memory right there.

Disabled candidates lose before cost even gets compared

There is one more rule inside the comparison. Whether a candidate is disabled is checked before cost is.

What does "disabled candidate" mean here? PG has GUCs like enable_seqscan, enable_indexscan, enable_nestloop. When an operator turns one of these to off, PG does not stop making that kind of path. It makes the path normally, but stamps it once with a note that says "use this only as a last resort." That stamp lives inside the path node as a counter called disabled_nodes. When a seq scan path is built with enable_seqscan = off, that path's disabled_nodes is 1, while the normal candidates sit at 0.

The dominance comparison looks at disabled_nodes first. No matter what else is going on, whichever side has fewer disabled_nodes wins outright. Only when they tie does the comparison move on to cost, sort order, and the rest.

Here is how this plays out in real operation. With enable_seqscan = off, the seq scan path gets disabled_nodes = 1, and as long as even one normal candidate is alive at disabled_nodes = 0, that normal candidate wins unconditionally. The seq scan almost always loses. The word almost matters. If there is no other candidate, or every candidate has been bumped to disabled_nodes = 1 for the same reason, the seq scan rises back up. The enable_* GUCs are not hard "you can never use this" switches; they are soft "if there is another way, prefer it" hints.

The pathlist itself is kept sorted first by disabled_nodes and then by total_cost. The least disabled candidate, and among those the cheapest, is always at the front.

Joins

Joins repeat the same mechanism one level up. Once the pathlists for two base rels are filled, a RelOptInfo for the join workspace is built that combines them, and one join path each for NestLoop, MergeJoin, and HashJoin is registered into that join workspace's pathlist.

A join path is a single node, but it points to two children. One child path is held as outer, the other as inner. When joining table a and table b, a NestLoop join path looks roughly like this.

NestLoop join path (a ⨝ b)
├── outer child → IndexScan path on a
└── inner child → SeqScan path on b

In a query that joins three tables a, b, and c, join paths stack up like this.

HashJoin path ((a ⨝ b) ⨝ c)
├── outer child → NestLoop path (a ⨝ b)
│                ├── outer → IndexScan path on a
│                └── inner → SeqScan path on b
└── inner child → SeqScan path on c

Once paths start pointing at other paths as children, a single path sitting in the final join workspace's pathlist is itself an entire tree. When the book talks about a Path tree, this shape is what it means.

A Plan tree holds only the one survivor

After the planner has filled every workspace's pathlist, what is left at the end is the top-level join workspace that produces the entire query's result. The cheapest single tree sits at the front of that workspace's pathlist. The planner picks it as best_path.

This is where the next step begins. best_path is a data structure that only makes sense inside the planner. It is not a form the executor can take in and run directly. So the planner rewrites best_path into a form the executor can use. The result is the Plan tree.

The Plan tree is a single tree. There are no multiple candidates the way there were in the Path tree. It is built in roughly the same shape as the path tree it came from. What was called outer and inner in the path tree is called lefttree and righttree in the plan tree. The names are different, but they point to the same thing.

HashJoin plan        ← corresponds to HashJoin path in the path tree
├── lefttree → NestLoop plan      ← corresponds to NestLoop path
│              ├── lefttree → IndexScan plan  (a)
│              └── righttree → SeqScan plan  (b)
└── righttree → SeqScan plan      (c)

A Plan node directly holds three things.

Where rows come from: pointers (lefttree, righttree) to the child plan nodes.
What to filter by: the filter conditions applied at execution time (qual list).
What to output: the expression list of output columns (targetlist).

The executor walks this tree one node at a time, taking the rows its children hand up and applying its own work (filter, join, output) on top. In the diagram above, when SeqScan plans pull rows up off disk, the NestLoop above takes them in and performs the a-b join, and the HashJoin above takes that result and joins it with c.

Packing both into one structure means each stage drags dead fields

A natural question shows up here. If the path tree and the plan tree are nearly the same shape, why build both? Why not bolt cost info onto a single structure, do the comparisons, and hand that structure straight to the executor?

The answer is that the two stages need different information.

Planning needs comparison-side info. To run the dominance check, you have to know whether two candidates have the same sort order, whether they take the same external parameters, which workspace they belong to. A Path node therefore carries fields like pathkeys, which expresses sort order, param_info, which holds parameterization, and a back-link to its own workspace (RelOptInfo). These fields are useless the moment planning ends. The executor never compares candidates.

Execution needs runtime-side info. A Plan node carries the pointers it needs to find its children, the filter expressions it applies to the rows passing through, and the expressions for its output columns directly inside the node. The English field names are the ones you saw in the diagram (lefttree, righttree, qual, targetlist). These fields are useless during planning, because candidate comparison does not look at child pointers or filter conditions.

The PG optimizer README sums up this asymmetry in a single sentence.

"There is pretty nearly a one-to-one correspondence between the Path and Plan trees, but Path nodes omit info that won't be needed during planning, and include info needed for planning that won't be needed by the executor."

Packing both kinds of field into one structure makes both stages heavier. Planning's comparison loop drags dead execution fields through every node, leaking cache efficiency and memory. The executor's plan nodes carry dead comparison fields around forever.

There is one more reason. With two separate structures, a path that loses dominance can be freed right there. If a path were the same thing as a plan, you could not know in advance which candidate would eventually go to the executor, so you would have to keep every candidate alive to the end. With them separated, the moment planning ends you only need to convert the one best_path into a plan, and you can throw the rest away. In big queries where join candidates blow up, this difference shows up directly as memory cost.

create_plan freezes one best_path into a plan tree

The conversion from path tree to plan tree happens in a function called create_plan. It takes one best_path and walks the path tree, rewriting each node as the corresponding plan node.

The correspondence is one-to-one.

Path node	Plan node
SeqScan path	SeqScan plan
IndexScan path	IndexScan plan
NestPath	NestLoop plan
MergePath	MergeJoin plan
HashPath	HashJoin plan
SortPath	Sort plan

Each path carries its own kind in a pathtype field, and the conversion engine switches on that value to call the right per-type conversion function.

The conversion is not a plain copy. Information moves in three directions.

First, cost info is carried over as is. The startup_cost, total_cost, rows, and disabled_nodes that the path was holding are copied straight onto the plan. The cost numbers you see in EXPLAIN are these values brought over from the path side. The plan stage does not recompute them.

Second, qual is distributed out of RestrictInfo into individual plan nodes. Why this distribution is needed takes a little untangling. WHERE conditions and join conditions are not pinned directly to path nodes. Instead, each workspace (RelOptInfo) holds a list of WHERE conditions attached to its own table, each join path holds a list of its own join conditions, and each condition lives inside that list as a RestrictInfo object.

A RestrictInfo is not a plain wrapper. Along with the condition expression itself, it carries side information that, once computed, can be reused in many places. Which rels the condition references, the estimated selectivity, the evaluation cost, whether the condition is usable as a mergejoin or hashjoin clause, and so on. In other words, the condition expression plus a cache of the interpretation done over it is what makes up a RestrictInfo. When several candidate paths in the same workspace need to apply the same condition, they reuse the cached selectivity and eval cost from the RestrictInfo instead of recomputing every time. The wrapper exists not just for sharing the condition, but for reusing the interpretation.

A Plan node, however, has to hold its own qual directly. So create_plan walks the path tree in post-order and decides which RestrictInfo should descend into which plan node's qual. Given a condition like WHERE a.x = b.x AND a.y > 0, on the NestLoop join path between a and b, a.x = b.x goes down as the join qual, and a.y > 0 goes down to the SeqScan plan on a as its filter. Once distribution is finished, each plan node holds only the qual that applies at its own location.

Third, only for MergeJoin children does the plan stage actually add nodes.

In the general case, sort is already pinned in during the path stage. When something like ORDER BY or GROUP BY demands a sort, the planner inserts a SortPath into the path tree as a separate path node, and conversion just transforms that SortPath into a Sort plan one-to-one.

MergeJoin is the exception. Instead of placing a separate sort path on top of a child path, the MergeJoinPath carries the sort keys for each of its two children in its own fields (just the fact that "a sort is needed above this child"). At conversion time, the plan stage reads those fields and slips a Sort plan directly on top of the corresponding child plan.

So plan trees are mirror images of path trees almost everywhere, with MergeJoin's child-side sort as the one exception where the plan stage actually inserts a node.

When conversion is over, the plan tree has one root node with children hanging below it. The plan tree is then wrapped in PlannedStmt and handed to the executor. The detailed structure of PlannedStmt was covered in the 1.4 chapter overview.

What enters the conversion stage is a single tree, and the plan tree is nearly that single tree's mirror image. Paths that lost dominance had already been freed at registration time by add_path, and the other paths that survived but were not chosen as best_path are cleaned up along with the planner's context the moment planning ends.

What this means in practice

First, "why did the optimizer pick this plan" cannot be answered by EXPLAIN alone. The node tree EXPLAIN shows, with its SeqScan, IndexScan, NestLoop, is the plan tree, not the path tree. Only the single survivor is visible. Which candidates competed in the path stage for the same query, at what costs, and which paths fell out of dominance and disappeared leave no trace on the plan tree. The actual time from EXPLAIN ANALYZE, the output from EXPLAIN (VERBOSE), the per-node JSON from EXPLAIN (COSTS OFF, FORMAT JSON) all read fields off the plan nodes. They do not show the path-stage comparison results. To answer "why was this plan chosen" you cannot just stare at one EXPLAIN. You have to vary the input statistics and the GUCs, watch how the plan shifts, and infer the path-stage comparison backwards from those shifts.

Second, if the plan for the same SQL has changed, something in the path-stage comparison flipped. When the same query that was running fine starts coming back with a different plan and the performance shifts, the first place to suspect is the path-stage comparison. Statistics (pg_statistic) may have been refreshed and selectivity estimates moved; indexes may have been added or dropped so that new candidate paths appeared or old ones disappeared; cost parameters like random_page_cost may have changed, shifting the cost score of the same path. None of these change the plan tree directly. They flip dominance results inside the path stage, and once best_path ends up pointing to a different path, the plan that comes out the other side comes out differently shaped. In production debugging, replacing the question "why did the plan change" with "which path's cost estimate flipped" makes the trail much shorter.

Third, the diagnostic order when enable_seqscan = off does not change the plan. As seen earlier, enable_* GUCs are soft hints, so when no other normal candidate survives, the disabled path stays alive and ends up in the plan anyway. That is why, before pushing the GUC harder, you check whether a candidate is even alive. (a) Confirm that an index exists first, with \d+ <table> or pg_indexes. (b) Check last_analyze in pg_stat_user_tables to see whether the statistics are alive. (c) Compare the cost numbers in EXPLAIN against the other plan candidates to see which candidate is alive and why its cost is higher.