JoongHyuk Shin

Posted on Jun 21

1.4.9 Planner Preprocessing: Subquery Pull-up, Predicate Pushdown, Equivalence Classes

#postgres #database #internals #planner

So far we have seen the planner build candidate paths and rank them by cost, choose join methods and join orders, and estimate those costs from statistics. But before any of that cost comparison begins, the planner does something else first. It takes the incoming Query tree and rewrites it once into a shape that is easier to optimize. That rewriting is preprocessing.

In processing order, preprocessing is the planner's very first step. Yet why it pays off only becomes clear once you understand cost and joins. Why it is cheap to shrink the rows entering a join is explained by join cost, and why it matters to manufacture a missing join condition to avoid a Cartesian product is explained by join order. That is why a step that runs first in processing is the last one we look at in this book.

Preprocessing rewrites the Query tree before cost comparison starts

The planner's entry point runs preprocessing once for each Query it receives. The function that gathers this work is subquery_planner. The name makes it sound like it only deals with subqueries, but it actually does everything that should happen exactly once per Query object. Only after that preparation finishes does the cost-based candidate comparison begin.

One thing is worth pinning down here. The rewriter from chapter 1.3 also rewrites the Query tree, but its purpose is different from this preprocessing. The rewriter expands views into their contents, injects RLS policies as conditions, and fills in default values for columns omitted from an INSERT. In other words, it settles what the query actually means. Preprocessing does not touch that settled meaning at all. It leaves the rows that will come out exactly as they are, and rearranges only the shape of the tree so the same result can be produced more cheaply. If the rewriter finishes deciding "what to do," preprocessing is the first trimming of "how to do it."

Why rewrite at all? The Query tree the rewriter hands over still carries the shape the SQL author wrote. People write for readability, not for the planner's convenience. For clarity they wrap parts of the query in subqueries and pile several conditions into one WHERE clause. Leave that shape alone, and the cost and join comparison never sees the better candidates. A condition that only needs to filter one table tags along all the way to the join and inflates the join's input, and with no join condition spelled out in the SQL, the only join order available between two tables is a Cartesian product. Preprocessing tidies up this shape so the cost comparison gets to see all its candidates. The five sections that follow look at that tidying one piece at a time.

Pull subqueries up into the parent query

The first thing it does is pull subqueries in the FROM clause up into the parent query. Merging a subquery into the parent as a single level is called flattening, or subquery pull-up, in the sense of lifting the subquery upward. It runs in the opposite direction from the predicate pushdown we will see later, which pushes conditions down. Take this query as an example.

SELECT *
FROM (SELECT * FROM customers WHERE status = 'active') c
JOIN orders o ON c.id = o.customer_id;

The author wrapped the part that selects only active customers in a subquery to keep it readable. If the planner leaves this subquery as is, selecting active customers and joining with orders split into two stages. The inner subquery first produces an intermediate result of "active customers," and the outer query joins that result with orders. The problem is that because these two stages are processed separately, the planner cannot put customers and orders together into its join-order search. customers is trapped inside the subquery while orders sits outside, so the two belong to different processing units.

Flattening removes this boundary. It pulls the subquery up into the parent query, conceptually turning the query above into this shape.

SELECT *
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE c.status = 'active';

The SQL above is not text that actually gets re-emitted; it just renders the result of the tree transformation in SQL form. Now customers and orders are two tables in the same query, and the planner can put both into its join-order search to compare which one to read first and how to join them. The WHERE status = 'active' condition also moves outward and becomes a target for the predicate pushdown we will see shortly.

Not every subquery can be merged. If the subquery contains aggregation (GROUP BY or aggregate functions), window functions, DISTINCT, LIMIT, ordering, or HAVING, it is not merged. These features need the subquery computed as one closed result for its meaning to be preserved. For instance, if the subquery computes per-customer order totals with GROUP BY, mixing and merging it with the outer table breaks the aggregation grouping. So the planner only pulls up simple subqueries that produce the same result either way, and plans the rest as closed units on their own.

IN and EXISTS subqueries that appear in a WHERE clause or a JOIN condition are handled similarly. Conditions like WHERE customer_id IN (SELECT ...) or WHERE EXISTS (SELECT ...) are converted into joins when possible. Converting them lets the planner pick an order and method like any other join and optimize it, instead of checking the subquery separately for each outer row. This internal form of an IN or EXISTS turned into a join is called a semi-join, and it is not SQL syntax. The author does not write SEMI JOIN; the planner turns the IN or EXISTS into a kind of join internally.

Compute the expressions that can be reduced to constants ahead of time

Next the planner scans the expressions in the query and computes ahead of time the parts that can be computed early. This is called constant folding.

WHERE amount > 2 + 2     -- as written
WHERE amount > 4         -- after constant folding

2 + 2 is a constant with no reason to be re-added for each row, so it is reduced to 4 ahead of time. There are cases where the result is determined even with non-constant parts mixed in. For example, if some view or condition expands into a form like WHERE is_vip OR true (is_vip being a boolean column for whether a customer is a VIP), it is always true whether is_vip is true or false, so it reduces to WHERE true and the condition itself disappears. Conversely, if AND false creeps in, that condition is always false and gets tidied away entirely.

That said, not every function is computed ahead of time. A function that can return a different value each time it is called must not be precomputed. nextval(), which advances a sequence every time it is called, is the classic example. PostgreSQL precomputes a function only when it is marked immutable (the property of always returning the same output for the same input); otherwise it reduces only the arguments as far as possible and leaves the function call itself for execution time.

This step has two effects. One is that the executor stops repeating the same computation for every row. Instead of adding 2 + 2 a million times across a million rows, it reduces it to 4 once when the plan is built. The other is that with expressions simplified, the transformations that follow recognize conditions more easily. For instance, the equivalence class we will see at the end of this section looks for "column = constant" conditions like region_id = 5 to spread to other tables, and if the expression sits unreduced as region_id = 2 + 3, it fails to recognize that as a constant comparison and misses the chance to spread it. Constant folding has to reduce 2 + 3 to 5 ahead of time for the next step to work.

Turn an outer join into a simpler inner join

An outer join costs more than an inner join. First, the difference between the two. An inner join produces only rows that have a match in both tables. An outer join, specifically a LEFT JOIN, keeps every row of the left table and, where there is no match on the right, manufactures a row with the right columns filled with NULL. This NULL-filling adds work and narrows the planner's freedom to reorder joins. Yet there are cases where a LEFT JOIN the author wrote produces exactly the same result as an inner join. The planner finds those cases and turns the outer join into a simpler inner join. This does not mean an inner join ranks above an outer join; it means simplifying toward the side without the extra work of NULL-filling.

The key to the decision is the WHERE condition applied to the finished join. In SQL, WHERE applies to the entire result after FROM and JOIN have been processed.

SELECT *
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.amount > 100;

The LEFT JOIN was used to keep every customer. But the join result has WHERE o.amount > 100 on it. A row where orders is NULL-filled because there is no matching order has o.amount as NULL, and NULL > 100 is not true, so that row is dropped by WHERE. In other words, the NULL-filled rows cannot survive into the final result anyway. So there is no reason to do the NULL-filling in the first place. The planner turns this LEFT JOIN into an inner join. Conceptually the query above becomes this.

SELECT *
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.amount > 100;

For this reasoning to hold, the WHERE condition has to be the kind that "can never be true once NULL comes in." o.amount > 100 is such a condition. If amount is NULL the comparison cannot be true, so the NULL-filled rows are guaranteed to be filtered out. (PostgreSQL calls an operator that cannot be true on NULL input a strict operator. Most comparison operators, =, >, <, and so on, qualify.) If the condition were o.customer_id IS NULL instead, it is the opposite. customer_id is the join key, so it is NULL only in rows that were NULL-filled for lack of a match. Letting only those rows through means "select only customers with no orders," which the planner recognizes as an anti-join. Like the semi-join, the anti-join is not SQL syntax but a kind of join internal to the planner.

The reason simplifying outer joins matters is that the fewer outer joins there are, the more freely the planner can shuffle the join order.

Push each condition down to the table it references

When you write several conditions joined by AND in a WHERE clause, they sit bundled together at the top of the query in the SQL text. But each of those conditions actually touches a different table. Some look at only one table, some at two together. The planner pushes each condition down toward the table it actually references, so it applies right where that table is read. This transformation is called predicate pushdown, and internally PostgreSQL calls it qual distribution. qual is short for qualification, PostgreSQL's shorthand for the filter conditions that go into a WHERE or ON.

SELECT *
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.region_id = 5;

c.region_id = 5 references only customers. Treat this condition literally as "a filter on the joined result," and it filters by region only after orders and customers are fully joined. The rows to be discarded were dragged all the way to the join. Instead the planner pushes this condition down to where customers is read. Conceptually it becomes this.

SELECT *
FROM orders o
JOIN (SELECT * FROM customers WHERE region_id = 5) c ON o.customer_id = c.id;

The "after" above does not mean the tree actually turns into a subquery; it renders in SQL form the fact that region_id = 5 is applied first where customers is read. customers narrows to only the customers in region 5 first, and only those few customers join with orders.

The difference is clear in numbers. If there are 100,000 customers nationwide and 5,000 in region 5, then without pushing the condition down, all 100,000 customers are joined with orders and then filtered. Push it down, and only 5,000 enter the join. A join costs less the smaller its input, so the same result comes at far less cost.

As a result of this distribution, each table's workspace (RelOptInfo) accumulates a list of conditions that apply only to that table. Each condition goes into the list wrapped in an object called a RestrictInfo, whose additional contents the earlier section on the path tree covered. A join condition that references two tables together is handled not at one table but at the join, and among those, an equality join condition like o.customer_id = c.id gets one more level of special handling. That handling is the very last transformation we will see next.

Pull out missing conditions by grouping equal values together

The last transformation fills in missing conditions by grouping equality conditions together. When several = conditions hang off a WHERE clause, the planner gathers them into a group of values known to be equal to each other. This group is called an equivalence class.

The principle is simple. If a = b and b = c, then a = c is also true. The planner can pull out this a = c, which is nowhere in the SQL, and use it as a new join condition.

Why this is useful shows up in join order.

SELECT *
FROM orders o, customers c, regions r
WHERE o.region_id = c.region_id
  AND c.region_id = r.id;
-- conditions written by the author: orders-customers, customers-regions
-- condition pulled out by the planner: o.region_id = r.id  (orders-regions)

The author wrote only the orders-customers condition and the customers-regions condition. There is no condition directly linking orders and regions. If the planner used only the written conditions, then trying to join orders and regions first, with no condition linking them, the only thing it could make is a Cartesian product. A Cartesian product is a join that produces every combination of rows from two tables, so 100,000 orders and 100 regions yield ten million rows. Almost always a disaster, so the planner discards that join order. But if the equivalence class pulls out o.region_id = r.id, a direct join between orders and regions becomes possible. A join order that would have been discarded comes back as a candidate.

Something stronger happens when a constant joins the group.

SELECT *
FROM orders o, customers c
WHERE o.region_id = c.region_id
  AND c.region_id = 5;
-- group: {o.region_id, c.region_id, 5}
-- pulled out: o.region_id = 5,  c.region_id = 5

Once the group becomes {o.region_id, c.region_id, 5}, the planner applies o.region_id = 5 and c.region_id = 5 right where each table is read. o.region_id = c.region_id was originally a condition you could only check by joining the two tables, but after the transitivity, each table is filtered by = 5 ahead of time. The predicate pushdown from the previous section ends up applying to both tables on its own. There is no longer any need to check o.region_id = c.region_id separately at the join.

This transitivity does not hold everywhere, though. Below an outer join, the NULL-filling makes a derivation like = 5 unsafe. So the planner applies transitivity only within a group of tables inner-joined to each other, and handles the region an outer join creates separately.

Beyond these five transformations, there are smaller tasks bundled into the same preprocessing step. Turning a MERGE command into a join form, inlining function-form FROM entries into the body where possible, and flattening a simple UNION ALL into one. They all share the same goal of producing a shape the planner can work with.

What this means in practice

First, "why is this condition I wrote here applied over there" is because of preprocessing. In EXPLAIN you sometimes see a condition you wrote in WHERE show up in an odd place. It is attached to a Filter or Index Cond on a table scan node rather than at a join, or a condition you never wrote appears. That is the result of predicate pushdown pulling the condition down to a table scan, or an equivalence class pulling out a condition that was not in the SQL. When you read the per-node Filter and Index Cond in EXPLAIN, knowing that the position is not where you wrote it but where the planner pushed or derived it makes the plan much easier to read.

Second, wrapping something in a subquery does not change performance. Whether you wrap for readability or write it flat, a simple subquery gets merged into the same query by flattening, so the plan is the same. However, a subquery with aggregation, DISTINCT, LIMIT, or ordering is not merged, creating a boundary across which the planner cannot weigh the inside and outside together. If a subquery boundary in a performance-critical query seems to have trapped the plan unintentionally, suspect something inside that blocks flattening. A DISTINCT habitually attached with no effect on the result, or an unnecessary LIMIT, is a common culprit.

Third, an equality join condition fills in by transitivity even if you write only one side. Writing just o.region_id = c.region_id AND c.region_id = r.id lets the planner pull out o.region_id = r.id, so you do not need to spell out a join condition for every pair of tables. But this is limited to equality (=) conditions. Range conditions (o.amount < c.credit) and inequalities are not propagated. If a join is not linked by equality and a Cartesian product creeps into the plan and slows the query down, that is a sign of a structure transitivity cannot fill. The fix is to link the missing join path directly with an equality condition.