Philip McClarence

Posted on Apr 29

PostgreSQL Subquery and CTE Optimization

#database #sql #postgres #performance

Every SELECT in PostgreSQL is made of smaller SELECTs, even when it doesn't look that way. WHERE col IN (SELECT ...), WHERE EXISTS (SELECT ...), (SELECT count(*) FROM ... WHERE ...) in the column list, WITH x AS (SELECT ...) — these look syntactically different but all get rewritten into plan nodes at plan time. Which plan node the planner chooses determines whether your query runs in three milliseconds or three seconds, and the rules are different for each pattern.

This is part of the Complete Guide to PostgreSQL SQL Query Analysis & Optimization. Assumes you can read EXPLAIN output and are familiar with how the planner chooses join strategies. Running dataset: 500k-row sim_bp_orders, 200k-row sim_bp_users, on Neon Postgres 17.8.

We'll cover: scalar and existence subqueries (SubPlan, EXISTS, IN), when correlated subqueries should be rewritten as joins, how CTEs are executed on modern PostgreSQL, when to use MATERIALIZED vs NOT MATERIALIZED, LATERAL joins, and recursive CTEs.

Scalar correlated subqueries — the SubPlan trap

A scalar subquery in the column list is the easiest way to accidentally write an O(n²) query:

SELECT u.user_id,
       u.email,
       (SELECT count(*)
          FROM sim_bp_orders o
         WHERE o.user_id = u.user_id
           AND o.status = 'pending') AS pending_count
FROM sim_bp_users u
WHERE u.status = 'active'
LIMIT 100;

The query reads naturally: "for each active user, count their pending orders." The plan is what that description implies:

Limit  (cost=0.42..1642.83 rows=100 width=33) (actual time=0.088..3.438 rows=100 loops=1)
  Buffers: shared hit=565 read=3
  ->  Index Scan using sim_bp_users_pkey on sim_bp_users u
        (cost=0.42..3118066 rows=189807 width=33)
        (actual time=0.087..3.433 rows=100 loops=1)
        Filter: ((u.status)::text = 'active'::text)
        SubPlan 1
          ->  Aggregate  (cost=16.24..16.25 rows=1 width=8)
                (actual time=0.033..0.033 rows=1 loops=100)
                ->  Bitmap Heap Scan on sim_bp_orders o
                      (actual time=0.032..0.033 rows=0 loops=100)
                      Recheck Cond: (o.user_id = u.user_id)
                      Filter: ((o.status)::text = 'pending'::text)
                      ->  Bitmap Index Scan on idx_sim_bp_orders_user_id
                            (actual time=0.029..0.029 rows=3 loops=100)
                            Index Cond: (o.user_id = u.user_id)
 Execution Time: 3.444 ms

Two signals. First, the SubPlan 1 node is inside the outer index scan — it runs once per outer row. actual time=0.033..0.033 rows=1 loops=100 tells you the subquery was executed 100 times (once per user returned). With LIMIT 100 it's cheap; without the limit, it would run 200,000 times and that's six seconds of just-subquery time before any other work.

Second, SubPlan N in a plan is a heads-up that the query is executing per-outer-row work, which is almost always worth rewriting — either as an aggregating JOIN or a correlated aggregate pushed into a LATERAL. Both rewrites scale better as the outer set grows.

EXISTS, IN, and JOIN — three ways to express "filter by related rows"

For the "find rows that have at least one related row" pattern, SQL offers three syntactic choices. They don't all produce the same plan.

Rewriting the earlier query as an EXISTS — asking a boolean question, "find users who have at least one pending order":

SELECT u.user_id, u.email
FROM sim_bp_users u
WHERE u.status = 'active'
  AND EXISTS (
      SELECT 1 FROM sim_bp_orders o
      WHERE o.user_id = u.user_id
        AND o.status = 'pending'
  )
LIMIT 100;

Limit  (cost=0.85..137.98 rows=100 width=25) (actual time=0.089..3.238 rows=100 loops=1)
  Buffers: shared hit=726 read=1
  ->  Merge Semi Join  (actual time=0.088..3.234 rows=100 loops=1)
        Merge Cond: (u.user_id = o.user_id)
        ->  Index Scan using sim_bp_users_pkey on sim_bp_users u
              Filter: ((u.status)::text = 'active'::text)
        ->  Index Scan using idx_sim_bp_orders_user_id on sim_bp_orders o
              Filter: ((o.status)::text = 'pending'::text)
              Rows Removed by Filter: 586
 Execution Time: 3.240 ms

The planner picked a Merge Semi Join — stops at the first match per outer row. That's exactly what EXISTS semantics require. Both sides come in user_id-ordered streams (left from the users primary-key btree; right from idx_sim_bp_orders_user_id with status='pending' as a filter), and the merge walks them in lockstep. No per-outer-row SubPlan, no re-execution. The planner doesn't always pick Merge Semi Join — a Nested Loop Semi Join with an index probe is also common, especially with a tight outer LIMIT. Both shapes scale linearly; the SubPlan pattern was quadratic.

IN (SELECT ...) is a third way. Most of the time PostgreSQL treats WHERE col IN (SELECT ...) and WHERE EXISTS (SELECT ... WHERE ... = col) identically, producing the same plan. Two gotchas:

NOT IN with nullable columns is not equivalent to NOT EXISTS. If any value in the inner set is NULL, NOT IN returns unknown (effectively no rows). Always prefer NOT EXISTS unless you've proven the column is NOT NULL.
IN on an array literal (WHERE id IN (1, 2, 3)) is a different beast — syntactic sugar for ANY (ARRAY[1,2,3]), nothing to do with subqueries.

An explicit JOIN works too, but duplicates outer rows for each matching inner row:

SELECT DISTINCT u.user_id, u.email
FROM sim_bp_users u
JOIN sim_bp_orders o ON o.user_id = u.user_id
WHERE u.status = 'active' AND o.status = 'pending';

The DISTINCT is required because a user with five pending orders would appear five times. Usually slower than EXISTS (produces all matching rows then distincts them down), and you have to remember the DISTINCT. Use EXISTS for existence questions, JOIN for data you actually want from the related table.

Rule of thumb:

Count of related rows → aggregating subquery or aggregating JOIN with GROUP BY.
Existence → EXISTS. (Non-existence → NOT EXISTS.)
Data from related rows → regular JOIN.
First/last/top-N of related rows per outer → LATERAL (below).

LATERAL — top-N per group without window functions

A LATERAL join lets a subquery on the right side of a FROM reference columns from the left side: "for each row on the left, evaluate this subquery with those columns bound, and join the result." The SQL-standard way to express "the latest order per customer," "the most recent status message per ticket" — any top-N per outer group.

SELECT u.user_id, u.email,
       latest.order_id,
       latest.total_amount_cents
FROM sim_bp_users u
CROSS JOIN LATERAL (
    SELECT order_id, total_amount_cents
    FROM sim_bp_orders o
    WHERE o.user_id = u.user_id
    ORDER BY o.created_at DESC
    LIMIT 1
) latest
WHERE u.status = 'active'
LIMIT 50;

"For each active user, return their most recent order."

Nested Loop  (actual time=0.016..0.452 rows=50 loops=1)
  Buffers: shared hit=314
  ->  Index Scan using sim_bp_users_pkey on sim_bp_users u
        Filter: ((u.status)::text = 'active'::text)
  ->  Subquery Scan on latest  (actual time=0.008..0.008 rows=1 loops=54)
        ->  Sort  (actual time=0.007..0.007 rows=1 loops=54)
              Sort Key: o.created_at DESC
              Sort Method: quicksort  Memory: 25kB
              ->  Bitmap Heap Scan on sim_bp_orders o
                    (actual time=0.003..0.006 rows=3 loops=54)
                    Recheck Cond: (o.user_id = u.user_id)
 Execution Time: 0.462 ms

0.46 ms. The planner ran the lateral subquery 54 times (one per user, until outer LIMIT 50 was satisfied after some users had zero orders). Each lateral execution was a cheap bitmap index scan + tiny sort bounded by LIMIT 1.

The window-function equivalent — ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY created_at DESC) with an outer WHERE rn = 1 — often produces a worse plan on PostgreSQL when only the top 1 or 2 per group are needed, because it computes row numbers for every row before filtering. LATERAL with a LIMIT inside lets the planner stop early.

Two practical notes:

CROSS JOIN LATERAL vs LEFT JOIN LATERAL. CROSS JOIN LATERAL drops outer rows where the subquery returns nothing. LEFT JOIN LATERAL ... ON TRUE preserves them with NULLs. Swapping them changes results silently.
Indexes matter more than for anything else. The subquery runs per outer row, so any table scan inside it multiplies. The lateral on sim_bp_orders.user_id was quick because idx_sim_bp_orders_user_id exists. Without it, the query would be 500,000× slower.

CTEs — materialised by default no longer

Before PostgreSQL 12, every WITH clause was an optimisation fence: the CTE was computed in full and stored in a temporary buffer, and the planner could not push predicates from the outer query into the CTE. People used this intentionally (the "CTE trick" to force materialisation), but it also silently hurt a lot of queries.

PostgreSQL 12 reversed the default. Now a CTE referenced once and without data-modifying statements is inlined — the planner treats it like a subquery, and predicate pushdown works as expected. CTEs referenced multiple times or containing INSERT/UPDATE/DELETE are still materialised.

Two keywords override the default:

WITH foo AS NOT MATERIALIZED (...) — force inlining even if referenced multiple times.
WITH foo AS MATERIALIZED (...) — force materialisation even if referenced only once.

Typical cases:

-- Inlined by default — works like a subquery, predicates push in.
WITH recent_pending AS (
    SELECT order_id, user_id, created_at
    FROM sim_bp_orders
    WHERE status = 'pending'
)
SELECT rp.order_id, u.email
FROM recent_pending rp
JOIN sim_bp_users u ON u.user_id = rp.user_id
WHERE rp.created_at > now() - interval '7 days';

The created_at > now() - interval '7 days' filter is pushed into the CTE, so the combined filter (status = 'pending' AND created_at > ...) can use a single index scan rather than materialising all pending orders first.

-- Expensive aggregation referenced twice — worth materialising once.
WITH user_totals AS MATERIALIZED (
    SELECT user_id, sum(total_amount_cents) AS total
    FROM sim_bp_orders
    GROUP BY user_id
)
SELECT u.email, ut.total
FROM sim_bp_users u
JOIN user_totals ut ON ut.user_id = u.user_id
WHERE ut.total > 1000000

UNION ALL

SELECT u.email, 0
FROM sim_bp_users u
WHERE NOT EXISTS (SELECT 1 FROM user_totals WHERE user_id = u.user_id);

Without MATERIALIZED, the aggregation runs twice (once per reference). With it, it runs once and both references read from the materialised temp table.

Recursive CTEs

Recursive CTEs are for hierarchical data: trees, graphs, transitive closures, category parents, reporting chains.

WITH RECURSIVE employee_tree AS (
    -- Base case: root of the tree
    SELECT employee_id, manager_id, name, 1 AS depth
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive step: children of previously-found rows
    SELECT e.employee_id, e.manager_id, e.name, et.depth + 1
    FROM employees e
    JOIN employee_tree et ON et.employee_id = e.manager_id
)
SELECT * FROM employee_tree;

PostgreSQL computes the base case, then repeatedly applies the recursive step to previously-produced rows until no new rows are generated. Two practical concerns:

No termination without a base case. A recursive CTE referencing itself in the base term, or whose recursive step produces the same rows forever, loops forever. Use depth < N as a guard when testing.
Index the join column. The recursive step joins the CTE's accumulated rows against the source table — without an index on employees.manager_id, each iteration is a sequential scan.

For transitive-closure queries (shortest paths, graph traversals), recursive CTEs work but scale poorly beyond a few tens of thousands of rows. For heavier graph workloads, look at dedicated extensions or materialised adjacency tables.

Subqueries in the FROM clause

SELECT ... FROM (SELECT ...) AS sub is semantically just a derived table. The planner inlines it the same way it inlines a CTE (PG 12+ behaviour), pushing predicates in.

One case where FROM subqueries matter: forcing a computation to happen once rather than per outer row. If you have SELECT ..., f(x) AS computed_val FROM t WHERE f(x) > 10, PostgreSQL may call f(x) twice per row (once for filter, once for projection) unless f is marked STABLE. Wrapping the expensive call in a FROM subquery sometimes ensures one-call-per-row evaluation.

Practical rules

SubPlan in the plan output → consider rewriting as a JOIN or LATERAL.
EXISTS / IN / JOIN+DISTINCT → default to EXISTS for boolean questions; it's usually clearest and gets the best plan on PostgreSQL.
NOT IN on a nullable column → almost always a bug. Use NOT EXISTS.
CTE used once → inlined by default in PG 12+. Don't wrap something in a CTE hoping to force materialisation; use MATERIALIZED explicitly.
CTE used multiple times with expensive aggregation → MATERIALIZED wins.
Top-N per group → LATERAL with LIMIT inside. Cleaner plan than window functions for small N.
Recursive traversals → WITH RECURSIVE, but index the join column and put a depth guard on anything you're not sure terminates.