Franck Pachot for AWS Heroes

Posted on Dec 23, 2025 • Edited on Dec 24, 2025

Unnesting Scalar Subqueries into Left Outer Joins in SQL in Aurora and PostgreSQL

#aws #aurora #postgres #optimizer

Relational databases treat your query as a declarative description of the desired result and select the most efficient execution plan. They may rewrite the query—for example, by transforming subqueries into joins and vice versa—so the database, not the developer, manages the complex optimisation.

Historically, PostgreSQL has had fewer planner transformations than many commercial databases. As an open-source project, it favors simpler code that promotes contributions and good SQL design, while commercial vendors can justify more complex planning logic when it helps revenue-generating customers in reducing their optimization efforts. PostgreSQL does not maintain a global shared plan cache, so most queries are planned per execution. This encourages keeping planning overhead low. The only exceptions are when using prepared statements, functions, or other situations where a plan can be generic, retained and reused. Transformations open more access paths, potentially improving execution plans, but at the cost of higher planning time.

AWS recently added these kinds of transformations to Amazon Aurora PostgreSQL (APG) to optimize queries with a correlated subquery in the projection:

Optimizing correlated subqueries in Amazon Aurora PostgreSQL | AWS Database Blog

aws.amazon.com

Without this transformation, projection expressions are evaluated per row, which at best yields nested-loop–like performance. By pushing the subquery down so it runs before the projection, more efficient join algorithms can be used. Amazon Aurora is a proprietary fork of PostgreSQL, and its improvements are not contributed upstream, so this feature is not available in PostgreSQL. Nonetheless, examining the transformation Aurora implements can inspire similar manual rewrites of queries. I’ll use the example from the AWS blog post to show how to rewrite the query more efficiently. It is important to understand the transformation carefully to ensure it does not change the results.

Here is the DDL and DML to create the same example as AWS's article:

-- Clean up from previous runs

DROP TABLE IF EXISTS outer_table;
DROP TABLE IF EXISTS inner_table;

-- Outer table (like customers)

CREATE TABLE outer_table (
    id  SERIAL PRIMARY KEY,
    a   INT,
    b   INT
);

-- Inner table (like orders)

CREATE TABLE inner_table (
    id  SERIAL PRIMARY KEY,
    a   INT,
    b   INT
);

-- Insert data into inner_table:
--   10,000 rows, 'a' cycles from 1..100, 'b' random 0–999

INSERT INTO inner_table (a, b)
SELECT
    1 + mod(gs - 1, 100),
    floor(random() * 1000)::int
FROM generate_series(1, 10000) AS gs;

-- Insert outer_table:
--   First 25K rows: a = 1..100 repeated

INSERT INTO outer_table (a, b)
SELECT
    1 + mod(gs - 1, 100),
    floor(random() * 1000)::int
FROM generate_series(1, 25000) AS gs;
--   Next 25K rows: unique a = 101..35100
INSERT INTO outer_table (a, b)
SELECT
    gs + 100,
    floor(random() * 1000)::int
FROM generate_series(1, 25000) AS gs;
-- Gather statistics
ANALYZE outer_table;
ANALYZE inner_table;

I've set up an Aurora Serverless database to test it by enabling the transformation parameter and running the query, displaying the execution plan:

SET apg_enable_correlated_scalar_transform = on;

explain (analyze , verbose, costs off)
SELECT outer_table.a, outer_table.b
       , (SELECT AVG(inner_table.b)
          FROM inner_table
          WHERE inner_table.a = outer_table.a
) FROM outer_table
;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash Left Join (actual time=4.904..15.740 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, (avg(inner_table.b))
   Inner Unique: true
   Hash Cond: (outer_table.a = inner_table.a)
   ->  Seq Scan on public.outer_table (actual time=0.016..2.968 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash (actual time=2.985..2.986 rows=100 loops=1)
         Output: (avg(inner_table.b)), inner_table.a
         Buckets: 1024  Batches: 1  Memory Usage: 13kB
         ->  HashAggregate (actual time=2.930..2.960 rows=100 loops=1)
               Output: avg(inner_table.b), inner_table.a
               Group Key: inner_table.a
               Batches: 1  Memory Usage: 32kB
               ->  Seq Scan on public.inner_table (actual time=0.016..0.637 rows=10000 loops=1)
                     Output: inner_table.id, inner_table.a, inner_table.b
 Query Identifier: -2382945993278526738
 Planning Time: 2.439 ms
 Execution Time: 23.322 ms
(18 rows)

I've added the verbose option to explain because it is important to see all elements of the rewritten query. Here, with the transformation, (SELECT AVG(inner_table.b) FROM ... WHERE ... ) has been transformed to (avg(inner_table.b)) over a Hash Left Join to the deduplicated (HashAggregate) inner table.

It is the equivalent of:

explain (analyze , verbose, costs off )
SELECT outer_table.a,  outer_table.b
       , agg.avg_b
FROM outer_table
LEFT JOIN (
    SELECT a, AVG(b) AS avg_b
    FROM inner_table
    GROUP BY a
) AS agg
ON outer_table.a = agg.a
;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Hash Left Join (actual time=4.469..16.534 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, agg.avg_b
   Inner Unique: true
   Hash Cond: (outer_table.a = agg.a)
   ->  Seq Scan on public.outer_table (actual time=0.011..3.124 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash (actual time=3.804..3.806 rows=100 loops=1)
         Output: agg.avg_b, agg.a
         Buckets: 1024  Batches: 1  Memory Usage: 13kB
         ->  Subquery Scan on agg (actual time=3.733..3.778 rows=100 loops=1)
               Output: agg.avg_b, agg.a
               ->  HashAggregate (actual time=3.732..3.765 rows=100 loops=1)
                     Output: inner_table.a, avg(inner_table.b)
                     Group Key: inner_table.a
                     Batches: 1  Memory Usage: 32kB
                     ->  Seq Scan on public.inner_table (actual time=0.004..0.668 rows=10000 loops=1)
                           Output: inner_table.id, inner_table.a, inner_table.b
 Query Identifier: -3523129028670016640
 Planning Time: 1.361 ms
 Execution Time: 19.674 ms
(20 rows)

This looks simple, but SQL can be weird and not all aggregation functions have the same semantic with the absence of values. If there are no rows from the inner table, AVG() returns a NULL, like the correlated subquery would return in the absence of rows. However, if a COUNT() was used instead of AVG() it would be different, as a subquery still returns no rows but a COUNT() would return 0.

I test the automatic transformation on Aurora with a COUNT():

explain (analyze , verbose, costs off)
SELECT outer_table.a, outer_table.b
       , (SELECT COUNT(inner_table.b)
          FROM inner_table
          WHERE inner_table.a = outer_table.a
) FROM outer_table
;
                                           QUERY PLAN
------------------------------------------------------------------------------------------------
 Hash Left Join (actual time=2.319..13.332 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, COALESCE((count(inner_table.b)), 0)
   Inner Unique: true
   Hash Cond: (outer_table.a = inner_table.a)
   ->  Seq Scan on public.outer_table (actual time=0.012..3.003 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash (actual time=2.302..2.304 rows=100 loops=1)
         Output: (count(inner_table.b)), inner_table.a
         Buckets: 1024  Batches: 1  Memory Usage: 13kB
         ->  HashAggregate (actual time=2.255..2.268 rows=100 loops=1)
               Output: count(inner_table.b), inner_table.a
               Group Key: inner_table.a
               Batches: 1  Memory Usage: 24kB
               ->  Seq Scan on public.inner_table (actual time=0.003..0.640 rows=10000 loops=1)
                     Output: inner_table.id, inner_table.a, inner_table.b
 Query Identifier: 6903753335662751945
 Planning Time: 1.267 ms
 Execution Time: 15.219 ms
(18 rows)

Now the VERBOSE option shows COALESCE((count(inner_table.b)), 0) to transform a NULL into a zero, in order to match the COUNT semantic.

Here is the equivalent query if you want to do the transformation manually:

explain (analyze , verbose )
SELECT outer_table.a,  outer_table.b
       , COALESCE(agg.cnt_b, 0) AS cnt_b
FROM outer_table
LEFT JOIN (
    SELECT a, COUNT(b) AS cnt_b
    FROM inner_table
    GROUP BY a
) AS agg
ON outer_table.a = agg.a;

                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=208.25..1110.53 rows=50000 width=16) (actual time=2.556..77.785 rows=50000 loops=1)
   Output: outer_table.a, outer_table.b, COALESCE(agg.cnt_b, '0'::bigint)
   Inner Unique: true
   Hash Cond: (outer_table.a = agg.a)
   ->  Seq Scan on public.outer_table  (cost=0.00..771.00 rows=50000 width=8) (actual time=0.012..3.183 rows=50000 loops=1)
         Output: outer_table.id, outer_table.a, outer_table.b
   ->  Hash  (cost=207.00..207.00 rows=100 width=12) (actual time=2.532..2.534 rows=100 loops=1)
         Output: agg.cnt_b, agg.a
         Buckets: 1024  Batches: 1  Memory Usage: 13kB
         ->  Subquery Scan on agg  (cost=205.00..207.00 rows=100 width=12) (actual time=2.485..2.509 rows=100 loops=1)
               Output: agg.cnt_b, agg.a
               ->  HashAggregate  (cost=205.00..206.00 rows=100 width=12) (actual time=2.484..2.497 rows=100 loops=1)
                     Output: inner_table.a, count(inner_table.b)
                     Group Key: inner_table.a
                     Batches: 1  Memory Usage: 24kB
                     ->  Seq Scan on public.inner_table  (cost=0.00..155.00 rows=10000 width=8) (actual time=0.006..0.679 rows=10000 loops=1)
                           Output: inner_table.id, inner_table.a, inner_table.b
 Query Identifier: 4982770911819576582
 Planning Time: 0.151 ms
 Execution Time: 80.622 ms
(20 rows)

You need COALESCE with COUNT in the manual join rewrite because of how SQL aggregates behave: unlike other aggregates, COUNT returns 0—not NULL—when there are no rows.

Aurora PostgreSQL’s apg_enable_correlated_scalar_transform shows how the planner can rewrite correlated subqueries in the SELECT list into join-based aggregates, replacing per-row subqueries with set-based aggregation and hash joins for much better performance.

Even if you don’t use Aurora in production, you can spin up a small Aurora Serverless instance to validate your manual rewrites. Enable the parameter and compare the execution plan to your PostgreSQL version. The plans should match, with one structural difference: the manual version’s grouped subquery is wrapped in a Subquery Scan, which is expected in hand-written SQL.

For aggregates like AVG(), the rewrite preserves semantics with no extra changes. For COUNT(), wrap the join output in COALESCE(..., 0) so it behaves like the original correlated subquery when no rows are found.

By understanding these transformations and their semantics, you can reproduce Aurora’s optimization in upstream PostgreSQL and gain similar performance benefits while keeping full control over correctness.

Top comments (5)

Andrei Lepikhov • Dec 26 '25

Postgres has seen multiple approaches to correlated subquery unnesting, but only a limited feature has been committed. At the same time, forks (I may witness about PostgresPro) have implementations that close most types of subquery transformations (maybe only target-list subquery is not yet covered). And there are the reasons for such a situation: any kind of correlated subquery always has corner cases when it is better not to transform it at all.

So, users get stuck in those corner cases from time to time, and the core team doesn't want to add a feature that may eliminate a query writer's intentional efforts to isolate complexity inside a SubPlan.

Another question: why not allow extensions to invent their own query tree transformations and let them decide what is better for users? But this is a very different story ...

Franck Pachot AWS Heroes • Dec 26 '25

Thanks, that’s interesting feedback. When it comes to corner cases, the decision might be cost-based, but that doesn’t necessarily make it the best option in every situation.

Andrei Lepikhov • Dec 27 '25

That’s the key issue with this feature: subquery transformation can only be performed at the parsing stage (at least in the current architecture). So, no cost-based decisions are possible.

Andrei Lepikhov • Feb 9

One more reason behind 'fewer planner transformations' is that 'flattening' a subquery you increase join tree search space. Imagine you have a 10-table join tree and add another 10-table join tree from a subquery. That means you risk to exceed the 'join collapse limit'. In this case your planner will apply joins over the limit according to an arbitrary order defined mostly by the SQL string ...

I wonder if AWS has already received complaints about performance degradation caused by the increased number of transformations. Do you know how they resolve this problem?

Franck Pachot AWS Heroes • Feb 12

@danolivo-19 They also added a shared plan cache so maybe the more time to parse is compensated by less parse 🤔