In the first SQL article, you met the basics: tables, data types, simple SELECTs, and filters. Now you’re ready for the point where “one query” quietly turns into “three nested queries and a headache.”
This is where subqueries and Common Table Expressions (CTEs) earn their keep. They let you think in layers, express complex logic clearly, and avoid turning every report into a wall of SQL noise.
In this follow‑up, we’ll stay practical and Postgres‑friendly. You’ll see where subqueries and CTEs fit, how they differ, and how to refactor queries when they start to sag under their own weight. The goal is that, as you read each example, you can mentally picture what rows are being produced at each step.
Subqueries: Queries Inside Queries
A subquery is just a query inside another query. The outer query asks a question, and the inner query helps answer part of it.
You’ll see subqueries most often in:
-
SELECT– to compute a single value per row. -
FROM– as an inline view (a “temporary table” for that query). -
WHERE/HAVING– to filter based on another result set. -
JOINconditions – for advanced patterns or cleaner predicates.
Conceptually, the database plans everything together, but you can think of the subquery as producing a value or set of values that the outer query then uses.
Common subquery shapes
Let’s walk through the main types. For each one, ask yourself two questions:
- What does the inner query return (shape: one row, many rows, one column, many columns)?
- How does the outer query use those rows?
Scalar subquery
A scalar subquery returns exactly one row and one column. It behaves like a single value you can plug into an expression.
SELECT
c.customer_id,
c.first_name,
c.last_name,
(
SELECT AVG(total_amount)
FROM assignment.sales
) AS avg_order_value
FROM assignment.customers c;
Here’s how to read this:
- Inner query:
SELECT AVG(total_amount) FROM assignment.salesscans thesalestable and returns one number: the average order value across all customers. - Outer query: for each row in
customers, we just attach that same single value asavg_order_value.
So the result is: one row per customer, plus a column showing the global average order value for comparison.
Single‑row subquery
A single‑row subquery returns one row (maybe multiple columns), which you can compare as a unit.
SELECT *
FROM assignment.products p
WHERE (p.category, p.price) = (
SELECT category, MAX(price)
FROM assignment.products
GROUP BY category
ORDER BY MAX(price) DESC
LIMIT 1
);
Step by step:
- The inner query groups products by category, computes
MAX(price)per category, orders by that max price descending, and keeps only the first row withLIMIT 1. - That gives you something like
(category = 'Electronics', price = 1500.00). - The outer query then finds any product row whose
(category, price)pair exactly matches that tuple. In practice, you’ll get the single priciest product in the dataset.
Multi‑row subquery with IN
Here the subquery returns many rows, and the outer query checks if a value is in that set.
-- Customers who have ever placed an order
SELECT *
FROM assignment.customers c
WHERE c.customer_id IN (
SELECT DISTINCT customer_id
FROM assignment.sales
);
Think of the inner query as producing a list like [1, 2, 4, 7, 9, ...] of customer_ids that appear in the sales table. The outer query then says: “give me all customers whose customer_id is in that list.”
Correlated subquery
A correlated subquery depends on columns from the outer query, so it is logically evaluated per row of the outer query.
-- Per product, compute total quantity via a correlated subquery
SELECT
p.product_id,
p.product_name,
(
SELECT SUM(s.quantity_sold)
FROM assignment.sales s
WHERE s.product_id = p.product_id
) AS total_quantity_sold
FROM assignment.products p;
Here:
- For each row in
products(outer query), Postgres plugs that row’sproduct_idinto the inner query. - The inner query filters
salesdown to that product and sums itsquantity_sold. - The result is one row per product with a per‑product total quantity.
On small tables this is fine. On big ones, “run this inner aggregate for each product” can become expensive, which is why we sometimes refactor to joins or CTEs.
Inline view (subquery in FROM)
An inline view is just a subquery in the FROM clause. You can treat its output like a temporary table with a name.
SELECT
dep.category,
dep.total_revenue
FROM (
SELECT
p.category,
SUM(s.total_amount) AS total_revenue
FROM assignment.products p
JOIN assignment.sales s
ON s.product_id = p.product_id
GROUP BY p.category
) AS dep
WHERE dep.total_revenue > 10000;
Mentally:
- The inner query computes one row per category with
total_revenue. - We give that result the alias
dep. - The outer query simply filters that temporary table to categories over a threshold.
This is the same idea you’ll see later with CTEs, just without the WITH keyword.
When Subqueries Shine (and When They Hurt)
Subqueries are great tools when:
- You want existence checks:
WHERE EXISTS (SELECT 1 FROM ... )to answer “does this related row exist?” - You need single‑value lookups: comparing against global
MAX(),AVG(), or a config row. - You prefer contained filtering logic:
WHERE x IN (SELECT ...)instead of more joins.
For example, a simple existence check:
SELECT c.customer_id, c.first_name
FROM assignment.customers c
WHERE EXISTS (
SELECT 1
FROM assignment.sales s
WHERE s.customer_id = c.customer_id
);
The inner query returns at least one row if the customer has any sales. EXISTS doesn’t care what the row contains, just whether it exists.
But there are downsides:
- Correlated subqueries can be slow, because they are logically evaluated per outer row. For a million products, that can mean a million inner aggregations.
- Deep nesting makes queries hard to read and debug.
- Refactoring is painful when logic is split across many layers of parentheses.
As a heuristic: use subqueries for small, sharp problems. When logic grows or the data set is large, move toward CTEs or joins.
CTEs: Naming Your Thought Process
A CTE (Common Table Expression) is a named subquery defined at the top of your statement using WITH. It only exists for that one query, but it gives you a clear label for each step.
Basic pattern:
WITH customer_totals AS (
SELECT
customer_id,
SUM(total_amount) AS total_spent
FROM assignment.sales
GROUP BY customer_id
)
SELECT
c.customer_id,
c.first_name,
ct.total_spent
FROM assignment.customers c
JOIN customer_totals ct
ON ct.customer_id = c.customer_id
WHERE ct.total_spent > 1000;
Read it like this:
-
customer_totalsCTE: “Build a small table that has one row per customer with theirtotal_spent.” - Main query: “Join that table to
customersand filter ontotal_spent.”
Instead of one big tangled query, you now have a clear two‑step story.
Chained / multiple CTEs
You can define several CTEs, where later ones depend on earlier ones. This feels a lot like building a small data pipeline.
WITH base_sales AS (
SELECT *
FROM assignment.sales
WHERE sale_date >= DATE '2023-01-01'
),
product_totals AS (
SELECT
product_id,
SUM(total_amount) AS total_revenue
FROM base_sales
GROUP BY product_id
)
SELECT *
FROM product_totals
WHERE total_revenue > 2000;
Here:
-
base_salesisolates “sales from 2023 onward.” -
product_totalsaggregates only that subset. - The final query filters heavy hitters.
Each step is small and named, so it’s easy to tweak just one without touching the others.
Recursive CTE (hierarchies)
Recursive CTEs are designed for hierarchical data: org charts, category trees, folder structures.
-- employees(emp_id, manager_id, full_name)
WITH RECURSIVE manager_chain AS (
-- Anchor: start from the employee we care about
SELECT
e.emp_id,
e.manager_id,
e.full_name,
0 AS depth
FROM employees e
WHERE e.emp_id = 42
UNION ALL
-- Recursive step: move one level up each time
SELECT
m.emp_id,
m.manager_id,
m.full_name,
mc.depth + 1 AS depth
FROM employees m
JOIN manager_chain mc
ON m.emp_id = mc.manager_id
)
SELECT *
FROM manager_chain
ORDER BY depth;
Walkthrough:
- The anchor query grabs the starting employee.
- The recursive query says: “Given everyone we already know about in
manager_chain, join to their managers and add them to the chain.” - This continues until there’s no higher manager.
You end up with a neat ordered list from “employee” up to “CEO.”
MATERIALIZED vs NOT MATERIALIZED (Postgres)
Postgres can treat CTEs in two ways:
- Materialized: run the CTE once, store the result (in memory or on disk), then reuse it.
- Not materialized / inlined: conceptually splice the CTE into the main query so the planner can push filters and joins through it.
You can nudge Postgres:
WITH cte AS MATERIALIZED (
SELECT * FROM big_expensive_view
),
filtered AS (
SELECT * FROM cte WHERE important_flag = true
)
SELECT * FROM filtered;
Use MATERIALIZED when:
- The CTE is expensive but reused multiple times.
- You want an optimization fence (no predicate pushdown) for predictable behavior.
Otherwise, let Postgres inline or explicitly use NOT MATERIALIZED and let the planner be clever.
Subqueries vs CTEs: How to Choose
You can often write the same query with either subqueries or CTEs. Choosing is about clarity and performance.
Readability & maintainability
- Subqueries are great for tiny, local bits of logic.
- CTEs shine when you can describe your query as a sequence of steps.
Performance & planner behavior
- Non‑correlated subqueries often optimize similarly to joins.
- Correlated subqueries can explode in cost on large tables.
- CTEs may be inlined or materialized depending on hints and Postgres version; always check the plan.
Reuse & provenance
- If you copy‑paste the same subquery twice, that’s a good signal it wants to be a CTE.
- CTEs give you one place to change logic while keeping intent obvious: “these rows came from
customer_totals.”
When to reach for JOINs / LATERAL
If your subquery is really “find related rows,” a JOIN is usually clearer. JOIN LATERAL is ideal when you want “top N per row” or “latest row per group.”
Concrete Refactors (Postgres)
Let’s refactor some common patterns so you can see where subqueries, CTEs, and LATERAL each feel natural.
1. Correlated COUNT → CTE + JOIN
Correlated version:
SELECT
c.customer_id,
c.first_name,
(
SELECT COUNT(*)
FROM assignment.sales s
WHERE s.customer_id = c.customer_id
) AS order_count
FROM assignment.customers c;
For each customer row, the inner query counts matching sales rows. On a small dataset, no problem. On a large one, this is “run COUNT again and again.”
CTE + JOIN version:
WITH order_counts AS (
SELECT
customer_id,
COUNT(*) AS order_count
FROM assignment.sales
GROUP BY customer_id
)
SELECT
c.customer_id,
c.first_name,
COALESCE(oc.order_count, 0) AS order_count
FROM assignment.customers c
LEFT JOIN order_counts oc
ON oc.customer_id = c.customer_id;
Now:
-
order_countscomputes every customer’s count once. - The final query simply joins that small aggregated table to
customers. - Customers with no sales come out as
NULL, which we turn into0withCOALESCE.
2. “Latest row per group” with LATERAL
Correlated style:
SELECT
c.customer_id,
c.first_name,
(
SELECT s.sale_date
FROM assignment.sales s
WHERE s.customer_id = c.customer_id
ORDER BY s.sale_date DESC
LIMIT 1
) AS last_order_date
FROM assignment.customers c;
Again, one inner query per customer. Let’s make the intent more explicit with LATERAL.
LATERAL refactor:
SELECT
c.customer_id,
c.first_name,
s_last.sale_date AS last_order_date
FROM assignment.customers c
LEFT JOIN LATERAL (
SELECT sale_date
FROM assignment.sales s
WHERE s.customer_id = c.customer_id
ORDER BY sale_date DESC
LIMIT 1
) AS s_last ON TRUE;
LATERAL means: “for each row of customers, run this subquery with access to that row’s columns.”
- The inner block still does
ORDER BY ... LIMIT 1, but it’s clearer that we’re pairing one latest sale row per customer. - We can easily extend it to return more columns (amount, product, etc.) from the latest sale.
3. Recursive CTE for an org chart
Revisiting our recursive example, now you can see how the pieces fit:
- Anchor: define the starting point.
- Recursive step: define how to move to the “next level.”
- Final SELECT: decide what to display.
WITH RECURSIVE manager_chain AS (
-- Start from a specific employee
SELECT
e.emp_id,
e.manager_id,
e.full_name,
0 AS depth
FROM employees e
WHERE e.emp_id = 42
UNION ALL
-- Then move up the chain to each manager
SELECT
m.emp_id,
m.manager_id,
m.full_name,
mc.depth + 1 AS depth
FROM employees m
JOIN manager_chain mc
ON m.emp_id = mc.manager_id
)
SELECT *
FROM manager_chain
ORDER BY depth;
If you print the results, you’ll literally see: depth = 0 for the employee, 1 for their manager, 2 for the manager’s manager, and so on. That step‑by‑step mental picture is what makes recursive CTEs click.
Practical Checklist Before You Ship That Query
- Can I name this step? If you can describe a piece of logic (“active 2023 sales”), that’s a good candidate for a CTE.
-
Is this subquery correlated? If it references outer columns, run
EXPLAIN ANALYZEand see if it’s doing repeated work. -
Are join/filter columns indexed? Especially
customer_id,product_id, and date columns used in subqueries or CTEs. - Am I repeating logic? Repeated subqueries usually want to be a CTE or an inline view.
- Do I really need MATERIALIZED? Force it only when you want “compute once, reuse many times” and you accept the temp cost.
- Can a JOIN / LATERAL express this more clearly? If yes, it often improves both readability and performance.
Wrapping Up
In the first article, getting a SELECT to run was the win. As your queries grow, the real win is writing SQL that your future self can understand at a glance and your database can execute efficiently.
Subqueries and CTEs are how you get there: they let you structure your thinking, name your steps, and refactor as your logic evolves. The best way to internalize them now is to pick one of your existing “medium‑ugly” queries, rewrite it with a couple of CTEs, and see how much easier it is to read, explain, and tune.
This article was written to help data engineers, from early-career to mid-level, master how and when to use subqueries, CTEs or other alternatives. As always, practice makes perfect, so go ahead: break some eggs! The article was submitted in fulfilment of a LuxDevHQ Cohort 7 DataEngieering assignment ©adev3loper
Top comments (0)