Mungai M.

Posted on Apr 19

Level Up Your SQL: Subqueries & CTEs in the Real World

#sql #dataengineering #database #assignment

In the first SQL article, you met the basics: tables, data types, simple SELECTs, and filters. Now you’re ready for the point where “one query” quietly turns into “three nested queries and a headache.”

This is where subqueries and Common Table Expressions (CTEs) earn their keep. They let you think in layers, express complex logic clearly, and avoid turning every report into a wall of SQL noise.

In this follow‑up, we’ll stay practical and Postgres‑friendly. You’ll see where subqueries and CTEs fit, how they differ, and how to refactor queries when they start to sag under their own weight. The goal is that, as you read each example, you can mentally picture what rows are being produced at each step.

Subqueries: Queries Inside Queries

A subquery is just a query inside another query. The outer query asks a question, and the inner query helps answer part of it.

You’ll see subqueries most often in:

SELECT – to compute a single value per row.
FROM – as an inline view (a “temporary table” for that query).
WHERE / HAVING – to filter based on another result set.
JOIN conditions – for advanced patterns or cleaner predicates.

Conceptually, the database plans everything together, but you can think of the subquery as producing a value or set of values that the outer query then uses.

Common subquery shapes

Let’s walk through the main types. For each one, ask yourself two questions:

What does the inner query return (shape: one row, many rows, one column, many columns)?
How does the outer query use those rows?

Scalar subquery

A scalar subquery returns exactly one row and one column. It behaves like a single value you can plug into an expression.

SELECT
  c.customer_id,
  c.first_name,
  c.last_name,
  (
    SELECT AVG(total_amount)
    FROM assignment.sales
  ) AS avg_order_value
FROM assignment.customers c;

Here’s how to read this:

Inner query: SELECT AVG(total_amount) FROM assignment.sales scans the sales table and returns one number: the average order value across all customers.
Outer query: for each row in customers, we just attach that same single value as avg_order_value.

So the result is: one row per customer, plus a column showing the global average order value for comparison.

Single‑row subquery

A single‑row subquery returns one row (maybe multiple columns), which you can compare as a unit.

SELECT *
FROM assignment.products p
WHERE (p.category, p.price) = (
  SELECT category, MAX(price)
  FROM assignment.products
  GROUP BY category
  ORDER BY MAX(price) DESC
  LIMIT 1
);

Step by step:

The inner query groups products by category, computes MAX(price) per category, orders by that max price descending, and keeps only the first row with LIMIT 1.
That gives you something like (category = 'Electronics', price = 1500.00).
The outer query then finds any product row whose (category, price) pair exactly matches that tuple. In practice, you’ll get the single priciest product in the dataset.

Multi‑row subquery with `IN`

Here the subquery returns many rows, and the outer query checks if a value is in that set.

-- Customers who have ever placed an order
SELECT *
FROM assignment.customers c
WHERE c.customer_id IN (
  SELECT DISTINCT customer_id
  FROM assignment.sales
);

Think of the inner query as producing a list like [1, 2, 4, 7, 9, ...] of customer_ids that appear in the sales table. The outer query then says: “give me all customers whose customer_id is in that list.”

Correlated subquery

A correlated subquery depends on columns from the outer query, so it is logically evaluated per row of the outer query.

-- Per product, compute total quantity via a correlated subquery
SELECT
  p.product_id,
  p.product_name,
  (
    SELECT SUM(s.quantity_sold)
    FROM assignment.sales s
    WHERE s.product_id = p.product_id
  ) AS total_quantity_sold
FROM assignment.products p;

Here:

For each row in products (outer query), Postgres plugs that row’s product_id into the inner query.
The inner query filters sales down to that product and sums its quantity_sold.
The result is one row per product with a per‑product total quantity.

On small tables this is fine. On big ones, “run this inner aggregate for each product” can become expensive, which is why we sometimes refactor to joins or CTEs.

Inline view (subquery in `FROM`)

An inline view is just a subquery in the FROM clause. You can treat its output like a temporary table with a name.

SELECT
  dep.category,
  dep.total_revenue
FROM (
  SELECT
    p.category,
    SUM(s.total_amount) AS total_revenue
  FROM assignment.products p
  JOIN assignment.sales s
    ON s.product_id = p.product_id
  GROUP BY p.category
) AS dep
WHERE dep.total_revenue > 10000;

Mentally:

The inner query computes one row per category with total_revenue.
We give that result the alias dep.
The outer query simply filters that temporary table to categories over a threshold.

This is the same idea you’ll see later with CTEs, just without the WITH keyword.

When Subqueries Shine (and When They Hurt)

Subqueries are great tools when:

You want existence checks: WHERE EXISTS (SELECT 1 FROM ... ) to answer “does this related row exist?”
You need single‑value lookups: comparing against global MAX(), AVG(), or a config row.
You prefer contained filtering logic: WHERE x IN (SELECT ...) instead of more joins.

For example, a simple existence check:

SELECT c.customer_id, c.first_name
FROM assignment.customers c
WHERE EXISTS (
  SELECT 1
  FROM assignment.sales s
  WHERE s.customer_id = c.customer_id
);

The inner query returns at least one row if the customer has any sales. EXISTS doesn’t care what the row contains, just whether it exists.

But there are downsides:

Correlated subqueries can be slow, because they are logically evaluated per outer row. For a million products, that can mean a million inner aggregations.
Deep nesting makes queries hard to read and debug.
Refactoring is painful when logic is split across many layers of parentheses.

As a heuristic: use subqueries for small, sharp problems. When logic grows or the data set is large, move toward CTEs or joins.

CTEs: Naming Your Thought Process

A CTE (Common Table Expression) is a named subquery defined at the top of your statement using WITH. It only exists for that one query, but it gives you a clear label for each step.

Basic pattern:

WITH customer_totals AS (
  SELECT
    customer_id,
    SUM(total_amount) AS total_spent
  FROM assignment.sales
  GROUP BY customer_id
)
SELECT
  c.customer_id,
  c.first_name,
  ct.total_spent
FROM assignment.customers c
JOIN customer_totals ct
  ON ct.customer_id = c.customer_id
WHERE ct.total_spent > 1000;

Read it like this:

customer_totals CTE: “Build a small table that has one row per customer with their total_spent.”
Main query: “Join that table to customers and filter on total_spent.”

Instead of one big tangled query, you now have a clear two‑step story.

Chained / multiple CTEs

You can define several CTEs, where later ones depend on earlier ones. This feels a lot like building a small data pipeline.

WITH base_sales AS (
  SELECT *
  FROM assignment.sales
  WHERE sale_date >= DATE '2023-01-01'
),
product_totals AS (
  SELECT
    product_id,
    SUM(total_amount) AS total_revenue
  FROM base_sales
  GROUP BY product_id
)
SELECT *
FROM product_totals
WHERE total_revenue > 2000;

Here:

base_sales isolates “sales from 2023 onward.”
product_totals aggregates only that subset.
The final query filters heavy hitters.

Each step is small and named, so it’s easy to tweak just one without touching the others.

Recursive CTE (hierarchies)

Recursive CTEs are designed for hierarchical data: org charts, category trees, folder structures.

-- employees(emp_id, manager_id, full_name)
WITH RECURSIVE manager_chain AS (
  -- Anchor: start from the employee we care about
  SELECT
    e.emp_id,
    e.manager_id,
    e.full_name,
    0 AS depth
  FROM employees e
  WHERE e.emp_id = 42

  UNION ALL

  -- Recursive step: move one level up each time
  SELECT
    m.emp_id,
    m.manager_id,
    m.full_name,
    mc.depth + 1 AS depth
  FROM employees m
  JOIN manager_chain mc
    ON m.emp_id = mc.manager_id
)
SELECT *
FROM manager_chain
ORDER BY depth;

Walkthrough:

The anchor query grabs the starting employee.
The recursive query says: “Given everyone we already know about in manager_chain, join to their managers and add them to the chain.”
This continues until there’s no higher manager.

You end up with a neat ordered list from “employee” up to “CEO.”

MATERIALIZED vs NOT MATERIALIZED (Postgres)

Postgres can treat CTEs in two ways:

Materialized: run the CTE once, store the result (in memory or on disk), then reuse it.
Not materialized / inlined: conceptually splice the CTE into the main query so the planner can push filters and joins through it.

You can nudge Postgres:

WITH cte AS MATERIALIZED (
  SELECT * FROM big_expensive_view
),
filtered AS (
  SELECT * FROM cte WHERE important_flag = true
)
SELECT * FROM filtered;

Use MATERIALIZED when:

The CTE is expensive but reused multiple times.
You want an optimization fence (no predicate pushdown) for predictable behavior.

Otherwise, let Postgres inline or explicitly use NOT MATERIALIZED and let the planner be clever.

Subqueries vs CTEs: How to Choose

You can often write the same query with either subqueries or CTEs. Choosing is about clarity and performance.

Readability & maintainability

Subqueries are great for tiny, local bits of logic.
CTEs shine when you can describe your query as a sequence of steps.

Performance & planner behavior

Non‑correlated subqueries often optimize similarly to joins.
Correlated subqueries can explode in cost on large tables.
CTEs may be inlined or materialized depending on hints and Postgres version; always check the plan.

Reuse & provenance

If you copy‑paste the same subquery twice, that’s a good signal it wants to be a CTE.
CTEs give you one place to change logic while keeping intent obvious: “these rows came from customer_totals.”

When to reach for JOINs / LATERAL

If your subquery is really “find related rows,” a JOIN is usually clearer. JOIN LATERAL is ideal when you want “top N per row” or “latest row per group.”

Concrete Refactors (Postgres)

Let’s refactor some common patterns so you can see where subqueries, CTEs, and LATERAL each feel natural.

1. Correlated COUNT → CTE + JOIN

Correlated version:

SELECT
  c.customer_id,
  c.first_name,
  (
    SELECT COUNT(*)
    FROM assignment.sales s
    WHERE s.customer_id = c.customer_id
  ) AS order_count
FROM assignment.customers c;

For each customer row, the inner query counts matching sales rows. On a small dataset, no problem. On a large one, this is “run COUNT again and again.”

CTE + JOIN version:

WITH order_counts AS (
  SELECT
    customer_id,
    COUNT(*) AS order_count
  FROM assignment.sales
  GROUP BY customer_id
)
SELECT
  c.customer_id,
  c.first_name,
  COALESCE(oc.order_count, 0) AS order_count
FROM assignment.customers c
LEFT JOIN order_counts oc
  ON oc.customer_id = c.customer_id;

Now:

order_counts computes every customer’s count once.
The final query simply joins that small aggregated table to customers.
Customers with no sales come out as NULL, which we turn into 0 with COALESCE.

2. “Latest row per group” with LATERAL

Correlated style:

SELECT
  c.customer_id,
  c.first_name,
  (
    SELECT s.sale_date
    FROM assignment.sales s
    WHERE s.customer_id = c.customer_id
    ORDER BY s.sale_date DESC
    LIMIT 1
  ) AS last_order_date
FROM assignment.customers c;

Again, one inner query per customer. Let’s make the intent more explicit with LATERAL.

LATERAL refactor:

SELECT
  c.customer_id,
  c.first_name,
  s_last.sale_date AS last_order_date
FROM assignment.customers c
LEFT JOIN LATERAL (
  SELECT sale_date
  FROM assignment.sales s
  WHERE s.customer_id = c.customer_id
  ORDER BY sale_date DESC
  LIMIT 1
) AS s_last ON TRUE;

LATERAL means: “for each row of customers, run this subquery with access to that row’s columns.”

The inner block still does ORDER BY ... LIMIT 1, but it’s clearer that we’re pairing one latest sale row per customer.
We can easily extend it to return more columns (amount, product, etc.) from the latest sale.

3. Recursive CTE for an org chart

Revisiting our recursive example, now you can see how the pieces fit:

Anchor: define the starting point.
Recursive step: define how to move to the “next level.”
Final SELECT: decide what to display.

WITH RECURSIVE manager_chain AS (
  -- Start from a specific employee
  SELECT
    e.emp_id,
    e.manager_id,
    e.full_name,
    0 AS depth
  FROM employees e
  WHERE e.emp_id = 42

  UNION ALL

  -- Then move up the chain to each manager
  SELECT
    m.emp_id,
    m.manager_id,
    m.full_name,
    mc.depth + 1 AS depth
  FROM employees m
  JOIN manager_chain mc
    ON m.emp_id = mc.manager_id
)
SELECT *
FROM manager_chain
ORDER BY depth;

If you print the results, you’ll literally see: depth = 0 for the employee, 1 for their manager, 2 for the manager’s manager, and so on. That step‑by‑step mental picture is what makes recursive CTEs click.

Practical Checklist Before You Ship That Query

Can I name this step? If you can describe a piece of logic (“active 2023 sales”), that’s a good candidate for a CTE.
Is this subquery correlated? If it references outer columns, run EXPLAIN ANALYZE and see if it’s doing repeated work.
Are join/filter columns indexed? Especially customer_id, product_id, and date columns used in subqueries or CTEs.
Am I repeating logic? Repeated subqueries usually want to be a CTE or an inline view.
Do I really need MATERIALIZED? Force it only when you want “compute once, reuse many times” and you accept the temp cost.
Can a JOIN / LATERAL express this more clearly? If yes, it often improves both readability and performance.

Wrapping Up

In the first article, getting a SELECT to run was the win. As your queries grow, the real win is writing SQL that your future self can understand at a glance and your database can execute efficiently.

Subqueries and CTEs are how you get there: they let you structure your thinking, name your steps, and refactor as your logic evolves. The best way to internalize them now is to pick one of your existing “medium‑ugly” queries, rewrite it with a couple of CTEs, and see how much easier it is to read, explain, and tune.

This article was written to help data engineers, from early-career to mid-level, master how and when to use subqueries, CTEs or other alternatives. As always, practice makes perfect, so go ahead: break some eggs! The article was submitted in fulfilment of a LuxDevHQ Cohort 7 DataEngieering assignment ©adev3loper

DEV Community

Level Up Your SQL: Subqueries & CTEs in the Real World

Subqueries: Queries Inside Queries

Common subquery shapes

Scalar subquery

Single‑row subquery

Multi‑row subquery with `IN`

Correlated subquery

Inline view (subquery in `FROM`)

When Subqueries Shine (and When They Hurt)

CTEs: Naming Your Thought Process

Chained / multiple CTEs

Recursive CTE (hierarchies)

MATERIALIZED vs NOT MATERIALIZED (Postgres)

Subqueries vs CTEs: How to Choose

Concrete Refactors (Postgres)

1. Correlated COUNT → CTE + JOIN

2. “Latest row per group” with LATERAL

3. Recursive CTE for an org chart

Practical Checklist Before You Ship That Query

Wrapping Up

Top comments (0)

Subqueries: Queries Inside Queries

Common subquery shapes

Scalar subquery

Single‑row subquery

Multi‑row subquery with IN

Correlated subquery

Inline view (subquery in FROM)

When Subqueries Shine (and When They Hurt)

CTEs: Naming Your Thought Process

Chained / multiple CTEs

Recursive CTE (hierarchies)

MATERIALIZED vs NOT MATERIALIZED (Postgres)

Subqueries vs CTEs: How to Choose

Concrete Refactors (Postgres)

1. Correlated COUNT → CTE + JOIN

2. “Latest row per group” with LATERAL

3. Recursive CTE for an org chart

Practical Checklist Before You Ship That Query

Wrapping Up

Multi‑row subquery with `IN`

Inline view (subquery in `FROM`)