DEV Community

Mohamed Idris
Mohamed Idris

Posted on

Part 10: Subqueries and CTEs

Part of the "SQL: Zero to Ninja" series, written for junior web developers.

Your boss asks: "Show me every user who spent more than the average order." You freeze. To filter by the average, you first need to calculate the average. But that is a whole query on its own. Can you put a query inside another query? Yes. That little trick is called a subquery, and it is about to make you dangerous.

The idea in one line

A subquery is a query inside another query (the inner one runs first and hands its answer to the outer one), and a CTE is just a way to give that inner query a clear name so the whole thing reads like plain English.

The metaphor: a recipe with sub-steps

Think of cooking pasta with sauce. You do not make it all in one motion. You have a sub-step: "first, make the sauce." Once the sauce is ready, you use it in the main dish.

A subquery is that sub-step happening inside the recipe, sometimes a bit buried in the instructions. A CTE is when you write "Sauce" at the top of your prep board with its own little recipe, then just say "add the Sauce" later. Same cooking, way easier to read.

Subquery #1: in WHERE with IN

You want every user who has placed at least one order. The order user ids live in the orders table. So first grab all those ids (the sub-step), then keep users whose id is in that list.

SELECT name
FROM users
WHERE id IN (
  SELECT user_id        -- the sub-step: all user ids that have an order
  FROM orders
);
Enter fullscreen mode Exit fullscreen mode

Read it out loud: "Give me users whose id appears in the list of user_ids from orders." The inner query runs first and produces a list. IN checks membership against that list.

Subquery #2: a scalar subquery (a single value)

Now the boss's question: users who spent more than the average order total. A subquery that returns exactly one value (one row, one column) is called a scalar subquery. You can drop it right where a number would go.

SELECT name, total
FROM orders
JOIN users ON users.id = orders.user_id
WHERE total > (
  SELECT AVG(total)     -- one single number, the average order total
  FROM orders
);
Enter fullscreen mode Exit fullscreen mode

The inner SELECT AVG(total) returns one number, say 45.00. SQL swaps it in, so the line becomes WHERE total > 45.00. The sauce is made, now use it.

Subquery #3: in FROM (a derived table)

You can also use a subquery as if it were a temporary table. This is called a derived table. Say you want the count of orders per user, then only the busy users:

SELECT user_id, order_count
FROM (
  SELECT user_id, COUNT(*) AS order_count   -- a mini-table built on the fly
  FROM orders
  GROUP BY user_id
) AS user_orders
WHERE order_count > 3;
Enter fullscreen mode Exit fullscreen mode

The inner query builds a little table with two columns. The outer query then treats it like any normal table and filters it. Note the AS user_orders: a derived table needs a name.

A quick word on correlated subqueries

Most subqueries run once. A correlated subquery is different: it points back at the outer query, so it runs again for every single row. See the o2.user_id = o1.user_id link below:

SELECT o1.id, o1.total
FROM orders o1
WHERE o1.total > (
  SELECT AVG(o2.total)
  FROM orders o2
  WHERE o2.user_id = o1.user_id   -- depends on the current outer row
);
Enter fullscreen mode Exit fullscreen mode

This finds orders bigger than that user's own average. Handy, but careful: if you have a million rows, that inner query runs a million times. It can get slow.

CTEs: the same thing, but readable

When subqueries pile up, your query turns into a nest of parentheses that nobody wants to read. A CTE (Common Table Expression) uses the WITH keyword to name a sub-step up front, like writing "Sauce" on your prep board.

Here is a nested mess:

-- Hard to read: where does one query end and the next begin?
SELECT name
FROM users
WHERE id IN (
  SELECT user_id
  FROM orders
  WHERE total > (SELECT AVG(total) FROM orders)
);
Enter fullscreen mode Exit fullscreen mode

Now the same logic with CTEs. Read it top to bottom like a story:

WITH avg_order AS (
  SELECT AVG(total) AS amount FROM orders
),
big_spenders AS (
  SELECT DISTINCT user_id
  FROM orders, avg_order
  WHERE total > avg_order.amount
)
SELECT name
FROM users
WHERE id IN (SELECT user_id FROM big_spenders);
Enter fullscreen mode Exit fullscreen mode

First we name the average avg_order. Then we name the users who beat it big_spenders. Then the final query just reads "give me the names of the big spenders." Each step has a clear label.

When to use which

  • Subquery: quick, one-off, used in one spot. A simple IN or scalar check.
  • CTE: when the query gets hard to read, or when you want to reuse the same sub-result more than once. A CTE can be referenced again and again. A buried subquery has to be copy-pasted.

Both give the same answers. The difference is how kind you are being to the next person who reads your code (often future you).

Gotchas

  1. Scalar subquery that returns more than one row. If you write WHERE total > (SELECT total FROM orders) and that inner query returns many rows, the database errors out. A > wants one value. Use AVG, MAX, or add a LIMIT 1.
  2. Correlated subqueries can be slow. They run per row. If a query crawls, check whether the inner part secretly depends on the outer row, and consider a JOIN or CTE instead.
  3. Forgetting to name a derived table. A subquery in FROM must have an alias (AS user_orders), or SQL complains.

Recap

  • A subquery is a query inside a query. The inner one runs first.
  • IN subquery: filter by a list of values from another table.
  • Scalar subquery: returns one value, drop it where a number goes.
  • Derived table: a subquery in FROM, used like a temporary table (needs a name).
  • Correlated subquery: runs per row, can be slow.
  • A CTE (WITH name AS (...)) names a sub-step so the query reads top to bottom.

Your turn

Write a query that finds every product that has never been ordered. Hint: the ordered product ids live in order_items, and you want products whose id is not in that list. Try it first as a subquery, then rewrite it as a CTE. If you can explain why the CTE reads easier, you have got it.

That wraps the core querying skills. Next up: Part 11: Indexes, where we make all these queries run fast.

Top comments (1)

Collapse
 
edriso profile image
Mohamed Idris

Part 10 Practice: Subqueries and CTEs

Let's put queries inside queries, then clean them up. These use our shared schema: users, orders, products, order_items. Try each before looking.


1. IN subquery

Find the names of all users who have placed at least one order.

Solution

SELECT name
FROM users
WHERE id IN (
  SELECT user_id
  FROM orders
);
Enter fullscreen mode Exit fullscreen mode

Why: the inner query builds a list of every user_id that has an order. IN keeps the users whose id shows up in that list.


2. NOT IN subquery (find what is missing)

Find every product that has never been ordered.

Solution

SELECT name
FROM products
WHERE id NOT IN (
  SELECT product_id
  FROM order_items
);
Enter fullscreen mode Exit fullscreen mode

Why: order_items holds every product that was ordered. NOT IN flips it: keep products whose id is not in that list. (Heads up: if product_id can be NULL, NOT IN gets weird. Cleaner schemas avoid that.)


3. Scalar subquery (compare to the average)

Find all orders whose total is greater than the average order total.

Solution

SELECT id, total
FROM orders
WHERE total > (
  SELECT AVG(total)
  FROM orders
);
Enter fullscreen mode Exit fullscreen mode

Why: the inner query returns one number, the average. SQL swaps it in, so you compare each order's total against that single value.


4. Subquery in FROM (derived table)

Show each user_id along with how many orders they have, but only for users with more than 2 orders.

Solution

SELECT user_id, order_count
FROM (
  SELECT user_id, COUNT(*) AS order_count
  FROM orders
  GROUP BY user_id
) AS counts
WHERE order_count > 2;
Enter fullscreen mode Exit fullscreen mode

Why: you cannot filter on COUNT(*) in a plain WHERE. So you build the counts as a derived table first, name it counts, then filter that. (A HAVING clause also works, this shows the derived-table way.)


5. Rewrite as a CTE

Take this nested query and rewrite it using a CTE so it reads cleanly.

SELECT name
FROM users
WHERE id IN (
  SELECT user_id
  FROM orders
  WHERE total > (SELECT AVG(total) FROM orders)
);
Enter fullscreen mode Exit fullscreen mode

Solution

WITH avg_order AS (
  SELECT AVG(total) AS amount FROM orders
),
big_spenders AS (
  SELECT DISTINCT user_id
  FROM orders, avg_order
  WHERE total > avg_order.amount
)
SELECT name
FROM users
WHERE id IN (SELECT user_id FROM big_spenders);
Enter fullscreen mode Exit fullscreen mode

Why: each step gets a clear name (avg_order, then big_spenders), so you read it top to bottom instead of untangling nested parentheses.


6. CTE you reuse twice

Using a CTE, find users who placed more orders than the average number of orders per user. (Hint: name the per-user counts once, then use them twice.)

Solution

WITH order_counts AS (
  SELECT user_id, COUNT(*) AS cnt
  FROM orders
  GROUP BY user_id
)
SELECT user_id, cnt
FROM order_counts
WHERE cnt > (SELECT AVG(cnt) FROM order_counts);
Enter fullscreen mode Exit fullscreen mode

Why: order_counts is referenced twice, once for the rows and once to compute the average. With a plain subquery you would have to write that counting logic twice. The CTE lets you name it once and reuse it.


Great job. If you can take a tangled nested query and turn it into a clean WITH block, your future self (and your teammates) will thank you. On to Part 11: Indexes.