Joan Wambui

Posted on Apr 21

CTEs, Subqueries, and Query Optimisation in SQL

#database #tutorial #performance #sql

If you have been writing SQL for a while, you have almost certainly run into a moment where a query starts to feel unwieldy. You need to filter based on an aggregation, or you need to reference a result you just computed, and suddenly your query is nested three levels deep and impossible to read. That is where CTEs and subqueries come in. Both solve the same core problem: referencing intermediate results within a larger query. But they do it differently, and knowing when to reach for each one will make you a sharper, more deliberate SQL writer.

Subqueries

A subquery is a query written inside another query. It is enclosed in parentheses and can appear in several places: inside a WHERE clause, inside a FROM clause, or inside a SELECT clause.

Subquery in a WHERE clause

This is the most common use. You filter the outer query based on a result computed by the inner query.

SELECT product_name, price
FROM products
WHERE price > (
  SELECT AVG(price)
  FROM products
);

The inner query runs first, computes the average price across all products, and hands that single value to the outer query. The outer query then filters using it.

Subquery in a FROM clause

Here, the inner query acts as a temporary table that the outer query reads from. This is sometimes called a derived table.

SELECT customer_name, total_spent
FROM (
  SELECT customer_id, SUM(total_amount) AS total_spent
  FROM sales
  GROUP BY customer_id
) AS customer_totals
WHERE total_spent > 5000;

Notice the alias customer_totals after the closing parenthesis. This is required in most databases when you use a subquery in a FROM clause.

Correlated subquery

A correlated subquery is one that references a column from the outer query. It re-executes for every row the outer query processes.

SELECT product_name, price, category
FROM products p
WHERE price > (
  SELECT AVG(price)
  FROM products
  WHERE category = p.category
);

Here, the inner query cannot run independently. It depends on p.category from the outer query, so it runs once per row. This is powerful but can be slow on large tables.

Common Table Expressions (CTEs)

A Common Table Expression, defined with the WITH keyword, lets you name an intermediate result and reference it by that name later in the same query. Think of it as a named, temporary result set that exists only for the duration of the query.

WITH customer_spending AS (
  SELECT customer_id, SUM(total_amount) AS total_spent
  FROM sales
  GROUP BY customer_id
)
SELECT c.first_name, c.last_name, cs.total_spent
FROM customers c
INNER JOIN customer_spending cs ON c.customer_id = cs.customer_id
WHERE cs.total_spent > 5000;

The WITH block defines customer_spending. The main query below it then joins against it as if it were a real table.

Chaining multiple CTEs

One of the most useful features of CTEs is that you can define several of them in sequence, and each one can reference the ones defined before it.

WITH product_sales AS (
  SELECT product_id, SUM(quantity_sold) AS total_qty
  FROM sales
  GROUP BY product_id
),
avg_sales AS (
  SELECT AVG(total_qty) AS avg_qty
  FROM product_sales
)
SELECT p.product_name, ps.total_qty
FROM products p
INNER JOIN product_sales ps ON p.product_id = ps.product_id
WHERE ps.total_qty > (SELECT avg_qty FROM avg_sales);

avg_sales is built directly on top of product_sales. This kind of step-by-step construction would be much harder to express clearly with nested subqueries.

Recursive CTEs

CTEs also support recursion, which is useful for querying hierarchical data such as org charts, file structures, or category trees. The CTE references itself until a termination condition is met.

WITH RECURSIVE employee_hierarchy AS (
  SELECT employee_id, name, manager_id, 1 AS level
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  SELECT e.employee_id, e.name, e.manager_id, eh.level + 1
  FROM employees e
  INNER JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT * FROM employee_hierarchy ORDER BY level;

Recursive CTEs are one of the few things CTEs can do that subqueries fundamentally cannot replicate cleanly.

CTEs vs Subqueries: Key Differences

Understanding the difference between the two goes beyond syntax. They reflect a different way of structuring your thinking.

Readability

Subqueries, especially nested ones, are read from the inside out. The deeper the nesting, the harder it is to follow. CTEs read from top to bottom, like steps in a recipe. When a query has multiple intermediate stages, CTEs are almost always clearer.

Reusability within a query

A subquery can only be used in the one place where it is written. If you need the same intermediate result in two different parts of your query, you have to write the subquery twice. A CTE can be referenced multiple times within the same query after it is defined once.

WITH high_value_sales AS (
  SELECT customer_id, SUM(total_amount) AS total_spent
  FROM sales
  GROUP BY customer_id
  HAVING SUM(total_amount) > 10000
)
SELECT COUNT(*) AS total_high_value_customers FROM high_value_sales
UNION ALL
SELECT SUM(total_spent) AS combined_revenue FROM high_value_sales;

Performance

This is more nuanced. In most modern databases like PostgreSQL, MySQL 8+, and SQL Server, the optimiser treats CTEs and subqueries similarly in many cases. However, some databases materialise CTEs (compute and store the result once), while subqueries may be inlined and optimised as part of the surrounding query. In PostgreSQL prior to version 12, CTEs were always materialised, which could actually make them slower in some situations. From PostgreSQL 12 onward, the optimiser can choose whether to materialise or inline a CTE. The practical takeaway is: do not choose one over the other for performance reasons without testing on your specific database version.

Correlated logic

Correlated subqueries are uniquely suited to row-by-row comparisons, like checking whether a value exceeds an average within its own group. CTEs cannot be correlated in the same way because they are evaluated independently of the outer query.

When to use each

Use a subquery when:

The logic is simple and self-contained
You need correlated row-by-row comparisons
The intermediate result is only needed in one place
You want to keep the query concise for a single-step filter

Use a CTE when:

You have multiple intermediate steps
You need to reference the same intermediate result more than once
You are working with recursive or hierarchical data
Readability and maintainability matter (which is most of the time)

Optimising SQL Queries

Writing a query that returns the right answer is the first goal. Writing one that does it efficiently is the second. Here are the most impactful methods for optimising SQL queries.

1. Use indexes wisely

An index is a data structure that allows the database to find rows much faster than scanning the entire table. Always ensure that columns used frequently in WHERE, JOIN, ORDER BY, and GROUP BY clauses are indexed.

-- Without an index on sale_date, this scans every row
SELECT * FROM sales WHERE sale_date = '2023-06-01';

-- Creating an index
CREATE INDEX idx_sales_date ON sales(sale_date);

Be careful not to over-index. Every index adds overhead to INSERT, UPDATE, and DELETE operations because the index must be updated alongside the data.

2. Avoid SELECT *

Selecting all columns forces the database to read and transfer every column, even those you do not need. Always specify only the columns your query requires.

-- Avoid
SELECT * FROM customers;

-- Preferred
SELECT customer_id, first_name, last_name FROM customers;

This reduces the amount of data read from disk and transferred across the network.

3. Filter early

The further upstream you apply filters, the less data each subsequent step has to process. In joins, filter the data before joining where possible.

-- Less efficient: joins all sales, then filters
SELECT c.first_name, s.total_amount
FROM customers c
INNER JOIN sales s ON c.customer_id = s.customer_id
WHERE EXTRACT(YEAR FROM s.sale_date) = 2023;

-- More efficient: pre-filter in a subquery or CTE
WITH sales_2023 AS (
  SELECT customer_id, total_amount
  FROM sales
  WHERE EXTRACT(YEAR FROM sale_date) = 2023
)
SELECT c.first_name, s.total_amount
FROM customers c
INNER JOIN sales_2023 s ON c.customer_id = s.customer_id;

4. Avoid functions on indexed columns in WHERE clauses

Wrapping a column in a function in a WHERE clause prevents the database from using any index on that column. The function has to be applied to every row before filtering can happen.

-- This cannot use an index on sale_date
WHERE EXTRACT(YEAR FROM sale_date) = 2023

-- This can use a range index on sale_date
WHERE sale_date BETWEEN '2023-01-01' AND '2023-12-31'

5. Use LIMIT when exploring data

When you are investigating data or developing a query, add LIMIT to avoid accidentally scanning millions of rows.

SELECT * FROM sales ORDER BY sale_date DESC LIMIT 100;

Remove it only when you need the full result set.

6. Prefer EXISTS over IN for large subquery results

When checking whether a related row exists, EXISTS stops as soon as it finds the first match. IN collects all results from the subquery first and then checks membership. On large datasets, EXISTS is typically faster.

-- Using IN
SELECT first_name, last_name
FROM customers
WHERE customer_id IN (SELECT customer_id FROM sales);

-- Using EXISTS (generally faster on large tables)
SELECT first_name, last_name
FROM customers c
WHERE EXISTS (
  SELECT 1 FROM sales s WHERE s.customer_id = c.customer_id
);

7. Use EXPLAIN / EXPLAIN ANALYZE

Before and after any optimisation, use your database's query plan tool to understand what is actually happening. In PostgreSQL:

EXPLAIN ANALYZE
SELECT p.product_name, SUM(s.total_amount)
FROM products p
INNER JOIN sales s ON p.product_id = s.product_id
GROUP BY p.product_name;

The output shows whether indexes are being used, how many rows are being scanned, and where the most time is spent. This is the single most reliable way to confirm whether an optimisation is actually helping.

8. Avoid correlated subqueries on large tables

As noted earlier, a correlated subquery runs once per row in the outer query. On a table with 100,000 rows, that means 100,000 subquery executions. Wherever possible, rewrite correlated subqueries as joins or CTEs.

-- Correlated subquery: runs once per customer row
SELECT first_name,
  (SELECT SUM(total_amount) FROM sales WHERE customer_id = c.customer_id) AS total_spent
FROM customers c;

-- Equivalent join: far more efficient
SELECT c.first_name, SUM(s.total_amount) AS total_spent
FROM customers c
LEFT JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.customer_id, c.first_name;

9. Use appropriate JOIN types

Inner joins are generally faster than outer joins because they return fewer rows and allow the optimiser more flexibility. Use LEFT JOIN or RIGHT JOIN only when you genuinely need to retain unmatched rows.

10. Aggregate before joining where possible

If you need to join an aggregated result to another table, aggregate first and then join, rather than joining first and then aggregating. This reduces the number of rows the join has to process.

-- Join first, aggregate after (more rows to process during join)
SELECT c.first_name, SUM(s.total_amount)
FROM customers c
INNER JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.customer_id, c.first_name;

-- Aggregate first, then join (fewer rows in the join)
WITH customer_totals AS (
  SELECT customer_id, SUM(total_amount) AS total_spent
  FROM sales
  GROUP BY customer_id
)
SELECT c.first_name, ct.total_spent
FROM customers c
INNER JOIN customer_totals ct ON c.customer_id = ct.customer_id;

Putting It All Together

CTEs and subqueries are not competing tools; they are complementary. Subqueries are precise and compact for simple, single-use logic. CTEs shine when a query has multiple steps, when the same intermediate result is needed more than once, or when the goal is to write something a colleague can read and understand without guessing. Query optimisation, meanwhile, is not about memorising a checklist; it is about understanding what the database is doing and removing unnecessary work at every stage. The combination of clean, readable structure and deliberate performance thinking is what separates functional SQL from genuinely good SQL.

DEV Community