RaggedGent-io

Posted on Apr 26 • Edited on Apr 27

SQL Subqueries vs CTEs: A Complete Guide for Data Analysts

#sql #datascience #database #tutorial

SQL Subqueries vs CTEs: A Complete Guide for Data Analysts

If you've been writing SQL for a while, you've probably run into situations where a single SELECT statement just doesn't cut it. You need to filter based on aggregated data, break a complex query into logical steps, or reuse a derived dataset multiple times in the same query.

That's where subqueries and CTEs (Common Table Expressions) come in.

In this article, I'll walk you through both concepts clearly what they are, the different types, when to use each, and how they compare in performance and readability. By the end, you'll know exactly which tool to reach for depending on the situation.

What Is a Subquery?

A subquery (also called an inner query or nested query) is a SELECT statement written inside another SQL statement. The outer query uses the result of the inner query as part of its logic.

Think of it like a function call: the inner query runs first, returns a result, and the outer query uses that result.

Basic Example

Suppose you have an orders table and you want to find all customers who placed orders above the average order value:

SELECT customer_id, order_value
FROM orders
WHERE order_value > (
    SELECT AVG(order_value)
    FROM orders
);

Here, SELECT AVG(order_value) FROM orders is the subquery. It computes the average, and the outer query filters rows against it.

Types of Subqueries

Subqueries come in several forms depending on where they appear and how they relate to the outer query.

1. Scalar Subquery

Returns a single value (one row, one column). Used in SELECT, WHERE, or HAVING clauses.

SELECT 
    product_name,
    price,
    (SELECT AVG(price) FROM products) AS avg_price
FROM products;

This adds the overall average price as a column alongside each product row.

2. Column Subquery (Multi-Row Subquery)

Returns one column with multiple rows. Typically used with IN, ANY, or ALL.

SELECT customer_name
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
    WHERE order_date >= '2024-01-01'
);

This returns the names of customers who placed at least one order in 2024.

3. Row Subquery

Returns one row with multiple columns. Useful for row-level comparisons.

SELECT *
FROM employees
WHERE (department_id, salary) = (
    SELECT department_id, MAX(salary)
    FROM employees
    WHERE department_id = 3
    GROUP BY department_id
);

4. Table Subquery (Derived Table)

Returns a full result set (multiple rows and columns) used in the FROM clause. Often called a derived table.

SELECT dept_summary.department, dept_summary.avg_salary
FROM (
    SELECT department, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY department
) AS dept_summary
WHERE dept_summary.avg_salary > 50000;

The inner query builds a temporary table, and the outer query filters it further.

5. Correlated Subquery

This is the most powerful and most expensive type. A correlated subquery references a column from the outer query, meaning it runs once per row of the outer query.

SELECT e.employee_name, e.salary, e.department
FROM employees e
WHERE e.salary > (
    SELECT AVG(salary)
    FROM employees
    WHERE department = e.department
);

Here, e.department is from the outer query. For every employee row, the subquery recalculates the department average. This is powerful but can be slow on large datasets.

When Should You Use Subqueries?

Subqueries are the right tool when:

You need a quick, one-off filter based on an aggregated value (e.g., above average)
The nested logic is simple and short keeping it inline is readable enough
You're using IN, EXISTS, or NOT EXISTS to filter against another table's result
The subquery is not reused elsewhere in the same query
You're writing in an environment that doesn't support CTEs (older MySQL versions, for example)

Use EXISTS instead of IN for large datasets it short-circuits as soon as a match is found:

-- Faster than IN for large tables
SELECT customer_name
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
);

What Are CTEs (Common Table Expressions)?

A CTE is a named temporary result set defined using the WITH keyword at the beginning of a query. You can think of it like creating a named "view" that only lives for the duration of that single query.

Basic Syntax

WITH cte_name AS (
    -- Your inner query here
    SELECT column1, column2
    FROM some_table
    WHERE condition
)
SELECT *
FROM cte_name;

Example

WITH high_value_orders AS (
    SELECT customer_id, SUM(order_value) AS total_spent
    FROM orders
    GROUP BY customer_id
    HAVING SUM(order_value) > 10000
)
SELECT c.customer_name, h.total_spent
FROM customers c
JOIN high_value_orders h ON c.customer_id = h.customer_id;

The CTE high_value_orders is defined once and then used in the main query like a regular table.

Types and Use Cases of CTEs

1. Simple (Non-Recursive) CTE

The standard form: it defines a named result set used once or multiple times in the main query.

Use case: Cleaning up complex FROM clause logic, improving readability.

WITH active_products AS (
    SELECT product_id, product_name, price
    FROM products
    WHERE is_active = TRUE
)
SELECT * FROM active_products WHERE price < 500;

2. Multiple CTEs

You can chain multiple CTEs in a single WITH block, separated by commas.

Use case: Breaking a complex query into clear, logical steps.

WITH 
sales_2024 AS (
    SELECT product_id, SUM(quantity) AS total_units
    FROM sales
    WHERE YEAR(sale_date) = 2024
    GROUP BY product_id
),
product_info AS (
    SELECT product_id, product_name, category
    FROM products
    WHERE is_active = TRUE
)
SELECT p.product_name, p.category, s.total_units
FROM product_info p
JOIN sales_2024 s ON p.product_id = s.product_id
ORDER BY s.total_units DESC;

Each CTE builds on the previous ones, making the logic easy to follow step by step.

3. Recursive CTE

A recursive CTE is one that references itself. It's used to traverse hierarchical or tree-structured data like org charts, category trees, or file directories.

Structure:

WITH RECURSIVE cte_name AS (
    -- Anchor member (base case)
    SELECT ...
    FROM table
    WHERE condition

    UNION ALL

    -- Recursive member (references the CTE itself)
    SELECT ...
    FROM table
    JOIN cte_name ON ...
)
SELECT * FROM cte_name;

Use case: Traversing an employee hierarchy

WITH RECURSIVE org_chart AS (
    -- Start with the CEO (no manager)
    SELECT employee_id, employee_name, manager_id, 1 AS level
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Add each employee's direct reports
    SELECT e.employee_id, e.employee_name, e.manager_id, oc.level + 1
    FROM employees e
    JOIN org_chart oc ON e.manager_id = oc.employee_id
)
SELECT * FROM org_chart ORDER BY level;

This walks down the entire management tree level by level, something impossible to do cleanly with a regular subquery.

Subqueries vs CTEs: A Clear Comparison

Factor	Subquery	CTE
Readability	Can get deeply nested and hard to follow	Highly readable; named blocks read like prose
Reusability	Must be duplicated if needed more than once	Defined once, referenced many times
Recursion	Not supported	Supported (Recursive CTEs)
Debugging	Hard; inner queries cannot be isolated	Easy; each CTE can be run independently
Performance	Varies; correlated subqueries can be slow	Generally similar; optimizer often treats them the same
Scope	Exists inline within the query	Scoped to the entire query block
Compatibility	Supported in virtually all SQL versions	Requires SQL:1999+ (most modern databases support it)

Performance: Does It Actually Matter?

This is a common question and the honest answer is: it depends on your database engine.

In most modern databases (PostgreSQL, SQL Server, BigQuery, MySQL 8+), the query optimizer often treats CTEs and subqueries equivalently. It may inline a CTE and execute it the same way it would a derived table subquery.

However, there are important nuances:

PostgreSQL pre-v12 treated CTEs as optimization fences they were always materialized (computed once and stored), which could either help or hurt performance depending on the query. From v12 onwards, the optimizer decides whether to inline or materialize.
Correlated subqueries are almost always slower than their CTE equivalent because they execute repeatedly per row.
Multiple references to the same subquery are inefficient. A CTE avoids this by computing the result once.

Rule of thumb: Don't optimize prematurely. Write for readability first using CTEs. If you have a performance problem, measure with EXPLAIN and optimize from there.

Readability: Where CTEs Win

Compare these two approaches to the same problem finding the top customers per region:

With a nested subquery:

SELECT region, customer_name, total_spent
FROM (
    SELECT c.region, c.customer_name, SUM(o.order_value) AS total_spent,
           RANK() OVER (PARTITION BY c.region ORDER BY SUM(o.order_value) DESC) AS rnk
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.region, c.customer_name
) ranked
WHERE rnk = 1;

With a CTE:

WITH customer_totals AS (
    SELECT c.region, c.customer_name, SUM(o.order_value) AS total_spent
    FROM customers c
    JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.region, c.customer_name
),
ranked_customers AS (
    SELECT *, RANK() OVER (PARTITION BY region ORDER BY total_spent DESC) AS rnk
    FROM customer_totals
)
SELECT region, customer_name, total_spent
FROM ranked_customers
WHERE rnk = 1;

Both return the same result. But the CTE version is self-documenting each step has a name that explains what it does. Any teammate (or future you) can read it top to bottom like a story.

When to Use Each: A Decision Guide

Use a subquery when:

The logic is simple and short (a single filter or lookup)
It's only used once and doesn't need a name
You're using EXISTS, NOT EXISTS, IN, or NOT IN
You want maximum compatibility across older SQL environments

Use a CTE when:

The logic is complex or involves multiple steps
You need to reference the same derived dataset more than once
You want your query to be easily readable and debuggable
You're working with hierarchical data (recursive CTE)
You're collaborating with a team and code clarity matters

Final Thoughts

Subqueries and CTEs both solve the same core problem: letting you build queries on top of queries. The difference lies in how you express that logic.

Subqueries are concise and widely supported, making them great for simple, inline filtering. CTEs are structured and readable, making them essential for anything complex, multi-step, or hierarchical.

As you grow in your SQL journey, you'll find yourself naturally reaching for CTEs more often, not just because they're more readable, but because they make your thinking explicit. A well-named CTE is documentation built right into your query.

The best SQL isn't just correct, it is clear.

If this helped you, drop a like or leave a comment below. I'd love to know how you use subqueries and CTEs in your own work, Thank you.

DEV Community

SQL Subqueries vs CTEs: A Complete Guide for Data Analysts

SQL Subqueries vs CTEs: A Complete Guide for Data Analysts

What Is a Subquery?

Basic Example

Types of Subqueries

1. Scalar Subquery

2. Column Subquery (Multi-Row Subquery)

3. Row Subquery

4. Table Subquery (Derived Table)

5. Correlated Subquery

When Should You Use Subqueries?

What Are CTEs (Common Table Expressions)?

Basic Syntax

Example

Types and Use Cases of CTEs

1. Simple (Non-Recursive) CTE

2. Multiple CTEs

3. Recursive CTE

Subqueries vs CTEs: A Clear Comparison

Performance: Does It Actually Matter?

Readability: Where CTEs Win

When to Use Each: A Decision Guide

Final Thoughts

Top comments (0)