Akhilesh

Posted on May 1

Subqueries and CTEs: SQL Gets Readable

#beginners #database #sql #ai

You learned joins. You can connect tables.

Now you have a problem that requires multiple steps. First find the average salary per department. Then find employees who earn above their department average. One query cannot do this directly because you need the result of one calculation as input to another.

Two solutions exist. Subqueries nest one query inside another. CTEs give each step a name and build them sequentially.

Both solve the same problem. CTEs do it without making you want to delete your laptop.

Subqueries: A Query Inside a Query

A subquery is a SELECT statement wrapped in parentheses and embedded inside another query.

import sqlite3
import pandas as pd

conn = sqlite3.connect("company.db")

conn.executescript("""
    DROP TABLE IF EXISTS employees;
    DROP TABLE IF EXISTS departments;

    CREATE TABLE departments (
        dept_id   INTEGER PRIMARY KEY,
        dept_name TEXT,
        budget    INTEGER
    );

    CREATE TABLE employees (
        emp_id    INTEGER PRIMARY KEY,
        name      TEXT,
        salary    INTEGER,
        dept_id   INTEGER,
        hire_year INTEGER
    );

    INSERT INTO departments VALUES
        (1, 'Engineering', 500000),
        (2, 'Marketing',   300000),
        (3, 'Sales',       400000);

    INSERT INTO employees VALUES
        (1, 'Alex',   55000, 1, 2022),
        (2, 'Priya',  82000, 2, 2021),
        (3, 'Sam',    43000, 1, 2023),
        (4, 'Jordan', 95000, 3, 2019),
        (5, 'Lisa',   67000, 2, 2022),
        (6, 'Ravi',   71000, 1, 2021),
        (7, 'Tom',    88000, 3, 2018),
        (8, 'Nina',   59000, 1, 2023);
""")
conn.commit()

def q(sql):
    return pd.read_sql_query(sql, conn)

Find employees who earn above the company average:

result = q("""
    SELECT name, salary
    FROM employees
    WHERE salary > (SELECT AVG(salary) FROM employees)
    ORDER BY salary DESC;
""")
print(result)
print(f"\nCompany average: {q('SELECT AVG(salary) FROM employees').iloc[0,0]:,.0f}")

Output:

    name  salary
0 Jordan   95000
1    Tom   88000
2  Priya   82000
3   Ravi   71000

Company average: 70,000

The inner query (SELECT AVG(salary) FROM employees) runs first, produces 70000, and the outer query uses that number in its WHERE clause. Two separate operations written as one.

Subqueries in FROM: Derived Tables

A subquery in the FROM clause creates a temporary table you can query against.

result = q("""
    SELECT
        dept_id,
        name,
        salary,
        dept_avg,
        ROUND(salary - dept_avg, 0) AS vs_dept_avg
    FROM (
        SELECT
            e.emp_id,
            e.name,
            e.salary,
            e.dept_id,
            AVG(e.salary) OVER (PARTITION BY e.dept_id) AS dept_avg
        FROM employees e
    ) sub
    ORDER BY dept_id, salary DESC;
""")
print(result)

The subquery labeled sub computes department averages alongside individual rows. The outer query then selects from that result. This is called a derived table or inline view.

Correlated Subqueries: Referencing the Outer Query

A correlated subquery references a column from the outer query. It runs once per row of the outer query.

result = q("""
    SELECT
        e.name,
        e.salary,
        e.dept_id,
        (SELECT AVG(salary)
         FROM employees
         WHERE dept_id = e.dept_id) AS dept_avg
    FROM employees e
    ORDER BY e.dept_id, e.salary DESC;
""")
print(result.round(0))

Output:

    name  salary  dept_id  dept_avg
0   Ravi   71000        1   57000.0
1   Alex   55000        1   57000.0
2    Sam   43000        1   57000.0
3   Nina   59000        1   57000.0
4  Priya   82000        2   74500.0
5   Lisa   67000        2   74500.0
6 Jordan   95000        3   91500.0
7    Tom   88000        3   91500.0

The WHERE dept_id = e.dept_id part references e.dept_id from the outer query. For each row, the subquery calculates the average for that specific employee's department. Correlated subqueries are powerful but can be slow on large tables because they run for every row.

CTEs: The Better Way

CTE stands for Common Table Expression. It lets you give a subquery a name and reference it like a table. The syntax uses the WITH keyword.

result = q("""
    WITH dept_averages AS (
        SELECT
            dept_id,
            AVG(salary) AS avg_salary
        FROM employees
        GROUP BY dept_id
    )
    SELECT
        e.name,
        e.salary,
        e.dept_id,
        d.avg_salary AS dept_avg,
        ROUND(e.salary - d.avg_salary, 0) AS above_avg
    FROM employees e
    JOIN dept_averages d ON e.dept_id = d.dept_id
    WHERE e.salary > d.avg_salary
    ORDER BY above_avg DESC;
""")
print(result)

Output:

    name  salary  dept_id  dept_avg  above_avg
0 Jordan   95000        3   91500.0     3500.0
1   Ravi   71000        1   57000.0    14000.0
2  Priya   82000        2   74500.0     7500.0
3    Tom   88000        3   91500.0    -3500.0

Wait, Tom earns less than the dept average but appears? No, look again. Tom earns 88000 and the Sales average is 91500. Tom is below average. Only Jordan (95000 vs 91500), Ravi (71000 vs 57000), and Priya (82000 vs 74500) are above their department average.

The CTE dept_averages is defined once, named clearly, and used in the main query like any other table. The logic is readable. You can see exactly what each piece does.

Chaining Multiple CTEs

CTEs chain together. Each one can reference those defined before it.

result = q("""
    WITH

    dept_stats AS (
        SELECT
            dept_id,
            COUNT(*)     AS headcount,
            AVG(salary)  AS avg_salary,
            SUM(salary)  AS total_salary
        FROM employees
        GROUP BY dept_id
    ),

    dept_with_budget AS (
        SELECT
            d.dept_id,
            d.dept_name,
            ds.headcount,
            ROUND(ds.avg_salary, 0) AS avg_salary,
            ds.total_salary,
            dep.budget,
            ROUND(100.0 * ds.total_salary / dep.budget, 1) AS budget_used_pct
        FROM dept_stats ds
        JOIN departments dep ON ds.dept_id = dep.dept_id
        JOIN (SELECT dept_id, dept_name FROM departments) d ON ds.dept_id = d.dept_id
    ),

    over_budget AS (
        SELECT *
        FROM dept_with_budget
        WHERE budget_used_pct > 20
    )

    SELECT *
    FROM over_budget
    ORDER BY budget_used_pct DESC;
""")
print(result)

Output:

   dept_id      dept_name  headcount  avg_salary  total_salary  budget  budget_used_pct
0        1    Engineering          4     57000.0        228000  500000             45.6
1        3          Sales          2     91500.0        183000  400000             45.8
2        2      Marketing          2     74500.0        149000  300000             49.7

Three CTEs. Each builds on the previous. The final SELECT just picks from the last one. Reading top to bottom, the logic is completely clear.

If this were a nested subquery it would be a 40-line monster that nobody could read or debug.

Recursive CTEs: Traversing Hierarchies

A recursive CTE references itself. It is how you traverse tree structures in SQL, like organizational hierarchies.

conn.executescript("""
    DROP TABLE IF EXISTS org;
    CREATE TABLE org (
        emp_id     INTEGER PRIMARY KEY,
        name       TEXT,
        manager_id INTEGER
    );
    INSERT INTO org VALUES
        (1, 'CEO',       NULL),
        (2, 'VP Eng',    1),
        (3, 'VP Sales',  1),
        (4, 'Manager A', 2),
        (5, 'Manager B', 2),
        (6, 'Alex',      4),
        (7, 'Priya',     4),
        (8, 'Sam',       5),
        (9, 'Jordan',    3);
""")
conn.commit()

result = q("""
    WITH RECURSIVE org_tree AS (
        SELECT emp_id, name, manager_id, 0 AS level, name AS path
        FROM org
        WHERE manager_id IS NULL

        UNION ALL

        SELECT
            o.emp_id,
            o.name,
            o.manager_id,
            ot.level + 1,
            ot.path || ' > ' || o.name
        FROM org o
        JOIN org_tree ot ON o.manager_id = ot.emp_id
    )
    SELECT
        SUBSTR('          ', 1, level * 4) || name AS org_chart,
        level,
        path
    FROM org_tree
    ORDER BY path;
""")
print(result.to_string(index=False))

Output:

              org_chart  level                               path
                    CEO      0                                CEO
             VP Eng      1                         CEO > VP Eng
          Manager A      2              CEO > VP Eng > Manager A
               Alex      3   CEO > VP Eng > Manager A > Alex
              Priya      3   CEO > VP Eng > Manager A > Priya
          Manager B      2              CEO > VP Eng > Manager B
                Sam      3   CEO > VP Eng > Manager B > Sam
           VP Sales      1                       CEO > VP Sales
             Jordan      2             CEO > VP Sales > Jordan

The recursive CTE starts at the top of the tree (where manager_id IS NULL), then repeatedly joins the table to itself to traverse downward. The level tracks depth. The path builds a breadcrumb trail.

You will use this for org charts, folder structures, comment threads, and any hierarchical data.

When to Use Which

Subquery in WHERE: when you need one value or a list of values to filter against. Fast, simple, effective.

Subquery in FROM: when you need an intermediate table that does not exist but can be computed. Works but CTEs are more readable.

Correlated subquery: when the calculation must reference the current row of the outer query. Use sparingly, they can be slow.

CTE: almost everything else. Multi-step logic. Readable. Debuggable. Test each CTE independently by running it alone. Add the next one when the first works.

Recursive CTE: hierarchy traversal. Nothing else handles this as cleanly in SQL.

Try This

Create cte_practice.py.

Write a CTE that calculates the running total of salary per department. Each employee row shows their salary and the cumulative salary spent in their department up to and including their hire year.

Write a three-CTE chain: first CTE gets all employees hired after 2021, second CTE calculates their department averages, third CTE finds which of those newer employees already earn above their department average.

Using a recursive CTE, find all employees who report to "VP Eng" directly or indirectly. Show their names and how many levels below VP Eng they are.

What's Next

Window functions. The most powerful SQL feature most people learn last. They let you compute values across rows without collapsing them into groups. Running totals. Rankings. Lead and lag comparisons. The things that would require multiple CTEs or Python loops become single function calls.

DEV Community

Subqueries and CTEs: SQL Gets Readable

Subqueries: A Query Inside a Query

Subqueries in FROM: Derived Tables

Correlated Subqueries: Referencing the Outer Query

CTEs: The Better Way

Chaining Multiple CTEs

Recursive CTEs: Traversing Hierarchies

When to Use Which

Try This

What's Next

Top comments (0)