In this article we will cover:
- Advanced Aggregations
- Advanced Set Operations
- Window Functions
- CTEs and Query Structuring
- Subqueries
- EXISTS vs NOT EXISTS
- Query Optimization
Most real-world insights come from combining multiple SQL techniques.
While a basic understanding of SQL e.g. selecting columns, filtering rows, and joining tables is essential for anyone working with data, mastering advanced SQL techniques is what truly separates a new data analyst from an expert data analyst. These advanced skills unlock deeper insights, help solve complex business problems, and ultimately enable more confident, data-driven decision-making.
If you have spent any time working with data, you are probably already familiar with the fundamentals of SQL: pulling rows from a table (SELECT), filtering results using conditions (WHERE), and perhaps joining two tables together (JOIN). That foundation is important.
But the moment you start working as a data analyst, the questions change. Suddenly you're asked things like:
Which customers made a purchase this month but haven’t returned since?
For each sales representative, what was their running total of revenue by the end of every quarter?
Which product categories fall within the top 10% of sales performance?
Simple SQL isn't enough anymore. That’s where advanced SQL techniques come in.
1. Advanced Aggregations: CASE WHEN, FILTER, and GROUPING SETS
Aggregation in SQL goes far beyond COUNT() and SUM().
Business reporting often requires conditional aggregation, multiple grouping levels, and flexible summarization.
Three tools make this much easier: CASE WHEN, FILTER & GROUPING SETS
a. CASE WHEN is used for conditional logic inside a query. It allows you to compute conditional aggregates — for example, counting only orders above a certain value, or separating revenue by customer tier within the same row:
Think of it as SQL’s version of an
if/elsestatement.
Example: Categorize customers based on spending.
SELECT customer_name, total_spent,
CASE WHEN total_spent > 1000 THEN 'High Value'
WHEN total_spent > 500 THEN 'Medium Value'
ELSE 'Low Value'
END AS customer_category
FROM customers;
Result:
customer total_spent category
Alice 1200 High Value
Brian 650 Medium Value
John 200 Low Value
b. FILTER -is a cleaner alternative to conditional aggregation.
Instead of writing CASE WHEN, you attach a filter directly to the aggregate.
Example
SELECT COUNT(*) FILTER (WHERE region = 'Kenya') AS kenya_customers,
COUNT(*) FILTER (WHERE region = 'Uganda') AS uganda_customers
FROM customers;
c. GROUPING SETS - This is more advanced and extremely powerful.
It allows you to calculate multiple GROUP BY aggregations in one query.
Normally this requires three queries + UNION.
Example
region product sales
Kenya Laptop 200
Kenya Phone 300
Uganda Laptop 150
Suppose you want: Sales by region, Sales by product and the Grand total
Normally you'd write 3 queries.
But with GROUPING SETS:
SELECT region, product, SUM(sales) AS total_sales
FROM sales
GROUP BY GROUPING SETS ((region), (product),());
Result:
region product total_sales
Kenya NULL 500
Uganda NULL 150
This kind of aggregation is frequently requested by business stakeholders who want a single summary table rather than multiple separate queries.
2. Advanced Set Operations (UNION, UNION ALL, INTERSECT, EXCEPT)
Set operations combine the results of two SELECT queries.
They require:
- Same number of columns
- Compatible data types
- Same column order
A. UNION/UNION ALL
It combines results from two queries and removes duplicates automatically
Example Tables
Table: customers_2024
customer_id name
1 Alice
2 Brian
3 John
Table: customers_2025
customer_id name
3 John
4 Mary
5 James
QUERY
SELECT name FROM customers_2024
UNION
SELECT name FROM customers_2025;
RESULT
name
Alice
Brian
John
Mary
James
Think of them as working with result sets, not tables directly.
NB/ John appears in both tables
UNION removes the duplicate
UNION ALL keeps all the duplicate values and is faster
B. INTERSECT
This returns only the rows that exist in both queries
When to Use INTERSECT
- Identifying returning customers
- Matching common records between systems
- Data validation
Example use case:
Customers who bought in both 2024 and 2025. Using the same tables:
SELECT name FROM customers_2024
INTERSECT
SELECT name FROM customers_2025;
Result
name
John
C. EXCEPT
It returns rows from the first query that do NOT exist in the second query.
Think of it as: “Show me what's in A but not in B”
SELECT name FROM customers_2024
EXCEPT
SELECT name FROM customers_2025;
Result
name
Alice
Brian
When to Use EXCEPT
- Finding new customers
- Detecting missing records
- Comparing two datasets
- Data auditing
Question: When would you use EXCEPT instead of LEFT JOIN?
I use EXCEPT when I want to compare two result sets directly and identify records present in one dataset but not another, especially for data validation tasks.
3. Window Functions
Window functions are one of the most powerful tools in SQL analytics.
They allow you to perform calculations across rows while still keeping every row in the result.
Example problem: You want to see each employee's sales alongside their department's total sales.
A normal GROUP BY would collapse rows.
Window functions avoid that.
The syntax centers on the OVER() clause, which defines the 'window' — the set of rows the function considers:
SELECT employee_name, department, sales_amount,
SUM(sales_amount) OVER (PARTITION BY department) AS dept_total,
ROUND(sales_amount * 100.0 / SUM(sales_amount)
OVER (PARTITION BY department), 2)
AS pct_of_dept
FROM employee_sales;
Here, PARTITION BY department tells SQL to calculate the sum separately for each department.
Every row for the Marketing team gets Marketing's total; every row for Engineering gets Engineering's total. No rows are removed; no separate subquery is needed.
Window functions become even more powerful with ranking.
The ROW_NUMBER (), RANK (), and DENSE_RANK () functions let you rank records within groups.
- ROW_NUMBER() - gives every row a unique number. Even if two rows have the same value, they still get different numbers.
- RANK() - gives the same rank to tied values, but it skips numbers after ties.
- DENSE_RANK() - also gives the same rank for ties, but does NOT skip numbers.
Example table: employees
name salary
Alice 9000
Bob 8000
Carol 8000
David 7000
Query 1: Row_number()
SELECT name, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num
FROM employees;
Result
name salary row_num
Alice 9000 1
Bob 8000 2
Carol 8000 3
David 7000 4
NB/
Bob and Carol have the same salary, but they still get different row numbers
So, ROW_NUMBER() forces each row to be unique.
Query 2: RANK()
SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank_num
FROM employees;
Result
name salary rank_num
Alice 9000 1
Bob 8000 2
Carol 8000 2
David 7000 4
NB/
Bob and Carol tie at rank 2,then SQL skips rank 3 and goes to 4.
Query 3: DENSE_RANK()
SELECT name, salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
FROM employees;
Result
name salary dense_rank
Alice 9000 1
Bob 8000 2
Carol 8000 2
David 7000 3
NB/ No number is skipped.
This is just one example of the power of window functions. They can be used for other analytical tasks, such as:
- Ranking: NTILE () to find the top N products by sales in each category.
Lead/Lag Analysis: LEAD (), LAG () to compare a value with the value from a subsequent or preceding row, for example, to calculate month-over-month growth.
Example
SELECT month, revenue, revenue - LAG(revenue) OVER (ORDER BY month) AS growth
FROM monthly_sales;Running Totals: SUM() with an appropriate frame clause to calculate cumulative sales over time.
A classic business use case is identifying the top-performing product in each category
SELECT *
FROM (SELECT product_name, category, revenue,
RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS rnk
FROM product_sales)
WHERE rnk = 1;
4. Common Table Expressions (CTEs)
A CTE is a temporary, named result set that you can reference within a subsequent SELECT, INSERT, UPDATE, or DELETE statement.
Think of it as creating a temporary, virtual table that exists only for the duration of your query.
CTEs provide a way to break down a complex query into logical, readable steps. By giving a name to each step of your analysis, you can make your code self-documenting and easier to debug.
Understanding the Syntax
You start with the WITH keyword, followed by the name of the CTE, and then the AS keyword with the query that defines the CTE enclosed in parentheses.
Business Case Scenario: Analyzing Customer Order Data
Let's say you are an analyst for an online retailer, and you have been tasked with identifying the top 5 customers by total spending in the "Kenyan" region for the year 2025.
Attempting to do this with nested subqueries would result in a hard-to-read query. Instead, we can use CTEs to break down the problem into clear, logical steps.
Assume you have the following tables:
Table 1: customers
`customer_id customer_name region
- John Smith Kenya
- Jane Doe Uganda`
Table 2: orders
order_id customer_id order_date order_total
101 1 2025-01-15 250.00
102 2 2025-02-20 150.75
Here is how you can solve this problem using CTEs:
WITH orders_2025 AS (
--Step 1: Filter orders from the year 2025--
SELECT order_id, customer_id, order_total
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2025),
kenya_customer_orders AS (
--Step 2 & 3: Join with customers and filter for Kenyan region--
SELECT c.customer_id, c.customer_name, o.order_total
FROM orders_2025 AS o
JOIN customers AS c ON o.customer_id = c.customer_id
WHERE c.region = 'Kenya'),
customer_total_spending AS (
--Step 4: Calculate total spending for each customer--
SELECT customer_id, customer_name,
SUM(order_total) AS total_spending
FROM kenya_customer_orders
GROUP BY customer_id, customer_name)
_Step 5: Rank customers and select the top 5_
SELECT customer_name, total_spending
FROM customer_total_spending
ORDER BY total_spending DESC
LIMIT 5;
As you can see, the query is much more readable and self-explanatory. Each CTE has a clear purpose, and the final SELECT statement is simple and easy to understand. If you needed to debug this query, you could easily test each CTE independently to verify its output
CTEs also support recursion, which is useful for hierarchical data structures like org charts, product categories with parent-child relationships, or network graphs.
Example Table
CEO
├─ Alice
│ ├─ Carol
│ └─ David
└─ Bob
└─ Emma
Question: Find all employees under a specific manager.
employee_id name manager_id
1 CEO NULL
2 Alice 1
3 Bob 1
4 Carol 2
5 David 2
6 Emma 3
WITH RECURSIVE employee_tree AS (
-- start from manager
SELECT employee_id, name, manager_id
FROM employees
WHERE employee_id = 1
UNION ALL
-- find subordinates
SELECT e.employee_id, e.name, e.manager_id
FROM employees e
JOIN employee_tree et
ON e.manager_id = et.employee_id)
SELECT *
FROM employee_tree;
Example: Sales Analytics Query
Imagine this table:
order_id customer region order_total order_date
1 Alice Kenya 500 2025-01-05
2 Brian Kenya 700 2025-02-02
3 James Uganda 300 2025-02-15
4 Alice Kenya 400 2025-03-01
Goal:
Total spending per customer
Kenyan vs non-Kenyan totals
Rank customers by spending
--Step 1 — CTE (clean and organize the orders from 2025) --
WITH customer_orders AS (
SELECT customer, region, order_total
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2025)
--Step 2 — FILTER (conditional aggregation)--
SELECT customer, SUM(order_total) AS total_spent,
SUM(order_total) FILTER (WHERE region = 'Kenya') AS kenya_spending,
SUM(order_total) FILTER (WHERE region != 'Kenya') AS international_spending
--Without FILTER, you'd need multiple CASE WHEN statements.--
--Step 3 — Window function (ranking)--
RANK() OVER (ORDER BY SUM(order_total) DESC) AS spending_rank
--This ranks customers based on their total spending.--
--Final Combined Query--
WITH customer_orders AS (
SELECT customer, region, order_total
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2025)
SELECT customer, SUM(order_total) AS total_spent,
SUM(order_total) FILTER (WHERE region = 'Kenya') AS kenya_spending,
SUM(order_total) FILTER (WHERE region != 'Kenya') AS international_spending,
RANK() OVER (ORDER BY SUM(order_total) DESC) AS spending_rank
FROM customer_orders
GROUP BY customer;
Instead of writing 5 queries, you get everything in one query.
Mental model - The flow usually looks like this:
CTE
↓
Aggregate with FILTER
↓
Analyze with WINDOW FUNCTIONS
↓
Final output
5. Subquery
This is simply a query inside another query.
It’s like saying: First calculate this… then use that result to answer the main question.
They can be used in:
- SELECT
- FROM
- WHERE
- HAVING
- CORRELATED SUBQUERIES
a. Subquery in the WHERE Clause (Most Common)
Used for filtering.
Example:
Find employees who earn more than the average salary.
SELECT name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
So first, Inner query calculates average salary. Outer query filters employees above that value.
b. Subquery in the FROM Clause (Derived Table)
Here, the subquery acts like a temporary table.
Example:
Find departments with average salary above 50,000.
SELECT department_id, avg_salary
FROM (
SELECT department_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY department_id) AS dept_avg
WHERE avg_salary > 50000;
c. Subquery in the SELECT Clause
Used to return a calculated value per row.
Example:
Show each employee with the company average salary.
SELECT name, salary,
(SELECT AVG(salary) FROM employees) AS company_avg
FROM employees;
*d. Subquery in the HAVING Clause *
It is used when you want to filter aggregated results using another query.
WHERE filters rows before aggregation
HAVING filters groups after aggregation
So a subquery in HAVING usually compares one group's aggregate with another aggregate value.
SELECT department_id, AVG(salary) AS dept_avg
FROM employees
GROUP BY department_id
HAVING AVG(salary) > (
SELECT AVG(salary)
FROM employees);
Real Business Case: Find products whose total sales are greater than the average product sales.
SELECT product_id, SUM(sales) AS total_sales
FROM orders
GROUP BY product_id
HAVING SUM(sales) > (SELECT AVG(product_sales)
FROM(SELECT SUM(sales) AS product_sales
FROM orders
GROUP BY product_id) t);
e. Correlated Subqueries
A correlated subquery depends on the outer query. It references a column from the outer query and runs once per row.
Example:
Find employees who earn more than their department average.
SELECT e1.name, e1.salary, e1.department_id
FROM employees e1
WHERE salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e1.department_id
);
6. Finding Relationships with EXISTS and NOT EXISTS
EXISTS checks if at least one row exists. It ignores the actual column values.
Use the 2 when:
- Checking relationships
- Finding missing records
- Data validation
- Complex filtering
- Working with large datasets
- Avoiding NULL issues
Example
Table 1: customers
customer_id customer_name
1 Alice
2 Bob
3 Carol
4 David
Table 2: orders
order_id customer_id amount
101 1 500
102 1 300
103 2 700
Now let’s use them.
i. EXISTS
Find customers who have placed at least one order.
SELECT c.customer_id, c.customer_name
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id);
Result:
customer_id customer_name
1 Alice
2 Bob
ii. NOT EXISTS
Now let’s find customers who NEVER placed an order.
SELECT c.customer_id, c.customer_name
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id);
NOTE BETTER:
- The select 1 in both just stands in as a placeholder, you can replace it with any value. It does NOT read the column
- NOT EXISTS (recommended) OR LEFT JOIN + IS NULL - give same result
7. Query Optimization
Writing SQL that produces the correct answer is one thing.
Writing SQL that runs efficiently on millions of rows is another skill entirely.
Query optimization is the process of writing SQL in a way that runs faster, uses less memory, and scales better on large data.
Key techniques:
If you frequently filter or join on a column, it’s likely a good candidate.
Example:
SELECT *
FROM orders o
JOIN customers c
ON o.customer_id = c.customer_id;
customer_id should likely be indexed.
How Databases Actually Work
When you run a query:
- SQL parses it (checks syntax).
- The query optimizer analyzes it.
- It chooses the best execution plan.
- Then it runs it.
The optimizer decides things like:
- Should it use an index?
- Should it scan the whole table?
- Which join order is best?
- Should it use a hash join or nested loop?
You don’t see this — but it’s happening.
Key principles include:
- Use EXPLAIN or EXPLAIN ANALYZE This reveals how the database executes your query.
2. Use indexes on frequently filtered columns
Indexes are like a book’s table of contents. Without index, database scans every row (slow) but with index, database jumps directly to matching rows (fast).
3. Avoid SELECT * in production queries
Just select necessary columns because less data transferred and less memory used hence faster execution
4. Filter early using WHERE clauses
Filtering early reduces the number of rows processed.
- Avoid functions on indexed columns
Example:
Works but not as good:
WHERE YEAR(order_date) = 2024
Better:
WHERE order_date >= '2024-01-01'
AND order_date < '2025-01-01'
NB/ Indexes are NOT helpful for:
- Small tables
- Queries returning most of the table
- Columns with very low uniqueness (like gender: M/F)
- Heavy writes (because indexes slow down inserts/updates)
Discussion
What SQL concept took you the longest to understand?
- Window functions?
- CTEs?
- Subqueries?
- Something else?
Let me know.
Top comments (1)
That’s a great refresher Isika 👌 Window functions and CTEs are indeed the kind of things you think you know, until a clean example proves there are still a trick or two to pick up… 🙂