Isika Millicent

Posted on Mar 6

Advanced SQL Techniques Every Data Analyst Should Know

#sql #dataanalytics #database #beginners

In this article we will cover:

Advanced Aggregations
Advanced Set Operations
Window Functions
CTEs and Query Structuring
Subqueries
EXISTS vs NOT EXISTS
Query Optimization

Most real-world insights come from combining multiple SQL techniques.

While a basic understanding of SQL e.g. selecting columns, filtering rows, and joining tables is essential for anyone working with data, mastering advanced SQL techniques is what truly separates a new data analyst from an expert data analyst. These advanced skills unlock deeper insights, help solve complex business problems, and ultimately enable more confident, data-driven decision-making.

If you have spent any time working with data, you are probably already familiar with the fundamentals of SQL: pulling rows from a table (SELECT), filtering results using conditions (WHERE), and perhaps joining two tables together (JOIN). That foundation is important.

But the moment you start working as a data analyst, the questions change. Suddenly you're asked things like:

Which customers made a purchase this month but haven’t returned since?
For each sales representative, what was their running total of revenue by the end of every quarter?
Which product categories fall within the top 10% of sales performance?

Simple SQL isn't enough anymore. That’s where advanced SQL techniques come in.

1. Advanced Aggregations: CASE WHEN, FILTER, and GROUPING SETS

Aggregation in SQL goes far beyond COUNT() and SUM().
Business reporting often requires conditional aggregation, multiple grouping levels, and flexible summarization.
Three tools make this much easier: CASE WHEN, FILTER & GROUPING SETS

a. CASE WHEN is used for conditional logic inside a query. It allows you to compute conditional aggregates — for example, counting only orders above a certain value, or separating revenue by customer tier within the same row:

Think of it as SQL’s version of an if/else statement.

Example: Categorize customers based on spending.

SELECT customer_name, total_spent, CASE WHEN total_spent > 1000 THEN 'High Value' WHEN total_spent > 500 THEN 'Medium Value' ELSE 'Low Value' END AS customer_category FROM customers;

Result:

customer total_spent category Alice 1200 High Value Brian 650 Medium Value John 200 Low Value

b. FILTER -is a cleaner alternative to conditional aggregation.
Instead of writing CASE WHEN, you attach a filter directly to the aggregate.

Example
SELECT COUNT(*) FILTER (WHERE region = 'Kenya') AS kenya_customers, COUNT(*) FILTER (WHERE region = 'Uganda') AS uganda_customers FROM customers;

c. GROUPING SETS - This is more advanced and extremely powerful.
It allows you to calculate multiple GROUP BY aggregations in one query.
Normally this requires three queries + UNION.

Example
region product sales Kenya Laptop 200 Kenya Phone 300 Uganda Laptop 150

Suppose you want: Sales by region, Sales by product and the Grand total
Normally you'd write 3 queries.

But with GROUPING SETS:

SELECT region, product, SUM(sales) AS total_sales FROM sales GROUP BY GROUPING SETS ((region), (product),());

Result:

region product total_sales Kenya NULL 500 Uganda NULL 150

This kind of aggregation is frequently requested by business stakeholders who want a single summary table rather than multiple separate queries.

2. Advanced Set Operations (UNION, UNION ALL, INTERSECT, EXCEPT)

Set operations combine the results of two SELECT queries.
They require:

Same number of columns
Compatible data types
Same column order

A. UNION/UNION ALL
It combines results from two queries and removes duplicates automatically

Example Tables

Table: customers_2024
customer_id name
1 Alice
2 Brian
3 John

Table: customers_2025
customer_id name
3 John
4 Mary
5 James

QUERY
SELECT name FROM customers_2024 UNION SELECT name FROM customers_2025;

RESULT
name
Alice
Brian
John
Mary
James

Think of them as working with result sets, not tables directly.
NB/ John appears in both tables
UNION removes the duplicate
UNION ALL keeps all the duplicate values and is faster

B. INTERSECT
This returns only the rows that exist in both queries

When to Use INTERSECT

Identifying returning customers
Matching common records between systems
Data validation

Example use case:
Customers who bought in both 2024 and 2025. Using the same tables:

SELECT name FROM customers_2024 INTERSECT SELECT name FROM customers_2025;

Result
name
John

C. EXCEPT
It returns rows from the first query that do NOT exist in the second query.

Think of it as: “Show me what's in A but not in B”

SELECT name FROM customers_2024 EXCEPT SELECT name FROM customers_2025;

Result
name
Alice
Brian

When to Use EXCEPT

Finding new customers
Detecting missing records
Comparing two datasets
Data auditing

Question: When would you use EXCEPT instead of LEFT JOIN?
I use EXCEPT when I want to compare two result sets directly and identify records present in one dataset but not another, especially for data validation tasks.

3. Window Functions

Window functions are one of the most powerful tools in SQL analytics.
They allow you to perform calculations across rows while still keeping every row in the result.

Example problem: You want to see each employee's sales alongside their department's total sales.

A normal GROUP BY would collapse rows.

Window functions avoid that.

The syntax centers on the OVER() clause, which defines the 'window' — the set of rows the function considers:

SELECT employee_name, department, sales_amount, SUM(sales_amount) OVER (PARTITION BY department) AS dept_total, ROUND(sales_amount * 100.0 / SUM(sales_amount) OVER (PARTITION BY department), 2) AS pct_of_dept FROM employee_sales;

Here, PARTITION BY department tells SQL to calculate the sum separately for each department.
Every row for the Marketing team gets Marketing's total; every row for Engineering gets Engineering's total. No rows are removed; no separate subquery is needed.

Window functions become even more powerful with ranking.

The ROW_NUMBER (), RANK (), and DENSE_RANK () functions let you rank records within groups.

ROW_NUMBER() - gives every row a unique number. Even if two rows have the same value, they still get different numbers.
RANK() - gives the same rank to tied values, but it skips numbers after ties.
DENSE_RANK() - also gives the same rank for ties, but does NOT skip numbers.

Example table: employees
name salary Alice 9000 Bob 8000 Carol 8000 David 7000

Query 1: Row_number()
SELECT name, salary, ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num FROM employees;

Result
name salary row_num Alice 9000 1 Bob 8000 2 Carol 8000 3 David 7000 4

NB/
Bob and Carol have the same salary, but they still get different row numbers
So, ROW_NUMBER() forces each row to be unique.

Query 2: RANK()
SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rank_num FROM employees;

Result
name salary rank_num Alice 9000 1 Bob 8000 2 Carol 8000 2 David 7000 4

NB/
Bob and Carol tie at rank 2,then SQL skips rank 3 and goes to 4.

Query 3: DENSE_RANK()
SELECT name, salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank FROM employees;

Result
name salary dense_rank Alice 9000 1 Bob 8000 2 Carol 8000 2 David 7000 3

NB/ No number is skipped.

This is just one example of the power of window functions. They can be used for other analytical tasks, such as:

Ranking: NTILE () to find the top N products by sales in each category.
Lead/Lag Analysis: LEAD (), LAG () to compare a value with the value from a subsequent or preceding row, for example, to calculate month-over-month growth.
Example
SELECT month, revenue, revenue - LAG(revenue) OVER (ORDER BY month) AS growth FROM monthly_sales;
Running Totals: SUM() with an appropriate frame clause to calculate cumulative sales over time.

A classic business use case is identifying the top-performing product in each category

SELECT * FROM (SELECT product_name, category, revenue, RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS rnk FROM product_sales) WHERE rnk = 1;

4. Common Table Expressions (CTEs)

A CTE is a temporary, named result set that you can reference within a subsequent SELECT, INSERT, UPDATE, or DELETE statement.

Think of it as creating a temporary, virtual table that exists only for the duration of your query.

CTEs provide a way to break down a complex query into logical, readable steps. By giving a name to each step of your analysis, you can make your code self-documenting and easier to debug.

Understanding the Syntax
You start with the WITH keyword, followed by the name of the CTE, and then the AS keyword with the query that defines the CTE enclosed in parentheses.

Business Case Scenario: Analyzing Customer Order Data

Let's say you are an analyst for an online retailer, and you have been tasked with identifying the top 5 customers by total spending in the "Kenyan" region for the year 2025.

Attempting to do this with nested subqueries would result in a hard-to-read query. Instead, we can use CTEs to break down the problem into clear, logical steps.

Assume you have the following tables:

Table 1: customers
`customer_id customer_name region

John Smith Kenya
Jane Doe Uganda`

Table 2: orders
order_id customer_id order_date order_total 101 1 2025-01-15 250.00 102 2 2025-02-20 150.75

Here is how you can solve this problem using CTEs:

WITH orders_2025 AS (
    --Step 1: Filter orders from the year 2025--
    SELECT order_id, customer_id, order_total
    FROM orders
    WHERE EXTRACT(YEAR FROM order_date) = 2025),

kenya_customer_orders AS (
    --Step 2 & 3: Join with customers and filter for Kenyan region--
    SELECT c.customer_id, c.customer_name, o.order_total
    FROM orders_2025 AS o
    JOIN customers AS c ON o.customer_id = c.customer_id
    WHERE c.region = 'Kenya'),

customer_total_spending AS (
     --Step 4: Calculate total spending for each customer--
    SELECT customer_id, customer_name, 
    SUM(order_total) AS total_spending
    FROM kenya_customer_orders
    GROUP BY customer_id, customer_name)
 _Step 5: Rank customers and select the top 5_
SELECT customer_name, total_spending
FROM customer_total_spending
ORDER BY total_spending DESC
LIMIT 5;

As you can see, the query is much more readable and self-explanatory. Each CTE has a clear purpose, and the final SELECT statement is simple and easy to understand. If you needed to debug this query, you could easily test each CTE independently to verify its output

CTEs also support recursion, which is useful for hierarchical data structures like org charts, product categories with parent-child relationships, or network graphs.

Example Table
CEO
├─ Alice
│ ├─ Carol
│ └─ David
└─ Bob
└─ Emma

Question: Find all employees under a specific manager.

employee_id name manager_id
1 CEO NULL
2 Alice 1
3 Bob 1
4 Carol 2
5 David 2
6 Emma 3

WITH RECURSIVE employee_tree AS (
    -- start from manager
SELECT employee_id, name, manager_id
FROM employees
WHERE employee_id = 1
    UNION ALL
    -- find subordinates
SELECT e.employee_id, e.name, e.manager_id
FROM employees e
JOIN employee_tree et
ON e.manager_id = et.employee_id)

SELECT * FROM employee_tree;

Example: Sales Analytics Query

Imagine this table:

order_id customer region order_total order_date 1 Alice Kenya 500 2025-01-05 2 Brian Kenya 700 2025-02-02 3 James Uganda 300 2025-02-15 4 Alice Kenya 400 2025-03-01

Goal:
Total spending per customer
Kenyan vs non-Kenyan totals
Rank customers by spending

--Step 1 — CTE (clean and organize the orders from 2025) --

WITH customer_orders AS (
SELECT customer, region, order_total
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2025)

--Step 2 — FILTER (conditional aggregation)--

SELECT customer, SUM(order_total) AS total_spent,
SUM(order_total) FILTER (WHERE region = 'Kenya') AS kenya_spending,
SUM(order_total) FILTER (WHERE region != 'Kenya') AS international_spending

--Without FILTER, you'd need multiple CASE WHEN statements.--
--Step 3 — Window function (ranking)--
RANK() OVER (ORDER BY SUM(order_total) DESC) AS spending_rank

--This ranks customers based on their total spending.--

--Final Combined Query--

WITH customer_orders AS (
SELECT customer, region, order_total
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2025)

SELECT customer, SUM(order_total) AS total_spent,
SUM(order_total) FILTER (WHERE region = 'Kenya') AS kenya_spending,
SUM(order_total) FILTER (WHERE region != 'Kenya') AS international_spending,
RANK() OVER (ORDER BY SUM(order_total) DESC) AS spending_rank

FROM customer_orders
GROUP BY customer;

Instead of writing 5 queries, you get everything in one query.

Mental model - The flow usually looks like this:
CTE ↓ Aggregate with FILTER ↓ Analyze with WINDOW FUNCTIONS ↓ Final output

5. Subquery

This is simply a query inside another query.

It’s like saying: First calculate this… then use that result to answer the main question.
They can be used in:

SELECT

FROM

WHERE

HAVING

CORRELATED SUBQUERIES

a. Subquery in the WHERE Clause (Most Common)
Used for filtering.

Example:
Find employees who earn more than the average salary.

SELECT name, salary FROM employees WHERE salary > (SELECT AVG(salary) FROM employees);

So first, Inner query calculates average salary. Outer query filters employees above that value.

b. Subquery in the FROM Clause (Derived Table)
Here, the subquery acts like a temporary table.

Example:

Find departments with average salary above 50,000.

SELECT department_id, avg_salary FROM ( SELECT department_id, AVG(salary) AS avg_salary FROM employees GROUP BY department_id) AS dept_avg WHERE avg_salary > 50000;

c. Subquery in the SELECT Clause
Used to return a calculated value per row.

Example:
Show each employee with the company average salary.
SELECT name, salary, (SELECT AVG(salary) FROM employees) AS company_avg FROM employees;

*d. Subquery in the HAVING Clause *
It is used when you want to filter aggregated results using another query.

WHERE filters rows before aggregation
HAVING filters groups after aggregation

So a subquery in HAVING usually compares one group's aggregate with another aggregate value.

SELECT department_id, AVG(salary) AS dept_avg FROM employees GROUP BY department_id HAVING AVG(salary) > ( SELECT AVG(salary) FROM employees);

Real Business Case: Find products whose total sales are greater than the average product sales.

SELECT product_id, SUM(sales) AS total_sales FROM orders GROUP BY product_id HAVING SUM(sales) > (SELECT AVG(product_sales) FROM(SELECT SUM(sales) AS product_sales FROM orders GROUP BY product_id) t);

e. Correlated Subqueries
A correlated subquery depends on the outer query. It references a column from the outer query and runs once per row.

Example:
Find employees who earn more than their department average.

SELECT e1.name, e1.salary, e1.department_id FROM employees e1 WHERE salary > ( SELECT AVG(e2.salary) FROM employees e2 WHERE e2.department_id = e1.department_id );

6. Finding Relationships with EXISTS and NOT EXISTS

EXISTS checks if at least one row exists. It ignores the actual column values.

Use the 2 when:

Checking relationships
Finding missing records
Data validation
Complex filtering
Working with large datasets
Avoiding NULL issues

Example
Table 1: customers
customer_id customer_name 1 Alice 2 Bob 3 Carol 4 David

Table 2: orders

order_id customer_id amount 101 1 500 102 1 300 103 2 700

Now let’s use them.
i. EXISTS
Find customers who have placed at least one order.

SELECT c.customer_id, c.customer_name FROM customers c WHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id);

Result:
customer_id customer_name 1 Alice 2 Bob

ii. NOT EXISTS
Now let’s find customers who NEVER placed an order.

SELECT c.customer_id, c.customer_name FROM customers c WHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.customer_id);

NOTE BETTER:

The select 1 in both just stands in as a placeholder, you can replace it with any value. It does NOT read the column
NOT EXISTS (recommended) OR LEFT JOIN + IS NULL - give same result

7. Query Optimization

Writing SQL that produces the correct answer is one thing.
Writing SQL that runs efficiently on millions of rows is another skill entirely.

Query optimization is the process of writing SQL in a way that runs faster, uses less memory, and scales better on large data.

Key techniques:

If you frequently filter or join on a column, it’s likely a good candidate.
Example:

SELECT * FROM orders o JOIN customers c ON o.customer_id = c.customer_id;

customer_id should likely be indexed.

How Databases Actually Work
When you run a query:

SQL parses it (checks syntax).
The query optimizer analyzes it.
It chooses the best execution plan.
Then it runs it.

The optimizer decides things like:

Should it use an index?
Should it scan the whole table?
Which join order is best?
Should it use a hash join or nested loop?

You don’t see this — but it’s happening.

Key principles include:

Use EXPLAIN or EXPLAIN ANALYZE This reveals how the database executes your query.

2. Use indexes on frequently filtered columns
Indexes are like a book’s table of contents. Without index, database scans every row (slow) but with index, database jumps directly to matching rows (fast).

3. Avoid SELECT * in production queries
Just select necessary columns because less data transferred and less memory used hence faster execution

4. Filter early using WHERE clauses
Filtering early reduces the number of rows processed.

Avoid functions on indexed columns

Example:
Works but not as good:
WHERE YEAR(order_date) = 2024

Better:
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'

NB/ Indexes are NOT helpful for:

Small tables
Queries returning most of the table
Columns with very low uniqueness (like gender: M/F)
Heavy writes (because indexes slow down inserts/updates)

Discussion

What SQL concept took you the longest to understand?

Window functions?
CTEs?
Subqueries?
Something else?

Let me know.

Top comments (2)

👨‍💻Pierre-Henry ✨ • Mar 6 • Edited

That’s a great refresher Isika 👌 Window functions and CTEs are indeed the kind of things you think you know, until a clean example proves there are still a trick or two to pick up… 🙂

Isika Millicent • Mar 8

Thank you! 😊 I completely agree, window functions and CTEs are funny like that, you use them often enough to feel comfortable, but every once in a while, a simple example makes you realize there’s still a new trick hiding in there. Happy you enjoyed the refresher!