John Wakaba

Posted on Apr 9

Advanced SQL Techniques Every Data Analyst Should Know

#data #sql #postgressql #analytics

You can write a SELECT statement. You can JOIN tables and slap on a WHERE clause. But somewhere between "I know SQL" and "I really know SQL" lies a gap that separates analysts who get things done from analysts who get things done fast, elegantly, and correctly.

This article covers the techniques that live in that gap.

1. Window Functions

Most analysts discover GROUP BY early and lean on it forever. Window functions do something fundamentally different — they let you compute aggregates without collapsing rows.

SELECT
  employee_id,
  department,
  salary,
  AVG(salary) OVER (PARTITION BY department) AS dept_avg,
  salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_avg
FROM employees;

You get one row per employee, but each row carries its department's average alongside it. No subquery. No self-join. No mess.

Running totals and moving averages

SELECT
  order_date,
  revenue,
  SUM(revenue) OVER (ORDER BY order_date) AS running_total,
  AVG(revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7day_avg
FROM daily_sales;

The ROWS BETWEEN clause is where window functions get powerful. You can define exactly which rows belong to each window frame — preceding rows, following rows, or any combination.

Ranking without ties headaches

SELECT
  product_name,
  category,
  revenue,
  RANK()       OVER (PARTITION BY category ORDER BY revenue DESC) AS rank_with_gaps,
  DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS rank_no_gaps,
  ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) AS row_num
FROM products;

RANK() skips numbers after ties. DENSE_RANK() doesn't. ROW_NUMBER() ignores ties entirely and just counts. Know which one you actually need before reaching for the first one.

2. CTEs

Common Table Expressions (CTEs) don't make your query faster (usually), but they make it more readable — and readable queries are maintainable queries.

WITH monthly_revenue AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS revenue
  FROM orders
  GROUP BY 1
),
revenue_growth AS (
  SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    ROUND(
      100.0 * (revenue - LAG(revenue) OVER (ORDER BY month))
           / LAG(revenue) OVER (ORDER BY month),
      2
    ) AS mom_growth_pct
  FROM monthly_revenue
)
SELECT *
FROM revenue_growth
WHERE mom_growth_pct IS NOT NULL
ORDER BY month;

Each CTE is a named, composable step. You can read this top-to-bottom and understand exactly what's happening. Compare that to a nested subquery version and you'll never go back.

Recursive CTEs for hierarchical data

When you have org charts, category trees, or any parent-child relationship, recursive CTEs are the right tool:

WITH RECURSIVE org_chart AS (
  -- Base case: top-level managers
  SELECT
    employee_id,
    name,
    manager_id,
    0 AS depth,
    name AS path
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  -- Recursive case: direct reports
  SELECT
    e.employee_id,
    e.name,
    e.manager_id,
    oc.depth + 1,
    oc.path || ' > ' || e.name
  FROM employees e
  INNER JOIN org_chart oc ON e.manager_id = oc.employee_id
)
SELECT * FROM org_chart ORDER BY path;

This walks the entire hierarchy in a single query, no matter how deep it goes.

3. Advanced Aggregations: GROUPING SETS, ROLLUP, and CUBE

Say goodbye to UNION ALL chains for multi-level summaries.

SELECT
  region,
  product_category,
  SUM(revenue) AS total_revenue
FROM sales
GROUP BY GROUPING SETS (
  (region, product_category),  -- subtotals by region + category
  (region),                    -- subtotals by region only
  (product_category),          -- subtotals by category only
  ()                           -- grand total
);

ROLLUP is a shorthand when your groupings have a natural hierarchy:

GROUP BY ROLLUP (year, quarter, month)
-- Produces: (year, quarter, month), (year, quarter), (year), ()

CUBE generates all possible combinations. Useful for cross-dimensional analysis, but be careful — it grows exponentially with the number of dimensions.

4. The FILTER Clause: Conditional Aggregation Without CASE

Most people do conditional aggregation like this:

SELECT
  SUM(CASE WHEN status = 'completed' THEN amount ELSE 0 END) AS completed_revenue,
  SUM(CASE WHEN status = 'refunded'  THEN amount ELSE 0 END) AS refunded_revenue
FROM orders;

There's a cleaner way:

SELECT
  SUM(amount) FILTER (WHERE status = 'completed') AS completed_revenue,
  SUM(amount) FILTER (WHERE status = 'refunded')  AS refunded_revenue
FROM orders;

The FILTER clause attaches directly to the aggregate function. It's not just aesthetically cleaner — it makes the intent unmistakably clear, and it works with any aggregate function including COUNT, AVG, STRING_AGG, and window functions.

5. LATERAL Joins: Correlated Subqueries That Scale

A LATERAL join lets a subquery in the FROM clause reference columns from tables that appear earlier in the same FROM clause. Think of it as a for each row, compute this operation.

SELECT
  c.customer_id,
  c.name,
  recent.order_date,
  recent.amount
FROM customers c
CROSS JOIN LATERAL (
  SELECT order_date, amount
  FROM orders o
  WHERE o.customer_id = c.customer_id
  ORDER BY order_date DESC
  LIMIT 3
) recent;

This fetches the 3 most recent orders per customer — something that's awkward with a window function and impossible with a regular join. Lateral joins shine for top-N-per-group patterns.

6. String Aggregation and Array Operations

Real-world data is messy. Sometimes you need to collapse multiple rows into a single delimited string, or work with arrays directly.

-- Collapse tags into a comma-separated list per article
SELECT
  article_id,
  STRING_AGG(tag, ', ' ORDER BY tag) AS tags
FROM article_tags
GROUP BY article_id;

-- PostgreSQL: aggregate into an actual array
SELECT
  user_id,
  ARRAY_AGG(DISTINCT product_id ORDER BY product_id) AS purchased_products
FROM purchases
GROUP BY user_id;

And once you have arrays, you can query into them:

SELECT user_id
FROM user_preferences
WHERE 'dark_mode' = ANY(feature_flags);

7. Query Optimization: Reading EXPLAIN Output

Fast queries aren't magic — they're the result of understanding what the database is actually doing.

EXPLAIN ANALYZE
SELECT c.name, COUNT(o.order_id)
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;

The output tells you:

Seq Scan vs Index Scan: sequential scans on large tables are a red flag
Hash Join vs Nested Loop: hash joins are usually better for large datasets; nested loops for small ones
Actual rows vs estimated rows: large discrepancies mean stale statistics — run ANALYZE
Rows removed by filter: indexes on the right columns eliminate these entirely

A few high-impact habits:

Index foreign keys (the database often won't do this automatically)
Avoid functions on indexed columns in WHERE clauses — WHERE YEAR(created_at) = 2024 can't use an index on created_at, but WHERE created_at >= '2024-01-01' can
Use LIMIT with OFFSET carefully — large offsets scan and discard rows; keyset pagination is faster for deep pages

8. Date and Time Manipulation

Time-series analysis is central to most analyst work, and doing it well requires a solid command of date functions.

-- Cohort analysis: group users by signup month
SELECT
  DATE_TRUNC('month', signup_date) AS cohort_month,
  DATE_TRUNC('month', activity_date) AS activity_month,
  DATE_PART('month', AGE(activity_date, signup_date)) AS months_since_signup,
  COUNT(DISTINCT user_id) AS active_users
FROM user_activity
GROUP BY 1, 2, 3
ORDER BY 1, 3;

-- Generate a complete date spine (no gaps even if data is missing)
SELECT date::date
FROM GENERATE_SERIES(
  '2024-01-01'::date,
  '2024-12-31'::date,
  '1 day'::interval
) AS gs(date);

The date spine pattern is essential for time-series work — join it against your data and missing dates appear as NULLs rather than disappearing from your results entirely.

9. NULL Handling

NULL is not zero. NULL is not an empty string. NULL is the absence of a value, and it propagates in ways that catch everyone out at some point.

-- This looks right but silently excludes NULLs from the average
SELECT AVG(response_time) FROM requests;
-- NULL values are ignored by AVG — this may be what you want, or may not be

-- Be explicit:
SELECT
  AVG(response_time) AS avg_excluding_nulls,
  AVG(COALESCE(response_time, 0)) AS avg_treating_null_as_zero,
  COUNT(*) AS total_rows,
  COUNT(response_time) AS non_null_rows
FROM requests;

And the classic NULL comparison mistake:

-- This never returns rows where manager_id is NULL
WHERE manager_id != 5

-- You need:
WHERE manager_id != 5 OR manager_id IS NULL

NULLIF is your friend for division-by-zero protection:

SELECT revenue / NULLIF(sessions, 0) AS revenue_per_session
FROM traffic_data;

When sessions is 0, NULLIF returns NULL, and dividing by NULL yields NULL — no error, no corrupt data.

10. Pivot Tables with CASE and FILTER

SQL doesn't have a native PIVOT keyword in most databases, but you can build pivot tables with conditional aggregation:

SELECT
  product_category,
  SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2022) AS revenue_2022,
  SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2023) AS revenue_2023,
  SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2024) AS revenue_2024,
  ROUND(
    100.0 * (
      SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2024) -
      SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2022)
    ) / NULLIF(SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2022), 0),
    1
  ) AS pct_change_2022_to_2024
FROM orders
GROUP BY product_category
ORDER BY revenue_2024 DESC NULLS LAST;

Conclusion

These techniques aren't separate tricks to memorize — they combine. A typical advanced query might use a CTE to build a clean base dataset, window functions to compute ranks and running totals, FILTER for conditional aggregation, and LATERAL to pull related records. The result is a single, readable, performant query that would have taken three separate queries and some Python glue code to produce otherwise.

DEV Community