DEV Community

Cover image for SQL Aggregate Functions Deep Dive: SUM, AVG, MIN, MAX, COUNT(DISTINCT)
Gowtham Potureddi
Gowtham Potureddi

Posted on

SQL Aggregate Functions Deep Dive: SUM, AVG, MIN, MAX, COUNT(DISTINCT)

sql aggregate functions are the most-tested SQL pattern in data engineering interviews because every analytics question — from "how many distinct users converted last week" to "show me revenue by region with subtotals" — ultimately reduces to a COUNT, SUM, AVG, MIN, MAX, or COUNT(DISTINCT) call inside a GROUP BY. The five core aggregates look trivial on paper, but they hide four cliffs that trip junior engineers: silent sql count NULL-handling bugs, sql count distinct memory blow-ups on large cardinalities, BIGINT overflow in long-running SUMs, and the lexicographic-vs-chronological surprise in MIN/MAX.

This guide walks through the five aggregate families that show up most often in count distinct sql questions, count of count sql patterns, and GROUP BY extensions at FAANG, Stripe, DoorDash, and analytics-heavy shops. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll be able to defend COUNT(*) FILTER (WHERE …) over SUM(CASE WHEN …), choose between exact COUNT(DISTINCT) and APPROX_COUNT_DISTINCT on a HyperLogLog sketch, and ship a single-pass ROLLUP query that produces subtotals plus a grand total — the exact answers that senior interviewers reward.

PipeCode blog header for a SQL aggregate functions deep dive — bold white headline 'SQL Aggregates · Deep Dive' with subtitle 'SUM · AVG · MIN · MAX · COUNT(DISTINCT)' and a stylised Σ symbol with five aggregate pills on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse aggregation practice library →, drill GROUP BY and HAVING problems →, and rehearse on SQL practice for any topic →.


On this page


1. Why aggregates are the most-tested SQL pattern

Aggregates are the single most-asked SQL primitive in data engineering loops

The one-sentence invariant: an aggregate function reduces a column of N rows down to one scalar value, optionally per GROUP BY key, and the entire family — COUNT, SUM, AVG, MIN, MAX, COUNT(DISTINCT), plus the advanced extensions — shares one set of NULL-handling rules and one set of grouping rules. Once you internalise that mental model — reduction first, NULLs ignored except for COUNT(*), grouping is just "apply the reduction once per key" — most sql aggregate functions interview questions become a sequence of trade-off discussions.

Why every interview opens with aggregates in five bullets.

  • Coverage. Every analytics question is some shape of "count this, sum that, by these keys." Aggregates are the shortest path from raw rows to a business metric.
  • NULL discipline. Aggregates are the single best probe for whether a candidate understands SQL's three-valued logic. COUNT(*) ≠ COUNT(col) is the most common silent bug in production reports.
  • Cost intuition. SUM is O(N) streaming, COUNT(DISTINCT) is O(distinct cardinality) memory — interviewers test whether you reach for APPROX_COUNT_DISTINCT when the cardinality blows past memory.
  • Composability. Aggregates compose with GROUP BY, HAVING, window functions, and CTEs to form ~80% of analytics SQL.
  • Dialect drift. AVG on integers returns NUMERIC in Postgres but DOUBLE in MySQL; FILTER (WHERE …) is ANSI but unsupported in MySQL ≤ 8. Knowing which dialect you're in is a senior signal.

The two traps every interviewer probes.

  • NULL handling. COUNT(*) counts every row, NULLs included. COUNT(col) ignores NULLs. SUM, AVG, MIN, MAX all ignore NULLs but return NULL when the input is empty (not zero — a frequent reporting bug).
  • COUNT(*) vs COUNT(col) misuse. Junior engineers default to COUNT(*) everywhere; senior engineers reach for COUNT(col) or COUNT(DISTINCT col) the instant the column may be NULL or non-unique.

What interviewers listen for.

  • Do you mention NULL-handling the first time COUNT appears? — senior signal.
  • Do you reach for COUNT(DISTINCT) vs COUNT(*) based on the semantic question? — required answer.
  • Do you mention APPROX_COUNT_DISTINCT + HyperLogLog when cardinality is large? — senior signal.
  • Do you know FILTER (WHERE …) as the modern replacement for SUM(CASE WHEN …)? — current-default signal.
  • Do you protect divisions with NULLIF(denom, 0) and empty groups with COALESCE(SUM(...), 0)? — production-ready signal.
  • Do you reach for ROLLUP instead of UNION ALL of two queries for "details + grand total"? — senior signal.

The 2026 reality (SQL aggregates in modern engines).

  • ANSI FILTER (WHERE …) is the standard conditional-aggregate idiom. Postgres has had it since 9.4, Snowflake since 2018, SQL Server since 2022, DuckDB and Spark SQL natively. MySQL ≤ 8 still requires SUM(CASE WHEN …).
  • APPROX_COUNT_DISTINCT is native in Snowflake, BigQuery, and Spark; Postgres needs the hll or postgresql-hll extension; MySQL still has no built-in. HyperLogLog++ is the algorithm everyone implements.
  • GROUPING SETS, ROLLUP, CUBE are ANSI and supported by every major warehouse. MySQL only ships WITH ROLLUP; BigQuery ships ROLLUP and GROUPING SETS but not CUBE.
  • PIVOT / UNPIVOT clauses exist in Snowflake, SQL Server, and Oracle but are syntactic sugar around GROUP BY + conditional aggregates — the portable answer is still SUM(amount) FILTER (WHERE region = 'X').

Worked example — the one-row trap every junior misses

Detailed explanation. A classic interview probe: "Here's a 5-row table where one amount is NULL. What do COUNT(*), COUNT(amount), SUM(amount), and AVG(amount) return — and which one will surprise a junior engineer?" The probe is testing whether you remember that AVG divides by COUNT(amount), not by COUNT(*).

Question. Given the orders table below with five rows and one NULL amount, predict the exact result of each of the four aggregates, and explain why AVG and SUM/COUNT(*) diverge.

Input.

order_id amount
1 100
2 200
3 NULL
4 300
5 400

Code.

SELECT
  COUNT(*)            AS rows_total,
  COUNT(amount)       AS rows_with_amount,
  SUM(amount)         AS sum_amount,
  AVG(amount)         AS avg_amount_correct,
  SUM(amount) * 1.0 / COUNT(*)  AS avg_amount_wrong
FROM orders;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. COUNT(*) walks every row regardless of value — 5 rows in, 5 out. NULL rows still count.
  2. COUNT(amount) walks every row but skips the ones where amount IS NULL — 4 rows in, 4 out.
  3. SUM(amount) adds 100 + 200 + 300 + 400 = 1000. The NULL row is silently ignored — SUM treats NULL as "absent value," not zero.
  4. AVG(amount) is defined as SUM(amount) / COUNT(amount) = 1000 / 4 = 250. Postgres returns NUMERIC; MySQL returns DOUBLE; both equal 250 here.
  5. The "wrong" formula uses COUNT(*) in the denominator: 1000 / 5 = 200. This is what you get if you reimplement AVG by hand and forget the NULL rule.

Output.

rows_total rows_with_amount sum_amount avg_amount_correct avg_amount_wrong
5 4 1000 250 200

Rule of thumb. Whenever a junior asks "why is AVG higher than SUM / number of rows?" the answer is always the same: AVG divides by COUNT(col), not COUNT(*). If you want NULL rows to drag the average down, replace them with zero first using COALESCE(amount, 0).

SQL interview question on COUNT vs COUNT(col) NULL handling

A senior interviewer often frames this as: "We migrated a Postgres table and our daily-active-users dashboard dropped 8% overnight. The query is SELECT COUNT(user_id) FROM events GROUP BY day. What's the silent bug?" The expected answer surfaces the COUNT(col) ignores NULL behaviour, then walks the fix.

Solution Using COUNT(*) for "rows" vs COUNT(col) for "non-null rows"

-- Two different questions, two different answers
SELECT
  day,
  COUNT(*)                                AS events_per_day,        -- every event row
  COUNT(user_id)                          AS events_with_user_id,   -- excludes anonymous
  COUNT(DISTINCT user_id)                 AS distinct_users         -- the DAU answer
FROM events
GROUP BY day
ORDER BY day;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

day row_id user_id event
2026-05-01 1 42 login
2026-05-01 2 NULL anon_view
2026-05-01 3 42 click
2026-05-01 4 99 login
2026-05-01 5 NULL anon_view

Walking the aggregate one row at a time:

  1. COUNT(*) increments on every row → 5.
  2. COUNT(user_id) skips rows 2 and 5 because user_id IS NULL → 3.
  3. COUNT(DISTINCT user_id) adds only previously-unseen non-null values: {42, 99} → 2.
  4. The DAU dashboard wanted COUNT(DISTINCT user_id) = 2 but the new query returned COUNT(user_id) = 3 — which counts user 42's two events as two users.
  5. After the migration, anonymous events stopped being dropped at the ingest layer, so a flood of NULL user_id rows arrived. COUNT(user_id) already ignored them, so it looked stable — but COUNT(DISTINCT user_id) would have moved if the actual user set changed.

Output:

metric value what it counts
COUNT(*) 5 total rows (including anonymous)
COUNT(user_id) 3 rows with a non-null user (still over-counts repeats)
COUNT(DISTINCT user_id) 2 unique users — the DAU answer

Why this works — concept by concept:

  • COUNT(*) — counts every row in the input, ignoring nothing. The only aggregate that does not have a NULL-skip rule. Use it for "how many records exist."
  • COUNT(col) — counts rows where col IS NOT NULL. Silently drops NULL rows. Use it for "how many records have a value."
  • COUNT(DISTINCT col) — counts unique non-null values. Use it for "how many distinct entities show up." This is almost always the right shape for DAU/MAU/unique-X metrics.
  • NULL semantics — SQL's three-valued logic means NULL = NULL is itself NULL (not TRUE), which is why every aggregate except COUNT(*) has to encode the "skip NULL" rule explicitly.
  • CostCOUNT(*) and COUNT(col) are O(N) streaming, no extra memory. COUNT(DISTINCT col) is O(distinct cardinality) memory because it must keep a hash of seen values.

SQL
Topic — aggregation
Aggregation problems (COUNT, SUM, AVG)

Practice →


2. COUNT, SUM, AVG anatomy and NULL handling

sql count, SUM, and AVG share one rule — every aggregate except COUNT(*) ignores NULL

The mental model in one line: COUNT(*) counts rows; everything else counts or sums values, skipping NULLs. Once you say "NULL is silently filtered before the reduction runs," every sql count and sql count distinct interview question collapses to obvious answers — and you stop being surprised when SUM on an all-NULL column returns NULL instead of 0.

Visual diagram of COUNT, SUM, AVG NULL-handling — a small input table with one NULL row, then three side-by-side aggregate cards showing how each function handles NULLs (COUNT(*) includes, COUNT(col) ignores, SUM ignores, AVG = SUM/COUNT(col)); on a light PipeCode card.

The seven rules every interview probes.

  • COUNT(*) — counts every row, NULLs included. The only aggregate that does not have a NULL-skip rule.
  • COUNT(col) — counts rows where col IS NOT NULL. Use for "rows with a value."
  • COUNT(DISTINCT col) — counts unique non-NULL values. Use for "unique entities."
  • SUM(col) — adds non-NULL values; returns NULL if every input value is NULL (not 0).
  • AVG(col) — defined as SUM(col) / COUNT(col). Skips NULLs in both halves; returns NULL on empty input.
  • MIN(col) / MAX(col) — return the smallest / largest non-NULL value; NULL ignored.
  • Empty group rule. Every aggregate on an empty group returns NULL, except COUNT(*) and COUNT(col) which return 0.

The BIGINT overflow trap.

  • SUM on an INT column in MySQL returns a BIGINT, but SUM on a BIGINT column has no native promotion — at 9.2 × 10^18 the sum wraps or errors depending on dialect.
  • Postgres SUM(bigint) promotes to NUMERIC automatically — safer, but slower for very-large datasets.
  • Snowflake/BigQuery use 128-bit NUMERIC arithmetic for SUM by default. Overflow is essentially never seen in practice.
  • The interview answer: cast to a wider type before summing if you're not sure: SUM(amount::numeric) (Postgres) or SUM(CAST(amount AS DECIMAL(38,2))) (Snowflake/Oracle/SQL Server).

The AVG dialect drift.

  • AVG(int_col) in Postgres returns NUMERIC — exact, slower, often surprises beginners (AVG(int) = 2.50000000000000000000).
  • AVG(int_col) in MySQL returns DOUBLE — fast, lossy, can show 2.5 or 2.4999999999999... depending on values.
  • AVG(int_col) in Snowflake/BigQuery returns FLOAT (BigQuery) or NUMBER(38, 6) (Snowflake).
  • Defensive coding. AVG(amount::numeric) or SUM(amount) * 1.0 / NULLIF(COUNT(amount), 0) if you need explicit control over the result type.

The SUM(NULL) vs SUM(all NULLs) rule.

  • SUM(NULL) (literal NULL passed in) → NULL.
  • SUM(col) where every row of col is NULL → NULL (not 0).
  • SUM(col) where some rows are NULL → sum of the non-NULL rows. NULL is treated as "absent value."
  • If your report needs zero on empty input, wrap with COALESCE(SUM(col), 0).

Worked example — daily revenue with conditional aggregates

Detailed explanation. A classic analytics problem: compute daily revenue, plus a separate "paid revenue" column that only counts orders where status = 'paid'. The naive answer is SUM(CASE WHEN status = 'paid' THEN amount ELSE 0 END); the modern answer is SUM(amount) FILTER (WHERE status = 'paid').

Question. Given a 6-row orders table with three statuses (paid, pending, refunded), write a single query that returns total revenue, paid revenue, and refunded revenue per day. Use ANSI FILTER if available; show the portable CASE WHEN form for MySQL/Snowflake legacy code.

Input.

order_id order_day amount status
1 2026-05-01 100 paid
2 2026-05-01 200 paid
3 2026-05-01 50 refunded
4 2026-05-02 300 paid
5 2026-05-02 NULL pending
6 2026-05-02 400 paid

Code.

-- Postgres / Snowflake / SQL Server 2022+: ANSI FILTER
SELECT
  order_day,
  SUM(amount)                              AS total_revenue,
  SUM(amount) FILTER (WHERE status='paid')     AS paid_revenue,
  SUM(amount) FILTER (WHERE status='refunded') AS refunded_revenue,
  COUNT(*)   FILTER (WHERE status='paid')      AS paid_orders
FROM orders
GROUP BY order_day
ORDER BY order_day;

-- Portable form (works on MySQL ≤ 8, older Snowflake, all engines)
SELECT
  order_day,
  SUM(amount)                                       AS total_revenue,
  SUM(CASE WHEN status='paid'     THEN amount END)  AS paid_revenue,
  SUM(CASE WHEN status='refunded' THEN amount END)  AS refunded_revenue,
  SUM(CASE WHEN status='paid'     THEN 1 ELSE 0 END) AS paid_orders
FROM orders
GROUP BY order_day
ORDER BY order_day;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. SUM(amount) walks every row in the group and adds non-NULL amounts: day 1 → 100+200+50 = 350; day 2 → 300+400 = 700 (NULL row skipped).
  2. SUM(amount) FILTER (WHERE status='paid') evaluates the predicate per row; only rows where the predicate is TRUE contribute. Day 1 → 100+200 = 300; day 2 → 300+400 = 700.
  3. The portable CASE WHEN form returns NULL for non-matching rows; SUM then skips them — same answer as FILTER.
  4. COUNT(*) FILTER (WHERE status='paid') is equivalent to SUM(CASE WHEN status='paid' THEN 1 ELSE 0 END) but reads more clearly and runs slightly faster (no intermediate 1/0 materialisation).
  5. The pending order on day 2 has NULL amount — it contributes to COUNT(*) but not to any SUM. If the dashboard says "5 orders, $400 revenue" the missing $400 is a NULL in the data, not a bug in the query.

Output.

order_day total_revenue paid_revenue refunded_revenue paid_orders
2026-05-01 350 300 50 2
2026-05-02 700 700 NULL 2

Rule of thumb. Reach for FILTER (WHERE …) whenever your engine supports it — it's clearer, faster, and explicitly ANSI. Fall back to SUM(CASE WHEN …) only on MySQL or legacy code. The two are semantically identical but FILTER reads as "aggregate this subset" instead of "sum a conditional value."

SQL interview question on conditional aggregation

A staff-level interviewer might frame this as: "Write one query that returns, per region, total revenue, the percentage of orders that were refunded, and the average order value of paid orders only." It tests FILTER (or CASE WHEN), NULL-safe division, and grouping in a single pass.

Solution Using FILTER (WHERE …) with NULLIF for safe division

SELECT
  region,
  SUM(amount)                                      AS total_revenue,
  COUNT(*)                                         AS total_orders,
  COUNT(*) FILTER (WHERE status = 'refunded') * 1.0
    / NULLIF(COUNT(*), 0)                          AS refund_rate,
  AVG(amount) FILTER (WHERE status = 'paid')       AS avg_paid_order_value
FROM orders
GROUP BY region
ORDER BY region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

region order_id amount status
EU 1 100 paid
EU 2 200 paid
EU 3 50 refunded
US 4 300 paid
US 5 400 paid
US 6 100 refunded
US 7 NULL pending

Walking the aggregate per region:

  1. EU. Three orders. SUM(amount) = 350. COUNT(*) = 3. Refunded count = 1; refund_rate = 1/3 ≈ 0.333. AVG(amount) FILTER (WHERE status='paid') averages {100, 200} = 150.
  2. US. Four orders. SUM(amount) = 300+400+100 = 800 (NULL skipped). COUNT(*) = 4. Refunded count = 1; refund_rate = 1/4 = 0.25. Paid AVG averages {300, 400} = 350.
  3. NULLIF(COUNT(*), 0) is the divide-by-zero guard — if COUNT(*) = 0, NULLIF returns NULL and the whole division returns NULL instead of erroring. Critical for empty-group safety.
  4. The pending US row contributes to COUNT(*) (4 orders) and to the refund-rate denominator, but does not contribute to any SUM or AVG because its amount is NULL.

Output:

region total_revenue total_orders refund_rate avg_paid_order_value
EU 350 3 0.333 150.0
US 800 4 0.250 350.0

Why this works — concept by concept:

  • FILTER (WHERE …) — applies the predicate before the reduction. COUNT(*) FILTER (WHERE x) is the cleanest way to write a conditional count; SUM(amount) FILTER (WHERE x) the cleanest conditional sum. Available in Postgres, Snowflake, SQL Server 2022+, DuckDB, Spark SQL.
  • NULLIF(denom, 0) — the standard divide-by-zero guard. NULLIF(x, y) returns NULL when x = y, otherwise x. SQL propagates NULL through arithmetic, so the whole expression yields NULL — which most reporting layers render as "no data" rather than erroring.
  • Multiplying by 1.0 — coerces integer division to floating-point. Without it, 1 / 3 in Postgres returns 0 (integer division). The cleanest portable trick.
  • AVG ignores NULL automaticallyAVG(amount) FILTER (WHERE status='paid') averages only the paid rows whose amount is non-NULL. You don't need an extra IS NOT NULL guard.
  • Single-pass GROUP BY — the engine walks the table once per region, accumulating all four metrics into a tiny state. No subqueries, no JOIN, no WITH. The most efficient shape for analytics.
  • Cost — O(N) scan with O(distinct regions) memory. Index on region lets the optimiser do a sort-aggregate; otherwise a hash-aggregate with one slot per region.

SQL
Topic — conditional aggregation
Conditional aggregation problems (FILTER, CASE WHEN)

Practice →

Worked example — SUM and the empty-group surprise

Detailed explanation. A subtle bug that ships to production is SUM(amount) returning NULL on an empty group when the dashboard expected 0. The fix is COALESCE(SUM(amount), 0) at the SELECT level, but only if you understand why SUM of nothing is NULL, not zero. The other half of the bug is that GROUP BY region on a fact table only returns regions that appear in the fact table — empty regions silently drop out of the report.

Question. Compute the total revenue per region for the last 7 days. Some regions had zero orders this week. Write the query so empty regions show 0, not NULL — and so empty regions still appear as a row in the output.

Input.

region order_id amount order_day
EU 1 100 2026-05-25
EU 2 200 2026-05-26
US 3 300 2026-05-27
APAC (no orders this week)

Code.

-- Broken: APAC row is dropped entirely because no orders exist
SELECT region, SUM(amount) AS revenue
FROM orders
WHERE order_day >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY region;

-- Fixed: anchor on a regions dimension, LEFT JOIN to orders, COALESCE the SUM
SELECT
  r.region,
  COALESCE(SUM(o.amount), 0) AS revenue
FROM regions r
LEFT JOIN orders o
  ON o.region = r.region
 AND o.order_day >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY r.region
ORDER BY r.region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The broken query scans orders, groups by region, and produces one output row per region that appears in the orders table. APAC never appears, so APAC is silently missing.
  2. The fixed query anchors on the regions dimension table — every region exists there exactly once.
  3. The LEFT JOIN carries every region through even when no matching order exists. For APAC, o.amount IS NULL for every joined row.
  4. SUM(NULL) returns NULL on an empty group; COALESCE(SUM(o.amount), 0) substitutes 0 for the NULL result.
  5. The predicate o.order_day >= CURRENT_DATE - INTERVAL '7 days' lives inside the JOIN clause, not the WHERE clause. If it lived in WHERE, it would filter out the NULL-extended row for APAC and turn the LEFT JOIN back into an INNER JOIN — a classic anti-pattern.

Output (fixed).

region revenue
APAC 0
EU 300
US 300

Rule of thumb. Whenever a dashboard's zero-revenue regions go missing, the bug is almost always a GROUP BY on a fact table without anchoring on a dimension table. Use LEFT JOIN dim ... ON ... plus COALESCE(SUM(...), 0) and the report will show zeroes instead of dropping rows.


3. MIN and MAX over strings, dates, and NULLs

MIN/MAX never coerce types — feed them comparable values, get comparable results

The mental model in one line: MIN and MAX apply the column's natural ordering — lexicographic for strings, chronological for dates, numeric for numbers — and ignore NULL the same way every other aggregate does. The three edge cases interviewers probe are string collation surprises, the "matching row" trap (MIN(price) doesn't pair with the row's product_name for free), and NULL semantics (NULL is not "smallest" or "largest" — it's just absent).

Visual diagram of MIN/MAX edge cases — three side-by-side panels for strings (lexicographic + collation), dates (chronological), and NULLs (ignored); each panel has a small example table and a one-line annotation; on a light PipeCode card.

The five rules every MIN/MAX interview probes.

  • Strings. Lexicographic order: 'Alice' < 'Bob' < 'Charlie'. Collation determines whether 'a' < 'A' or 'a' > 'A'.
  • Dates / timestamps. Chronological order. MIN(date) is the earliest; MAX(date) is the latest. Works on DATE, TIMESTAMP, TIMESTAMPTZ.
  • Numerics. Standard < and >. No surprises.
  • NULLs. Skipped. MIN({10, NULL, 50, 30}) = 10, not NULL. If every value is NULL, MIN returns NULL.
  • Mixed types. SQL refuses to compare incompatible types — MIN on a column with mixed INT and VARCHAR values requires an explicit cast.

Idiomatic uses every analytics SQL writer should know.

  • MIN(created_at) → first-seen. First time a user appeared, first order in a region, first event of a session.
  • MAX(updated_at) → last-touched. Most recent update — pairs with WHERE updated_at >= last_run for incremental loads.
  • MIN(login_date) per user_id = cohort assignment.
  • MAX(price) - MIN(price) per product = price range — MIN/MAX compose to range.
  • MIN(score)/MAX(score) with GROUP BY give per-group bookends; for "the row with the min," see the matching-row trap below.

Collation is the trap that bites cross-region migrations.

  • Postgres default LC_COLLATE = en_US.UTF-8: 'a' < 'B' (case-insensitive sort puts lowercase before uppercase in some collations).
  • MySQL utf8mb4_general_ci: case-insensitive — 'Alice' = 'alice' for = and the same row order.
  • Snowflake: binary-collation by default — 'B' < 'a' because uppercase ASCII codepoints are smaller than lowercase.
  • The fix: explicit COLLATE clause, or normalise to a single case with LOWER(col) before the aggregate.

The "matching row" trap (most asked variant).

  • Wrong. SELECT MIN(price), product_name FROM products — engines that allow this (MySQL with ONLY_FULL_GROUP_BY=off) silently pair the minimum price with an arbitrary product name from the group.
  • Right (Postgres / Snowflake). SELECT product_name FROM (SELECT product_name, price, ROW_NUMBER() OVER (ORDER BY price) AS rn FROM products) WHERE rn = 1.
  • Right (Oracle). SELECT MIN(price) KEEP (DENSE_RANK FIRST ORDER BY price), MAX(product_name) KEEP (DENSE_RANK FIRST ORDER BY price) FROM products.
  • Right (any dialect). WITH ranked AS (SELECT *, ROW_NUMBER() OVER (ORDER BY price) rn FROM products) SELECT * FROM ranked WHERE rn = 1.

Worked example — first-seen and last-seen per user

Detailed explanation. A canonical analytics question: for each user, return the date of their first event and the date of their most recent event. The shape is MIN(event_date) + MAX(event_date) per user — a single GROUP BY query, no joins, no window functions needed.

Question. Given an events table with user_id and event_date, write a query that returns one row per user with their first-seen date, last-seen date, and the number of distinct active days. Order the output by last-seen descending.

Input.

event_id user_id event_date
1 42 2026-04-01
2 42 2026-04-15
3 42 2026-05-20
4 99 2026-04-10
5 99 2026-04-10
6 99 2026-05-25
7 7 2026-05-01

Code.

SELECT
  user_id,
  MIN(event_date)             AS first_seen,
  MAX(event_date)             AS last_seen,
  COUNT(DISTINCT event_date)  AS active_days,
  MAX(event_date) - MIN(event_date) AS span_days
FROM events
GROUP BY user_id
ORDER BY last_seen DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The engine groups by user_id. User 42 has 3 rows; user 99 has 3 rows; user 7 has 1 row.
  2. Per group, MIN(event_date) returns the chronologically earliest date; MAX(event_date) returns the latest.
  3. COUNT(DISTINCT event_date) walks the group, hashing each date and skipping duplicates. User 99 has two events on 2026-04-10 and one on 2026-05-25 — distinct count = 2.
  4. MAX(event_date) - MIN(event_date) is date arithmetic. In Postgres this returns an interval (or int days); in Snowflake / BigQuery you'd use DATEDIFF(day, MIN(...), MAX(...)).
  5. ORDER BY last_seen DESC sorts the result rows; this happens after the grouping completes, so it sees one row per user.

Output.

user_id first_seen last_seen active_days span_days
99 2026-04-10 2026-05-25 2 45
42 2026-04-01 2026-05-20 3 49
7 2026-05-01 2026-05-01 1 0

Rule of thumb. Whenever a question reads "first-seen," "last-seen," "first time a user did X," or "most recent update," the answer is MIN(timestamp) or MAX(timestamp) per group. If you also need the matching row's other columns (the product name when price is min, the country when login is first), reach for ROW_NUMBER() OVER (ORDER BY ...)MIN alone can't carry sibling columns.

SQL interview question on the matching-row trap

A staff-level interviewer might frame this as: "For each product category, return the product with the lowest price, along with its name and SKU. Don't pair the wrong product name with the minimum price." It tests whether the candidate reaches for ROW_NUMBER() instead of a naive MIN(price) + MIN(name) chain.

Solution Using ROW_NUMBER() OVER (PARTITION BY … ORDER BY …)

WITH ranked AS (
  SELECT
    category,
    product_name,
    sku,
    price,
    ROW_NUMBER() OVER (
      PARTITION BY category
      ORDER BY price ASC, sku ASC          -- deterministic tie-break
    ) AS rn
  FROM products
  WHERE price IS NOT NULL
)
SELECT category, product_name, sku, price
FROM ranked
WHERE rn = 1
ORDER BY category;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

category product_name sku price
Electronics Phone X E-101 699
Electronics Phone Y E-104 499
Electronics Phone Z E-110 499
Books SQL Deep Dive B-2001 39
Books Algorithms 4e B-2010 79

Walking the window:

  1. The ROW_NUMBER() window is partitioned by category and ordered by price ASC. Within each category, the lowest-price row gets rn = 1.
  2. Electronics. Two products tie at price 499. The tie-break ORDER BY price ASC, sku ASC makes E-104 win because 'E-104' < 'E-110' lexicographically. rn values: E-104 → 1, E-110 → 2, E-101 → 3.
  3. Books. No ties. B-2001 (price 39) → rn = 1. B-2010 (price 79) → rn = 2.
  4. The outer filter keeps only rows where rn = 1 → one row per category.
  5. The WHERE price IS NOT NULL guard inside the CTE prevents NULL prices from being silently sorted to the bottom (Postgres) or top (Snowflake) of the order.

Output:

category product_name sku price
Books SQL Deep Dive B-2001 39
Electronics Phone Y E-104 499

Why this works — concept by concept:

  • ROW_NUMBER() vs MIN()MIN(price) reduces N rows to one scalar; you lose the matching row's other columns. ROW_NUMBER() tags every row with a rank, so you can filter to rn = 1 and keep every column from the winning row.
  • PARTITION BY for per-groupPARTITION BY category resets the row-number counter at each new category. Equivalent to "do this min-finding once per group."
  • Deterministic tie-breakORDER BY price ASC, sku ASC ensures the same row wins every time the query runs. Without the secondary key, ties are broken arbitrarily, which makes debugging downstream reports painful.
  • NULL guard at the CTE — explicitly excluding NULL prices keeps the row-number ordering deterministic and avoids "the cheapest product is NULL" surprises.
  • Alternative: KEEP DENSE_RANK FIRST (Oracle)MIN(product_name) KEEP (DENSE_RANK FIRST ORDER BY price) is a one-line idiom in Oracle and DB2 that does the same job. Not portable, but very clean if you're locked into Oracle.
  • Cost — one window pass plus the filter; O(N log N) due to the sort. If the data is already clustered by (category, price), the optimiser does a merge instead of a sort.

SQL
Topic — window functions
Window function problems (ROW_NUMBER, RANK)

Practice →

Worked example — MIN and MAX over string columns with collation

Detailed explanation. A subtle interview probe is "what does MAX(username) return on a mixed-case column?" — because the answer depends on collation. In Postgres with en_US.UTF-8, 'alice' > 'Bob' (lowercase is "larger" in some collations). In Snowflake's default binary collation, 'A' < 'a' because uppercase ASCII codepoints are smaller. The interview test isn't whether you've memorised collations — it's whether you reach for LOWER(col) defensively when the answer matters.

Question. Given a users table with mixed-case username values, predict the result of MAX(username) in three engines: Postgres default, MySQL utf8mb4_general_ci, and Snowflake default. Then write a portable query that always returns the same answer.

Input.

user_id username
1 Alice
2 bob
3 Charlie
4 alice
5 BOB

Code.

-- Native MAX (engine-dependent result)
SELECT MAX(username) AS max_user FROM users;

-- Portable: normalise to a single case first
SELECT MAX(LOWER(username)) AS max_user_normalised FROM users;

-- Explicit collation (Postgres / SQL Server)
SELECT MAX(username COLLATE "C") AS max_user_binary FROM users;   -- Postgres
-- SELECT MAX(username COLLATE Latin1_General_BIN) FROM users;    -- SQL Server
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Postgres en_US.UTF-8. Locale-aware sort: 'Alice' < 'alice' < 'bob' < 'BOB' < 'Charlie' < 'charlie' (collation considers locale-specific rules). MAX returns the largest value in this ordering — typically 'charlie' or 'Charlie' depending on the precise locale rules.
  2. MySQL utf8mb4_general_ci. Case-insensitive: 'Alice' = 'alice' and 'Bob' = 'bob' = 'BOB'. The engine still has to break ties for MAX; it picks an arbitrary case-variant. Result: one of 'Charlie', 'charlie', or 'CHARLIE'.
  3. Snowflake binary collation. Strict ASCII order: uppercase (codepoints 65–90) sorts before lowercase (codepoints 97–122). So 'BOB' < 'Charlie' < 'alice' < 'bob'. MAX returns 'bob'.
  4. MAX(LOWER(username)) normalises every value to lowercase before the aggregate. Every engine agrees: the result is 'charlie'. The downside is you lose the original casing of the matching row.
  5. COLLATE "C" (Postgres) or Latin1_General_BIN (SQL Server) forces binary comparison. Every engine that supports the clause now agrees with Snowflake's default behaviour.

Output (engine-by-engine).

Engine MAX(username) MAX(LOWER(username))
Postgres (en_US.UTF-8) charlie (locale-aware) charlie
MySQL (utf8mb4_general_ci) any of Charlie / charlie / CHARLIE charlie
Snowflake (binary) bob charlie

Rule of thumb. When the answer to MAX(string_col) matters for business logic, normalise with LOWER(col) (or UPPER(col)) before the aggregate. If you need the original casing back, wrap with ROW_NUMBER() OVER (ORDER BY LOWER(username) DESC) and filter to rn = 1 — that returns the matching row in its original case.


4. COUNT(DISTINCT) and APPROX_COUNT_DISTINCT

count distinct sql is the most expensive aggregate — pick exact or approximate based on your latency budget

The mental model in one line: COUNT(DISTINCT col) must remember every distinct value it has seen, so its memory is O(distinct cardinality); APPROX_COUNT_DISTINCT(col) keeps a fixed-size HyperLogLog sketch and trades ~1.6% error for ~100× speed-up. Once you internalise that trade-off — exact vs approximate, memory vs error — every sql count distinct interview question collapses to a single question: "How much error can you tolerate?"

Visual comparison of COUNT(DISTINCT) vs APPROX_COUNT_DISTINCT — left side shows exact COUNT(DISTINCT) with a hash-table icon and 'O(distinct cardinality) memory' chip; right side shows APPROX_COUNT_DISTINCT with a HyperLogLog visual (log buckets) and 'O(log log n) memory · ~1.6% error' chip; on a light PipeCode card.

Exact COUNT(DISTINCT col) in five bullets.

  • What it does. Keeps a hash set of every distinct value seen; returns the set's size at the end.
  • Memory cost. O(distinct cardinality). 1 billion distinct strings ≈ 30–80 GB of RAM depending on string length.
  • CPU cost. O(N) walks + O(N) hashes; the hash set itself amortises O(1) per insert.
  • When it spills. When the distinct set exceeds available memory, the engine spills to disk. Query time goes from seconds to minutes.
  • When to use it. When the result feeds a financial report, a compliance metric, or any case where "exact" is non-negotiable.

Approximate APPROX_COUNT_DISTINCT(col) in five bullets.

  • What it does. Keeps a HyperLogLog sketch (a small array of "register" buckets, each storing a leading-zero count of hashed values).
  • Memory cost. O(log log n) — typically ~1.5 KB for the standard 14-bit HLL++ register count, regardless of dataset size.
  • CPU cost. O(N) walks + O(N) hashes; no hash-set growth, no spill.
  • Error bound. ~1.6% standard error with the default register count (Snowflake, BigQuery, Spark). Configurable down to ~0.5% with more registers.
  • When to use it. Dashboards, MAU/DAU metrics, ad analytics, anything where "approximate-within-2%" is acceptable and you need sub-second responses on billion-row tables.

Dialect coverage.

  • Snowflake. APPROX_COUNT_DISTINCT(col) native. ~1.6% error. Also HLL_ESTIMATE, HLL_COMBINE for sketch composition.
  • BigQuery. APPROX_COUNT_DISTINCT(col) native. Same HLL++ implementation. Also HLL_COUNT.INIT, HLL_COUNT.EXTRACT, HLL_COUNT.MERGE for sketch composition across daily partitions.
  • Spark SQL / Databricks. approx_count_distinct(col, rsd) where rsd is the target relative standard deviation.
  • Postgres. No native function — install the hll extension or postgresql-hll.
  • MySQL. No native function — fall back to exact COUNT(DISTINCT).

Multi-column COUNT(DISTINCT a, b).

  • ANSI SQL: COUNT(DISTINCT a, b) — counts distinct tuples of (a, b). Supported in Postgres, Snowflake, BigQuery, Oracle.
  • MySQL: COUNT(DISTINCT a, b) supported.
  • SQL Server: not supported — use COUNT(DISTINCT CONCAT(a, '|', b)) or COUNT(DISTINCT CAST(a AS VARCHAR) + '|' + CAST(b AS VARCHAR)).
  • Beware the NULL rule: if either a or b is NULL, the tuple is excluded.

HyperLogLog++ in one paragraph.

  • The HLL data structure is an array of m = 2^p "register" buckets (p = 14 is the common default → 16,384 registers, ~1.5 KB total).
  • Each input value is hashed to a 64-bit integer. The first p bits select a register; the rank of the leading-zero run in the remaining bits is stored in that register (running max).
  • The estimator inverts the register state to produce a cardinality estimate. Standard error ≈ 1.04 / sqrt(m) ≈ 0.81% for the default; engines often report ~1.6% to leave headroom for highly-skewed inputs.
  • Sketches composeMERGE(sketch_day_1, sketch_day_2) produces a sketch for the union. Critical for incremental dashboards that pre-aggregate per day and merge across windows.
  • Snowflake and BigQuery expose this composition via HLL_COMBINE / HLL_COUNT.MERGE. Postgres + hll extension exposes hll_union_agg.

Worked example — distinct values vs DISTINCT in SELECT

Detailed explanation. Junior engineers often confuse three shapes that look similar but answer different questions: SELECT DISTINCT col FROM t, SELECT COUNT(DISTINCT col) FROM t, and SELECT col, COUNT(*) FROM t GROUP BY col. Each has its place; mixing them up produces silent bugs.

Question. Given an events table with 100 rows and user_id values that include duplicates and NULLs, predict the row count of each of these three queries and explain when you'd use which.

Input (summary).

Property Value
Total rows 100
Distinct non-NULL user_id 30
Rows with user_id IS NULL 5

Code.

-- Query A — every distinct value (one row per user, NULL collapsed to one row)
SELECT DISTINCT user_id FROM events;

-- Query B — scalar count of distinct non-NULL values
SELECT COUNT(DISTINCT user_id) FROM events;

-- Query C — distinct values plus how many times each appears
SELECT user_id, COUNT(*) AS row_count
FROM events
GROUP BY user_id
ORDER BY row_count DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Query A. Returns 31 rows: 30 distinct non-NULL user_id values + 1 row for NULL. SELECT DISTINCT treats NULL as a distinct value and emits it once (unlike COUNT(DISTINCT)).
  2. Query B. Returns 1 row with a single scalar = 30. NULL is excluded because COUNT(DISTINCT) ignores NULLs just like every other aggregate.
  3. Query C. Returns 31 rows (same count as A) but each row carries the per-value frequency. NULL appears once with row_count = 5. This is the most common shape for "show me the distribution of user_id values."
  4. The three queries scan the same table and use the same hash-deduplication machinery internally; they just expose the result differently.
  5. The interview probe: "If DISTINCT keeps NULL but COUNT(DISTINCT) drops it, why?" — because DISTINCT is a row-level operation (NULL is a value like any other for row deduplication) while COUNT(DISTINCT) is an aggregate (every aggregate except COUNT(*) ignores NULLs).

Output (row counts).

Query Row count Includes NULL?
SELECT DISTINCT user_id 31 yes (one NULL row)
SELECT COUNT(DISTINCT user_id) 1 row, value 30 no (NULL excluded)
SELECT user_id, COUNT(*) GROUP BY user_id 31 yes (NULL row with row_count = 5)

Rule of thumb. Use SELECT DISTINCT col when you want the list. Use COUNT(DISTINCT col) when you want the scalar count of unique non-NULL values. Use GROUP BY col when you want the list and the per-value count (a frequency table). All three use the same de-duplication machinery — they're three views into the same operation.

Worked example — daily active users at 10B-row scale

Detailed explanation. A common analytics-engineering problem: compute DAU on an events table with 10 billion rows and ~1 billion distinct users per day. Exact COUNT(DISTINCT user_id) runs for minutes and may spill; APPROX_COUNT_DISTINCT runs in seconds with ~1.6% error.

Question. Compare the cost and accuracy of exact vs approximate distinct counts on a 10-billion-row events table with ~1 billion distinct users. Show the Snowflake query, the BigQuery query, and the Spark SQL query.

Input (table profile).

Attribute Value
Table events
Total rows 10,000,000,000
Distinct users ~1,000,000,000
Avg user_id length 16 bytes

Code.

-- Snowflake / BigQuery (HLL++ ~1.6% error)
SELECT
  event_day,
  COUNT(DISTINCT user_id)         AS dau_exact,
  APPROX_COUNT_DISTINCT(user_id)  AS dau_approx
FROM events
GROUP BY event_day
ORDER BY event_day;

-- Spark SQL — tighter error bound (~1.0%)
SELECT
  event_day,
  COUNT(DISTINCT user_id)              AS dau_exact,
  approx_count_distinct(user_id, 0.01) AS dau_approx_tight
FROM events
GROUP BY event_day;

-- Postgres with the hll extension
SELECT
  event_day,
  COUNT(DISTINCT user_id)               AS dau_exact,
  hll_cardinality(
    hll_add_agg(hll_hash_text(user_id))
  )                                     AS dau_approx
FROM events
GROUP BY event_day;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Exact COUNT(DISTINCT user_id) per event_day requires the engine to keep a hash of every distinct user_id seen so far in that group. For 1B distinct users × 16 bytes ≈ 16 GB hash + overhead → spills on most warehouses.
  2. APPROX_COUNT_DISTINCT(user_id) keeps a HyperLogLog sketch of ~1.5 KB per group regardless of cardinality. Memory is constant; CPU is one hash per row plus a tiny register update.
  3. Each row is hashed; the hash's leading-zero count selects a register; the register stores the running max of leading-zero counts. The HLL estimator inverts the register state to produce an approximate cardinality.
  4. The HLL++ implementation in Snowflake/BigQuery uses 14 bits of the hash for register selection (2^14 = 16,384 registers); standard error ≈ 1.04 / sqrt(16,384) ≈ 0.81% (Snowflake reports ~1.6% to leave headroom for skewed inputs).
  5. The Spark approx_count_distinct(col, rsd=0.01) form lets you target 1% error explicitly by increasing the register count to ~64K — about 4× the memory of the default.

Output (illustrative timing on a typical Snowflake X-Large warehouse).

Metric COUNT(DISTINCT user_id) APPROX_COUNT_DISTINCT(user_id)
Walltime ~120 s (with disk spill) ~8 s
Memory ~16 GB hash set ~1.5 KB sketch
Error exact (0%) ~1.6% standard
Cost (credits) high ~5–10× cheaper
Result on 1B users 1,000,000,000 1,016,432,118 (≈ +1.6%)

Rule of thumb. Default to APPROX_COUNT_DISTINCT for every dashboard, BI tile, and analytics-engineering metric. Fall back to exact COUNT(DISTINCT) only for financial reporting, fraud queries, or anywhere the result is shown to an external auditor.

SQL interview question on multi-column distinct counts

A senior interviewer might frame this as: "Count distinct (user_id, day) pairs in the last 30 days — i.e. how many user-active-day combinations did we record? — and then count the distinct days each user was active." It tests the COUNT(DISTINCT a, b) tuple form, NULL handling, and the count-of-count pattern (count of count sql) at the same time.

Solution Using COUNT(DISTINCT user_id, event_day) plus a count-of-count CTE

WITH user_day AS (
  SELECT
    user_id,
    event_day,
    COUNT(*) AS events_in_day
  FROM events
  WHERE event_day >= CURRENT_DATE - INTERVAL '30 days'
    AND user_id IS NOT NULL
  GROUP BY user_id, event_day
),
per_user AS (
  SELECT
    user_id,
    COUNT(*)         AS active_days,
    SUM(events_in_day) AS total_events
  FROM user_day
  GROUP BY user_id
)
SELECT
  COUNT(*)                          AS total_users,
  COUNT(DISTINCT user_id)            AS distinct_users,  -- == total_users here
  SUM(active_days)                   AS user_active_day_pairs,
  COUNT(*) FILTER (WHERE active_days = 1)  AS users_active_1_day,
  COUNT(*) FILTER (WHERE active_days >= 7) AS users_active_7plus_days
FROM per_user;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Walking the layered CTE on a tiny 30-day window with three users:

user_id event_day events_in_day
42 2026-05-01 5
42 2026-05-02 3
42 2026-05-15 7
99 2026-05-10 2
7 2026-05-01 1
7 2026-05-02 1
7 2026-05-03 1
7 2026-05-04 1
7 2026-05-05 1
7 2026-05-06 1
7 2026-05-07 1
7 2026-05-08 1
  1. user_day CTE. Groups events by (user_id, event_day). Output is one row per (user, day) tuple with the per-day event count.
  2. per_user CTE. Groups user_day by user_id. COUNT(*) here equals active_days — distinct days the user was active.
  3. Final SELECT. Now operating on per_user (one row per user) — this is the count-of-count layer.
  4. SUM(active_days) over per_user reproduces COUNT(DISTINCT user_id, event_day) on the original table — 12 user-active-day pairs.
  5. COUNT(*) FILTER (WHERE active_days = 1) counts users active on exactly one day; FILTER (WHERE active_days >= 7) counts users active on 7+ days (a "power-user" cohort).
  6. The WHERE user_id IS NOT NULL guard prevents anonymous event rows from polluting the counts.

Output:

total_users distinct_users user_active_day_pairs users_active_1_day users_active_7plus_days
3 3 12 1 1

Why this works — concept by concept:

  • COUNT(DISTINCT a, b) — counts distinct tuples. Equivalent to SUM(COUNT(*)) over GROUP BY a, b — which is exactly what the layered CTE computes.
  • Count-of-count patternGROUP BY user_id, day once (inner), then GROUP BY user_id again (outer). The outer group's COUNT(*) answers "how many distinct days per user." Apply once more and you get cohort buckets ("how many users have N active days") — the count of count sql shape interviewers love.
  • Layered CTE for readability — splitting user_day and per_user makes the query easier to debug and lets you EXPLAIN each layer in isolation. Most query optimisers will fuse the layers automatically.
  • FILTER (WHERE …) — replaces SUM(CASE WHEN active_days = 1 THEN 1 ELSE 0 END). Cleaner, ANSI-standard, slightly faster.
  • NULL guard at the sourceWHERE user_id IS NOT NULL ensures anonymous events don't contaminate the distinct-pair count. The tuple (NULL, day) would already be excluded by COUNT(DISTINCT), but being explicit makes intent clear.
  • Cost — two hash-aggregates: O(N) → O(distinct (user, day) pairs) → O(distinct users). On 10B rows with 1B distinct users, the inner aggregate dominates. Replace COUNT(DISTINCT) with APPROX_COUNT_DISTINCT if the dashboard tolerates 1.6% error.

SQL
Topic — distinct
DISTINCT and COUNT(DISTINCT) problems

Practice →

Worked example — count of count (frequency histogram)

Detailed explanation. A common analytics-engineering question is "how many users had exactly 1, 2, 3, … N orders this month?" — a histogram of the per-user order count. The shape is a classic count of count sql pattern: aggregate once to get the per-user count, then aggregate again to bucket the counts. Interviewers love it because the naive answer ("SELECT COUNT(orders) GROUP BY user_id, then GROUP BY count") only works once you understand that aggregates can be re-aggregated.

Question. Given an orders table with customer_id and order_id, produce a frequency histogram: one row per "number of orders this month" with the count of customers in that bucket. Cap the histogram at "10+" to avoid a long tail.

Input.

customer_id order_id
c_001 o_1
c_001 o_2
c_001 o_3
c_002 o_4
c_002 o_5
c_003 o_6
c_004 o_7
c_004 o_8
c_004 o_9
c_004 o_10
c_004 o_11

Code.

WITH per_customer AS (
  SELECT customer_id, COUNT(*) AS orders_count
  FROM orders
  WHERE order_day >= DATE_TRUNC('month', CURRENT_DATE)
  GROUP BY customer_id
),
bucketed AS (
  SELECT
    CASE WHEN orders_count >= 10 THEN '10+'
         ELSE CAST(orders_count AS VARCHAR) END AS bucket,
    COUNT(*) AS customers_in_bucket
  FROM per_customer
  GROUP BY 1
)
SELECT bucket, customers_in_bucket
FROM bucketed
ORDER BY
  CASE WHEN bucket = '10+' THEN 99 ELSE CAST(bucket AS INTEGER) END;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. per_customer CTE. Groups orders by customer_id; produces one row per customer with their orders_count. c_001 → 3, c_002 → 2, c_003 → 1, c_004 → 5.
  2. bucketed CTE. Re-aggregates the previous result. The CASE collapses any count ≥ 10 into the '10+' bucket; counts 1–9 are kept as their own bucket.
  3. The outer SELECT orders the result: numeric buckets ascending, then '10+' last. The trick is the CASE in ORDER BY'10+' would sort as a string between '1' and '2' otherwise.
  4. This is the textbook count-of-count pattern: COUNT(*) inside a group, then COUNT(*) over the resulting per-group rows.
  5. Notice that the inner CTE could also use COUNT(DISTINCT order_id) if the source table had duplicate order rows — same logic, slightly more defensive.

Output.

bucket customers_in_bucket
1 1
2 1
3 1
5 1
10+ 0

Rule of thumb. Whenever a question asks "how many users had X items, Y times" or "show me a histogram of …", the answer is a layered CTE: aggregate once to get per-entity counts, then aggregate again to bucket the counts. The pattern composes cleanly with FILTER (WHERE …) for cohorts and with window functions for percentiles.


5. Advanced aggregates — GROUPING SETS, ROLLUP, CUBE, FILTER

One query, multiple grouping levels — collapse three reports into one with GROUPING SETS

The mental model in one line: GROUPING SETS lets you specify multiple group-by combinations in a single query; ROLLUP is the hierarchical-subtotals shortcut; CUBE is the all-combinations shortcut; FILTER (WHERE …) is the conditional-aggregate shortcut. All four extensions exist because writing three UNION ALL-joined GROUP BY queries is verbose, slow (three scans of the same table), and error-prone.

Visual diagram of advanced GROUP BY extensions — four cards (GROUPING SETS, ROLLUP, CUBE, FILTER) with a mini-table illustration of which group keys each generates; a small annotation about replacing SUM(CASE WHEN ...) with FILTER (WHERE ...); on a light PipeCode card.

The four extensions in one bullet each.

  • GROUPING SETS ((a, b), (a), ()) — explicit list of grouping combinations. Most flexible; verbose for many combinations.
  • ROLLUP(a, b, c) — hierarchical subtotals: (a,b,c) → (a,b) → (a) → (). Cleanest way to ship "month + quarter + year + grand total" in one query.
  • CUBE(a, b) — all combinations: (a,b) → (a) → (b) → (). 2^N grouping sets for N keys — explodes fast, use sparingly.
  • FILTER (WHERE cond) — conditional aggregate. Replaces SUM(CASE WHEN cond THEN x ELSE 0 END) with SUM(x) FILTER (WHERE cond). ANSI, supported in Postgres / Snowflake / SQL Server 2022+ / DuckDB / Spark SQL.

Dialect coverage matrix.

Feature Postgres Snowflake BigQuery MySQL ≤ 8 SQL Server
GROUPING SETS yes yes yes (via WITH ROLLUP workaround) no yes
ROLLUP yes yes yes WITH ROLLUP only yes
CUBE yes yes no no yes
FILTER (WHERE …) yes yes no no 2022+
GROUPING(col) yes yes yes no yes

The GROUPING(col) function.

  • Returns 1 if the column is "grouped out" (subtotal row), 0 if it contributes to the row's grouping. Essential when paired with ROLLUP or CUBE to label subtotal rows in the output ("All regions" instead of NULL).
  • Idiomatic use: CASE WHEN GROUPING(region) = 1 THEN 'All regions' ELSE region END AS region_label.

Why FILTER (WHERE …) is strictly better than SUM(CASE WHEN …).

  • Clearer intent. "Sum amount, filtered to paid rows" reads top-to-bottom; the CASE form requires the reader to mentally fold the predicate inside the SUM.
  • Slightly faster. The engine can push the predicate down before the SUM accumulator runs; the CASE form materialises NULL or 0 per row first.
  • Composes with COUNT. COUNT(*) FILTER (WHERE x) is the clearest conditional count; the CASE form is SUM(CASE WHEN x THEN 1 ELSE 0 END) — readable but noisier.
  • Composes with window functions. COUNT(*) FILTER (WHERE x) OVER (PARTITION BY y) works in Postgres; the CASE form does too but reads worse.

Worked example — quarterly revenue with subtotals and a grand total

Detailed explanation. A classic finance-reporting question: produce a single result set with revenue by year × quarter, subtotals by year, and one grand-total row. The shape is GROUP BY ROLLUP(year, quarter), with GROUPING(quarter) and GROUPING(year) to label the subtotal rows.

Question. Given a sales table with year, quarter, and revenue, write a single query that returns four levels: each (year, quarter) bucket, a subtotal per year, and a grand total at the bottom. Use labels instead of NULLs in the subtotal rows.

Input.

year quarter revenue
2025 Q1 100
2025 Q2 150
2025 Q3 200
2025 Q4 250
2026 Q1 300
2026 Q2 350

Code.

SELECT
  CASE WHEN GROUPING(year)    = 1 THEN 'ALL YEARS'
       ELSE CAST(year AS VARCHAR) END     AS year_label,
  CASE WHEN GROUPING(quarter) = 1 THEN
        CASE WHEN GROUPING(year) = 1 THEN 'GRAND TOTAL' ELSE 'YEAR TOTAL' END
       ELSE quarter END                   AS quarter_label,
  SUM(revenue)                            AS revenue,
  GROUPING(year)                          AS g_year,
  GROUPING(quarter)                       AS g_quarter
FROM sales
GROUP BY ROLLUP(year, quarter)
ORDER BY g_year, year, g_quarter, quarter;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. ROLLUP(year, quarter) is equivalent to GROUPING SETS ((year, quarter), (year), ()) — three grouping levels generated automatically.
  2. For each grouping level the engine produces one row per distinct group plus the NULL placeholder for "rolled-up" columns. (year, quarter) produces 6 rows; (year) produces 2 rows (one per distinct year); () produces 1 grand-total row.
  3. GROUPING(quarter) returns 1 when quarter is rolled up (i.e. the row is a yearly subtotal or the grand total). GROUPING(year) returns 1 only on the grand-total row.
  4. The labelled CASE translates GROUPING(quarter) = 1 rows into 'YEAR TOTAL' (or 'GRAND TOTAL' when both year and quarter are rolled up) for human-friendly output.
  5. ORDER BY g_year, year, g_quarter, quarter keeps the subtotals immediately under their parent group and the grand total at the bottom — the natural reading order for a finance report.

Output.

year_label quarter_label revenue g_year g_quarter
2025 Q1 100 0 0
2025 Q2 150 0 0
2025 Q3 200 0 0
2025 Q4 250 0 0
2025 YEAR TOTAL 700 0 1
2026 Q1 300 0 0
2026 Q2 350 0 0
2026 YEAR TOTAL 650 0 1
ALL YEARS GRAND TOTAL 1350 1 1

Rule of thumb. When the report needs "details + subtotals + grand total," reach for ROLLUP. When the levels aren't strictly hierarchical (e.g. you want region totals AND channel totals AND the grand total, but not the (region, channel) cross), reach for explicit GROUPING SETS. Reserve CUBE for OLAP cubes where you genuinely need every combination.

SQL interview question on conditional aggregation across many slices

A staff-level interviewer might frame this as: "Compute a single report row per region with seven metrics: total revenue, paid revenue, refunded revenue, AOV, refund rate, distinct customers, and approximate distinct sessions. Do it in one pass." It blends FILTER, NULL-safe math, exact and approximate distinct counts, and a clean GROUP BY — exactly the shape staff interviewers reward.

Solution Using FILTER (WHERE …) for every conditional metric in one pass

SELECT
  region,
  SUM(amount)                                              AS total_revenue,
  SUM(amount) FILTER (WHERE status = 'paid')               AS paid_revenue,
  SUM(amount) FILTER (WHERE status = 'refunded')           AS refunded_revenue,
  AVG(amount) FILTER (WHERE status = 'paid')               AS avg_paid_order_value,
  COUNT(*) FILTER (WHERE status = 'refunded') * 1.0
    / NULLIF(COUNT(*), 0)                                  AS refund_rate,
  COUNT(DISTINCT customer_id)                              AS distinct_customers,
  APPROX_COUNT_DISTINCT(session_id)                        AS distinct_sessions_approx
FROM orders
WHERE order_day >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY ROLLUP(region)
ORDER BY GROUPING(region), region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

region order_id customer_id session_id amount status
EU 1 c_001 s_aa 100 paid
EU 2 c_001 s_aa 200 paid
EU 3 c_002 s_bb 50 refunded
US 4 c_003 s_cc 300 paid
US 5 c_003 s_cc 400 paid
US 6 c_004 s_dd 100 refunded
US 7 c_005 s_ee NULL pending

Walking the aggregate per region:

  1. EU. 3 orders, 2 distinct customers (c_001, c_002), 2 distinct sessions. SUM(amount) = 350. paid_revenue = 300. refunded_revenue = 50. avg_paid_order_value = 150. refund_rate = 1/3 ≈ 0.333.
  2. US. 4 orders, 3 distinct customers (c_003, c_004, c_005), 3 distinct sessions. SUM(amount) = 800 (NULL skipped). paid_revenue = 700. refunded_revenue = 100. avg_paid_order_value = 350. refund_rate = 1/4 = 0.25.
  3. Grand total (rolled-up region). 7 orders, 5 distinct customers, 5 distinct sessions. SUM(amount) = 1150. paid_revenue = 1000. refunded_revenue = 150.
  4. APPROX_COUNT_DISTINCT(session_id) keeps a HyperLogLog sketch per group; tiny memory regardless of session cardinality. On the test data it returns the exact value because the input is small enough that HLL is essentially perfect.
  5. ORDER BY GROUPING(region), region puts the detail rows first (g=0) and the grand-total row last (g=1) — natural for a downstream BI tile.

Output:

region total_revenue paid_revenue refunded_revenue avg_paid_order_value refund_rate distinct_customers distinct_sessions_approx
EU 350 300 50 150.0 0.333 2 2
US 800 700 100 350.0 0.250 3 3
NULL 1150 1000 150 250.0 0.286 5 5

Why this works — concept by concept:

  • FILTER (WHERE …) for every conditional metric — one accumulator per metric; engine walks the table once. The CASE-WHEN alternative would produce identical results but with noisier code.
  • ROLLUP(region) — adds the grand-total row automatically. Without it, you'd need a UNION ALL of two queries or a separate "all regions" CTE — both slower and more error-prone.
  • Exact COUNT(DISTINCT) for customers, APPROX_ for sessions — match the precision budget to the metric. Customer counts often feed billing or audit reports (exact); session counts feed dashboards (approximate is fine).
  • NULLIF guard — keeps the refund-rate division safe when a region has zero orders.
  • Single-pass aggregate — the entire result set comes from one scan of orders filtered to the last 30 days. The engine builds a hash-aggregate with one slot per region plus one for the rolled-up grand total.
  • Cost — O(N) scan with O(distinct regions + HLL sketch size) memory per group. The HLL sketches are tiny (~1.5 KB each), so memory is dominated by the exact-distinct customer set.

SQL
Topic — group-by
GROUP BY problems (subtotals, ROLLUP, CUBE)

Practice →

Worked example — explicit GROUPING SETS for non-hierarchical reports

Detailed explanation. ROLLUP assumes a strict hierarchy: (year, quarter, month) rolls up to (year, quarter) then (year) then grand total. But real reports often need non-hierarchical slices — "show me totals by region, totals by channel, and a grand total, but don't show me the (region, channel) cross." That's the use case for explicit GROUPING SETS.

Question. Given orders(region, channel, amount), produce three sets of totals in a single query: revenue per region, revenue per channel, and a grand total. The output should label each row's grouping level clearly so the BI tool can render the three sections separately.

Input.

region channel amount
EU web 100
EU mobile 150
US web 200
US mobile 300
APAC web 50

Code.

SELECT
  COALESCE(region,  'ALL REGIONS')  AS region,
  COALESCE(channel, 'ALL CHANNELS') AS channel,
  CASE
    WHEN GROUPING(region) = 0 AND GROUPING(channel) = 1 THEN 'by region'
    WHEN GROUPING(region) = 1 AND GROUPING(channel) = 0 THEN 'by channel'
    ELSE 'grand total'
  END AS grouping_level,
  SUM(amount) AS revenue
FROM orders
GROUP BY GROUPING SETS (
  (region),
  (channel),
  ()
)
ORDER BY GROUPING(region), region, GROUPING(channel), channel;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. GROUPING SETS ((region), (channel), ()) tells the engine to compute three independent aggregations and UNION the results into one output. No (region, channel) cross is produced.
  2. Per grouping set, the engine fills the columns that aren't grouped on with NULL. (region) rows have channel = NULL; (channel) rows have region = NULL; the () row has both NULL.
  3. COALESCE(region, 'ALL REGIONS') rewrites the NULL placeholder into a human label. This is the standard idiom because GROUPING(col) and COALESCE are both ANSI-portable.
  4. The grouping_level column uses GROUPING(region) and GROUPING(channel) to label each row — essential when the BI tool needs to render the three sections in separate UI cards.
  5. The ORDER BY puts region-level totals first, then channel-level totals, then the grand total. Stable, deterministic order regardless of input row order.

Output.

region channel grouping_level revenue
APAC ALL CHANNELS by region 50
EU ALL CHANNELS by region 250
US ALL CHANNELS by region 500
ALL REGIONS mobile by channel 450
ALL REGIONS web by channel 350
ALL REGIONS ALL CHANNELS grand total 800

Rule of thumb. Use ROLLUP when the levels are strictly hierarchical (year → quarter → month). Use explicit GROUPING SETS when the slices are non-hierarchical (region totals + channel totals, but no cross). Use CUBE only when you genuinely want every combination — and remember it's 2^N grouping sets for N keys, so CUBE(a, b, c, d) is 16 grouping sets in one query.


Cheat sheet — aggregate idioms you should memorise

  • Conditional count. COUNT(*) FILTER (WHERE x > 0) — ANSI; portable fallback is SUM(CASE WHEN x > 0 THEN 1 ELSE 0 END).
  • Conditional sum. SUM(amount) FILTER (WHERE status = 'paid') — clearest paid-vs-refunded shape.
  • NULL-safe divide. SUM(num) * 1.0 / NULLIF(SUM(denom), 0) — never crashes on empty groups.
  • Empty-group zero. COALESCE(SUM(amount), 0) — turns NULL into 0 when the group is empty.
  • First-seen. MIN(created_at) per group. Last-touched. MAX(updated_at) per group.
  • First / last value with row context. ROW_NUMBER() OVER (PARTITION BY group ORDER BY ts) + filter WHERE rn = 1 — keeps every column of the matching row.
  • Distinct days active. COUNT(DISTINCT event_day) per user_id — natural cohort metric.
  • Distinct tuples. COUNT(DISTINCT user_id, event_day) (Postgres / Snowflake / BigQuery / Oracle); fallback COUNT(DISTINCT CONCAT(user_id, '|', event_day)) on SQL Server.
  • Approximate distinct. APPROX_COUNT_DISTINCT(col) — Snowflake / BigQuery; approx_count_distinct(col, rsd) — Spark; hll_cardinality(hll_add_agg(hll_hash_text(col))) — Postgres + hll extension.
  • Median. PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x) — Postgres / Snowflake / Oracle.
  • Percentile. PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) — p95 latency in one line.
  • Mode (most-frequent). MODE() WITHIN GROUP (ORDER BY x) — Postgres; (SELECT x FROM t GROUP BY x ORDER BY COUNT(*) DESC LIMIT 1) — portable fallback.
  • Range. MAX(x) - MIN(x) — works on numerics and dates.
  • Stddev / variance. STDDEV_SAMP(x), VAR_SAMP(x) — sample stats; STDDEV_POP, VAR_POP for population stats.
  • String aggregation. STRING_AGG(name, ', ' ORDER BY name) (Postgres / SQL Server); LISTAGG(name, ', ') WITHIN GROUP (ORDER BY name) (Snowflake / Oracle); GROUP_CONCAT(name SEPARATOR ', ') (MySQL).
  • Array aggregation. ARRAY_AGG(value ORDER BY ts) (Postgres / Snowflake / BigQuery) — emit a sorted array per group.
  • Subtotals + grand total. GROUP BY ROLLUP(year, quarter) — one query, four grouping levels.
  • All combinations. GROUP BY CUBE(region, channel) — explodes to 2^N grouping sets; use sparingly.
  • Explicit groupings. GROUP BY GROUPING SETS ((region), (channel), ()) — region totals, channel totals, grand total, no cross.
  • Pivot via FILTER. SUM(amount) FILTER (WHERE region = 'EU') AS eu_revenue, SUM(amount) FILTER (WHERE region = 'US') AS us_revenue — explicit pivot without PIVOT clause.
  • Running totals. SUM(amount) OVER (PARTITION BY user_id ORDER BY day) — aggregate-as-window, no GROUP BY collapse.
  • Defensive cast for big SUMs. SUM(amount::numeric) (Postgres) or SUM(CAST(amount AS DECIMAL(38, 2))) (Snowflake / Oracle) — avoids BIGINT overflow on long-running aggregates.
  • HAVING vs WHERE. HAVING SUM(amount) > 1000 filters after the aggregate; WHERE amount > 1000 filters before. Don't mix them up.
  • Anchored zero rows. LEFT JOIN dim + COALESCE(SUM(...), 0) — empty groups show as zero, not missing.

Frequently asked questions

COUNT(*) vs COUNT(1) vs COUNT(col) — which is faster?

COUNT(*) and COUNT(1) are identical in every modern engine — Postgres, MySQL 8+, Snowflake, BigQuery, SQL Server, and Spark SQL all optimise both to the same execution plan. The "COUNT(1) is faster" myth comes from very old MySQL versions and Oracle 8 trivia that no longer applies. COUNT(col) is genuinely different: it scans the column to check for NULLs, so it is marginally slower on wide tables, and it returns a different number when NULLs are present. Use COUNT(*) when you want "rows," COUNT(col) when you want "rows with a value," and COUNT(DISTINCT col) when you want "unique values."

Does COUNT(DISTINCT) use an index?

Usually no — COUNT(DISTINCT col) requires building a hash of every distinct value, which the optimiser builds by scanning the column. A B-tree index on col can occasionally help if the engine supports an "index-only scan" of the sorted index leaves (Postgres can do this for some shapes), but the typical execution plan is a sequential scan plus a hash-aggregate. The fix when COUNT(DISTINCT) is too slow is rarely "add an index" — it's "switch to APPROX_COUNT_DISTINCT" or "pre-aggregate to a daily snapshot table."

How accurate is APPROX_COUNT_DISTINCT?

The HyperLogLog++ algorithm used by Snowflake, BigQuery, and Spark targets ~1.6% standard error with the default 14-bit register count (16,384 registers, ~1.5 KB memory). That means on a true distinct count of 1,000,000,000 the approximate answer is typically within ±16 million. The error is symmetric (it can over- or under-count) and unbiased in expectation. You can tune it tighter — Spark's approx_count_distinct(col, rsd=0.01) targets 1% error at ~4× memory — but for almost every dashboard the default is fine, and the speed-up vs exact COUNT(DISTINCT) is often 50–200×.

Can I aggregate over window functions?

Sort of — you can nest a window function inside a subquery and then aggregate over the result, but you cannot put a window function directly inside an aggregate function (SUM(ROW_NUMBER() OVER (...)) is illegal). The idiomatic pattern is SELECT SUM(rn) FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t) sub. Aggregates can also be used as window functions — SUM(amount) OVER (PARTITION BY user_id ORDER BY day) produces a running total per user — but those are window-aggregates, not regular aggregates with a window inside.

AVG ignores NULLs but COUNT(*) counts them — why is the rule asymmetric?

COUNT(*) is defined as "the cardinality of the input rows" — it has nothing to do with any column's value, so NULLs don't apply. Every other aggregate is defined as "the reduction of non-NULL values in col" — they operate on the column, and SQL's three-valued logic treats NULL as "absent value," so it's excluded from the reduction. The practical implication: AVG(amount) divides by COUNT(amount), not by COUNT(*). If you want NULL rows to drag the average down (treating them as zero), wrap the column: AVG(COALESCE(amount, 0)).

GROUP BY vs PARTITION BY — when to use each?

GROUP BY collapses the input into one output row per group; PARTITION BY (inside an OVER window) keeps every input row and tags each one with a window-aggregate value. Use GROUP BY for "show me one row per region with total revenue"; use PARTITION BY for "show me every order row along with its region's total revenue next to it." They compose well: a GROUP BY query can sit inside a CTE, and a window function can run over that CTE's output. The classic interview probe is "how would you write a query that returns each order plus its rank within the region by amount?" — answer: SELECT *, ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) FROM orders, not a GROUP BY.

Why does SUM on an empty group return NULL instead of 0?

SQL aggregates are defined as "the reduction of non-NULL values," and the reduction of zero values is undefined — SQL chooses to encode that as NULL. Only COUNT(*) and COUNT(col) return 0 on an empty input because their domain is row counts, where 0 is a legitimate cardinality. If your dashboard needs zero on empty groups, wrap with COALESCE(SUM(amount), 0). The most common production bug from this rule is dashboards that lose rows entirely: when you GROUP BY region on a fact table, regions with zero orders disappear from the result, not just their revenue. The fix is to anchor on a dimension table with LEFT JOIN and COALESCE the aggregate.

What is the difference between HAVING and WHERE in aggregates?

WHERE filters rows before the aggregate runs; HAVING filters groups after the aggregate runs. WHERE amount > 100 keeps only rows where the individual order is above 100, then aggregates. HAVING SUM(amount) > 1000 aggregates first, then keeps only groups where the total is above 1000. The two are not interchangeable — WHERE SUM(amount) > 1000 is illegal because SUM isn't defined per row. The portable rule: put scalar predicates in WHERE, aggregate predicates in HAVING. Modern engines optimise WHERE predicates by pushing them into the scan, while HAVING predicates have to wait for the aggregate to complete — so anything that can go in WHERE should.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every aggregate concept above ships with hands-on practice rooms where you write the COUNT, SUM, and ROLLUP queries against real schemas and get scored in real time. Start with the aggregation library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice aggregation now →
GROUP BY drills →

Top comments (0)