sql aggregate functions are the most-tested SQL pattern in data engineering interviews because every analytics question — from "how many distinct users converted last week" to "show me revenue by region with subtotals" — ultimately reduces to a COUNT, SUM, AVG, MIN, MAX, or COUNT(DISTINCT) call inside a GROUP BY. The five core aggregates look trivial on paper, but they hide four cliffs that trip junior engineers: silent sql count NULL-handling bugs, sql count distinct memory blow-ups on large cardinalities, BIGINT overflow in long-running SUMs, and the lexicographic-vs-chronological surprise in MIN/MAX.
This guide walks through the five aggregate families that show up most often in count distinct sql questions, count of count sql patterns, and GROUP BY extensions at FAANG, Stripe, DoorDash, and analytics-heavy shops. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll be able to defend COUNT(*) FILTER (WHERE …) over SUM(CASE WHEN …), choose between exact COUNT(DISTINCT) and APPROX_COUNT_DISTINCT on a HyperLogLog sketch, and ship a single-pass ROLLUP query that produces subtotals plus a grand total — the exact answers that senior interviewers reward.
When you want hands-on reps immediately after reading, browse aggregation practice library →, drill GROUP BY and HAVING problems →, and rehearse on SQL practice for any topic →.
On this page
- Why aggregates are the most-tested SQL pattern
- COUNT, SUM, AVG anatomy and NULL handling
- MIN and MAX over strings, dates, and NULLs
- COUNT(DISTINCT) and APPROX_COUNT_DISTINCT
- Advanced aggregates — GROUPING SETS, ROLLUP, CUBE, FILTER
- Cheat sheet — aggregate idioms you should memorise
- Frequently asked questions
- Practice on PipeCode
1. Why aggregates are the most-tested SQL pattern
Aggregates are the single most-asked SQL primitive in data engineering loops
The one-sentence invariant: an aggregate function reduces a column of N rows down to one scalar value, optionally per GROUP BY key, and the entire family — COUNT, SUM, AVG, MIN, MAX, COUNT(DISTINCT), plus the advanced extensions — shares one set of NULL-handling rules and one set of grouping rules. Once you internalise that mental model — reduction first, NULLs ignored except for COUNT(*), grouping is just "apply the reduction once per key" — most sql aggregate functions interview questions become a sequence of trade-off discussions.
Why every interview opens with aggregates in five bullets.
- Coverage. Every analytics question is some shape of "count this, sum that, by these keys." Aggregates are the shortest path from raw rows to a business metric.
-
NULL discipline. Aggregates are the single best probe for whether a candidate understands SQL's three-valued logic.
COUNT(*) ≠ COUNT(col)is the most common silent bug in production reports. -
Cost intuition.
SUMis O(N) streaming,COUNT(DISTINCT)is O(distinct cardinality) memory — interviewers test whether you reach forAPPROX_COUNT_DISTINCTwhen the cardinality blows past memory. -
Composability. Aggregates compose with
GROUP BY,HAVING, window functions, and CTEs to form ~80% of analytics SQL. -
Dialect drift. AVG on integers returns NUMERIC in Postgres but DOUBLE in MySQL;
FILTER (WHERE …)is ANSI but unsupported in MySQL ≤ 8. Knowing which dialect you're in is a senior signal.
The two traps every interviewer probes.
-
NULL handling.
COUNT(*)counts every row, NULLs included.COUNT(col)ignores NULLs.SUM,AVG,MIN,MAXall ignore NULLs but returnNULLwhen the input is empty (not zero — a frequent reporting bug). -
COUNT(*) vs COUNT(col) misuse. Junior engineers default to
COUNT(*)everywhere; senior engineers reach forCOUNT(col)orCOUNT(DISTINCT col)the instant the column may be NULL or non-unique.
What interviewers listen for.
- Do you mention NULL-handling the first time
COUNTappears? — senior signal. - Do you reach for
COUNT(DISTINCT)vsCOUNT(*)based on the semantic question? — required answer. - Do you mention
APPROX_COUNT_DISTINCT+ HyperLogLog when cardinality is large? — senior signal. - Do you know
FILTER (WHERE …)as the modern replacement forSUM(CASE WHEN …)? — current-default signal. - Do you protect divisions with
NULLIF(denom, 0)and empty groups withCOALESCE(SUM(...), 0)? — production-ready signal. - Do you reach for
ROLLUPinstead ofUNION ALLof two queries for "details + grand total"? — senior signal.
The 2026 reality (SQL aggregates in modern engines).
-
ANSI
FILTER (WHERE …)is the standard conditional-aggregate idiom. Postgres has had it since 9.4, Snowflake since 2018, SQL Server since 2022, DuckDB and Spark SQL natively. MySQL ≤ 8 still requiresSUM(CASE WHEN …). -
APPROX_COUNT_DISTINCTis native in Snowflake, BigQuery, and Spark; Postgres needs thehllorpostgresql-hllextension; MySQL still has no built-in. HyperLogLog++ is the algorithm everyone implements. -
GROUPING SETS,ROLLUP,CUBEare ANSI and supported by every major warehouse. MySQL only shipsWITH ROLLUP; BigQuery shipsROLLUPandGROUPING SETSbut notCUBE. -
PIVOT/UNPIVOTclauses exist in Snowflake, SQL Server, and Oracle but are syntactic sugar aroundGROUP BY+ conditional aggregates — the portable answer is stillSUM(amount) FILTER (WHERE region = 'X').
Worked example — the one-row trap every junior misses
Detailed explanation. A classic interview probe: "Here's a 5-row table where one amount is NULL. What do COUNT(*), COUNT(amount), SUM(amount), and AVG(amount) return — and which one will surprise a junior engineer?" The probe is testing whether you remember that AVG divides by COUNT(amount), not by COUNT(*).
Question. Given the orders table below with five rows and one NULL amount, predict the exact result of each of the four aggregates, and explain why AVG and SUM/COUNT(*) diverge.
Input.
| order_id | amount |
|---|---|
| 1 | 100 |
| 2 | 200 |
| 3 | NULL |
| 4 | 300 |
| 5 | 400 |
Code.
SELECT
COUNT(*) AS rows_total,
COUNT(amount) AS rows_with_amount,
SUM(amount) AS sum_amount,
AVG(amount) AS avg_amount_correct,
SUM(amount) * 1.0 / COUNT(*) AS avg_amount_wrong
FROM orders;
Step-by-step explanation.
-
COUNT(*)walks every row regardless of value — 5 rows in, 5 out. NULL rows still count. -
COUNT(amount)walks every row but skips the ones whereamount IS NULL— 4 rows in, 4 out. -
SUM(amount)adds 100 + 200 + 300 + 400 = 1000. The NULL row is silently ignored —SUMtreats NULL as "absent value," not zero. -
AVG(amount)is defined asSUM(amount) / COUNT(amount)= 1000 / 4 = 250. Postgres returns NUMERIC; MySQL returns DOUBLE; both equal 250 here. - The "wrong" formula uses
COUNT(*)in the denominator: 1000 / 5 = 200. This is what you get if you reimplement AVG by hand and forget the NULL rule.
Output.
| rows_total | rows_with_amount | sum_amount | avg_amount_correct | avg_amount_wrong |
|---|---|---|---|---|
| 5 | 4 | 1000 | 250 | 200 |
Rule of thumb. Whenever a junior asks "why is AVG higher than SUM / number of rows?" the answer is always the same: AVG divides by COUNT(col), not COUNT(*). If you want NULL rows to drag the average down, replace them with zero first using COALESCE(amount, 0).
SQL interview question on COUNT vs COUNT(col) NULL handling
A senior interviewer often frames this as: "We migrated a Postgres table and our daily-active-users dashboard dropped 8% overnight. The query is SELECT COUNT(user_id) FROM events GROUP BY day. What's the silent bug?" The expected answer surfaces the COUNT(col) ignores NULL behaviour, then walks the fix.
Solution Using COUNT(*) for "rows" vs COUNT(col) for "non-null rows"
-- Two different questions, two different answers
SELECT
day,
COUNT(*) AS events_per_day, -- every event row
COUNT(user_id) AS events_with_user_id, -- excludes anonymous
COUNT(DISTINCT user_id) AS distinct_users -- the DAU answer
FROM events
GROUP BY day
ORDER BY day;
Step-by-step trace.
| day | row_id | user_id | event |
|---|---|---|---|
| 2026-05-01 | 1 | 42 | login |
| 2026-05-01 | 2 | NULL | anon_view |
| 2026-05-01 | 3 | 42 | click |
| 2026-05-01 | 4 | 99 | login |
| 2026-05-01 | 5 | NULL | anon_view |
Walking the aggregate one row at a time:
-
COUNT(*)increments on every row → 5. -
COUNT(user_id)skips rows 2 and 5 becauseuser_id IS NULL→ 3. -
COUNT(DISTINCT user_id)adds only previously-unseen non-null values: {42, 99} → 2. - The DAU dashboard wanted
COUNT(DISTINCT user_id)= 2 but the new query returnedCOUNT(user_id)= 3 — which counts user 42's two events as two users. - After the migration, anonymous events stopped being dropped at the ingest layer, so a flood of NULL
user_idrows arrived.COUNT(user_id)already ignored them, so it looked stable — butCOUNT(DISTINCT user_id)would have moved if the actual user set changed.
Output:
| metric | value | what it counts |
|---|---|---|
COUNT(*) |
5 | total rows (including anonymous) |
COUNT(user_id) |
3 | rows with a non-null user (still over-counts repeats) |
COUNT(DISTINCT user_id) |
2 | unique users — the DAU answer |
Why this works — concept by concept:
- COUNT(*) — counts every row in the input, ignoring nothing. The only aggregate that does not have a NULL-skip rule. Use it for "how many records exist."
-
COUNT(col) — counts rows where
col IS NOT NULL. Silently drops NULL rows. Use it for "how many records have a value." - COUNT(DISTINCT col) — counts unique non-null values. Use it for "how many distinct entities show up." This is almost always the right shape for DAU/MAU/unique-X metrics.
-
NULL semantics — SQL's three-valued logic means
NULL = NULLis itself NULL (not TRUE), which is why every aggregate exceptCOUNT(*)has to encode the "skip NULL" rule explicitly. -
Cost —
COUNT(*)andCOUNT(col)are O(N) streaming, no extra memory.COUNT(DISTINCT col)is O(distinct cardinality) memory because it must keep a hash of seen values.
SQL
Topic — aggregation
Aggregation problems (COUNT, SUM, AVG)
2. COUNT, SUM, AVG anatomy and NULL handling
sql count, SUM, and AVG share one rule — every aggregate except COUNT(*) ignores NULL
The mental model in one line: COUNT(*) counts rows; everything else counts or sums values, skipping NULLs. Once you say "NULL is silently filtered before the reduction runs," every sql count and sql count distinct interview question collapses to obvious answers — and you stop being surprised when SUM on an all-NULL column returns NULL instead of 0.
The seven rules every interview probes.
-
COUNT(*)— counts every row, NULLs included. The only aggregate that does not have a NULL-skip rule. -
COUNT(col)— counts rows wherecol IS NOT NULL. Use for "rows with a value." -
COUNT(DISTINCT col)— counts unique non-NULL values. Use for "unique entities." -
SUM(col)— adds non-NULL values; returnsNULLif every input value is NULL (not 0). -
AVG(col)— defined asSUM(col) / COUNT(col). Skips NULLs in both halves; returns NULL on empty input. -
MIN(col)/MAX(col)— return the smallest / largest non-NULL value; NULL ignored. -
Empty group rule. Every aggregate on an empty group returns NULL, except
COUNT(*)andCOUNT(col)which return 0.
The BIGINT overflow trap.
-
SUMon anINTcolumn in MySQL returns aBIGINT, butSUMon aBIGINTcolumn has no native promotion — at 9.2 × 10^18 the sum wraps or errors depending on dialect. - Postgres
SUM(bigint)promotes toNUMERICautomatically — safer, but slower for very-large datasets. - Snowflake/BigQuery use 128-bit
NUMERICarithmetic forSUMby default. Overflow is essentially never seen in practice. - The interview answer: cast to a wider type before summing if you're not sure:
SUM(amount::numeric)(Postgres) orSUM(CAST(amount AS DECIMAL(38,2)))(Snowflake/Oracle/SQL Server).
The AVG dialect drift.
-
AVG(int_col)in Postgres returnsNUMERIC— exact, slower, often surprises beginners (AVG(int) = 2.50000000000000000000). -
AVG(int_col)in MySQL returnsDOUBLE— fast, lossy, can show2.5or2.4999999999999...depending on values. -
AVG(int_col)in Snowflake/BigQuery returnsFLOAT(BigQuery) orNUMBER(38, 6)(Snowflake). -
Defensive coding.
AVG(amount::numeric)orSUM(amount) * 1.0 / NULLIF(COUNT(amount), 0)if you need explicit control over the result type.
The SUM(NULL) vs SUM(all NULLs) rule.
-
SUM(NULL)(literal NULL passed in) →NULL. -
SUM(col)where every row ofcolis NULL →NULL(not 0). -
SUM(col)where some rows are NULL → sum of the non-NULL rows. NULL is treated as "absent value." - If your report needs zero on empty input, wrap with
COALESCE(SUM(col), 0).
Worked example — daily revenue with conditional aggregates
Detailed explanation. A classic analytics problem: compute daily revenue, plus a separate "paid revenue" column that only counts orders where status = 'paid'. The naive answer is SUM(CASE WHEN status = 'paid' THEN amount ELSE 0 END); the modern answer is SUM(amount) FILTER (WHERE status = 'paid').
Question. Given a 6-row orders table with three statuses (paid, pending, refunded), write a single query that returns total revenue, paid revenue, and refunded revenue per day. Use ANSI FILTER if available; show the portable CASE WHEN form for MySQL/Snowflake legacy code.
Input.
| order_id | order_day | amount | status |
|---|---|---|---|
| 1 | 2026-05-01 | 100 | paid |
| 2 | 2026-05-01 | 200 | paid |
| 3 | 2026-05-01 | 50 | refunded |
| 4 | 2026-05-02 | 300 | paid |
| 5 | 2026-05-02 | NULL | pending |
| 6 | 2026-05-02 | 400 | paid |
Code.
-- Postgres / Snowflake / SQL Server 2022+: ANSI FILTER
SELECT
order_day,
SUM(amount) AS total_revenue,
SUM(amount) FILTER (WHERE status='paid') AS paid_revenue,
SUM(amount) FILTER (WHERE status='refunded') AS refunded_revenue,
COUNT(*) FILTER (WHERE status='paid') AS paid_orders
FROM orders
GROUP BY order_day
ORDER BY order_day;
-- Portable form (works on MySQL ≤ 8, older Snowflake, all engines)
SELECT
order_day,
SUM(amount) AS total_revenue,
SUM(CASE WHEN status='paid' THEN amount END) AS paid_revenue,
SUM(CASE WHEN status='refunded' THEN amount END) AS refunded_revenue,
SUM(CASE WHEN status='paid' THEN 1 ELSE 0 END) AS paid_orders
FROM orders
GROUP BY order_day
ORDER BY order_day;
Step-by-step explanation.
-
SUM(amount)walks every row in the group and adds non-NULL amounts: day 1 → 100+200+50 = 350; day 2 → 300+400 = 700 (NULL row skipped). -
SUM(amount) FILTER (WHERE status='paid')evaluates the predicate per row; only rows where the predicate is TRUE contribute. Day 1 → 100+200 = 300; day 2 → 300+400 = 700. - The portable
CASE WHENform returnsNULLfor non-matching rows;SUMthen skips them — same answer asFILTER. -
COUNT(*) FILTER (WHERE status='paid')is equivalent toSUM(CASE WHEN status='paid' THEN 1 ELSE 0 END)but reads more clearly and runs slightly faster (no intermediate1/0materialisation). - The pending order on day 2 has NULL amount — it contributes to
COUNT(*)but not to anySUM. If the dashboard says "5 orders, $400 revenue" the missing $400 is a NULL in the data, not a bug in the query.
Output.
| order_day | total_revenue | paid_revenue | refunded_revenue | paid_orders |
|---|---|---|---|---|
| 2026-05-01 | 350 | 300 | 50 | 2 |
| 2026-05-02 | 700 | 700 | NULL | 2 |
Rule of thumb. Reach for FILTER (WHERE …) whenever your engine supports it — it's clearer, faster, and explicitly ANSI. Fall back to SUM(CASE WHEN …) only on MySQL or legacy code. The two are semantically identical but FILTER reads as "aggregate this subset" instead of "sum a conditional value."
SQL interview question on conditional aggregation
A staff-level interviewer might frame this as: "Write one query that returns, per region, total revenue, the percentage of orders that were refunded, and the average order value of paid orders only." It tests FILTER (or CASE WHEN), NULL-safe division, and grouping in a single pass.
Solution Using FILTER (WHERE …) with NULLIF for safe division
SELECT
region,
SUM(amount) AS total_revenue,
COUNT(*) AS total_orders,
COUNT(*) FILTER (WHERE status = 'refunded') * 1.0
/ NULLIF(COUNT(*), 0) AS refund_rate,
AVG(amount) FILTER (WHERE status = 'paid') AS avg_paid_order_value
FROM orders
GROUP BY region
ORDER BY region;
Step-by-step trace.
| region | order_id | amount | status |
|---|---|---|---|
| EU | 1 | 100 | paid |
| EU | 2 | 200 | paid |
| EU | 3 | 50 | refunded |
| US | 4 | 300 | paid |
| US | 5 | 400 | paid |
| US | 6 | 100 | refunded |
| US | 7 | NULL | pending |
Walking the aggregate per region:
-
EU. Three orders.
SUM(amount)= 350.COUNT(*)= 3. Refunded count = 1; refund_rate = 1/3 ≈ 0.333.AVG(amount) FILTER (WHERE status='paid')averages {100, 200} = 150. -
US. Four orders.
SUM(amount)= 300+400+100 = 800 (NULL skipped).COUNT(*)= 4. Refunded count = 1; refund_rate = 1/4 = 0.25. Paid AVG averages {300, 400} = 350. -
NULLIF(COUNT(*), 0)is the divide-by-zero guard — ifCOUNT(*) = 0,NULLIFreturns NULL and the whole division returns NULL instead of erroring. Critical for empty-group safety. - The pending US row contributes to
COUNT(*)(4 orders) and to the refund-rate denominator, but does not contribute to any SUM or AVG because its amount is NULL.
Output:
| region | total_revenue | total_orders | refund_rate | avg_paid_order_value |
|---|---|---|---|---|
| EU | 350 | 3 | 0.333 | 150.0 |
| US | 800 | 4 | 0.250 | 350.0 |
Why this works — concept by concept:
-
FILTER (WHERE …) — applies the predicate before the reduction.
COUNT(*) FILTER (WHERE x)is the cleanest way to write a conditional count;SUM(amount) FILTER (WHERE x)the cleanest conditional sum. Available in Postgres, Snowflake, SQL Server 2022+, DuckDB, Spark SQL. -
NULLIF(denom, 0) — the standard divide-by-zero guard.
NULLIF(x, y)returns NULL whenx = y, otherwisex. SQL propagates NULL through arithmetic, so the whole expression yields NULL — which most reporting layers render as "no data" rather than erroring. -
Multiplying by
1.0— coerces integer division to floating-point. Without it,1 / 3in Postgres returns0(integer division). The cleanest portable trick. -
AVG ignores NULL automatically —
AVG(amount) FILTER (WHERE status='paid')averages only the paid rows whose amount is non-NULL. You don't need an extraIS NOT NULLguard. -
Single-pass GROUP BY — the engine walks the table once per region, accumulating all four metrics into a tiny state. No subqueries, no
JOIN, noWITH. The most efficient shape for analytics. -
Cost — O(N) scan with O(distinct regions) memory. Index on
regionlets the optimiser do a sort-aggregate; otherwise a hash-aggregate with one slot per region.
SQL
Topic — conditional aggregation
Conditional aggregation problems (FILTER, CASE WHEN)
Worked example — SUM and the empty-group surprise
Detailed explanation. A subtle bug that ships to production is SUM(amount) returning NULL on an empty group when the dashboard expected 0. The fix is COALESCE(SUM(amount), 0) at the SELECT level, but only if you understand why SUM of nothing is NULL, not zero. The other half of the bug is that GROUP BY region on a fact table only returns regions that appear in the fact table — empty regions silently drop out of the report.
Question. Compute the total revenue per region for the last 7 days. Some regions had zero orders this week. Write the query so empty regions show 0, not NULL — and so empty regions still appear as a row in the output.
Input.
| region | order_id | amount | order_day |
|---|---|---|---|
| EU | 1 | 100 | 2026-05-25 |
| EU | 2 | 200 | 2026-05-26 |
| US | 3 | 300 | 2026-05-27 |
| APAC | (no orders this week) | — | — |
Code.
-- Broken: APAC row is dropped entirely because no orders exist
SELECT region, SUM(amount) AS revenue
FROM orders
WHERE order_day >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY region;
-- Fixed: anchor on a regions dimension, LEFT JOIN to orders, COALESCE the SUM
SELECT
r.region,
COALESCE(SUM(o.amount), 0) AS revenue
FROM regions r
LEFT JOIN orders o
ON o.region = r.region
AND o.order_day >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY r.region
ORDER BY r.region;
Step-by-step explanation.
- The broken query scans
orders, groups byregion, and produces one output row per region that appears in the orders table. APAC never appears, so APAC is silently missing. - The fixed query anchors on the
regionsdimension table — every region exists there exactly once. - The
LEFT JOINcarries every region through even when no matching order exists. For APAC,o.amount IS NULLfor every joined row. -
SUM(NULL)returns NULL on an empty group;COALESCE(SUM(o.amount), 0)substitutes 0 for the NULL result. - The predicate
o.order_day >= CURRENT_DATE - INTERVAL '7 days'lives inside the JOIN clause, not the WHERE clause. If it lived in WHERE, it would filter out the NULL-extended row for APAC and turn the LEFT JOIN back into an INNER JOIN — a classic anti-pattern.
Output (fixed).
| region | revenue |
|---|---|
| APAC | 0 |
| EU | 300 |
| US | 300 |
Rule of thumb. Whenever a dashboard's zero-revenue regions go missing, the bug is almost always a GROUP BY on a fact table without anchoring on a dimension table. Use LEFT JOIN dim ... ON ... plus COALESCE(SUM(...), 0) and the report will show zeroes instead of dropping rows.
3. MIN and MAX over strings, dates, and NULLs
MIN/MAX never coerce types — feed them comparable values, get comparable results
The mental model in one line: MIN and MAX apply the column's natural ordering — lexicographic for strings, chronological for dates, numeric for numbers — and ignore NULL the same way every other aggregate does. The three edge cases interviewers probe are string collation surprises, the "matching row" trap (MIN(price) doesn't pair with the row's product_name for free), and NULL semantics (NULL is not "smallest" or "largest" — it's just absent).
The five rules every MIN/MAX interview probes.
-
Strings. Lexicographic order:
'Alice' < 'Bob' < 'Charlie'. Collation determines whether'a' < 'A'or'a' > 'A'. -
Dates / timestamps. Chronological order.
MIN(date)is the earliest;MAX(date)is the latest. Works onDATE,TIMESTAMP,TIMESTAMPTZ. -
Numerics. Standard
<and>. No surprises. -
NULLs. Skipped.
MIN({10, NULL, 50, 30})= 10, not NULL. If every value is NULL,MINreturns NULL. -
Mixed types. SQL refuses to compare incompatible types —
MINon a column with mixedINTandVARCHARvalues requires an explicit cast.
Idiomatic uses every analytics SQL writer should know.
-
MIN(created_at) → first-seen. First time a user appeared, first order in a region, first event of a session. -
MAX(updated_at) → last-touched. Most recent update — pairs withWHERE updated_at >= last_runfor incremental loads. -
MIN(login_date)peruser_id= cohort assignment. -
MAX(price) - MIN(price)perproduct= price range —MIN/MAXcompose to range. -
MIN(score)/MAX(score)withGROUP BYgive per-group bookends; for "the row with the min," see the matching-row trap below.
Collation is the trap that bites cross-region migrations.
- Postgres default
LC_COLLATE = en_US.UTF-8:'a' < 'B'(case-insensitive sort puts lowercase before uppercase in some collations). - MySQL
utf8mb4_general_ci: case-insensitive —'Alice' = 'alice'for=and the same row order. - Snowflake: binary-collation by default —
'B' < 'a'because uppercase ASCII codepoints are smaller than lowercase. - The fix: explicit
COLLATEclause, or normalise to a single case withLOWER(col)before the aggregate.
The "matching row" trap (most asked variant).
-
Wrong.
SELECT MIN(price), product_name FROM products— engines that allow this (MySQL withONLY_FULL_GROUP_BY=off) silently pair the minimum price with an arbitrary product name from the group. -
Right (Postgres / Snowflake).
SELECT product_name FROM (SELECT product_name, price, ROW_NUMBER() OVER (ORDER BY price) AS rn FROM products) WHERE rn = 1. -
Right (Oracle).
SELECT MIN(price) KEEP (DENSE_RANK FIRST ORDER BY price), MAX(product_name) KEEP (DENSE_RANK FIRST ORDER BY price) FROM products. -
Right (any dialect).
WITH ranked AS (SELECT *, ROW_NUMBER() OVER (ORDER BY price) rn FROM products) SELECT * FROM ranked WHERE rn = 1.
Worked example — first-seen and last-seen per user
Detailed explanation. A canonical analytics question: for each user, return the date of their first event and the date of their most recent event. The shape is MIN(event_date) + MAX(event_date) per user — a single GROUP BY query, no joins, no window functions needed.
Question. Given an events table with user_id and event_date, write a query that returns one row per user with their first-seen date, last-seen date, and the number of distinct active days. Order the output by last-seen descending.
Input.
| event_id | user_id | event_date |
|---|---|---|
| 1 | 42 | 2026-04-01 |
| 2 | 42 | 2026-04-15 |
| 3 | 42 | 2026-05-20 |
| 4 | 99 | 2026-04-10 |
| 5 | 99 | 2026-04-10 |
| 6 | 99 | 2026-05-25 |
| 7 | 7 | 2026-05-01 |
Code.
SELECT
user_id,
MIN(event_date) AS first_seen,
MAX(event_date) AS last_seen,
COUNT(DISTINCT event_date) AS active_days,
MAX(event_date) - MIN(event_date) AS span_days
FROM events
GROUP BY user_id
ORDER BY last_seen DESC;
Step-by-step explanation.
- The engine groups by
user_id. User 42 has 3 rows; user 99 has 3 rows; user 7 has 1 row. - Per group,
MIN(event_date)returns the chronologically earliest date;MAX(event_date)returns the latest. -
COUNT(DISTINCT event_date)walks the group, hashing each date and skipping duplicates. User 99 has two events on2026-04-10and one on2026-05-25— distinct count = 2. -
MAX(event_date) - MIN(event_date)is date arithmetic. In Postgres this returns aninterval(orintdays); in Snowflake / BigQuery you'd useDATEDIFF(day, MIN(...), MAX(...)). -
ORDER BY last_seen DESCsorts the result rows; this happens after the grouping completes, so it sees one row per user.
Output.
| user_id | first_seen | last_seen | active_days | span_days |
|---|---|---|---|---|
| 99 | 2026-04-10 | 2026-05-25 | 2 | 45 |
| 42 | 2026-04-01 | 2026-05-20 | 3 | 49 |
| 7 | 2026-05-01 | 2026-05-01 | 1 | 0 |
Rule of thumb. Whenever a question reads "first-seen," "last-seen," "first time a user did X," or "most recent update," the answer is MIN(timestamp) or MAX(timestamp) per group. If you also need the matching row's other columns (the product name when price is min, the country when login is first), reach for ROW_NUMBER() OVER (ORDER BY ...) — MIN alone can't carry sibling columns.
SQL interview question on the matching-row trap
A staff-level interviewer might frame this as: "For each product category, return the product with the lowest price, along with its name and SKU. Don't pair the wrong product name with the minimum price." It tests whether the candidate reaches for ROW_NUMBER() instead of a naive MIN(price) + MIN(name) chain.
Solution Using ROW_NUMBER() OVER (PARTITION BY … ORDER BY …)
WITH ranked AS (
SELECT
category,
product_name,
sku,
price,
ROW_NUMBER() OVER (
PARTITION BY category
ORDER BY price ASC, sku ASC -- deterministic tie-break
) AS rn
FROM products
WHERE price IS NOT NULL
)
SELECT category, product_name, sku, price
FROM ranked
WHERE rn = 1
ORDER BY category;
Step-by-step trace.
| category | product_name | sku | price |
|---|---|---|---|
| Electronics | Phone X | E-101 | 699 |
| Electronics | Phone Y | E-104 | 499 |
| Electronics | Phone Z | E-110 | 499 |
| Books | SQL Deep Dive | B-2001 | 39 |
| Books | Algorithms 4e | B-2010 | 79 |
Walking the window:
- The
ROW_NUMBER()window is partitioned bycategoryand ordered byprice ASC. Within each category, the lowest-price row getsrn = 1. -
Electronics. Two products tie at price 499. The tie-break
ORDER BY price ASC, sku ASCmakesE-104win because'E-104' < 'E-110'lexicographically.rnvalues: E-104 → 1, E-110 → 2, E-101 → 3. -
Books. No ties.
B-2001(price 39) →rn = 1.B-2010(price 79) →rn = 2. - The outer filter keeps only rows where
rn = 1→ one row per category. - The
WHERE price IS NOT NULLguard inside the CTE prevents NULL prices from being silently sorted to the bottom (Postgres) or top (Snowflake) of the order.
Output:
| category | product_name | sku | price |
|---|---|---|---|
| Books | SQL Deep Dive | B-2001 | 39 |
| Electronics | Phone Y | E-104 | 499 |
Why this works — concept by concept:
-
ROW_NUMBER() vs MIN() —
MIN(price)reduces N rows to one scalar; you lose the matching row's other columns.ROW_NUMBER()tags every row with a rank, so you can filter torn = 1and keep every column from the winning row. -
PARTITION BY for per-group —
PARTITION BY categoryresets the row-number counter at each new category. Equivalent to "do this min-finding once per group." -
Deterministic tie-break —
ORDER BY price ASC, sku ASCensures the same row wins every time the query runs. Without the secondary key, ties are broken arbitrarily, which makes debugging downstream reports painful. - NULL guard at the CTE — explicitly excluding NULL prices keeps the row-number ordering deterministic and avoids "the cheapest product is NULL" surprises.
-
Alternative: KEEP DENSE_RANK FIRST (Oracle) —
MIN(product_name) KEEP (DENSE_RANK FIRST ORDER BY price)is a one-line idiom in Oracle and DB2 that does the same job. Not portable, but very clean if you're locked into Oracle. -
Cost — one window pass plus the filter; O(N log N) due to the sort. If the data is already clustered by
(category, price), the optimiser does a merge instead of a sort.
SQL
Topic — window functions
Window function problems (ROW_NUMBER, RANK)
Worked example — MIN and MAX over string columns with collation
Detailed explanation. A subtle interview probe is "what does MAX(username) return on a mixed-case column?" — because the answer depends on collation. In Postgres with en_US.UTF-8, 'alice' > 'Bob' (lowercase is "larger" in some collations). In Snowflake's default binary collation, 'A' < 'a' because uppercase ASCII codepoints are smaller. The interview test isn't whether you've memorised collations — it's whether you reach for LOWER(col) defensively when the answer matters.
Question. Given a users table with mixed-case username values, predict the result of MAX(username) in three engines: Postgres default, MySQL utf8mb4_general_ci, and Snowflake default. Then write a portable query that always returns the same answer.
Input.
| user_id | username |
|---|---|
| 1 | Alice |
| 2 | bob |
| 3 | Charlie |
| 4 | alice |
| 5 | BOB |
Code.
-- Native MAX (engine-dependent result)
SELECT MAX(username) AS max_user FROM users;
-- Portable: normalise to a single case first
SELECT MAX(LOWER(username)) AS max_user_normalised FROM users;
-- Explicit collation (Postgres / SQL Server)
SELECT MAX(username COLLATE "C") AS max_user_binary FROM users; -- Postgres
-- SELECT MAX(username COLLATE Latin1_General_BIN) FROM users; -- SQL Server
Step-by-step explanation.
-
Postgres
en_US.UTF-8. Locale-aware sort:'Alice' < 'alice' < 'bob' < 'BOB' < 'Charlie' < 'charlie'(collation considers locale-specific rules).MAXreturns the largest value in this ordering — typically'charlie'or'Charlie'depending on the precise locale rules. -
MySQL
utf8mb4_general_ci. Case-insensitive:'Alice' = 'alice'and'Bob' = 'bob' = 'BOB'. The engine still has to break ties forMAX; it picks an arbitrary case-variant. Result: one of'Charlie','charlie', or'CHARLIE'. -
Snowflake binary collation. Strict ASCII order: uppercase (codepoints 65–90) sorts before lowercase (codepoints 97–122). So
'BOB' < 'Charlie' < 'alice' < 'bob'.MAXreturns'bob'. -
MAX(LOWER(username))normalises every value to lowercase before the aggregate. Every engine agrees: the result is'charlie'. The downside is you lose the original casing of the matching row. -
COLLATE "C"(Postgres) orLatin1_General_BIN(SQL Server) forces binary comparison. Every engine that supports the clause now agrees with Snowflake's default behaviour.
Output (engine-by-engine).
| Engine | MAX(username) |
MAX(LOWER(username)) |
|---|---|---|
| Postgres (en_US.UTF-8) |
charlie (locale-aware) |
charlie |
| MySQL (utf8mb4_general_ci) | any of Charlie / charlie / CHARLIE
|
charlie |
| Snowflake (binary) | bob |
charlie |
Rule of thumb. When the answer to MAX(string_col) matters for business logic, normalise with LOWER(col) (or UPPER(col)) before the aggregate. If you need the original casing back, wrap with ROW_NUMBER() OVER (ORDER BY LOWER(username) DESC) and filter to rn = 1 — that returns the matching row in its original case.
4. COUNT(DISTINCT) and APPROX_COUNT_DISTINCT
count distinct sql is the most expensive aggregate — pick exact or approximate based on your latency budget
The mental model in one line: COUNT(DISTINCT col) must remember every distinct value it has seen, so its memory is O(distinct cardinality); APPROX_COUNT_DISTINCT(col) keeps a fixed-size HyperLogLog sketch and trades ~1.6% error for ~100× speed-up. Once you internalise that trade-off — exact vs approximate, memory vs error — every sql count distinct interview question collapses to a single question: "How much error can you tolerate?"
Exact COUNT(DISTINCT col) in five bullets.
- What it does. Keeps a hash set of every distinct value seen; returns the set's size at the end.
- Memory cost. O(distinct cardinality). 1 billion distinct strings ≈ 30–80 GB of RAM depending on string length.
- CPU cost. O(N) walks + O(N) hashes; the hash set itself amortises O(1) per insert.
- When it spills. When the distinct set exceeds available memory, the engine spills to disk. Query time goes from seconds to minutes.
- When to use it. When the result feeds a financial report, a compliance metric, or any case where "exact" is non-negotiable.
Approximate APPROX_COUNT_DISTINCT(col) in five bullets.
- What it does. Keeps a HyperLogLog sketch (a small array of "register" buckets, each storing a leading-zero count of hashed values).
- Memory cost. O(log log n) — typically ~1.5 KB for the standard 14-bit HLL++ register count, regardless of dataset size.
- CPU cost. O(N) walks + O(N) hashes; no hash-set growth, no spill.
- Error bound. ~1.6% standard error with the default register count (Snowflake, BigQuery, Spark). Configurable down to ~0.5% with more registers.
- When to use it. Dashboards, MAU/DAU metrics, ad analytics, anything where "approximate-within-2%" is acceptable and you need sub-second responses on billion-row tables.
Dialect coverage.
-
Snowflake.
APPROX_COUNT_DISTINCT(col)native. ~1.6% error. AlsoHLL_ESTIMATE,HLL_COMBINEfor sketch composition. -
BigQuery.
APPROX_COUNT_DISTINCT(col)native. Same HLL++ implementation. AlsoHLL_COUNT.INIT,HLL_COUNT.EXTRACT,HLL_COUNT.MERGEfor sketch composition across daily partitions. -
Spark SQL / Databricks.
approx_count_distinct(col, rsd)wherersdis the target relative standard deviation. -
Postgres. No native function — install the
hllextension orpostgresql-hll. -
MySQL. No native function — fall back to exact
COUNT(DISTINCT).
Multi-column COUNT(DISTINCT a, b).
- ANSI SQL:
COUNT(DISTINCT a, b)— counts distinct tuples of(a, b). Supported in Postgres, Snowflake, BigQuery, Oracle. - MySQL:
COUNT(DISTINCT a, b)supported. - SQL Server: not supported — use
COUNT(DISTINCT CONCAT(a, '|', b))orCOUNT(DISTINCT CAST(a AS VARCHAR) + '|' + CAST(b AS VARCHAR)). - Beware the NULL rule: if either
aorbis NULL, the tuple is excluded.
HyperLogLog++ in one paragraph.
- The HLL data structure is an array of
m = 2^p"register" buckets (p = 14is the common default → 16,384 registers, ~1.5 KB total). - Each input value is hashed to a 64-bit integer. The first
pbits select a register; the rank of the leading-zero run in the remaining bits is stored in that register (running max). - The estimator inverts the register state to produce a cardinality estimate. Standard error ≈
1.04 / sqrt(m)≈ 0.81% for the default; engines often report ~1.6% to leave headroom for highly-skewed inputs. - Sketches compose —
MERGE(sketch_day_1, sketch_day_2)produces a sketch for the union. Critical for incremental dashboards that pre-aggregate per day and merge across windows. - Snowflake and BigQuery expose this composition via
HLL_COMBINE/HLL_COUNT.MERGE. Postgres +hllextension exposeshll_union_agg.
Worked example — distinct values vs DISTINCT in SELECT
Detailed explanation. Junior engineers often confuse three shapes that look similar but answer different questions: SELECT DISTINCT col FROM t, SELECT COUNT(DISTINCT col) FROM t, and SELECT col, COUNT(*) FROM t GROUP BY col. Each has its place; mixing them up produces silent bugs.
Question. Given an events table with 100 rows and user_id values that include duplicates and NULLs, predict the row count of each of these three queries and explain when you'd use which.
Input (summary).
| Property | Value |
|---|---|
| Total rows | 100 |
Distinct non-NULL user_id
|
30 |
Rows with user_id IS NULL
|
5 |
Code.
-- Query A — every distinct value (one row per user, NULL collapsed to one row)
SELECT DISTINCT user_id FROM events;
-- Query B — scalar count of distinct non-NULL values
SELECT COUNT(DISTINCT user_id) FROM events;
-- Query C — distinct values plus how many times each appears
SELECT user_id, COUNT(*) AS row_count
FROM events
GROUP BY user_id
ORDER BY row_count DESC;
Step-by-step explanation.
-
Query A. Returns 31 rows: 30 distinct non-NULL
user_idvalues + 1 row for NULL.SELECT DISTINCTtreats NULL as a distinct value and emits it once (unlikeCOUNT(DISTINCT)). -
Query B. Returns 1 row with a single scalar = 30. NULL is excluded because
COUNT(DISTINCT)ignores NULLs just like every other aggregate. -
Query C. Returns 31 rows (same count as A) but each row carries the per-value frequency. NULL appears once with
row_count = 5. This is the most common shape for "show me the distribution of user_id values." - The three queries scan the same table and use the same hash-deduplication machinery internally; they just expose the result differently.
- The interview probe: "If
DISTINCTkeeps NULL butCOUNT(DISTINCT)drops it, why?" — becauseDISTINCTis a row-level operation (NULL is a value like any other for row deduplication) whileCOUNT(DISTINCT)is an aggregate (every aggregate exceptCOUNT(*)ignores NULLs).
Output (row counts).
| Query | Row count | Includes NULL? |
|---|---|---|
SELECT DISTINCT user_id |
31 | yes (one NULL row) |
SELECT COUNT(DISTINCT user_id) |
1 row, value 30 | no (NULL excluded) |
SELECT user_id, COUNT(*) GROUP BY user_id |
31 | yes (NULL row with row_count = 5) |
Rule of thumb. Use SELECT DISTINCT col when you want the list. Use COUNT(DISTINCT col) when you want the scalar count of unique non-NULL values. Use GROUP BY col when you want the list and the per-value count (a frequency table). All three use the same de-duplication machinery — they're three views into the same operation.
Worked example — daily active users at 10B-row scale
Detailed explanation. A common analytics-engineering problem: compute DAU on an events table with 10 billion rows and ~1 billion distinct users per day. Exact COUNT(DISTINCT user_id) runs for minutes and may spill; APPROX_COUNT_DISTINCT runs in seconds with ~1.6% error.
Question. Compare the cost and accuracy of exact vs approximate distinct counts on a 10-billion-row events table with ~1 billion distinct users. Show the Snowflake query, the BigQuery query, and the Spark SQL query.
Input (table profile).
| Attribute | Value |
|---|---|
| Table | events |
| Total rows | 10,000,000,000 |
| Distinct users | ~1,000,000,000 |
| Avg user_id length | 16 bytes |
Code.
-- Snowflake / BigQuery (HLL++ ~1.6% error)
SELECT
event_day,
COUNT(DISTINCT user_id) AS dau_exact,
APPROX_COUNT_DISTINCT(user_id) AS dau_approx
FROM events
GROUP BY event_day
ORDER BY event_day;
-- Spark SQL — tighter error bound (~1.0%)
SELECT
event_day,
COUNT(DISTINCT user_id) AS dau_exact,
approx_count_distinct(user_id, 0.01) AS dau_approx_tight
FROM events
GROUP BY event_day;
-- Postgres with the hll extension
SELECT
event_day,
COUNT(DISTINCT user_id) AS dau_exact,
hll_cardinality(
hll_add_agg(hll_hash_text(user_id))
) AS dau_approx
FROM events
GROUP BY event_day;
Step-by-step explanation.
- Exact
COUNT(DISTINCT user_id)perevent_dayrequires the engine to keep a hash of every distinctuser_idseen so far in that group. For 1B distinct users × 16 bytes ≈ 16 GB hash + overhead → spills on most warehouses. -
APPROX_COUNT_DISTINCT(user_id)keeps a HyperLogLog sketch of ~1.5 KB per group regardless of cardinality. Memory is constant; CPU is one hash per row plus a tiny register update. - Each row is hashed; the hash's leading-zero count selects a register; the register stores the running max of leading-zero counts. The HLL estimator inverts the register state to produce an approximate cardinality.
- The HLL++ implementation in Snowflake/BigQuery uses 14 bits of the hash for register selection (2^14 = 16,384 registers); standard error ≈ 1.04 / sqrt(16,384) ≈ 0.81% (Snowflake reports ~1.6% to leave headroom for skewed inputs).
- The Spark
approx_count_distinct(col, rsd=0.01)form lets you target 1% error explicitly by increasing the register count to ~64K — about 4× the memory of the default.
Output (illustrative timing on a typical Snowflake X-Large warehouse).
| Metric | COUNT(DISTINCT user_id) |
APPROX_COUNT_DISTINCT(user_id) |
|---|---|---|
| Walltime | ~120 s (with disk spill) | ~8 s |
| Memory | ~16 GB hash set | ~1.5 KB sketch |
| Error | exact (0%) | ~1.6% standard |
| Cost (credits) | high | ~5–10× cheaper |
| Result on 1B users | 1,000,000,000 | 1,016,432,118 (≈ +1.6%) |
Rule of thumb. Default to APPROX_COUNT_DISTINCT for every dashboard, BI tile, and analytics-engineering metric. Fall back to exact COUNT(DISTINCT) only for financial reporting, fraud queries, or anywhere the result is shown to an external auditor.
SQL interview question on multi-column distinct counts
A senior interviewer might frame this as: "Count distinct (user_id, day) pairs in the last 30 days — i.e. how many user-active-day combinations did we record? — and then count the distinct days each user was active." It tests the COUNT(DISTINCT a, b) tuple form, NULL handling, and the count-of-count pattern (count of count sql) at the same time.
Solution Using COUNT(DISTINCT user_id, event_day) plus a count-of-count CTE
WITH user_day AS (
SELECT
user_id,
event_day,
COUNT(*) AS events_in_day
FROM events
WHERE event_day >= CURRENT_DATE - INTERVAL '30 days'
AND user_id IS NOT NULL
GROUP BY user_id, event_day
),
per_user AS (
SELECT
user_id,
COUNT(*) AS active_days,
SUM(events_in_day) AS total_events
FROM user_day
GROUP BY user_id
)
SELECT
COUNT(*) AS total_users,
COUNT(DISTINCT user_id) AS distinct_users, -- == total_users here
SUM(active_days) AS user_active_day_pairs,
COUNT(*) FILTER (WHERE active_days = 1) AS users_active_1_day,
COUNT(*) FILTER (WHERE active_days >= 7) AS users_active_7plus_days
FROM per_user;
Step-by-step trace.
Walking the layered CTE on a tiny 30-day window with three users:
| user_id | event_day | events_in_day |
|---|---|---|
| 42 | 2026-05-01 | 5 |
| 42 | 2026-05-02 | 3 |
| 42 | 2026-05-15 | 7 |
| 99 | 2026-05-10 | 2 |
| 7 | 2026-05-01 | 1 |
| 7 | 2026-05-02 | 1 |
| 7 | 2026-05-03 | 1 |
| 7 | 2026-05-04 | 1 |
| 7 | 2026-05-05 | 1 |
| 7 | 2026-05-06 | 1 |
| 7 | 2026-05-07 | 1 |
| 7 | 2026-05-08 | 1 |
-
user_dayCTE. Groups events by(user_id, event_day). Output is one row per (user, day) tuple with the per-day event count. -
per_userCTE. Groupsuser_daybyuser_id.COUNT(*)here equalsactive_days— distinct days the user was active. -
Final SELECT. Now operating on
per_user(one row per user) — this is the count-of-count layer. -
SUM(active_days)overper_userreproducesCOUNT(DISTINCT user_id, event_day)on the original table — 12 user-active-day pairs. -
COUNT(*) FILTER (WHERE active_days = 1)counts users active on exactly one day;FILTER (WHERE active_days >= 7)counts users active on 7+ days (a "power-user" cohort). - The
WHERE user_id IS NOT NULLguard prevents anonymous event rows from polluting the counts.
Output:
| total_users | distinct_users | user_active_day_pairs | users_active_1_day | users_active_7plus_days |
|---|---|---|---|---|
| 3 | 3 | 12 | 1 | 1 |
Why this works — concept by concept:
-
COUNT(DISTINCT a, b) — counts distinct tuples. Equivalent to
SUM(COUNT(*))overGROUP BY a, b— which is exactly what the layered CTE computes. -
Count-of-count pattern —
GROUP BY user_id, dayonce (inner), thenGROUP BY user_idagain (outer). The outer group'sCOUNT(*)answers "how many distinct days per user." Apply once more and you get cohort buckets ("how many users have N active days") — thecount of count sqlshape interviewers love. -
Layered CTE for readability — splitting
user_dayandper_usermakes the query easier to debug and lets you EXPLAIN each layer in isolation. Most query optimisers will fuse the layers automatically. -
FILTER (WHERE …) — replaces
SUM(CASE WHEN active_days = 1 THEN 1 ELSE 0 END). Cleaner, ANSI-standard, slightly faster. -
NULL guard at the source —
WHERE user_id IS NOT NULLensures anonymous events don't contaminate the distinct-pair count. The tuple(NULL, day)would already be excluded byCOUNT(DISTINCT), but being explicit makes intent clear. -
Cost — two hash-aggregates: O(N) → O(distinct (user, day) pairs) → O(distinct users). On 10B rows with 1B distinct users, the inner aggregate dominates. Replace
COUNT(DISTINCT)withAPPROX_COUNT_DISTINCTif the dashboard tolerates 1.6% error.
SQL
Topic — distinct
DISTINCT and COUNT(DISTINCT) problems
Worked example — count of count (frequency histogram)
Detailed explanation. A common analytics-engineering question is "how many users had exactly 1, 2, 3, … N orders this month?" — a histogram of the per-user order count. The shape is a classic count of count sql pattern: aggregate once to get the per-user count, then aggregate again to bucket the counts. Interviewers love it because the naive answer ("SELECT COUNT(orders) GROUP BY user_id, then GROUP BY count") only works once you understand that aggregates can be re-aggregated.
Question. Given an orders table with customer_id and order_id, produce a frequency histogram: one row per "number of orders this month" with the count of customers in that bucket. Cap the histogram at "10+" to avoid a long tail.
Input.
| customer_id | order_id |
|---|---|
| c_001 | o_1 |
| c_001 | o_2 |
| c_001 | o_3 |
| c_002 | o_4 |
| c_002 | o_5 |
| c_003 | o_6 |
| c_004 | o_7 |
| c_004 | o_8 |
| c_004 | o_9 |
| c_004 | o_10 |
| c_004 | o_11 |
Code.
WITH per_customer AS (
SELECT customer_id, COUNT(*) AS orders_count
FROM orders
WHERE order_day >= DATE_TRUNC('month', CURRENT_DATE)
GROUP BY customer_id
),
bucketed AS (
SELECT
CASE WHEN orders_count >= 10 THEN '10+'
ELSE CAST(orders_count AS VARCHAR) END AS bucket,
COUNT(*) AS customers_in_bucket
FROM per_customer
GROUP BY 1
)
SELECT bucket, customers_in_bucket
FROM bucketed
ORDER BY
CASE WHEN bucket = '10+' THEN 99 ELSE CAST(bucket AS INTEGER) END;
Step-by-step explanation.
-
per_customerCTE. Groups orders bycustomer_id; produces one row per customer with theirorders_count. c_001 → 3, c_002 → 2, c_003 → 1, c_004 → 5. -
bucketedCTE. Re-aggregates the previous result. TheCASEcollapses any count ≥ 10 into the'10+'bucket; counts 1–9 are kept as their own bucket. - The outer
SELECTorders the result: numeric buckets ascending, then'10+'last. The trick is theCASEinORDER BY—'10+'would sort as a string between'1'and'2'otherwise. - This is the textbook count-of-count pattern:
COUNT(*)inside a group, thenCOUNT(*)over the resulting per-group rows. - Notice that the inner CTE could also use
COUNT(DISTINCT order_id)if the source table had duplicate order rows — same logic, slightly more defensive.
Output.
| bucket | customers_in_bucket |
|---|---|
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 5 | 1 |
| 10+ | 0 |
Rule of thumb. Whenever a question asks "how many users had X items, Y times" or "show me a histogram of …", the answer is a layered CTE: aggregate once to get per-entity counts, then aggregate again to bucket the counts. The pattern composes cleanly with FILTER (WHERE …) for cohorts and with window functions for percentiles.
5. Advanced aggregates — GROUPING SETS, ROLLUP, CUBE, FILTER
One query, multiple grouping levels — collapse three reports into one with GROUPING SETS
The mental model in one line: GROUPING SETS lets you specify multiple group-by combinations in a single query; ROLLUP is the hierarchical-subtotals shortcut; CUBE is the all-combinations shortcut; FILTER (WHERE …) is the conditional-aggregate shortcut. All four extensions exist because writing three UNION ALL-joined GROUP BY queries is verbose, slow (three scans of the same table), and error-prone.
The four extensions in one bullet each.
-
GROUPING SETS ((a, b), (a), ())— explicit list of grouping combinations. Most flexible; verbose for many combinations. -
ROLLUP(a, b, c)— hierarchical subtotals:(a,b,c) → (a,b) → (a) → (). Cleanest way to ship "month + quarter + year + grand total" in one query. -
CUBE(a, b)— all combinations:(a,b) → (a) → (b) → (). 2^N grouping sets for N keys — explodes fast, use sparingly. -
FILTER (WHERE cond)— conditional aggregate. ReplacesSUM(CASE WHEN cond THEN x ELSE 0 END)withSUM(x) FILTER (WHERE cond). ANSI, supported in Postgres / Snowflake / SQL Server 2022+ / DuckDB / Spark SQL.
Dialect coverage matrix.
| Feature | Postgres | Snowflake | BigQuery | MySQL ≤ 8 | SQL Server |
|---|---|---|---|---|---|
GROUPING SETS |
yes | yes | yes (via WITH ROLLUP workaround) |
no | yes |
ROLLUP |
yes | yes | yes |
WITH ROLLUP only |
yes |
CUBE |
yes | yes | no | no | yes |
FILTER (WHERE …) |
yes | yes | no | no | 2022+ |
GROUPING(col) |
yes | yes | yes | no | yes |
The GROUPING(col) function.
- Returns 1 if the column is "grouped out" (subtotal row), 0 if it contributes to the row's grouping. Essential when paired with
ROLLUPorCUBEto label subtotal rows in the output ("All regions" instead of NULL). - Idiomatic use:
CASE WHEN GROUPING(region) = 1 THEN 'All regions' ELSE region END AS region_label.
Why FILTER (WHERE …) is strictly better than SUM(CASE WHEN …).
- Clearer intent. "Sum amount, filtered to paid rows" reads top-to-bottom; the CASE form requires the reader to mentally fold the predicate inside the SUM.
- Slightly faster. The engine can push the predicate down before the SUM accumulator runs; the CASE form materialises NULL or 0 per row first.
-
Composes with COUNT.
COUNT(*) FILTER (WHERE x)is the clearest conditional count; the CASE form isSUM(CASE WHEN x THEN 1 ELSE 0 END)— readable but noisier. -
Composes with window functions.
COUNT(*) FILTER (WHERE x) OVER (PARTITION BY y)works in Postgres; the CASE form does too but reads worse.
Worked example — quarterly revenue with subtotals and a grand total
Detailed explanation. A classic finance-reporting question: produce a single result set with revenue by year × quarter, subtotals by year, and one grand-total row. The shape is GROUP BY ROLLUP(year, quarter), with GROUPING(quarter) and GROUPING(year) to label the subtotal rows.
Question. Given a sales table with year, quarter, and revenue, write a single query that returns four levels: each (year, quarter) bucket, a subtotal per year, and a grand total at the bottom. Use labels instead of NULLs in the subtotal rows.
Input.
| year | quarter | revenue |
|---|---|---|
| 2025 | Q1 | 100 |
| 2025 | Q2 | 150 |
| 2025 | Q3 | 200 |
| 2025 | Q4 | 250 |
| 2026 | Q1 | 300 |
| 2026 | Q2 | 350 |
Code.
SELECT
CASE WHEN GROUPING(year) = 1 THEN 'ALL YEARS'
ELSE CAST(year AS VARCHAR) END AS year_label,
CASE WHEN GROUPING(quarter) = 1 THEN
CASE WHEN GROUPING(year) = 1 THEN 'GRAND TOTAL' ELSE 'YEAR TOTAL' END
ELSE quarter END AS quarter_label,
SUM(revenue) AS revenue,
GROUPING(year) AS g_year,
GROUPING(quarter) AS g_quarter
FROM sales
GROUP BY ROLLUP(year, quarter)
ORDER BY g_year, year, g_quarter, quarter;
Step-by-step explanation.
-
ROLLUP(year, quarter)is equivalent toGROUPING SETS ((year, quarter), (year), ())— three grouping levels generated automatically. - For each grouping level the engine produces one row per distinct group plus the NULL placeholder for "rolled-up" columns.
(year, quarter)produces 6 rows;(year)produces 2 rows (one per distinct year);()produces 1 grand-total row. -
GROUPING(quarter)returns 1 whenquarteris rolled up (i.e. the row is a yearly subtotal or the grand total).GROUPING(year)returns 1 only on the grand-total row. - The labelled
CASEtranslatesGROUPING(quarter) = 1rows into'YEAR TOTAL'(or'GRAND TOTAL'when both year and quarter are rolled up) for human-friendly output. -
ORDER BY g_year, year, g_quarter, quarterkeeps the subtotals immediately under their parent group and the grand total at the bottom — the natural reading order for a finance report.
Output.
| year_label | quarter_label | revenue | g_year | g_quarter |
|---|---|---|---|---|
| 2025 | Q1 | 100 | 0 | 0 |
| 2025 | Q2 | 150 | 0 | 0 |
| 2025 | Q3 | 200 | 0 | 0 |
| 2025 | Q4 | 250 | 0 | 0 |
| 2025 | YEAR TOTAL | 700 | 0 | 1 |
| 2026 | Q1 | 300 | 0 | 0 |
| 2026 | Q2 | 350 | 0 | 0 |
| 2026 | YEAR TOTAL | 650 | 0 | 1 |
| ALL YEARS | GRAND TOTAL | 1350 | 1 | 1 |
Rule of thumb. When the report needs "details + subtotals + grand total," reach for ROLLUP. When the levels aren't strictly hierarchical (e.g. you want region totals AND channel totals AND the grand total, but not the (region, channel) cross), reach for explicit GROUPING SETS. Reserve CUBE for OLAP cubes where you genuinely need every combination.
SQL interview question on conditional aggregation across many slices
A staff-level interviewer might frame this as: "Compute a single report row per region with seven metrics: total revenue, paid revenue, refunded revenue, AOV, refund rate, distinct customers, and approximate distinct sessions. Do it in one pass." It blends FILTER, NULL-safe math, exact and approximate distinct counts, and a clean GROUP BY — exactly the shape staff interviewers reward.
Solution Using FILTER (WHERE …) for every conditional metric in one pass
SELECT
region,
SUM(amount) AS total_revenue,
SUM(amount) FILTER (WHERE status = 'paid') AS paid_revenue,
SUM(amount) FILTER (WHERE status = 'refunded') AS refunded_revenue,
AVG(amount) FILTER (WHERE status = 'paid') AS avg_paid_order_value,
COUNT(*) FILTER (WHERE status = 'refunded') * 1.0
/ NULLIF(COUNT(*), 0) AS refund_rate,
COUNT(DISTINCT customer_id) AS distinct_customers,
APPROX_COUNT_DISTINCT(session_id) AS distinct_sessions_approx
FROM orders
WHERE order_day >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY ROLLUP(region)
ORDER BY GROUPING(region), region;
Step-by-step trace.
| region | order_id | customer_id | session_id | amount | status |
|---|---|---|---|---|---|
| EU | 1 | c_001 | s_aa | 100 | paid |
| EU | 2 | c_001 | s_aa | 200 | paid |
| EU | 3 | c_002 | s_bb | 50 | refunded |
| US | 4 | c_003 | s_cc | 300 | paid |
| US | 5 | c_003 | s_cc | 400 | paid |
| US | 6 | c_004 | s_dd | 100 | refunded |
| US | 7 | c_005 | s_ee | NULL | pending |
Walking the aggregate per region:
-
EU. 3 orders, 2 distinct customers (c_001, c_002), 2 distinct sessions.
SUM(amount)= 350.paid_revenue= 300.refunded_revenue= 50.avg_paid_order_value= 150.refund_rate= 1/3 ≈ 0.333. -
US. 4 orders, 3 distinct customers (c_003, c_004, c_005), 3 distinct sessions.
SUM(amount)= 800 (NULL skipped).paid_revenue= 700.refunded_revenue= 100.avg_paid_order_value= 350.refund_rate= 1/4 = 0.25. -
Grand total (rolled-up region). 7 orders, 5 distinct customers, 5 distinct sessions.
SUM(amount)= 1150.paid_revenue= 1000.refunded_revenue= 150. -
APPROX_COUNT_DISTINCT(session_id)keeps a HyperLogLog sketch per group; tiny memory regardless of session cardinality. On the test data it returns the exact value because the input is small enough that HLL is essentially perfect. -
ORDER BY GROUPING(region), regionputs the detail rows first (g=0) and the grand-total row last (g=1) — natural for a downstream BI tile.
Output:
| region | total_revenue | paid_revenue | refunded_revenue | avg_paid_order_value | refund_rate | distinct_customers | distinct_sessions_approx |
|---|---|---|---|---|---|---|---|
| EU | 350 | 300 | 50 | 150.0 | 0.333 | 2 | 2 |
| US | 800 | 700 | 100 | 350.0 | 0.250 | 3 | 3 |
| NULL | 1150 | 1000 | 150 | 250.0 | 0.286 | 5 | 5 |
Why this works — concept by concept:
- FILTER (WHERE …) for every conditional metric — one accumulator per metric; engine walks the table once. The CASE-WHEN alternative would produce identical results but with noisier code.
-
ROLLUP(region) — adds the grand-total row automatically. Without it, you'd need a
UNION ALLof two queries or a separate "all regions" CTE — both slower and more error-prone. - Exact COUNT(DISTINCT) for customers, APPROX_ for sessions — match the precision budget to the metric. Customer counts often feed billing or audit reports (exact); session counts feed dashboards (approximate is fine).
- NULLIF guard — keeps the refund-rate division safe when a region has zero orders.
-
Single-pass aggregate — the entire result set comes from one scan of
ordersfiltered to the last 30 days. The engine builds a hash-aggregate with one slot per region plus one for the rolled-up grand total. - Cost — O(N) scan with O(distinct regions + HLL sketch size) memory per group. The HLL sketches are tiny (~1.5 KB each), so memory is dominated by the exact-distinct customer set.
SQL
Topic — group-by
GROUP BY problems (subtotals, ROLLUP, CUBE)
Worked example — explicit GROUPING SETS for non-hierarchical reports
Detailed explanation. ROLLUP assumes a strict hierarchy: (year, quarter, month) rolls up to (year, quarter) then (year) then grand total. But real reports often need non-hierarchical slices — "show me totals by region, totals by channel, and a grand total, but don't show me the (region, channel) cross." That's the use case for explicit GROUPING SETS.
Question. Given orders(region, channel, amount), produce three sets of totals in a single query: revenue per region, revenue per channel, and a grand total. The output should label each row's grouping level clearly so the BI tool can render the three sections separately.
Input.
| region | channel | amount |
|---|---|---|
| EU | web | 100 |
| EU | mobile | 150 |
| US | web | 200 |
| US | mobile | 300 |
| APAC | web | 50 |
Code.
SELECT
COALESCE(region, 'ALL REGIONS') AS region,
COALESCE(channel, 'ALL CHANNELS') AS channel,
CASE
WHEN GROUPING(region) = 0 AND GROUPING(channel) = 1 THEN 'by region'
WHEN GROUPING(region) = 1 AND GROUPING(channel) = 0 THEN 'by channel'
ELSE 'grand total'
END AS grouping_level,
SUM(amount) AS revenue
FROM orders
GROUP BY GROUPING SETS (
(region),
(channel),
()
)
ORDER BY GROUPING(region), region, GROUPING(channel), channel;
Step-by-step explanation.
-
GROUPING SETS ((region), (channel), ())tells the engine to compute three independent aggregations andUNIONthe results into one output. No(region, channel)cross is produced. - Per grouping set, the engine fills the columns that aren't grouped on with NULL.
(region)rows havechannel = NULL;(channel)rows haveregion = NULL; the()row has both NULL. -
COALESCE(region, 'ALL REGIONS')rewrites the NULL placeholder into a human label. This is the standard idiom becauseGROUPING(col)andCOALESCEare both ANSI-portable. - The
grouping_levelcolumn usesGROUPING(region)andGROUPING(channel)to label each row — essential when the BI tool needs to render the three sections in separate UI cards. - The
ORDER BYputs region-level totals first, then channel-level totals, then the grand total. Stable, deterministic order regardless of input row order.
Output.
| region | channel | grouping_level | revenue |
|---|---|---|---|
| APAC | ALL CHANNELS | by region | 50 |
| EU | ALL CHANNELS | by region | 250 |
| US | ALL CHANNELS | by region | 500 |
| ALL REGIONS | mobile | by channel | 450 |
| ALL REGIONS | web | by channel | 350 |
| ALL REGIONS | ALL CHANNELS | grand total | 800 |
Rule of thumb. Use ROLLUP when the levels are strictly hierarchical (year → quarter → month). Use explicit GROUPING SETS when the slices are non-hierarchical (region totals + channel totals, but no cross). Use CUBE only when you genuinely want every combination — and remember it's 2^N grouping sets for N keys, so CUBE(a, b, c, d) is 16 grouping sets in one query.
Cheat sheet — aggregate idioms you should memorise
-
Conditional count.
COUNT(*) FILTER (WHERE x > 0)— ANSI; portable fallback isSUM(CASE WHEN x > 0 THEN 1 ELSE 0 END). -
Conditional sum.
SUM(amount) FILTER (WHERE status = 'paid')— clearest paid-vs-refunded shape. -
NULL-safe divide.
SUM(num) * 1.0 / NULLIF(SUM(denom), 0)— never crashes on empty groups. -
Empty-group zero.
COALESCE(SUM(amount), 0)— turns NULL into 0 when the group is empty. -
First-seen.
MIN(created_at)per group. Last-touched.MAX(updated_at)per group. -
First / last value with row context.
ROW_NUMBER() OVER (PARTITION BY group ORDER BY ts)+ filterWHERE rn = 1— keeps every column of the matching row. -
Distinct days active.
COUNT(DISTINCT event_day)peruser_id— natural cohort metric. -
Distinct tuples.
COUNT(DISTINCT user_id, event_day)(Postgres / Snowflake / BigQuery / Oracle); fallbackCOUNT(DISTINCT CONCAT(user_id, '|', event_day))on SQL Server. -
Approximate distinct.
APPROX_COUNT_DISTINCT(col)— Snowflake / BigQuery;approx_count_distinct(col, rsd)— Spark;hll_cardinality(hll_add_agg(hll_hash_text(col)))— Postgres + hll extension. -
Median.
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY x)— Postgres / Snowflake / Oracle. -
Percentile.
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms)— p95 latency in one line. -
Mode (most-frequent).
MODE() WITHIN GROUP (ORDER BY x)— Postgres;(SELECT x FROM t GROUP BY x ORDER BY COUNT(*) DESC LIMIT 1)— portable fallback. -
Range.
MAX(x) - MIN(x)— works on numerics and dates. -
Stddev / variance.
STDDEV_SAMP(x),VAR_SAMP(x)— sample stats;STDDEV_POP,VAR_POPfor population stats. -
String aggregation.
STRING_AGG(name, ', ' ORDER BY name)(Postgres / SQL Server);LISTAGG(name, ', ') WITHIN GROUP (ORDER BY name)(Snowflake / Oracle);GROUP_CONCAT(name SEPARATOR ', ')(MySQL). -
Array aggregation.
ARRAY_AGG(value ORDER BY ts)(Postgres / Snowflake / BigQuery) — emit a sorted array per group. -
Subtotals + grand total.
GROUP BY ROLLUP(year, quarter)— one query, four grouping levels. -
All combinations.
GROUP BY CUBE(region, channel)— explodes to 2^N grouping sets; use sparingly. -
Explicit groupings.
GROUP BY GROUPING SETS ((region), (channel), ())— region totals, channel totals, grand total, no cross. -
Pivot via FILTER.
SUM(amount) FILTER (WHERE region = 'EU') AS eu_revenue, SUM(amount) FILTER (WHERE region = 'US') AS us_revenue— explicit pivot withoutPIVOTclause. -
Running totals.
SUM(amount) OVER (PARTITION BY user_id ORDER BY day)— aggregate-as-window, no GROUP BY collapse. -
Defensive cast for big SUMs.
SUM(amount::numeric)(Postgres) orSUM(CAST(amount AS DECIMAL(38, 2)))(Snowflake / Oracle) — avoids BIGINT overflow on long-running aggregates. -
HAVING vs WHERE.
HAVING SUM(amount) > 1000filters after the aggregate;WHERE amount > 1000filters before. Don't mix them up. -
Anchored zero rows.
LEFT JOIN dim+COALESCE(SUM(...), 0)— empty groups show as zero, not missing.
Frequently asked questions
COUNT(*) vs COUNT(1) vs COUNT(col) — which is faster?
COUNT(*) and COUNT(1) are identical in every modern engine — Postgres, MySQL 8+, Snowflake, BigQuery, SQL Server, and Spark SQL all optimise both to the same execution plan. The "COUNT(1) is faster" myth comes from very old MySQL versions and Oracle 8 trivia that no longer applies. COUNT(col) is genuinely different: it scans the column to check for NULLs, so it is marginally slower on wide tables, and it returns a different number when NULLs are present. Use COUNT(*) when you want "rows," COUNT(col) when you want "rows with a value," and COUNT(DISTINCT col) when you want "unique values."
Does COUNT(DISTINCT) use an index?
Usually no — COUNT(DISTINCT col) requires building a hash of every distinct value, which the optimiser builds by scanning the column. A B-tree index on col can occasionally help if the engine supports an "index-only scan" of the sorted index leaves (Postgres can do this for some shapes), but the typical execution plan is a sequential scan plus a hash-aggregate. The fix when COUNT(DISTINCT) is too slow is rarely "add an index" — it's "switch to APPROX_COUNT_DISTINCT" or "pre-aggregate to a daily snapshot table."
How accurate is APPROX_COUNT_DISTINCT?
The HyperLogLog++ algorithm used by Snowflake, BigQuery, and Spark targets ~1.6% standard error with the default 14-bit register count (16,384 registers, ~1.5 KB memory). That means on a true distinct count of 1,000,000,000 the approximate answer is typically within ±16 million. The error is symmetric (it can over- or under-count) and unbiased in expectation. You can tune it tighter — Spark's approx_count_distinct(col, rsd=0.01) targets 1% error at ~4× memory — but for almost every dashboard the default is fine, and the speed-up vs exact COUNT(DISTINCT) is often 50–200×.
Can I aggregate over window functions?
Sort of — you can nest a window function inside a subquery and then aggregate over the result, but you cannot put a window function directly inside an aggregate function (SUM(ROW_NUMBER() OVER (...)) is illegal). The idiomatic pattern is SELECT SUM(rn) FROM (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t) sub. Aggregates can also be used as window functions — SUM(amount) OVER (PARTITION BY user_id ORDER BY day) produces a running total per user — but those are window-aggregates, not regular aggregates with a window inside.
AVG ignores NULLs but COUNT(*) counts them — why is the rule asymmetric?
COUNT(*) is defined as "the cardinality of the input rows" — it has nothing to do with any column's value, so NULLs don't apply. Every other aggregate is defined as "the reduction of non-NULL values in col" — they operate on the column, and SQL's three-valued logic treats NULL as "absent value," so it's excluded from the reduction. The practical implication: AVG(amount) divides by COUNT(amount), not by COUNT(*). If you want NULL rows to drag the average down (treating them as zero), wrap the column: AVG(COALESCE(amount, 0)).
GROUP BY vs PARTITION BY — when to use each?
GROUP BY collapses the input into one output row per group; PARTITION BY (inside an OVER window) keeps every input row and tags each one with a window-aggregate value. Use GROUP BY for "show me one row per region with total revenue"; use PARTITION BY for "show me every order row along with its region's total revenue next to it." They compose well: a GROUP BY query can sit inside a CTE, and a window function can run over that CTE's output. The classic interview probe is "how would you write a query that returns each order plus its rank within the region by amount?" — answer: SELECT *, ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) FROM orders, not a GROUP BY.
Why does SUM on an empty group return NULL instead of 0?
SQL aggregates are defined as "the reduction of non-NULL values," and the reduction of zero values is undefined — SQL chooses to encode that as NULL. Only COUNT(*) and COUNT(col) return 0 on an empty input because their domain is row counts, where 0 is a legitimate cardinality. If your dashboard needs zero on empty groups, wrap with COALESCE(SUM(amount), 0). The most common production bug from this rule is dashboards that lose rows entirely: when you GROUP BY region on a fact table, regions with zero orders disappear from the result, not just their revenue. The fix is to anchor on a dimension table with LEFT JOIN and COALESCE the aggregate.
What is the difference between HAVING and WHERE in aggregates?
WHERE filters rows before the aggregate runs; HAVING filters groups after the aggregate runs. WHERE amount > 100 keeps only rows where the individual order is above 100, then aggregates. HAVING SUM(amount) > 1000 aggregates first, then keeps only groups where the total is above 1000. The two are not interchangeable — WHERE SUM(amount) > 1000 is illegal because SUM isn't defined per row. The portable rule: put scalar predicates in WHERE, aggregate predicates in HAVING. Modern engines optimise WHERE predicates by pushing them into the scan, while HAVING predicates have to wait for the aggregate to complete — so anything that can go in WHERE should.
Practice on PipeCode
- Drill the aggregation practice library → for end-to-end COUNT, SUM, AVG, and conditional-aggregate questions.
- Rehearse GROUP BY and HAVING drills → for subtotals, ROLLUP, and HAVING-vs-WHERE traps.
- Sharpen DISTINCT and COUNT(DISTINCT) problems → when the interviewer wants exact-vs-approximate distinct counts.
- Stack window function questions → for the matching-row trap and running-totals patterns that pair with aggregates.
- Work through conditional-aggregation drills → to internalise FILTER vs CASE WHEN.
- Lock in SQL practice across every topic → for breadth across joins, subqueries, and aggregates.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Reinforce the relational fundamentals with the SQL for data engineering interviews course →.
- Pair this guide with SQL data types deep dive → for the types that the aggregates above promote and overflow against.
Pipecode.ai is Leetcode for Data Engineering — every aggregate concept above ships with hands-on practice rooms where you write the COUNT, SUM, and ROLLUP queries against real schemas and get scored in real time. Start with the aggregation library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.





Top comments (0)