GROUP BY in SQL collapses many detail rows into one row per distinct key combination, sql aggregate functions (COUNT, SUM, AVG, MIN, MAX) compute a summary value per group, and the HAVING clause filters those groups after the collapse — while WHERE filters raw rows before. That single sentence answers most of the sql interview questions in the group by sql / sql group by cluster, and getting the having clause sql semantics right separates candidates who pass screening from candidates who don't.
This guide walks through every clause in the GROUP BY / HAVING pipeline that reviewers love to test in data engineering interview questions: the row-collapse mental model, the full aggregate functions in sql catalogue, the canonical having vs where clause in sql execution-order trap, group by multiple columns sql with multi-key grain, conditional aggregation tied to Blog72, advanced rollup cube grouping sets sql subtotals, and the seven interview gotchas (fan-out, non-aggregate columns, NULL groups) that fail most candidates. Every section ends as sql interview questions with answers: a runnable PostgreSQL query, a traced execution, an output table, and a concept-by-concept why this works breakdown — the exact shape sql for data engineers rounds reward when aggregation in sql comes up.
When you want hands-on reps immediately after reading, browse GROUP BY SQL practice →, drill the HAVING clause SQL lane →, sharpen aggregation SQL drills →, rehearse conditional aggregation problems →, or widen coverage on the full SQL practice library →.
On this page
- Why GROUP BY and HAVING matter in data engineering interviews
- Anatomy of GROUP BY — the row-collapse mental model
- SQL aggregate functions — COUNT, SUM, AVG, MIN, MAX, COUNT DISTINCT
- WHERE vs HAVING — the execution-order trap
- GROUP BY with conditional aggregation — SUM(CASE WHEN …) revisited
- GROUP BY ROLLUP, CUBE, and GROUPING SETS — multi-level subtotals
- GROUP BY interview gotchas — fan-out, non-aggregate columns, NULL groups
- Choosing where to filter (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why GROUP BY and HAVING matter in data engineering interviews
group by sql collapses; aggregates summarise; having sql filters the groups
The one-sentence invariant interviewers chase: group by sql collapses rows that share the same key values into one row per group; aggregate functions compute a summary value per group; having sql filters those groups after the collapse, while WHERE filters rows *before*. If you can say that aloud and pick the right clause for each predicate, you have already cleared the bar most candidates fail to clear when sql group by having shows up on a whiteboard.
What GROUP BY does at a glance.
- Definition — collapses rows that share the listed column values into one output row per distinct combination.
-
Required syntax —
SELECT <cols> FROM <table> GROUP BY <col1>, <col2>, …. -
Output grain — exactly one row per distinct combination of the
GROUP BYcolumns. -
Non-aggregate column rule — every column in
SELECTmust either appear inGROUP BYor be wrapped inside an aggregate; ANSI SQL enforces this and modernmysql group bydoes too onceONLY_FULL_GROUP_BYis on.
What HAVING does at a glance.
- Definition — filters the grouped rows; usually references aggregate functions.
-
Required syntax —
… GROUP BY <cols> HAVING <predicate>. -
HAVINGcan reference aggregates —WHEREcannot. -
Both
WHEREandHAVINGcan coexist in one query; the canonical interview shape isWHEREfor row filters,GROUP BY, thenHAVINGfor group filters. - The phrase
having by in sql(a common misnomer for theHAVINGclause) refers to this same construct.
The execution pipeline you must memorise.
- The SQL clauses run in this order regardless of how you write them:
-
FROM/JOIN— assemble the row source. -
WHERE— pre-filter the detail rows. -
GROUP BY— collapse rows into one row per group. -
HAVING— filter the grouped rows. - window functions — decorate the grouped rows.
-
SELECT— project the final columns (including aggregate aliases). -
ORDER BY— sort the final output. -
LIMIT— truncate.
-
-
Consequence #1 —
WHEREcannot reference an aggregate (aggregates don't exist yet). -
Consequence #2 —
HAVINGcan reference aggregates and is the right place forSUM(x) > 1000-style predicates. -
Consequence #3 — the
SELECTalias is NOT visible inWHERE/GROUP BY/HAVING(alias does not yet exist).
Why interviewers love this combo.
- Tests row-grain awareness — knowing that
GROUP BY regionproduces one row per region. - Tests aggregate semantics — understanding
COUNT(*)vsCOUNT(col)vsCOUNT(DISTINCT col). - Tests execution-order understanding — picking
WHEREvsHAVINGcorrectly. - Tests fan-out discipline — knowing not to aggregate before deduping a one-to-many join.
What interviewers listen for.
- Do you name the grain of the output explicitly before writing the query? — "this produces one row per
(region, channel)combination." - Do you put row-level predicates in
WHEREand aggregate predicates inHAVING? — the canonical separation. - Do you know the execution order when challenged on
WHEREvsHAVING? — senior signal. - Do you reach for the right aggregate for the question? —
SUMfor totals,COUNT(*)for row counts,COUNT(DISTINCT col)for unique counts,AVGfor means.
Worked example — orders rolled up per region
Question. From orders(order_id, region, amount), return per region: the number of orders and the total revenue. Order the result by total revenue descending.
Input. orders slice.
| order_id | region | amount |
|---|---|---|
| 1 | US | 100 |
| 2 | US | 150 |
| 3 | EU | 80 |
| 4 | US | 200 |
| 5 | EU | 120 |
Code.
SELECT region,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue
FROM orders
GROUP BY region
ORDER BY total_revenue DESC;
Step-by-step explanation.
-
FROM ordersreads every row. -
GROUP BY regioncollapses rows that share the same region into one group. -
COUNT(*)counts the rows inside each group;SUM(amount)totals the amounts. - The output grain is exactly one row per distinct region.
-
ORDER BY total_revenue DESCreferences theSELECTalias — legal becauseORDER BYruns afterSELECT.
Output.
| region | order_count | total_revenue |
|---|---|---|
| US | 3 | 450 |
| EU | 2 | 200 |
Rule of thumb: when the question asks "per X", write GROUP BY X first, then pick aggregates for the metrics — that's the heart of sql group by and the cleanest way to learn aggregation in sql.
SQL
Topic — group-by
GROUP BY SQL drills
SQL
Topic — having-clause
HAVING clause SQL problems
2. Anatomy of GROUP BY — the row-collapse mental model
what does group by do in sql — collapses key-sharing rows into one row per group
The mental model: think of GROUP BY as a row-collapse operator. You start with many detail rows; you end with one row per distinct value of the GROUP BY columns, and aggregate functions populate the summary columns. That's how does group by work in sql in one sentence.
what does group by do in sql — the one-line answer.
- Collapses rows that share the listed column values into one output row per distinct combination.
- Aggregate functions populate the non-grouped output columns.
-
Detail-level information is lost unless preserved by an aggregate (
MIN,MAX,STRING_AGG, etc.). - The output grain equals the
GROUP BYcolumn combination.
Single column GROUP BY.
-
Shape —
GROUP BY <single_column>. -
Output grain — one row per distinct value of
<single_column>. - Example use cases — orders per region, signups per day, revenue per product.
-
NULL is a group — all rows where the column is
NULLform their own group.
group by multiple columns sql — multi-key grouping.
-
Shape —
GROUP BY <col1>, <col2>, <col3>. - Output grain — one row per distinct combination of all listed columns.
-
Order in
GROUP BYdoes not change the result — only the planner's sort/hash choice. -
Positional shorthand —
GROUP BY 1, 2refers to the first and secondSELECTcolumns (works in Postgres, MySQL, Snowflake, BigQuery; not SQL Server). -
Example —
GROUP BY region, channelproduces one row per(region, channel)pair.
GROUP BY on expressions.
-
GROUP BYaccepts any scalar expression — not just bare columns. -
Date truncation —
GROUP BY DATE_TRUNC('month', sale_date)producessql group by monthshape (one row per calendar month). -
CASE expressions —
GROUP BY CASE WHEN amount >= 100 THEN 'big' ELSE 'small' ENDgroups by a derived bucket. -
GROUP BY 1positional — handy for terse queries; less safe than explicit column names in long-lived code. -
group by in postgresqlquirk — Postgres lets youGROUP BYon aSELECTalias in some versions, unlike standard SQL.
The non-aggregate column rule.
-
Every non-aggregate
SELECTcolumn must appear inGROUP BY— ANSI SQL requirement. - MySQL pre-8.0 historically allowed silently picking an arbitrary value from the group (a notorious bug source).
-
Modern
mysql group byenforces this whensql_modeincludesONLY_FULL_GROUP_BY(the default since MySQL 5.7.5). -
Workaround — wrap the offending column in
MIN(col),MAX(col), orANY_VALUE(col)to satisfy the rule.
NULL is a valid group key.
-
Rows with
NULLin aGROUP BYcolumn form their own group together. -
NULL = NULLisUNKNOWNin normal predicates, but forGROUP BYpurposesNULLis treated as a single distinct value. -
One
NULLgroup perNULLcolumn combination —(NULL, 'US')is distinct from('CA', 'US')and from(NULL, NULL).
Worked example — sql group by month for time-series rollup
Question. From sales(sale_date, amount), compute total revenue per calendar month for 2026.
Input. sales slice.
| sale_date | amount |
|---|---|
| 2026-01-05 | 100 |
| 2026-01-20 | 200 |
| 2026-02-03 | 150 |
| 2026-02-18 | 250 |
| 2026-03-10 | 175 |
Code.
SELECT DATE_TRUNC('month', sale_date)::date AS month,
SUM(amount) AS monthly_revenue
FROM sales
WHERE sale_date >= '2026-01-01'
AND sale_date < '2027-01-01'
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY 1;
Step-by-step explanation.
-
WHERE sale_date BETWEEN …pre-filters detail rows before grouping (cheaper). -
DATE_TRUNC('month', sale_date)rounds each timestamp down to the first of its month. -
GROUP BY DATE_TRUNC('month', sale_date)collapses rows that share the same month. -
SUM(amount)totals the amounts per month. -
ORDER BY 1sorts by the first column (the truncated month).
Output.
| month | monthly_revenue |
|---|---|
| 2026-01-01 | 300 |
| 2026-02-01 | 400 |
| 2026-03-01 | 175 |
Rule of thumb: whenever the question says "by month" / "by quarter" / "by week", reach for DATE_TRUNC (Postgres) or the equivalent (DATE_FORMAT in MySQL, DATETRUNC in SQL Server, TRUNC in Oracle) and put it inside GROUP BY.
SQL
Topic — group-by
Multi-column GROUP BY drills
SQL
Topic — aggregation
Aggregation SQL library
3. SQL aggregate functions — COUNT, SUM, AVG, MIN, MAX, COUNT DISTINCT
The sql aggregate functions catalogue — five core plus the COUNT-DISTINCT family
The sql aggregate functions you actually use in interviews fit on one screen: COUNT, SUM, AVG, MIN, MAX, plus the COUNT(DISTINCT col) variant and the string/array aggregates. Knowing the NULL semantics of each is what separates juniors from seniors.
The aggregate functions in sql catalogue at a glance.
-
COUNT(*)— counts every row in the group, including rows with NULLs in any column. -
COUNT(col)— counts rows wherecol IS NOT NULL; skips NULL rows silently. -
COUNT(DISTINCT col)— counts unique non-NULL values ofcol; thecount distinct group by sqlinterview classic. -
SUM(col)— adds non-NULL numeric values; returnsNULLwhen every value in the group isNULL. -
AVG(col)— arithmetic mean of non-NULL values; ignores NULL. -
MIN(col)/MAX(col)— smallest / largest non-NULL value across the group. -
STRING_AGG(col, delim)— concatenates non-NULL values into one string per group (Postgres / SQL Server). Dialect aliases:GROUP_CONCAT(MySQL),LISTAGG(Snowflake / Oracle). -
ARRAY_AGG(col)— collects values into an array per group (Postgres / BigQuery / Snowflake). -
Statistical aggregates —
STDDEV,VARIANCE,PERCENTILE_CONT,PERCENTILE_DISC(dialect-dependent).
COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col) — the sql count group by interview test.
-
COUNT(*)counts rows; a row contributes1even if every column isNULL. Use this for "how many rows landed in this group". -
COUNT(col)counts rows wherecolis not NULL. Use this for "how many of these rows had a recordedcol". -
COUNT(DISTINCT col)counts distinct non-NULL values ofcol. Use this for "how many unique X showed up". -
Common bug — using
COUNT(col)when you meantCOUNT(*), and silently undercounting rows wherecolis NULL. -
Postgres / Snowflake — both fully support
COUNT(DISTINCT). BigQuery legacy — usesAPPROX_COUNT_DISTINCT; modern BigQuery supports exactCOUNT(DISTINCT).
SUM semantics — the sql sum group by pattern.
-
Adds non-NULL values per group; treats
NULLas zero in the sense of "skip it", not "add zero". -
Empty group returns
NULL, not0— guard withCOALESCE(SUM(x), 0)if the downstream consumer expects zero. -
Overflow — large
SUMs onINTEGERcolumns can overflow; cast toBIGINTorNUMERICif needed. -
Combined with
GROUP BY— the canonical patternSELECT region, SUM(amount) FROM orders GROUP BY region.
AVG semantics.
-
Arithmetic mean of non-NULL values per group;
(SUM / COUNT(col)). -
Empty group returns
NULL. -
Integer division trap —
AVGon an integer column returns an integer in some dialects (MySQL); cast toDECIMALorNUMERICfor fractional means:AVG(col::numeric).
MIN / MAX semantics.
- Smallest / largest non-NULL value across the group.
- Works on any orderable type — numbers, strings (lexicographic), dates, timestamps.
-
Empty group returns
NULL. -
Useful pattern —
MIN(created_at)to find each group's earliest record;MAX(updated_at)for the latest.
String aggregates — joining rows into one string per group.
-
PostgreSQL / SQL Server —
STRING_AGG(col, ', '). AddORDER BYinside for sorted concatenation:STRING_AGG(name, ', ' ORDER BY name). -
MySQL —
GROUP_CONCAT(col SEPARATOR ', ' ORDER BY col). Has a default length limit (group_concat_max_len); raise it for long results. -
Snowflake / Oracle —
LISTAGG(col, ', ') WITHIN GROUP (ORDER BY col). - Use case — "comma-separated list of customer names per region for the sales team".
Array aggregates.
-
Postgres / BigQuery / Snowflake —
ARRAY_AGG(col ORDER BY col). - Returns an array column — useful for JSON exports or downstream UNNEST operations.
-
ARRAY_AGG(DISTINCT col)— deduplicate before collecting. -
FILTER (WHERE …)clause works insideARRAY_AGGtoo (Postgres 9.4+).
SQL interview question — per-region rollup with five aggregates
Assume orders(order_id, region, customer_id, amount). Return per region: total orders (COUNT(*)), total revenue (SUM(amount)), average order value (AVG(amount)), smallest and largest order amounts (MIN / MAX), and the count of distinct customers (COUNT(DISTINCT customer_id)).
Solution Using GROUP BY + multiple aggregates
Code solution.
SELECT region,
COUNT(*) AS total_orders,
SUM(amount) AS total_revenue,
ROUND(AVG(amount)::numeric, 2) AS avg_order_value,
MIN(amount) AS min_order,
MAX(amount) AS max_order,
COUNT(DISTINCT customer_id) AS distinct_customers
FROM orders
GROUP BY region
ORDER BY total_revenue DESC;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 | Scan orders
|
every order row read |
| 2 | GROUP BY region |
hash aggregate; one row per region |
| 3 | Six aggregate calls evaluated per group | each KPI computed in the same pass |
| 4 | ROUND(AVG(amount)::numeric, 2) |
cast to numeric to avoid integer-division surprises; round to 2 decimal places |
| 5 | COUNT(DISTINCT customer_id) |
deduplicates customers per region before counting |
| 6 | Outer ORDER BY total_revenue DESC
|
best-performing regions first |
Output:
| region | total_orders | total_revenue | avg_order_value | min_order | max_order | distinct_customers |
|---|---|---|---|---|---|---|
| US | 142 | 12,500 | 88.03 | 5 | 1,200 | 89 |
| EU | 98 | 8,750 | 89.29 | 8 | 950 | 64 |
| APAC | 64 | 5,400 | 84.38 | 10 | 700 | 42 |
Why this works — concept by concept:
- GROUP BY region — defines the output grain as one row per region; every aggregate is computed within each region partition.
- COUNT(*) counts rows — orders per region, regardless of NULLs.
-
SUM(amount) totals values — every aggregate-as-window aggregation pattern in
sql sum group byfollows this shape. -
AVG(amount) cast to numeric — without the cast,
AVGon an integer column returns truncated integers in MySQL and older Postgres versions; rounding to 2 decimal places normalises the output. -
COUNT(DISTINCT customer_id) deduplicates — one customer with 5 orders contributes 1 to
distinct_customersand 5 tototal_orders; the difference is exactly thecount distinct group by sqlinterview probe. -
Single sequential scan — the planner reads
ordersonce and computes all six aggregates in the same hash aggregate node; far cheaper than six separate filtered queries. -
Cost —
Θ(n)for the scan plusΘ(n)for the hash aggregate keyed onregion; the six aggregate calls are a constant per-row factor.
SQL
Topic — aggregation
Aggregate-function SQL drills
SQL
Topic — aggregations
Multi-aggregate practice
4. WHERE vs HAVING — the execution-order trap
having vs where clause in sql — WHERE filters rows; HAVING filters groups
The canonical execution-order trap: WHERE filters detail rows before GROUP BY collapses them; HAVING filters the collapsed groups after. Get this wrong and your query either errors (WHERE cannot reference aggregates) or returns the right rows from a costlier plan (using HAVING for row-level filters). This is the textbook difference between where and having in sql interview question.
having vs where clause in sql — the headline difference.
-
WHEREruns beforeGROUP BY— it filters raw detail rows; aggregates do not yet exist. -
HAVINGruns afterGROUP BY— it filters the grouped rows; aggregates are fully computed. -
Both clauses can coexist in the same query; the canonical shape is
WHEREfor row predicates,GROUP BY,HAVINGfor aggregate predicates. - The
sql where vs having/where vs having sqlquestion is essentially this one sentence.
The execution order (memorise this).
-
FROM/JOIN→WHERE→GROUP BY→HAVING→ window functions →SELECT→DISTINCT→ORDER BY→LIMIT. -
Implication for
WHERE— it sees every raw column but cannot see aggregates (they don't exist yet). -
Implication for
HAVING— it sees the grouped rows; references to non-grouped columns are illegal (the column doesn't have a single value within the group). -
Implication for aliases — a
SELECTalias cannot be referenced inside the same query'sWHERE,GROUP BY, orHAVING; reference is legal inORDER BYonly.
WHERE cannot reference aggregates.
-
Illegal —
WHERE SUM(amount) > 1000— aggregates evaluate afterGROUP BY, afterWHEREhas already run. - Error message — most engines say "aggregate functions are not allowed in WHERE".
-
Correct fix — move the aggregate predicate to
HAVING:HAVING SUM(amount) > 1000.
HAVING should not reference non-aggregates that are also not grouped.
- Works in some dialects — MySQL is lenient; ANSI SQL is strict.
-
Defeats the purpose — a row-level predicate in
HAVINGruns after the engine has already grouped every row; the planner cannot push it down to filter raw rows earlier. -
Cost — slower plan for the same result. Put row-level filters in
WHEREfor planner efficiency.
difference between where and having in sql — quick decision test.
-
Predicate references an aggregate (
SUM,COUNT,AVG, …) → useHAVING. -
Predicate references only raw columns → use
WHERE. -
Both kinds of predicates → use both clauses;
WHEREfor the row-level ones,HAVINGfor the aggregate ones. -
The
sql having clauseis exclusively for aggregates — interview-grade answer.
Both WHERE and HAVING in one query — the canonical shape.
SELECT region,
SUM(amount) AS total
FROM orders
WHERE order_date >= '2026-01-01' -- pre-filter raw rows
AND status <> 'refund' -- row-level predicate stays in WHERE
GROUP BY region
HAVING SUM(amount) > 10000 -- aggregate predicate goes in HAVING
ORDER BY total DESC;
-
WHEREtrims detail rows before the planner builds the hash aggregate — cheaper. -
GROUP BYcollapses by region. -
HAVINGdrops regions whose total revenue is below10000. -
ORDER BYsorts the surviving regions.
SQL interview question — regions with > $10k 2026 revenue, excluding refunds
Assume orders(order_id, region, order_date, status, amount). Return regions whose total 2026 revenue exceeds $10,000, ignoring refunded orders. Sort by descending revenue.
Solution Using WHERE + GROUP BY + HAVING
Code solution.
SELECT region,
SUM(amount) AS total_revenue
FROM orders
WHERE order_date >= '2026-01-01'
AND order_date < '2027-01-01'
AND status <> 'refund'
GROUP BY region
HAVING SUM(amount) > 10000
ORDER BY total_revenue DESC;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 | Scan orders
|
every row read |
| 2 | WHERE order_date BETWEEN '2026-01-01' AND '2027-01-01' AND status <> 'refund' |
drops 2025 orders and refunds before grouping (cheap) |
| 3 | GROUP BY region |
hash aggregate over the surviving rows |
| 4 |
SUM(amount) evaluated per region |
one total per region |
| 5 | HAVING SUM(amount) > 10000 |
drops regions whose total is at or below the threshold |
| 6 | ORDER BY total_revenue DESC |
sorts the survivors |
Output:
| region | total_revenue |
|---|---|
| US | 12,500 |
| EU | 11,200 |
(APAC drops out because its 2026 non-refund revenue is below $10k.)
Why this works — concept by concept:
-
WHERE pre-filters detail rows — the date and status predicates run before
GROUP BY, so the planner builds a smaller hash aggregate; this is the textbookwhere vs having sqlefficiency lesson. - GROUP BY region defines the grain — one row per region in the intermediate result.
-
SUM(amount) computes the per-region total — the aggregate that the
HAVINGclause will then evaluate. -
HAVING filters groups, not rows —
HAVING SUM(amount) > 10000cannot be written inWHEREbecause the aggregate doesn't exist there. -
The
sql having clauseexists specifically for this — it is the only clause where you can predicate on aggregates of the currentGROUP BY. -
Cost —
Θ(n)scan with predicate pushdown thanks toWHERE, thenΘ(k)hash aggregate wherekis the count of surviving rows; theHAVINGfilter isΘ(g)wheregis the number of groups.
SQL
Topic — having-clause
HAVING clause SQL drills
SQL
Topic — aggregation
WHERE + GROUP BY + HAVING patterns
5. GROUP BY with conditional aggregation — SUM(CASE WHEN …) revisited
The bridge to Blog72 — conditional aggregation IS an aggregate inside a GROUP BY
The conditional aggregation pattern from Blog72 — SUM(CASE WHEN cond THEN value ELSE 0 END) — is just an aggregate inside a GROUP BY query. When sql group by having problems ask for "paid vs refunded revenue per region in one query," the answer combines the two ideas: GROUP BY defines the grain, conditional aggregation defines the metrics.
The bridge to Blog72.
-
SUM(CASE WHEN cond THEN val ELSE 0 END)sumsvalonly for rows wherecondis true — see Blog72 §4 for the full anatomy. -
Inside a
GROUP BYquery — one such expression per metric column; all share the sameGROUP BYgrain. -
The
ELSE 0rule — missing it makesSUMignore NULL and silently undercount. -
COUNT(CASE WHEN cond THEN 1 END)andSUM(CASE WHEN cond THEN 1 ELSE 0 END)are equivalent for counting.
Why this matters in interviews.
-
Combines two fluencies at once —
GROUP BYgrain awareness AND branched aggregate logic. - One-pass multi-KPI reports — paid / refunded / pending revenue per region in a single grouped query.
-
No PIVOT operator needed — works on every dialect that supports
CASE. -
Tests for the
ELSE 0discipline — the silent-undercount NULL trap from Blog72.
The canonical pattern — one SUM(CASE WHEN …) per metric, all under one GROUP BY.
SELECT region,
SUM(CASE WHEN status = 'paid' THEN amount ELSE 0 END) AS paid_revenue,
SUM(CASE WHEN status = 'refund' THEN amount ELSE 0 END) AS refunded_amount,
SUM(CASE WHEN status = 'pending' THEN amount ELSE 0 END) AS pending_amount,
SUM(CASE WHEN status = 'paid' THEN 1 ELSE 0 END) AS paid_orders
FROM orders
GROUP BY region
ORDER BY paid_revenue DESC;
Pair with HAVING — filter on any of the conditional aggregates.
-
Filter regions with at least one refund —
HAVING SUM(CASE WHEN status = 'refund' THEN 1 ELSE 0 END) > 0. -
Filter regions with paid revenue > $10k —
HAVING SUM(CASE WHEN status = 'paid' THEN amount ELSE 0 END) > 10000. -
Multi-condition existence —
HAVING SUM(CASE WHEN tier = 'free' THEN 1 ELSE 0 END) > 0 AND SUM(CASE WHEN tier = 'paid' THEN 1 ELSE 0 END) > 0keeps groups that have both free and paid users.
Worked example — paid / refunded / pending revenue per region
Question. From orders(order_id, region, status, amount), compute per region: paid revenue, refunded amount, pending amount, and count of paid orders, all in one grouped query.
Code.
SELECT region,
SUM(CASE WHEN status = 'paid' THEN amount ELSE 0 END) AS paid_revenue,
SUM(CASE WHEN status = 'refund' THEN amount ELSE 0 END) AS refunded_amount,
SUM(CASE WHEN status = 'pending' THEN amount ELSE 0 END) AS pending_amount,
SUM(CASE WHEN status = 'paid' THEN 1 ELSE 0 END) AS paid_orders
FROM orders
GROUP BY region;
Output.
| region | paid_revenue | refunded_amount | pending_amount | paid_orders |
|---|---|---|---|---|
| US | 11,200 | 1,300 | 0 | 142 |
| EU | 7,900 | 850 | 200 | 98 |
| APAC | 5,100 | 300 | 0 | 64 |
Rule of thumb: any time the question is "split metric X by category Y in the same grouped query", combine GROUP BY with SUM(CASE WHEN …) — that's the bridge between this blog and Blog72.
SQL
Topic — conditional-aggregation
Conditional aggregation drills
SQL
Topic — data-aggregation
Data aggregation library
6. GROUP BY ROLLUP, CUBE, and GROUPING SETS — multi-level subtotals
rollup cube grouping sets sql — one query, multiple grain levels
Once you understand single and multi-column GROUP BY, the next layer is producing multiple grain levels in one query. GROUPING SETS, ROLLUP, and CUBE extend GROUP BY to emit subtotals and grand totals without UNION-ing many separate queries.
The three advanced extensions.
-
GROUPING SETS ((a, b), (a), ())— explicit list of group combinations; total flexibility. -
ROLLUP(a, b, c)— hierarchical subtotals from finest grain to grand total; producesn+1grain levels forncolumns. -
CUBE(a, b, c)— every combination of the listed columns; produces2^ngrain levels.
GROUPING SETS — fully explicit.
-
Shape —
GROUP BY GROUPING SETS ((col1, col2), (col1), (col2), ()). -
What it does — runs the query once for each parenthesised tuple and
UNION ALLs the results. -
()is the empty grouping — produces one row with everything aggregated (grand total). -
Use it when — you need a custom mix of grain levels that doesn't match
ROLLUPorCUBE.
ROLLUP — hierarchical subtotals.
-
Shape —
GROUP BY ROLLUP(a, b, c). -
Equivalent to —
GROUPING SETS ((a, b, c), (a, b), (a), ()). -
Grain count —
n+1levels forncolumns: finest grain, peel off the rightmost column each step, plus the grand total. - Use it for — year → quarter → month subtotals with a grand total; geography → country → region rollups.
CUBE — combinatorial subtotals.
-
Shape —
GROUP BY CUBE(a, b, c). -
Equivalent to —
GROUPING SETS ((a, b, c), (a, b), (a, c), (b, c), (a), (b), (c), ()). -
Grain count —
2^nlevels forncolumns; explodes fast. - Use it for — exploratory OLAP analysis where every combination matters; pivot reports with marginal totals.
GROUPING(col) helper.
-
Returns
1when the column was rolled-up (the row's column value isNULLas part of the rollup). -
Returns
0when the row has a real value for that column. -
Use it to label rows —
CASE WHEN GROUPING(quarter) = 1 THEN 'year_total' ELSE quarter::text END.
Use cases.
- Finance subtotals — year → quarter → month → grand total (ROLLUP).
- Pivot reports with marginal totals — every row + every column + the corner cell (CUBE).
- OLAP dashboards — pre-computed grain levels for fast drill-down.
- Dashboard exports — a single query produces the body + the subtotal lines + the grand total.
Dialect support.
-
PostgreSQL, SQL Server, Oracle, BigQuery, Snowflake — all support
ROLLUP,CUBE,GROUPING SETSin the standard syntax. -
MySQL — supports the legacy
WITH ROLLUPsyntax (GROUP BY a, b WITH ROLLUP); notCUBEorGROUPING SETS. -
SQL Server limit — caps at 32 expressions per
ROLLUP/CUBE.
Worked example — revenue by (year, quarter) with year subtotals and grand total
Question. From orders(order_date, amount), compute total revenue at three grain levels in one query: per (year, quarter), per (year), and the grand total.
Code.
SELECT EXTRACT(YEAR FROM order_date) AS year,
EXTRACT(QUARTER FROM order_date) AS quarter,
SUM(amount) AS revenue
FROM orders
GROUP BY ROLLUP(
EXTRACT(YEAR FROM order_date),
EXTRACT(QUARTER FROM order_date)
)
ORDER BY year NULLS LAST, quarter NULLS LAST;
Output.
| year | quarter | revenue |
|---|---|---|
| 2026 | 1 | 5,200 |
| 2026 | 2 | 6,800 |
| 2026 | 3 | 7,400 |
| 2026 | 4 | 8,100 |
| 2026 | NULL | 27,500 |
| NULL | NULL | 27,500 |
- Rows 1–4 are the per-quarter detail.
- Row 5 (
quarter = NULL) is the year subtotal for 2026. - Row 6 (both NULL) is the grand total.
Rule of thumb: if the report needs subtotals plus a grand total in one query, reach for ROLLUP; if it needs every combination, reach for CUBE; if it needs an irregular custom mix, reach for GROUPING SETS.
SQL
Topic — aggregations
Multi-level GROUP BY drills
SQL
Topic — data-aggregation
ROLLUP / CUBE / GROUPING SETS patterns
7. GROUP BY interview gotchas — fan-out, non-aggregate columns, NULL groups
The six bugs interviewers test most often
GROUP BY has a small surface area but a long tail of edge cases that fail candidates. These are the six gotchas reviewers test most often.
Fan-out after a JOIN.
- The bug — joining a one-to-many relationship and aggregating afterwards multiplies the SUM.
-
Example —
orders LEFT JOIN order_itemsthenSUM(orders.amount)triples the amount when each order has three items. - Defence — pre-aggregate inside a CTE to the parent grain before joining; or join after aggregating.
-
State the grain aloud — "one row per
(order_id, item_id)" — every reviewer expects this.
Non-aggregate column error.
-
The bug —
SELECT region, customer_id, SUM(amount) FROM orders GROUP BY region—customer_idis neither inGROUP BYnor an aggregate. - ANSI SQL raises an error.
-
mysql group bypre-8.0 silently picked an arbitrarycustomer_idper group (notorious data-quality bug source). -
Modern MySQL with
ONLY_FULL_GROUP_BYenforces the rule. -
Fix — wrap in
MIN(customer_id),MAX(customer_id),ANY_VALUE(customer_id), or add toGROUP BY.
COUNT(*) vs COUNT(col) confusion.
-
The bug — using
COUNT(col)to count rows whencolis sometimes NULL; the count silently drops the NULL rows. -
Subtle — query "looks right" and returns a plausible number; only an audit against
COUNT(*)catches it. -
Fix — use
COUNT(*)for row counts; useCOUNT(col)only when you mean "rows wherecolis not NULL"; useCOUNT(DISTINCT col)for unique values. -
The
count distinct group by sqltest — interviewer hands you a table with NULLs and asks "how many customers per region?"; correct answer isCOUNT(DISTINCT customer_id), notCOUNT(*)orCOUNT(customer_id)if you want unique customers.
SUM over empty group returns NULL, not 0.
-
The bug —
LEFT JOINproduces empty groups;SUMover them returnsNULL; downstream consumers expect0. -
Symptom — dashboards render "—" or blank instead of
0; downstream arithmetic produces unexpected NULLs. -
Fix —
COALESCE(SUM(x), 0)— pairs cleanly with the safe-arithmetic patterns from Blog72.
NULL group key behaviour.
-
The behaviour — rows with
NULLin aGROUP BYcolumn form their own group (one group for all NULL rows together). -
Subtle — when downstream code joins back on the group key,
NULL = NULLisUNKNOWNand the NULL group row never matches. -
Fix —
COALESCE(group_col, 'UNKNOWN')if you want NULLs to display under a labelled bucket.
GROUP BY alias rules and sql group by month patterns.
-
The rule — a
SELECTalias is NOT visible inside the same query'sGROUP BY(logical processing order). -
Workaround #1 — repeat the expression in
GROUP BY:SELECT DATE_TRUNC('month', t) AS month … GROUP BY DATE_TRUNC('month', t). -
Workaround #2 — positional shorthand:
GROUP BY 1refers to the firstSELECTcolumn. - Workaround #3 — wrap in a CTE: define the alias in the inner query, reference it in the outer.
-
The
sql group by monthpattern ships with all three workarounds in production code.
SQL
Topic — group-by
GROUP BY gotcha drills
SQL
Topic — having-clause
HAVING gotcha drills
Choosing where to filter (cheat sheet)
A one-screen cheat sheet for using GROUP BY and HAVING in SQL — pick the clause that matches your predicate type, then copy the corresponding shape into your query.
| Predicate references … | Use clause … | Why |
|---|---|---|
| Only raw row columns (date, status, user_id) | WHERE |
Runs before GROUP BY; planner can push down |
An aggregate (SUM, COUNT, AVG, MIN, MAX) |
HAVING |
Runs after GROUP BY; aggregates exist by then |
| Both raw columns AND aggregates |
WHERE + HAVING
|
Separate them; row filters in WHERE, group filters in HAVING
|
| A derived bucket label (CASE expression) |
WHERE or GROUP BY
|
Repeat the CASE expression — alias not visible in WHERE / GROUP BY
|
| A row that should be counted only when condition holds | SUM(CASE WHEN cond THEN 1 ELSE 0 END) |
Conditional aggregation pattern from Blog72 |
| A specific group should be excluded by an aggregate threshold | HAVING <aggregate predicate> |
E.g. HAVING COUNT(*) >= 10 for "groups with ≥ 10 rows" |
| The output sort |
ORDER BY (after SELECT) |
ORDER BY can reference the SELECT alias |
Frequently asked questions
What does GROUP BY do in SQL?
GROUP BY collapses rows that share the same key values into one row per distinct combination of the listed columns, and aggregate functions populate the summary columns. The output grain equals the GROUP BY column combination — GROUP BY region returns one row per region, GROUP BY region, channel returns one row per (region, channel) pair. Every non-aggregate column in the SELECT list must appear in GROUP BY (ANSI rule, modern MySQL with ONLY_FULL_GROUP_BY enforces this), and rows with NULL in a GROUP BY column form their own group together.
What's the difference between WHERE and HAVING in SQL?
WHERE filters rows before GROUP BY collapses them; HAVING filters groups after the collapse. The execution order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY, so WHERE cannot reference aggregates (they don't exist yet), and HAVING is the only place you can predicate on SUM, COUNT, AVG, etc. against the current GROUP BY. The canonical interview answer to difference between where and having in sql is: row-level predicates go in WHERE (cheaper, planner can push them down); aggregate predicates go in HAVING (no other clause can reference them).
How do I use GROUP BY and HAVING together in SQL?
Use them together in the canonical shape: WHERE for row-level filters, GROUP BY for the grain, then HAVING for aggregate predicates. For example, "find regions with 2026 revenue over $10k, excluding refunds" becomes WHERE order_date >= '2026-01-01' AND status <> 'refund' GROUP BY region HAVING SUM(amount) > 10000. The group by and having in sql pattern lets you express "filter rows, then group, then filter groups" without breaking the query into multiple steps. Reviewers test this specifically because it conflates two filtering concepts that share a clause name in many languages but are distinct in SQL.
What's the difference between COUNT(*), COUNT(col), and COUNT(DISTINCT col)?
COUNT(*) counts every row in the group, including rows with NULLs in any column. COUNT(col) counts rows where col is not NULL; it silently skips NULL rows. COUNT(DISTINCT col) counts unique non-NULL values of col. The sql count group by interview question typically wants COUNT(*) for "how many rows landed in this group" and COUNT(DISTINCT customer_id) for "how many unique customers". A common bug is using COUNT(col) when you meant COUNT(*) and silently undercounting rows where col is NULL — audit against COUNT(*) to catch this.
What are ROLLUP, CUBE, and GROUPING SETS in SQL?
The rollup cube grouping sets sql family extends GROUP BY to produce multiple grain levels in one query. GROUPING SETS ((a, b), (a), ()) runs the query once per parenthesised tuple and UNION ALLs the results. ROLLUP(a, b, c) produces hierarchical subtotals — finest grain, then peel off the rightmost column each step, plus the grand total — n+1 levels for n columns. CUBE(a, b, c) produces every combination — 2^n levels. Use ROLLUP for finance subtotals (year → quarter → month → grand total), CUBE for exploratory OLAP, and GROUPING SETS for custom mixes. Supported by Postgres, SQL Server, Oracle, BigQuery, Snowflake; MySQL has legacy WITH ROLLUP.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including PostgreSQL-first SQL practice keyed to GROUP BY, HAVING, aggregate functions, conditional aggregation, and the rollup cube grouping sets sql family. Whether you are drilling sql interview questions with answers for sql for data engineers loops or grinding through data engineering interview questions end-to-end, the practice library mirrors the same row-collapse mental model this guide teaches.
Kick off via Explore practice →; drill the dedicated GROUP BY SQL lane →; fan out into the HAVING clause SQL lane →; reinforce aggregation SQL drills →; rehearse conditional aggregation problems →; widen coverage on the full SQL practice library →.





Top comments (0)