Gowtham Potureddi

Posted on May 22

GROUP BY and HAVING in SQL: Aggregation Patterns for Interviews

#python #sql #interview #dataengineering

GROUP BY in SQL collapses many detail rows into one row per distinct key combination, sql aggregate functions (COUNT, SUM, AVG, MIN, MAX) compute a summary value per group, and the HAVING clause filters those groups after the collapse — while WHERE filters raw rows before. That single sentence answers most of the sql interview questions in the group by sql / sql group by cluster, and getting the having clause sql semantics right separates candidates who pass screening from candidates who don't.

This guide walks through every clause in the GROUP BY / HAVING pipeline that reviewers love to test in data engineering interview questions: the row-collapse mental model, the full aggregate functions in sql catalogue, the canonical having vs where clause in sql execution-order trap, group by multiple columns sql with multi-key grain, conditional aggregation tied to Blog72, advanced rollup cube grouping sets sql subtotals, and the seven interview gotchas (fan-out, non-aggregate columns, NULL groups) that fail most candidates. Every section ends as sql interview questions with answers: a runnable PostgreSQL query, a traced execution, an output table, and a concept-by-concept why this works breakdown — the exact shape sql for data engineers rounds reward when aggregation in sql comes up.

When you want hands-on reps immediately after reading, browse GROUP BY SQL practice →, drill the HAVING clause SQL lane →, sharpen aggregation SQL drills →, rehearse conditional aggregation problems →, or widen coverage on the full SQL practice library →.

On this page

Why GROUP BY and HAVING matter in data engineering interviews
Anatomy of GROUP BY — the row-collapse mental model
SQL aggregate functions — COUNT, SUM, AVG, MIN, MAX, COUNT DISTINCT
WHERE vs HAVING — the execution-order trap
GROUP BY with conditional aggregation — SUM(CASE WHEN …) revisited
GROUP BY ROLLUP, CUBE, and GROUPING SETS — multi-level subtotals
GROUP BY interview gotchas — fan-out, non-aggregate columns, NULL groups
Choosing where to filter (cheat sheet)
Frequently asked questions
Practice on PipeCode

1. Why GROUP BY and HAVING matter in data engineering interviews

`group by sql` collapses; aggregates summarise; `having sql` filters the groups

The one-sentence invariant interviewers chase: group by sql collapses rows that share the same key values into one row per group; aggregate functions compute a summary value per group; having sql filters those groups after the collapse, while WHERE filters rows *before*. If you can say that aloud and pick the right clause for each predicate, you have already cleared the bar most candidates fail to clear when sql group by having shows up on a whiteboard.

What GROUP BY does at a glance.

Definition — collapses rows that share the listed column values into one output row per distinct combination.
Required syntax — SELECT <cols> FROM <table> GROUP BY <col1>, <col2>, ….
Output grain — exactly one row per distinct combination of the GROUP BY columns.
Non-aggregate column rule — every column in SELECT must either appear in GROUP BY or be wrapped inside an aggregate; ANSI SQL enforces this and modern mysql group by does too once ONLY_FULL_GROUP_BY is on.

What HAVING does at a glance.

Definition — filters the grouped rows; usually references aggregate functions.
Required syntax — … GROUP BY <cols> HAVING <predicate>.
HAVING can reference aggregates — WHERE cannot.
Both WHERE and HAVING can coexist in one query; the canonical interview shape is WHERE for row filters, GROUP BY, then HAVING for group filters.
The phrase having by in sql (a common misnomer for the HAVING clause) refers to this same construct.

The execution pipeline you must memorise.

The SQL clauses run in this order regardless of how you write them:
- FROM / JOIN — assemble the row source.
- WHERE — pre-filter the detail rows.
- GROUP BY — collapse rows into one row per group.
- HAVING — filter the grouped rows.
- window functions — decorate the grouped rows.
- SELECT — project the final columns (including aggregate aliases).
- ORDER BY — sort the final output.
- LIMIT — truncate.
Consequence #1 — WHERE cannot reference an aggregate (aggregates don't exist yet).
Consequence #2 — HAVING can reference aggregates and is the right place for SUM(x) > 1000-style predicates.
Consequence #3 — the SELECT alias is NOT visible in WHERE / GROUP BY / HAVING (alias does not yet exist).

Why interviewers love this combo.

Tests row-grain awareness — knowing that GROUP BY region produces one row per region.
Tests aggregate semantics — understanding COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col).
Tests execution-order understanding — picking WHERE vs HAVING correctly.
Tests fan-out discipline — knowing not to aggregate before deduping a one-to-many join.

What interviewers listen for.

Do you name the grain of the output explicitly before writing the query? — "this produces one row per (region, channel) combination."
Do you put row-level predicates in WHERE and aggregate predicates in HAVING? — the canonical separation.
Do you know the execution order when challenged on WHERE vs HAVING? — senior signal.
Do you reach for the right aggregate for the question? — SUM for totals, COUNT(*) for row counts, COUNT(DISTINCT col) for unique counts, AVG for means.

Worked example — orders rolled up per region

Question. From orders(order_id, region, amount), return per region: the number of orders and the total revenue. Order the result by total revenue descending.

Input. orders slice.

order_id	region	amount
1	US	100
2	US	150
3	EU	80
4	US	200
5	EU	120

Code.

SELECT region,
       COUNT(*)    AS order_count,
       SUM(amount) AS total_revenue
FROM orders
GROUP BY region
ORDER BY total_revenue DESC;

Step-by-step explanation.

FROM orders reads every row.
GROUP BY region collapses rows that share the same region into one group.
COUNT(*) counts the rows inside each group; SUM(amount) totals the amounts.
The output grain is exactly one row per distinct region.
ORDER BY total_revenue DESC references the SELECT alias — legal because ORDER BY runs after SELECT.

Output.

region	order_count	total_revenue
US	3	450
EU	2	200

Rule of thumb: when the question asks "per X", write GROUP BY X first, then pick aggregates for the metrics — that's the heart of sql group by and the cleanest way to learn aggregation in sql.

SQL
Topic — group-by
GROUP BY SQL drills

Practice →

SQL
Topic — having-clause
HAVING clause SQL problems

Practice →

2. Anatomy of GROUP BY — the row-collapse mental model

`what does group by do in sql` — collapses key-sharing rows into one row per group

The mental model: think of GROUP BY as a row-collapse operator. You start with many detail rows; you end with one row per distinct value of the GROUP BY columns, and aggregate functions populate the summary columns. That's how does group by work in sql in one sentence.

what does group by do in sql — the one-line answer.

Collapses rows that share the listed column values into one output row per distinct combination.
Aggregate functions populate the non-grouped output columns.
Detail-level information is lost unless preserved by an aggregate (MIN, MAX, STRING_AGG, etc.).
The output grain equals the GROUP BY column combination.

Single column GROUP BY.

Shape — GROUP BY <single_column>.
Output grain — one row per distinct value of <single_column>.
Example use cases — orders per region, signups per day, revenue per product.
NULL is a group — all rows where the column is NULL form their own group.

group by multiple columns sql — multi-key grouping.

Shape — GROUP BY <col1>, <col2>, <col3>.
Output grain — one row per distinct combination of all listed columns.
Order in GROUP BY does not change the result — only the planner's sort/hash choice.
Positional shorthand — GROUP BY 1, 2 refers to the first and second SELECT columns (works in Postgres, MySQL, Snowflake, BigQuery; not SQL Server).
Example — GROUP BY region, channel produces one row per (region, channel) pair.

GROUP BY on expressions.

GROUP BY accepts any scalar expression — not just bare columns.
Date truncation — GROUP BY DATE_TRUNC('month', sale_date) produces sql group by month shape (one row per calendar month).
CASE expressions — GROUP BY CASE WHEN amount >= 100 THEN 'big' ELSE 'small' END groups by a derived bucket.
GROUP BY 1 positional — handy for terse queries; less safe than explicit column names in long-lived code.
group by in postgresql quirk — Postgres lets you GROUP BY on a SELECT alias in some versions, unlike standard SQL.

The non-aggregate column rule.

Every non-aggregate SELECT column must appear in GROUP BY — ANSI SQL requirement.
MySQL pre-8.0 historically allowed silently picking an arbitrary value from the group (a notorious bug source).
Modern mysql group by enforces this when sql_mode includes ONLY_FULL_GROUP_BY (the default since MySQL 5.7.5).
Workaround — wrap the offending column in MIN(col), MAX(col), or ANY_VALUE(col) to satisfy the rule.

NULL is a valid group key.

Rows with NULL in a GROUP BY column form their own group together.
NULL = NULL is UNKNOWN in normal predicates, but for GROUP BY purposes NULL is treated as a single distinct value.
One NULL group per NULL column combination — (NULL, 'US') is distinct from ('CA', 'US') and from (NULL, NULL).

Worked example — `sql group by month` for time-series rollup

Question. From sales(sale_date, amount), compute total revenue per calendar month for 2026.

Input. sales slice.

sale_date	amount
2026-01-05	100
2026-01-20	200
2026-02-03	150
2026-02-18	250
2026-03-10	175

Code.

SELECT DATE_TRUNC('month', sale_date)::date AS month,
       SUM(amount)                          AS monthly_revenue
FROM sales
WHERE sale_date >= '2026-01-01'
  AND sale_date <  '2027-01-01'
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY 1;

Step-by-step explanation.

WHERE sale_date BETWEEN … pre-filters detail rows before grouping (cheaper).
DATE_TRUNC('month', sale_date) rounds each timestamp down to the first of its month.
GROUP BY DATE_TRUNC('month', sale_date) collapses rows that share the same month.
SUM(amount) totals the amounts per month.
ORDER BY 1 sorts by the first column (the truncated month).

Output.

month	monthly_revenue
2026-01-01	300
2026-02-01	400
2026-03-01	175

Rule of thumb: whenever the question says "by month" / "by quarter" / "by week", reach for DATE_TRUNC (Postgres) or the equivalent (DATE_FORMAT in MySQL, DATETRUNC in SQL Server, TRUNC in Oracle) and put it inside GROUP BY.

SQL
Topic — group-by
Multi-column GROUP BY drills

Practice →

SQL
Topic — aggregation
Aggregation SQL library

Practice →

3. SQL aggregate functions — COUNT, SUM, AVG, MIN, MAX, COUNT DISTINCT

The `sql aggregate functions` catalogue — five core plus the COUNT-DISTINCT family

The sql aggregate functions you actually use in interviews fit on one screen: COUNT, SUM, AVG, MIN, MAX, plus the COUNT(DISTINCT col) variant and the string/array aggregates. Knowing the NULL semantics of each is what separates juniors from seniors.

The aggregate functions in sql catalogue at a glance.

COUNT(*) — counts every row in the group, including rows with NULLs in any column.
COUNT(col) — counts rows where col IS NOT NULL; skips NULL rows silently.
COUNT(DISTINCT col) — counts unique non-NULL values of col; the count distinct group by sql interview classic.
SUM(col) — adds non-NULL numeric values; returns NULL when every value in the group is NULL.
AVG(col) — arithmetic mean of non-NULL values; ignores NULL.
MIN(col) / MAX(col) — smallest / largest non-NULL value across the group.
STRING_AGG(col, delim) — concatenates non-NULL values into one string per group (Postgres / SQL Server). Dialect aliases: GROUP_CONCAT (MySQL), LISTAGG (Snowflake / Oracle).
ARRAY_AGG(col) — collects values into an array per group (Postgres / BigQuery / Snowflake).
Statistical aggregates — STDDEV, VARIANCE, PERCENTILE_CONT, PERCENTILE_DISC (dialect-dependent).

COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col) — the sql count group by interview test.

COUNT(*) counts rows; a row contributes 1 even if every column is NULL. Use this for "how many rows landed in this group".
COUNT(col) counts rows where col is not NULL. Use this for "how many of these rows had a recorded col".
COUNT(DISTINCT col) counts distinct non-NULL values of col. Use this for "how many unique X showed up".
Common bug — using COUNT(col) when you meant COUNT(*), and silently undercounting rows where col is NULL.
Postgres / Snowflake — both fully support COUNT(DISTINCT). BigQuery legacy — uses APPROX_COUNT_DISTINCT; modern BigQuery supports exact COUNT(DISTINCT).

SUM semantics — the sql sum group by pattern.

Adds non-NULL values per group; treats NULL as zero in the sense of "skip it", not "add zero".
Empty group returns NULL, not 0 — guard with COALESCE(SUM(x), 0) if the downstream consumer expects zero.
Overflow — large SUMs on INTEGER columns can overflow; cast to BIGINT or NUMERIC if needed.
Combined with GROUP BY — the canonical pattern SELECT region, SUM(amount) FROM orders GROUP BY region.

AVG semantics.

Arithmetic mean of non-NULL values per group; (SUM / COUNT(col)).
Empty group returns NULL.
Integer division trap — AVG on an integer column returns an integer in some dialects (MySQL); cast to DECIMAL or NUMERIC for fractional means: AVG(col::numeric).

MIN / MAX semantics.

Smallest / largest non-NULL value across the group.
Works on any orderable type — numbers, strings (lexicographic), dates, timestamps.
Empty group returns NULL.
Useful pattern — MIN(created_at) to find each group's earliest record; MAX(updated_at) for the latest.

String aggregates — joining rows into one string per group.

PostgreSQL / SQL Server — STRING_AGG(col, ', '). Add ORDER BY inside for sorted concatenation: STRING_AGG(name, ', ' ORDER BY name).
MySQL — GROUP_CONCAT(col SEPARATOR ', ' ORDER BY col). Has a default length limit (group_concat_max_len); raise it for long results.
Snowflake / Oracle — LISTAGG(col, ', ') WITHIN GROUP (ORDER BY col).
Use case — "comma-separated list of customer names per region for the sales team".

Array aggregates.

Postgres / BigQuery / Snowflake — ARRAY_AGG(col ORDER BY col).
Returns an array column — useful for JSON exports or downstream UNNEST operations.
ARRAY_AGG(DISTINCT col) — deduplicate before collecting.
FILTER (WHERE …) clause works inside ARRAY_AGG too (Postgres 9.4+).

SQL interview question — per-region rollup with five aggregates

Assume orders(order_id, region, customer_id, amount). Return per region: total orders (COUNT(*)), total revenue (SUM(amount)), average order value (AVG(amount)), smallest and largest order amounts (MIN / MAX), and the count of distinct customers (COUNT(DISTINCT customer_id)).

Solution Using GROUP BY + multiple aggregates

Code solution.

SELECT region,
       COUNT(*)                       AS total_orders,
       SUM(amount)                    AS total_revenue,
       ROUND(AVG(amount)::numeric, 2) AS avg_order_value,
       MIN(amount)                    AS min_order,
       MAX(amount)                    AS max_order,
       COUNT(DISTINCT customer_id)    AS distinct_customers
FROM orders
GROUP BY region
ORDER BY total_revenue DESC;

Step-by-step trace.

step	relation	outcome
1	Scan `orders`	every order row read
2	`GROUP BY region`	hash aggregate; one row per region
3	Six aggregate calls evaluated per group	each KPI computed in the same pass
4	`ROUND(AVG(amount)::numeric, 2)`	cast to numeric to avoid integer-division surprises; round to 2 decimal places
5	`COUNT(DISTINCT customer_id)`	deduplicates customers per region before counting
6	Outer `ORDER BY total_revenue DESC`	best-performing regions first

Output:

region	total_orders	total_revenue	avg_order_value	min_order	max_order	distinct_customers
US	142	12,500	88.03	5	1,200	89
EU	98	8,750	89.29	8	950	64
APAC	64	5,400	84.38	10	700	42

Why this works — concept by concept:

GROUP BY region — defines the output grain as one row per region; every aggregate is computed within each region partition.
COUNT(*) counts rows — orders per region, regardless of NULLs.
SUM(amount) totals values — every aggregate-as-window aggregation pattern in sql sum group by follows this shape.
AVG(amount) cast to numeric — without the cast, AVG on an integer column returns truncated integers in MySQL and older Postgres versions; rounding to 2 decimal places normalises the output.
COUNT(DISTINCT customer_id) deduplicates — one customer with 5 orders contributes 1 to distinct_customers and 5 to total_orders; the difference is exactly the count distinct group by sql interview probe.
Single sequential scan — the planner reads orders once and computes all six aggregates in the same hash aggregate node; far cheaper than six separate filtered queries.
Cost — Θ(n) for the scan plus Θ(n) for the hash aggregate keyed on region; the six aggregate calls are a constant per-row factor.

SQL
Topic — aggregation
Aggregate-function SQL drills

Practice →

SQL
Topic — aggregations
Multi-aggregate practice

Practice →

4. WHERE vs HAVING — the execution-order trap

`having vs where clause in sql` — `WHERE` filters rows; `HAVING` filters groups

The canonical execution-order trap: WHERE filters detail rows before GROUP BY collapses them; HAVING filters the collapsed groups after. Get this wrong and your query either errors (WHERE cannot reference aggregates) or returns the right rows from a costlier plan (using HAVING for row-level filters). This is the textbook difference between where and having in sql interview question.

having vs where clause in sql — the headline difference.

WHERE runs before GROUP BY — it filters raw detail rows; aggregates do not yet exist.
HAVING runs after GROUP BY — it filters the grouped rows; aggregates are fully computed.
Both clauses can coexist in the same query; the canonical shape is WHERE for row predicates, GROUP BY, HAVING for aggregate predicates.
The sql where vs having / where vs having sql question is essentially this one sentence.

The execution order (memorise this).

FROM / JOIN → WHERE → GROUP BY → HAVING → window functions → SELECT → DISTINCT → ORDER BY → LIMIT.
Implication for WHERE — it sees every raw column but cannot see aggregates (they don't exist yet).
Implication for HAVING — it sees the grouped rows; references to non-grouped columns are illegal (the column doesn't have a single value within the group).
Implication for aliases — a SELECT alias cannot be referenced inside the same query's WHERE, GROUP BY, or HAVING; reference is legal in ORDER BY only.

WHERE cannot reference aggregates.

Illegal — WHERE SUM(amount) > 1000 — aggregates evaluate after GROUP BY, after WHERE has already run.
Error message — most engines say "aggregate functions are not allowed in WHERE".
Correct fix — move the aggregate predicate to HAVING: HAVING SUM(amount) > 1000.

HAVING should not reference non-aggregates that are also not grouped.

Works in some dialects — MySQL is lenient; ANSI SQL is strict.
Defeats the purpose — a row-level predicate in HAVING runs after the engine has already grouped every row; the planner cannot push it down to filter raw rows earlier.
Cost — slower plan for the same result. Put row-level filters in WHERE for planner efficiency.

difference between where and having in sql — quick decision test.

Predicate references an aggregate (SUM, COUNT, AVG, …) → use HAVING.
Predicate references only raw columns → use WHERE.
Both kinds of predicates → use both clauses; WHERE for the row-level ones, HAVING for the aggregate ones.
The sql having clause is exclusively for aggregates — interview-grade answer.

Both WHERE and HAVING in one query — the canonical shape.

SELECT region,
       SUM(amount) AS total
FROM orders
WHERE order_date >= '2026-01-01'   -- pre-filter raw rows
  AND status     <> 'refund'        -- row-level predicate stays in WHERE
GROUP BY region
HAVING SUM(amount) > 10000          -- aggregate predicate goes in HAVING
ORDER BY total DESC;

WHERE trims detail rows before the planner builds the hash aggregate — cheaper.
GROUP BY collapses by region.
HAVING drops regions whose total revenue is below 10000.
ORDER BY sorts the surviving regions.

SQL interview question — regions with > $10k 2026 revenue, excluding refunds

Assume orders(order_id, region, order_date, status, amount). Return regions whose total 2026 revenue exceeds $10,000, ignoring refunded orders. Sort by descending revenue.

Solution Using WHERE + GROUP BY + HAVING

Code solution.

SELECT region,
       SUM(amount) AS total_revenue
FROM orders
WHERE order_date >= '2026-01-01'
  AND order_date <  '2027-01-01'
  AND status     <> 'refund'
GROUP BY region
HAVING SUM(amount) > 10000
ORDER BY total_revenue DESC;

Step-by-step trace.

step	relation	outcome
1	Scan `orders`	every row read
2	`WHERE order_date BETWEEN '2026-01-01' AND '2027-01-01' AND status <> 'refund'`	drops 2025 orders and refunds before grouping (cheap)
3	`GROUP BY region`	hash aggregate over the surviving rows
4	`SUM(amount)` evaluated per region	one total per region
5	`HAVING SUM(amount) > 10000`	drops regions whose total is at or below the threshold
6	`ORDER BY total_revenue DESC`	sorts the survivors

Output:

region	total_revenue
US	12,500
EU	11,200

(APAC drops out because its 2026 non-refund revenue is below $10k.)

Why this works — concept by concept:

WHERE pre-filters detail rows — the date and status predicates run before GROUP BY, so the planner builds a smaller hash aggregate; this is the textbook where vs having sql efficiency lesson.
GROUP BY region defines the grain — one row per region in the intermediate result.
SUM(amount) computes the per-region total — the aggregate that the HAVING clause will then evaluate.
HAVING filters groups, not rows — HAVING SUM(amount) > 10000 cannot be written in WHERE because the aggregate doesn't exist there.
The sql having clause exists specifically for this — it is the only clause where you can predicate on aggregates of the current GROUP BY.
Cost — Θ(n) scan with predicate pushdown thanks to WHERE, then Θ(k) hash aggregate where k is the count of surviving rows; the HAVING filter is Θ(g) where g is the number of groups.

SQL
Topic — having-clause
HAVING clause SQL drills

Practice →

SQL
Topic — aggregation
WHERE + GROUP BY + HAVING patterns

Practice →

5. GROUP BY with conditional aggregation — SUM(CASE WHEN …) revisited

The bridge to Blog72 — conditional aggregation IS an aggregate inside a GROUP BY

The conditional aggregation pattern from Blog72 — SUM(CASE WHEN cond THEN value ELSE 0 END) — is just an aggregate inside a GROUP BY query. When sql group by having problems ask for "paid vs refunded revenue per region in one query," the answer combines the two ideas: GROUP BY defines the grain, conditional aggregation defines the metrics.

The bridge to Blog72.

SUM(CASE WHEN cond THEN val ELSE 0 END) sums val only for rows where cond is true — see Blog72 §4 for the full anatomy.
Inside a GROUP BY query — one such expression per metric column; all share the same GROUP BY grain.
The ELSE 0 rule — missing it makes SUM ignore NULL and silently undercount.
COUNT(CASE WHEN cond THEN 1 END) and SUM(CASE WHEN cond THEN 1 ELSE 0 END) are equivalent for counting.

Why this matters in interviews.

Combines two fluencies at once — GROUP BY grain awareness AND branched aggregate logic.
One-pass multi-KPI reports — paid / refunded / pending revenue per region in a single grouped query.
No PIVOT operator needed — works on every dialect that supports CASE.
Tests for the ELSE 0 discipline — the silent-undercount NULL trap from Blog72.

The canonical pattern — one SUM(CASE WHEN …) per metric, all under one GROUP BY.

SELECT region,
       SUM(CASE WHEN status = 'paid'    THEN amount ELSE 0 END) AS paid_revenue,
       SUM(CASE WHEN status = 'refund'  THEN amount ELSE 0 END) AS refunded_amount,
       SUM(CASE WHEN status = 'pending' THEN amount ELSE 0 END) AS pending_amount,
       SUM(CASE WHEN status = 'paid'    THEN 1      ELSE 0 END) AS paid_orders
FROM orders
GROUP BY region
ORDER BY paid_revenue DESC;

Pair with HAVING — filter on any of the conditional aggregates.

Filter regions with at least one refund — HAVING SUM(CASE WHEN status = 'refund' THEN 1 ELSE 0 END) > 0.
Filter regions with paid revenue > $10k — HAVING SUM(CASE WHEN status = 'paid' THEN amount ELSE 0 END) > 10000.
Multi-condition existence — HAVING SUM(CASE WHEN tier = 'free' THEN 1 ELSE 0 END) > 0 AND SUM(CASE WHEN tier = 'paid' THEN 1 ELSE 0 END) > 0 keeps groups that have both free and paid users.

Worked example — paid / refunded / pending revenue per region

Question. From orders(order_id, region, status, amount), compute per region: paid revenue, refunded amount, pending amount, and count of paid orders, all in one grouped query.

Code.

SELECT region,
       SUM(CASE WHEN status = 'paid'    THEN amount ELSE 0 END) AS paid_revenue,
       SUM(CASE WHEN status = 'refund'  THEN amount ELSE 0 END) AS refunded_amount,
       SUM(CASE WHEN status = 'pending' THEN amount ELSE 0 END) AS pending_amount,
       SUM(CASE WHEN status = 'paid'    THEN 1      ELSE 0 END) AS paid_orders
FROM orders
GROUP BY region;

Output.

region	paid_revenue	refunded_amount	pending_amount	paid_orders
US	11,200	1,300	0	142
EU	7,900	850	200	98
APAC	5,100	300	0	64

Rule of thumb: any time the question is "split metric X by category Y in the same grouped query", combine GROUP BY with SUM(CASE WHEN …) — that's the bridge between this blog and Blog72.

SQL
Topic — conditional-aggregation
Conditional aggregation drills

Practice →

SQL
Topic — data-aggregation
Data aggregation library

Practice →

6. GROUP BY ROLLUP, CUBE, and GROUPING SETS — multi-level subtotals

`rollup cube grouping sets sql` — one query, multiple grain levels

Once you understand single and multi-column GROUP BY, the next layer is producing multiple grain levels in one query. GROUPING SETS, ROLLUP, and CUBE extend GROUP BY to emit subtotals and grand totals without UNION-ing many separate queries.

The three advanced extensions.

GROUPING SETS ((a, b), (a), ()) — explicit list of group combinations; total flexibility.
ROLLUP(a, b, c) — hierarchical subtotals from finest grain to grand total; produces n+1 grain levels for n columns.
CUBE(a, b, c) — every combination of the listed columns; produces 2^n grain levels.

GROUPING SETS — fully explicit.

Shape — GROUP BY GROUPING SETS ((col1, col2), (col1), (col2), ()).
What it does — runs the query once for each parenthesised tuple and UNION ALLs the results.
() is the empty grouping — produces one row with everything aggregated (grand total).
Use it when — you need a custom mix of grain levels that doesn't match ROLLUP or CUBE.

ROLLUP — hierarchical subtotals.

Shape — GROUP BY ROLLUP(a, b, c).
Equivalent to — GROUPING SETS ((a, b, c), (a, b), (a), ()).
Grain count — n+1 levels for n columns: finest grain, peel off the rightmost column each step, plus the grand total.
Use it for — year → quarter → month subtotals with a grand total; geography → country → region rollups.

CUBE — combinatorial subtotals.

Shape — GROUP BY CUBE(a, b, c).
Equivalent to — GROUPING SETS ((a, b, c), (a, b), (a, c), (b, c), (a), (b), (c), ()).
Grain count — 2^n levels for n columns; explodes fast.
Use it for — exploratory OLAP analysis where every combination matters; pivot reports with marginal totals.

GROUPING(col) helper.

Returns 1 when the column was rolled-up (the row's column value is NULL as part of the rollup).
Returns 0 when the row has a real value for that column.
Use it to label rows — CASE WHEN GROUPING(quarter) = 1 THEN 'year_total' ELSE quarter::text END.

Use cases.

Finance subtotals — year → quarter → month → grand total (ROLLUP).
Pivot reports with marginal totals — every row + every column + the corner cell (CUBE).
OLAP dashboards — pre-computed grain levels for fast drill-down.
Dashboard exports — a single query produces the body + the subtotal lines + the grand total.

Dialect support.

PostgreSQL, SQL Server, Oracle, BigQuery, Snowflake — all support ROLLUP, CUBE, GROUPING SETS in the standard syntax.
MySQL — supports the legacy WITH ROLLUP syntax (GROUP BY a, b WITH ROLLUP); not CUBE or GROUPING SETS.
SQL Server limit — caps at 32 expressions per ROLLUP / CUBE.

Worked example — revenue by (year, quarter) with year subtotals and grand total

Question. From orders(order_date, amount), compute total revenue at three grain levels in one query: per (year, quarter), per (year), and the grand total.

Code.

SELECT EXTRACT(YEAR FROM order_date)    AS year,
       EXTRACT(QUARTER FROM order_date) AS quarter,
       SUM(amount)                      AS revenue
FROM orders
GROUP BY ROLLUP(
    EXTRACT(YEAR FROM order_date),
    EXTRACT(QUARTER FROM order_date)
)
ORDER BY year NULLS LAST, quarter NULLS LAST;

Output.

year	quarter	revenue
2026	1	5,200
2026	2	6,800
2026	3	7,400
2026	4	8,100
2026	NULL	27,500
NULL	NULL	27,500

Rows 1–4 are the per-quarter detail.
Row 5 (quarter = NULL) is the year subtotal for 2026.
Row 6 (both NULL) is the grand total.

Rule of thumb: if the report needs subtotals plus a grand total in one query, reach for ROLLUP; if it needs every combination, reach for CUBE; if it needs an irregular custom mix, reach for GROUPING SETS.

SQL
Topic — aggregations
Multi-level GROUP BY drills

Practice →

SQL
Topic — data-aggregation
ROLLUP / CUBE / GROUPING SETS patterns

Practice →

7. GROUP BY interview gotchas — fan-out, non-aggregate columns, NULL groups

The six bugs interviewers test most often

GROUP BY has a small surface area but a long tail of edge cases that fail candidates. These are the six gotchas reviewers test most often.

Fan-out after a JOIN.

The bug — joining a one-to-many relationship and aggregating afterwards multiplies the SUM.
Example — orders LEFT JOIN order_items then SUM(orders.amount) triples the amount when each order has three items.
Defence — pre-aggregate inside a CTE to the parent grain before joining; or join after aggregating.
State the grain aloud — "one row per (order_id, item_id)" — every reviewer expects this.

Non-aggregate column error.

The bug — SELECT region, customer_id, SUM(amount) FROM orders GROUP BY region — customer_id is neither in GROUP BY nor an aggregate.
ANSI SQL raises an error.
mysql group by pre-8.0 silently picked an arbitrary customer_id per group (notorious data-quality bug source).
Modern MySQL with ONLY_FULL_GROUP_BY enforces the rule.
Fix — wrap in MIN(customer_id), MAX(customer_id), ANY_VALUE(customer_id), or add to GROUP BY.

COUNT(*) vs COUNT(col) confusion.

The bug — using COUNT(col) to count rows when col is sometimes NULL; the count silently drops the NULL rows.
Subtle — query "looks right" and returns a plausible number; only an audit against COUNT(*) catches it.
Fix — use COUNT(*) for row counts; use COUNT(col) only when you mean "rows where col is not NULL"; use COUNT(DISTINCT col) for unique values.
The count distinct group by sql test — interviewer hands you a table with NULLs and asks "how many customers per region?"; correct answer is COUNT(DISTINCT customer_id), not COUNT(*) or COUNT(customer_id) if you want unique customers.

SUM over empty group returns NULL, not 0.

The bug — LEFT JOIN produces empty groups; SUM over them returns NULL; downstream consumers expect 0.
Symptom — dashboards render "—" or blank instead of 0; downstream arithmetic produces unexpected NULLs.
Fix — COALESCE(SUM(x), 0) — pairs cleanly with the safe-arithmetic patterns from Blog72.

NULL group key behaviour.

The behaviour — rows with NULL in a GROUP BY column form their own group (one group for all NULL rows together).
Subtle — when downstream code joins back on the group key, NULL = NULL is UNKNOWN and the NULL group row never matches.
Fix — COALESCE(group_col, 'UNKNOWN') if you want NULLs to display under a labelled bucket.

GROUP BY alias rules and sql group by month patterns.

The rule — a SELECT alias is NOT visible inside the same query's GROUP BY (logical processing order).
Workaround #1 — repeat the expression in GROUP BY: SELECT DATE_TRUNC('month', t) AS month … GROUP BY DATE_TRUNC('month', t).
Workaround #2 — positional shorthand: GROUP BY 1 refers to the first SELECT column.
Workaround #3 — wrap in a CTE: define the alias in the inner query, reference it in the outer.
The sql group by month pattern ships with all three workarounds in production code.

SQL
Topic — group-by
GROUP BY gotcha drills

Practice →

SQL
Topic — having-clause
HAVING gotcha drills

Practice →

Choosing where to filter (cheat sheet)

A one-screen cheat sheet for using GROUP BY and HAVING in SQL — pick the clause that matches your predicate type, then copy the corresponding shape into your query.

Predicate references …	Use clause …	Why
Only raw row columns (date, status, user_id)	`WHERE`	Runs before `GROUP BY`; planner can push down
An aggregate (`SUM`, `COUNT`, `AVG`, `MIN`, `MAX`)	`HAVING`	Runs after `GROUP BY`; aggregates exist by then
Both raw columns AND aggregates	`WHERE` + `HAVING`	Separate them; row filters in `WHERE`, group filters in `HAVING`
A derived bucket label (CASE expression)	`WHERE` or `GROUP BY`	Repeat the CASE expression — alias not visible in `WHERE` / `GROUP BY`
A row that should be counted only when condition holds	`SUM(CASE WHEN cond THEN 1 ELSE 0 END)`	Conditional aggregation pattern from Blog72
A specific group should be excluded by an aggregate threshold	`HAVING <aggregate predicate>`	E.g. `HAVING COUNT(*) >= 10` for "groups with ≥ 10 rows"
The output sort	`ORDER BY` (after `SELECT`)	`ORDER BY` can reference the `SELECT` alias

Frequently asked questions

What does GROUP BY do in SQL?

GROUP BY collapses rows that share the same key values into one row per distinct combination of the listed columns, and aggregate functions populate the summary columns. The output grain equals the GROUP BY column combination — GROUP BY region returns one row per region, GROUP BY region, channel returns one row per (region, channel) pair. Every non-aggregate column in the SELECT list must appear in GROUP BY (ANSI rule, modern MySQL with ONLY_FULL_GROUP_BY enforces this), and rows with NULL in a GROUP BY column form their own group together.

What's the difference between WHERE and HAVING in SQL?

WHERE filters rows before GROUP BY collapses them; HAVING filters groups after the collapse. The execution order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY, so WHERE cannot reference aggregates (they don't exist yet), and HAVING is the only place you can predicate on SUM, COUNT, AVG, etc. against the current GROUP BY. The canonical interview answer to difference between where and having in sql is: row-level predicates go in WHERE (cheaper, planner can push them down); aggregate predicates go in HAVING (no other clause can reference them).

How do I use GROUP BY and HAVING together in SQL?

Use them together in the canonical shape: WHERE for row-level filters, GROUP BY for the grain, then HAVING for aggregate predicates. For example, "find regions with 2026 revenue over $10k, excluding refunds" becomes WHERE order_date >= '2026-01-01' AND status <> 'refund' GROUP BY region HAVING SUM(amount) > 10000. The group by and having in sql pattern lets you express "filter rows, then group, then filter groups" without breaking the query into multiple steps. Reviewers test this specifically because it conflates two filtering concepts that share a clause name in many languages but are distinct in SQL.

What's the difference between COUNT(*), COUNT(col), and COUNT(DISTINCT col)?

COUNT(*) counts every row in the group, including rows with NULLs in any column. COUNT(col) counts rows where col is not NULL; it silently skips NULL rows. COUNT(DISTINCT col) counts unique non-NULL values of col. The sql count group by interview question typically wants COUNT(*) for "how many rows landed in this group" and COUNT(DISTINCT customer_id) for "how many unique customers". A common bug is using COUNT(col) when you meant COUNT(*) and silently undercounting rows where col is NULL — audit against COUNT(*) to catch this.

What are ROLLUP, CUBE, and GROUPING SETS in SQL?

The rollup cube grouping sets sql family extends GROUP BY to produce multiple grain levels in one query. GROUPING SETS ((a, b), (a), ()) runs the query once per parenthesised tuple and UNION ALLs the results. ROLLUP(a, b, c) produces hierarchical subtotals — finest grain, then peel off the rightmost column each step, plus the grand total — n+1 levels for n columns. CUBE(a, b, c) produces every combination — 2^n levels. Use ROLLUP for finance subtotals (year → quarter → month → grand total), CUBE for exploratory OLAP, and GROUPING SETS for custom mixes. Supported by Postgres, SQL Server, Oracle, BigQuery, Snowflake; MySQL has legacy WITH ROLLUP.

Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including PostgreSQL-first SQL practice keyed to GROUP BY, HAVING, aggregate functions, conditional aggregation, and the rollup cube grouping sets sql family. Whether you are drilling sql interview questions with answers for sql for data engineers loops or grinding through data engineering interview questions end-to-end, the practice library mirrors the same row-collapse mental model this guide teaches.

Kick off via Explore practice →; drill the dedicated GROUP BY SQL lane →; fan out into the HAVING clause SQL lane →; reinforce aggregation SQL drills →; rehearse conditional aggregation problems →; widen coverage on the full SQL practice library →.

DEV Community

GROUP BY and HAVING in SQL: Aggregation Patterns for Interviews

1. Why GROUP BY and HAVING matter in data engineering interviews

`group by sql` collapses; aggregates summarise; `having sql` filters the groups

Worked example — orders rolled up per region

2. Anatomy of GROUP BY — the row-collapse mental model

`what does group by do in sql` — collapses key-sharing rows into one row per group

Worked example — `sql group by month` for time-series rollup

3. SQL aggregate functions — COUNT, SUM, AVG, MIN, MAX, COUNT DISTINCT

The `sql aggregate functions` catalogue — five core plus the COUNT-DISTINCT family

SQL interview question — per-region rollup with five aggregates

Solution Using GROUP BY + multiple aggregates

4. WHERE vs HAVING — the execution-order trap

`having vs where clause in sql` — `WHERE` filters rows; `HAVING` filters groups

SQL interview question — regions with > $10k 2026 revenue, excluding refunds

Solution Using WHERE + GROUP BY + HAVING

5. GROUP BY with conditional aggregation — SUM(CASE WHEN …) revisited

The bridge to Blog72 — conditional aggregation IS an aggregate inside a GROUP BY

Worked example — paid / refunded / pending revenue per region

6. GROUP BY ROLLUP, CUBE, and GROUPING SETS — multi-level subtotals

`rollup cube grouping sets sql` — one query, multiple grain levels

Worked example — revenue by (year, quarter) with year subtotals and grand total

7. GROUP BY interview gotchas — fan-out, non-aggregate columns, NULL groups

The six bugs interviewers test most often

Choosing where to filter (cheat sheet)

Frequently asked questions

What does GROUP BY do in SQL?

What's the difference between WHERE and HAVING in SQL?

How do I use GROUP BY and HAVING together in SQL?

What's the difference between COUNT(*), COUNT(col), and COUNT(DISTINCT col)?

What are ROLLUP, CUBE, and GROUPING SETS in SQL?

Practice on PipeCode

Top comments (0)

1. Why GROUP BY and HAVING matter in data engineering interviews

group by sql collapses; aggregates summarise; having sql filters the groups

Worked example — orders rolled up per region

2. Anatomy of GROUP BY — the row-collapse mental model

what does group by do in sql — collapses key-sharing rows into one row per group

Worked example — sql group by month for time-series rollup

3. SQL aggregate functions — COUNT, SUM, AVG, MIN, MAX, COUNT DISTINCT

The sql aggregate functions catalogue — five core plus the COUNT-DISTINCT family

SQL interview question — per-region rollup with five aggregates

Solution Using GROUP BY + multiple aggregates

4. WHERE vs HAVING — the execution-order trap

having vs where clause in sql — WHERE filters rows; HAVING filters groups

SQL interview question — regions with > $10k 2026 revenue, excluding refunds

Solution Using WHERE + GROUP BY + HAVING

5. GROUP BY with conditional aggregation — SUM(CASE WHEN …) revisited

The bridge to Blog72 — conditional aggregation IS an aggregate inside a GROUP BY

Worked example — paid / refunded / pending revenue per region

6. GROUP BY ROLLUP, CUBE, and GROUPING SETS — multi-level subtotals

rollup cube grouping sets sql — one query, multiple grain levels

Worked example — revenue by (year, quarter) with year subtotals and grand total

7. GROUP BY interview gotchas — fan-out, non-aggregate columns, NULL groups

The six bugs interviewers test most often

Choosing where to filter (cheat sheet)

Frequently asked questions

What does GROUP BY do in SQL?

What's the difference between WHERE and HAVING in SQL?

How do I use GROUP BY and HAVING together in SQL?

What's the difference between COUNT(*), COUNT(col), and COUNT(DISTINCT col)?

What are ROLLUP, CUBE, and GROUPING SETS in SQL?

Practice on PipeCode

`group by sql` collapses; aggregates summarise; `having sql` filters the groups

`what does group by do in sql` — collapses key-sharing rows into one row per group

Worked example — `sql group by month` for time-series rollup

The `sql aggregate functions` catalogue — five core plus the COUNT-DISTINCT family

`having vs where clause in sql` — `WHERE` filters rows; `HAVING` filters groups

`rollup cube grouping sets sql` — one query, multiple grain levels