DEV Community

Cover image for GROUP BY and HAVING in SQL: Aggregation Patterns for Interviews
Gowtham Potureddi
Gowtham Potureddi

Posted on

GROUP BY and HAVING in SQL: Aggregation Patterns for Interviews

GROUP BY in SQL collapses many detail rows into one row per distinct key combination, sql aggregate functions (COUNT, SUM, AVG, MIN, MAX) compute a summary value per group, and the HAVING clause filters those groups after the collapse — while WHERE filters raw rows before. That single sentence answers most of the sql interview questions in the group by sql / sql group by cluster, and getting the having clause sql semantics right separates candidates who pass screening from candidates who don't.

This guide walks through every clause in the GROUP BY / HAVING pipeline that reviewers love to test in data engineering interview questions: the row-collapse mental model, the full aggregate functions in sql catalogue, the canonical having vs where clause in sql execution-order trap, group by multiple columns sql with multi-key grain, conditional aggregation tied to Blog72, advanced rollup cube grouping sets sql subtotals, and the seven interview gotchas (fan-out, non-aggregate columns, NULL groups) that fail most candidates. Every section ends as sql interview questions with answers: a runnable PostgreSQL query, a traced execution, an output table, and a concept-by-concept why this works breakdown — the exact shape sql for data engineers rounds reward when aggregation in sql comes up.

PipeCode blog header for a SQL GROUP BY and HAVING tutorial — bold white headline 'GROUP BY · HAVING' with subtitle 'aggregation patterns for interviews' and a minimal collapse diagram on a dark gradient with purple, green, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse GROUP BY SQL practice →, drill the HAVING clause SQL lane →, sharpen aggregation SQL drills →, rehearse conditional aggregation problems →, or widen coverage on the full SQL practice library →.


On this page


1. Why GROUP BY and HAVING matter in data engineering interviews

group by sql collapses; aggregates summarise; having sql filters the groups

The one-sentence invariant interviewers chase: group by sql collapses rows that share the same key values into one row per group; aggregate functions compute a summary value per group; having sql filters those groups after the collapse, while WHERE filters rows *before*. If you can say that aloud and pick the right clause for each predicate, you have already cleared the bar most candidates fail to clear when sql group by having shows up on a whiteboard.

What GROUP BY does at a glance.

  • Definition — collapses rows that share the listed column values into one output row per distinct combination.
  • Required syntaxSELECT <cols> FROM <table> GROUP BY <col1>, <col2>, ….
  • Output grain — exactly one row per distinct combination of the GROUP BY columns.
  • Non-aggregate column rule — every column in SELECT must either appear in GROUP BY or be wrapped inside an aggregate; ANSI SQL enforces this and modern mysql group by does too once ONLY_FULL_GROUP_BY is on.

What HAVING does at a glance.

  • Definition — filters the grouped rows; usually references aggregate functions.
  • Required syntax… GROUP BY <cols> HAVING <predicate>.
  • HAVING can reference aggregatesWHERE cannot.
  • Both WHERE and HAVING can coexist in one query; the canonical interview shape is WHERE for row filters, GROUP BY, then HAVING for group filters.
  • The phrase having by in sql (a common misnomer for the HAVING clause) refers to this same construct.

The execution pipeline you must memorise.

  • The SQL clauses run in this order regardless of how you write them:
    • FROM / JOIN — assemble the row source.
    • WHERE — pre-filter the detail rows.
    • GROUP BY — collapse rows into one row per group.
    • HAVING — filter the grouped rows.
    • window functions — decorate the grouped rows.
    • SELECT — project the final columns (including aggregate aliases).
    • ORDER BY — sort the final output.
    • LIMIT — truncate.
  • Consequence #1WHERE cannot reference an aggregate (aggregates don't exist yet).
  • Consequence #2HAVING can reference aggregates and is the right place for SUM(x) > 1000-style predicates.
  • Consequence #3 — the SELECT alias is NOT visible in WHERE / GROUP BY / HAVING (alias does not yet exist).

Why interviewers love this combo.

  • Tests row-grain awareness — knowing that GROUP BY region produces one row per region.
  • Tests aggregate semantics — understanding COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col).
  • Tests execution-order understanding — picking WHERE vs HAVING correctly.
  • Tests fan-out discipline — knowing not to aggregate before deduping a one-to-many join.

What interviewers listen for.

  • Do you name the grain of the output explicitly before writing the query? — "this produces one row per (region, channel) combination."
  • Do you put row-level predicates in WHERE and aggregate predicates in HAVING? — the canonical separation.
  • Do you know the execution order when challenged on WHERE vs HAVING? — senior signal.
  • Do you reach for the right aggregate for the question? — SUM for totals, COUNT(*) for row counts, COUNT(DISTINCT col) for unique counts, AVG for means.

Worked example — orders rolled up per region

Question. From orders(order_id, region, amount), return per region: the number of orders and the total revenue. Order the result by total revenue descending.

Input. orders slice.

order_id region amount
1 US 100
2 US 150
3 EU 80
4 US 200
5 EU 120

Code.

SELECT region,
       COUNT(*)    AS order_count,
       SUM(amount) AS total_revenue
FROM orders
GROUP BY region
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. FROM orders reads every row.
  2. GROUP BY region collapses rows that share the same region into one group.
  3. COUNT(*) counts the rows inside each group; SUM(amount) totals the amounts.
  4. The output grain is exactly one row per distinct region.
  5. ORDER BY total_revenue DESC references the SELECT alias — legal because ORDER BY runs after SELECT.

Output.

region order_count total_revenue
US 3 450
EU 2 200

Rule of thumb: when the question asks "per X", write GROUP BY X first, then pick aggregates for the metrics — that's the heart of sql group by and the cleanest way to learn aggregation in sql.

SQL
Topic — group-by
GROUP BY SQL drills

Practice →

SQL
Topic — having-clause
HAVING clause SQL problems

Practice →


2. Anatomy of GROUP BY — the row-collapse mental model

Diagram of the GROUP BY collapse — multiple detail rows of (region, amount) on the left collapsing via a purple arrow into one row per region on the right, with the aggregate value labelled on each group, on a light PipeCode card.

what does group by do in sql — collapses key-sharing rows into one row per group

The mental model: think of GROUP BY as a row-collapse operator. You start with many detail rows; you end with one row per distinct value of the GROUP BY columns, and aggregate functions populate the summary columns. That's how does group by work in sql in one sentence.

what does group by do in sql — the one-line answer.

  • Collapses rows that share the listed column values into one output row per distinct combination.
  • Aggregate functions populate the non-grouped output columns.
  • Detail-level information is lost unless preserved by an aggregate (MIN, MAX, STRING_AGG, etc.).
  • The output grain equals the GROUP BY column combination.

Single column GROUP BY.

  • ShapeGROUP BY <single_column>.
  • Output grain — one row per distinct value of <single_column>.
  • Example use cases — orders per region, signups per day, revenue per product.
  • NULL is a group — all rows where the column is NULL form their own group.

group by multiple columns sql — multi-key grouping.

  • ShapeGROUP BY <col1>, <col2>, <col3>.
  • Output grain — one row per distinct combination of all listed columns.
  • Order in GROUP BY does not change the result — only the planner's sort/hash choice.
  • Positional shorthandGROUP BY 1, 2 refers to the first and second SELECT columns (works in Postgres, MySQL, Snowflake, BigQuery; not SQL Server).
  • ExampleGROUP BY region, channel produces one row per (region, channel) pair.

GROUP BY on expressions.

  • GROUP BY accepts any scalar expression — not just bare columns.
  • Date truncationGROUP BY DATE_TRUNC('month', sale_date) produces sql group by month shape (one row per calendar month).
  • CASE expressionsGROUP BY CASE WHEN amount >= 100 THEN 'big' ELSE 'small' END groups by a derived bucket.
  • GROUP BY 1 positional — handy for terse queries; less safe than explicit column names in long-lived code.
  • group by in postgresql quirk — Postgres lets you GROUP BY on a SELECT alias in some versions, unlike standard SQL.

The non-aggregate column rule.

  • Every non-aggregate SELECT column must appear in GROUP BY — ANSI SQL requirement.
  • MySQL pre-8.0 historically allowed silently picking an arbitrary value from the group (a notorious bug source).
  • Modern mysql group by enforces this when sql_mode includes ONLY_FULL_GROUP_BY (the default since MySQL 5.7.5).
  • Workaround — wrap the offending column in MIN(col), MAX(col), or ANY_VALUE(col) to satisfy the rule.

NULL is a valid group key.

  • Rows with NULL in a GROUP BY column form their own group together.
  • NULL = NULL is UNKNOWN in normal predicates, but for GROUP BY purposes NULL is treated as a single distinct value.
  • One NULL group per NULL column combination(NULL, 'US') is distinct from ('CA', 'US') and from (NULL, NULL).

Worked example — sql group by month for time-series rollup

Question. From sales(sale_date, amount), compute total revenue per calendar month for 2026.

Input. sales slice.

sale_date amount
2026-01-05 100
2026-01-20 200
2026-02-03 150
2026-02-18 250
2026-03-10 175

Code.

SELECT DATE_TRUNC('month', sale_date)::date AS month,
       SUM(amount)                          AS monthly_revenue
FROM sales
WHERE sale_date >= '2026-01-01'
  AND sale_date <  '2027-01-01'
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY 1;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. WHERE sale_date BETWEEN … pre-filters detail rows before grouping (cheaper).
  2. DATE_TRUNC('month', sale_date) rounds each timestamp down to the first of its month.
  3. GROUP BY DATE_TRUNC('month', sale_date) collapses rows that share the same month.
  4. SUM(amount) totals the amounts per month.
  5. ORDER BY 1 sorts by the first column (the truncated month).

Output.

month monthly_revenue
2026-01-01 300
2026-02-01 400
2026-03-01 175

Rule of thumb: whenever the question says "by month" / "by quarter" / "by week", reach for DATE_TRUNC (Postgres) or the equivalent (DATE_FORMAT in MySQL, DATETRUNC in SQL Server, TRUNC in Oracle) and put it inside GROUP BY.

SQL
Topic — group-by
Multi-column GROUP BY drills

Practice →

SQL
Topic — aggregation
Aggregation SQL library

Practice →


3. SQL aggregate functions — COUNT, SUM, AVG, MIN, MAX, COUNT DISTINCT

Diagram of the SQL aggregate functions overview — five cards in a row labelled COUNT, SUM, AVG, MIN, MAX with sample input rows and the per-group output value highlighted, on a light PipeCode card.

The sql aggregate functions catalogue — five core plus the COUNT-DISTINCT family

The sql aggregate functions you actually use in interviews fit on one screen: COUNT, SUM, AVG, MIN, MAX, plus the COUNT(DISTINCT col) variant and the string/array aggregates. Knowing the NULL semantics of each is what separates juniors from seniors.

The aggregate functions in sql catalogue at a glance.

  • COUNT(*) — counts every row in the group, including rows with NULLs in any column.
  • COUNT(col) — counts rows where col IS NOT NULL; skips NULL rows silently.
  • COUNT(DISTINCT col) — counts unique non-NULL values of col; the count distinct group by sql interview classic.
  • SUM(col) — adds non-NULL numeric values; returns NULL when every value in the group is NULL.
  • AVG(col) — arithmetic mean of non-NULL values; ignores NULL.
  • MIN(col) / MAX(col) — smallest / largest non-NULL value across the group.
  • STRING_AGG(col, delim) — concatenates non-NULL values into one string per group (Postgres / SQL Server). Dialect aliases: GROUP_CONCAT (MySQL), LISTAGG (Snowflake / Oracle).
  • ARRAY_AGG(col) — collects values into an array per group (Postgres / BigQuery / Snowflake).
  • Statistical aggregatesSTDDEV, VARIANCE, PERCENTILE_CONT, PERCENTILE_DISC (dialect-dependent).

COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col) — the sql count group by interview test.

  • COUNT(*) counts rows; a row contributes 1 even if every column is NULL. Use this for "how many rows landed in this group".
  • COUNT(col) counts rows where col is not NULL. Use this for "how many of these rows had a recorded col".
  • COUNT(DISTINCT col) counts distinct non-NULL values of col. Use this for "how many unique X showed up".
  • Common bug — using COUNT(col) when you meant COUNT(*), and silently undercounting rows where col is NULL.
  • Postgres / Snowflake — both fully support COUNT(DISTINCT). BigQuery legacy — uses APPROX_COUNT_DISTINCT; modern BigQuery supports exact COUNT(DISTINCT).

SUM semantics — the sql sum group by pattern.

  • Adds non-NULL values per group; treats NULL as zero in the sense of "skip it", not "add zero".
  • Empty group returns NULL, not 0 — guard with COALESCE(SUM(x), 0) if the downstream consumer expects zero.
  • Overflow — large SUMs on INTEGER columns can overflow; cast to BIGINT or NUMERIC if needed.
  • Combined with GROUP BY — the canonical pattern SELECT region, SUM(amount) FROM orders GROUP BY region.

AVG semantics.

  • Arithmetic mean of non-NULL values per group; (SUM / COUNT(col)).
  • Empty group returns NULL.
  • Integer division trapAVG on an integer column returns an integer in some dialects (MySQL); cast to DECIMAL or NUMERIC for fractional means: AVG(col::numeric).

MIN / MAX semantics.

  • Smallest / largest non-NULL value across the group.
  • Works on any orderable type — numbers, strings (lexicographic), dates, timestamps.
  • Empty group returns NULL.
  • Useful patternMIN(created_at) to find each group's earliest record; MAX(updated_at) for the latest.

String aggregates — joining rows into one string per group.

  • PostgreSQL / SQL ServerSTRING_AGG(col, ', '). Add ORDER BY inside for sorted concatenation: STRING_AGG(name, ', ' ORDER BY name).
  • MySQLGROUP_CONCAT(col SEPARATOR ', ' ORDER BY col). Has a default length limit (group_concat_max_len); raise it for long results.
  • Snowflake / OracleLISTAGG(col, ', ') WITHIN GROUP (ORDER BY col).
  • Use case — "comma-separated list of customer names per region for the sales team".

Array aggregates.

  • Postgres / BigQuery / SnowflakeARRAY_AGG(col ORDER BY col).
  • Returns an array column — useful for JSON exports or downstream UNNEST operations.
  • ARRAY_AGG(DISTINCT col) — deduplicate before collecting.
  • FILTER (WHERE …) clause works inside ARRAY_AGG too (Postgres 9.4+).

SQL interview question — per-region rollup with five aggregates

Assume orders(order_id, region, customer_id, amount). Return per region: total orders (COUNT(*)), total revenue (SUM(amount)), average order value (AVG(amount)), smallest and largest order amounts (MIN / MAX), and the count of distinct customers (COUNT(DISTINCT customer_id)).

Solution Using GROUP BY + multiple aggregates

Code solution.

SELECT region,
       COUNT(*)                       AS total_orders,
       SUM(amount)                    AS total_revenue,
       ROUND(AVG(amount)::numeric, 2) AS avg_order_value,
       MIN(amount)                    AS min_order,
       MAX(amount)                    AS max_order,
       COUNT(DISTINCT customer_id)    AS distinct_customers
FROM orders
GROUP BY region
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step relation outcome
1 Scan orders every order row read
2 GROUP BY region hash aggregate; one row per region
3 Six aggregate calls evaluated per group each KPI computed in the same pass
4 ROUND(AVG(amount)::numeric, 2) cast to numeric to avoid integer-division surprises; round to 2 decimal places
5 COUNT(DISTINCT customer_id) deduplicates customers per region before counting
6 Outer ORDER BY total_revenue DESC best-performing regions first

Output:

region total_orders total_revenue avg_order_value min_order max_order distinct_customers
US 142 12,500 88.03 5 1,200 89
EU 98 8,750 89.29 8 950 64
APAC 64 5,400 84.38 10 700 42

Why this works — concept by concept:

  • GROUP BY region — defines the output grain as one row per region; every aggregate is computed within each region partition.
  • COUNT(*) counts rows — orders per region, regardless of NULLs.
  • SUM(amount) totals values — every aggregate-as-window aggregation pattern in sql sum group by follows this shape.
  • AVG(amount) cast to numeric — without the cast, AVG on an integer column returns truncated integers in MySQL and older Postgres versions; rounding to 2 decimal places normalises the output.
  • COUNT(DISTINCT customer_id) deduplicates — one customer with 5 orders contributes 1 to distinct_customers and 5 to total_orders; the difference is exactly the count distinct group by sql interview probe.
  • Single sequential scan — the planner reads orders once and computes all six aggregates in the same hash aggregate node; far cheaper than six separate filtered queries.
  • CostΘ(n) for the scan plus Θ(n) for the hash aggregate keyed on region; the six aggregate calls are a constant per-row factor.

SQL
Topic — aggregation
Aggregate-function SQL drills

Practice →

SQL
Topic — aggregations
Multi-aggregate practice

Practice →


4. WHERE vs HAVING — the execution-order trap

Diagram of the SQL execution pipeline — FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY — with WHERE highlighted in green (pre-grouping row filter) and HAVING highlighted in orange (post-grouping group filter), on a light PipeCode card.

having vs where clause in sqlWHERE filters rows; HAVING filters groups

The canonical execution-order trap: WHERE filters detail rows before GROUP BY collapses them; HAVING filters the collapsed groups after. Get this wrong and your query either errors (WHERE cannot reference aggregates) or returns the right rows from a costlier plan (using HAVING for row-level filters). This is the textbook difference between where and having in sql interview question.

having vs where clause in sql — the headline difference.

  • WHERE runs before GROUP BY — it filters raw detail rows; aggregates do not yet exist.
  • HAVING runs after GROUP BY — it filters the grouped rows; aggregates are fully computed.
  • Both clauses can coexist in the same query; the canonical shape is WHERE for row predicates, GROUP BY, HAVING for aggregate predicates.
  • The sql where vs having / where vs having sql question is essentially this one sentence.

The execution order (memorise this).

  • FROM / JOINWHEREGROUP BYHAVING → window functions → SELECTDISTINCTORDER BYLIMIT.
  • Implication for WHERE — it sees every raw column but cannot see aggregates (they don't exist yet).
  • Implication for HAVING — it sees the grouped rows; references to non-grouped columns are illegal (the column doesn't have a single value within the group).
  • Implication for aliases — a SELECT alias cannot be referenced inside the same query's WHERE, GROUP BY, or HAVING; reference is legal in ORDER BY only.

WHERE cannot reference aggregates.

  • IllegalWHERE SUM(amount) > 1000 — aggregates evaluate after GROUP BY, after WHERE has already run.
  • Error message — most engines say "aggregate functions are not allowed in WHERE".
  • Correct fix — move the aggregate predicate to HAVING: HAVING SUM(amount) > 1000.

HAVING should not reference non-aggregates that are also not grouped.

  • Works in some dialects — MySQL is lenient; ANSI SQL is strict.
  • Defeats the purpose — a row-level predicate in HAVING runs after the engine has already grouped every row; the planner cannot push it down to filter raw rows earlier.
  • Cost — slower plan for the same result. Put row-level filters in WHERE for planner efficiency.

difference between where and having in sql — quick decision test.

  • Predicate references an aggregate (SUM, COUNT, AVG, …) → use HAVING.
  • Predicate references only raw columns → use WHERE.
  • Both kinds of predicates → use both clauses; WHERE for the row-level ones, HAVING for the aggregate ones.
  • The sql having clause is exclusively for aggregates — interview-grade answer.

Both WHERE and HAVING in one query — the canonical shape.

SELECT region,
       SUM(amount) AS total
FROM orders
WHERE order_date >= '2026-01-01'   -- pre-filter raw rows
  AND status     <> 'refund'        -- row-level predicate stays in WHERE
GROUP BY region
HAVING SUM(amount) > 10000          -- aggregate predicate goes in HAVING
ORDER BY total DESC;
Enter fullscreen mode Exit fullscreen mode
  • WHERE trims detail rows before the planner builds the hash aggregate — cheaper.
  • GROUP BY collapses by region.
  • HAVING drops regions whose total revenue is below 10000.
  • ORDER BY sorts the surviving regions.

SQL interview question — regions with > $10k 2026 revenue, excluding refunds

Assume orders(order_id, region, order_date, status, amount). Return regions whose total 2026 revenue exceeds $10,000, ignoring refunded orders. Sort by descending revenue.

Solution Using WHERE + GROUP BY + HAVING

Code solution.

SELECT region,
       SUM(amount) AS total_revenue
FROM orders
WHERE order_date >= '2026-01-01'
  AND order_date <  '2027-01-01'
  AND status     <> 'refund'
GROUP BY region
HAVING SUM(amount) > 10000
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step relation outcome
1 Scan orders every row read
2 WHERE order_date BETWEEN '2026-01-01' AND '2027-01-01' AND status <> 'refund' drops 2025 orders and refunds before grouping (cheap)
3 GROUP BY region hash aggregate over the surviving rows
4 SUM(amount) evaluated per region one total per region
5 HAVING SUM(amount) > 10000 drops regions whose total is at or below the threshold
6 ORDER BY total_revenue DESC sorts the survivors

Output:

region total_revenue
US 12,500
EU 11,200

(APAC drops out because its 2026 non-refund revenue is below $10k.)

Why this works — concept by concept:

  • WHERE pre-filters detail rows — the date and status predicates run before GROUP BY, so the planner builds a smaller hash aggregate; this is the textbook where vs having sql efficiency lesson.
  • GROUP BY region defines the grain — one row per region in the intermediate result.
  • SUM(amount) computes the per-region total — the aggregate that the HAVING clause will then evaluate.
  • HAVING filters groups, not rowsHAVING SUM(amount) > 10000 cannot be written in WHERE because the aggregate doesn't exist there.
  • The sql having clause exists specifically for this — it is the only clause where you can predicate on aggregates of the current GROUP BY.
  • CostΘ(n) scan with predicate pushdown thanks to WHERE, then Θ(k) hash aggregate where k is the count of surviving rows; the HAVING filter is Θ(g) where g is the number of groups.

SQL
Topic — having-clause
HAVING clause SQL drills

Practice →

SQL
Topic — aggregation
WHERE + GROUP BY + HAVING patterns

Practice →


5. GROUP BY with conditional aggregation — SUM(CASE WHEN …) revisited

The bridge to Blog72 — conditional aggregation IS an aggregate inside a GROUP BY

The conditional aggregation pattern from Blog72 — SUM(CASE WHEN cond THEN value ELSE 0 END) — is just an aggregate inside a GROUP BY query. When sql group by having problems ask for "paid vs refunded revenue per region in one query," the answer combines the two ideas: GROUP BY defines the grain, conditional aggregation defines the metrics.

The bridge to Blog72.

  • SUM(CASE WHEN cond THEN val ELSE 0 END) sums val only for rows where cond is true — see Blog72 §4 for the full anatomy.
  • Inside a GROUP BY query — one such expression per metric column; all share the same GROUP BY grain.
  • The ELSE 0 rule — missing it makes SUM ignore NULL and silently undercount.
  • COUNT(CASE WHEN cond THEN 1 END) and SUM(CASE WHEN cond THEN 1 ELSE 0 END) are equivalent for counting.

Why this matters in interviews.

  • Combines two fluencies at onceGROUP BY grain awareness AND branched aggregate logic.
  • One-pass multi-KPI reports — paid / refunded / pending revenue per region in a single grouped query.
  • No PIVOT operator needed — works on every dialect that supports CASE.
  • Tests for the ELSE 0 discipline — the silent-undercount NULL trap from Blog72.

The canonical pattern — one SUM(CASE WHEN …) per metric, all under one GROUP BY.

SELECT region,
       SUM(CASE WHEN status = 'paid'    THEN amount ELSE 0 END) AS paid_revenue,
       SUM(CASE WHEN status = 'refund'  THEN amount ELSE 0 END) AS refunded_amount,
       SUM(CASE WHEN status = 'pending' THEN amount ELSE 0 END) AS pending_amount,
       SUM(CASE WHEN status = 'paid'    THEN 1      ELSE 0 END) AS paid_orders
FROM orders
GROUP BY region
ORDER BY paid_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

Pair with HAVING — filter on any of the conditional aggregates.

  • Filter regions with at least one refundHAVING SUM(CASE WHEN status = 'refund' THEN 1 ELSE 0 END) > 0.
  • Filter regions with paid revenue > $10kHAVING SUM(CASE WHEN status = 'paid' THEN amount ELSE 0 END) > 10000.
  • Multi-condition existenceHAVING SUM(CASE WHEN tier = 'free' THEN 1 ELSE 0 END) > 0 AND SUM(CASE WHEN tier = 'paid' THEN 1 ELSE 0 END) > 0 keeps groups that have both free and paid users.

Worked example — paid / refunded / pending revenue per region

Question. From orders(order_id, region, status, amount), compute per region: paid revenue, refunded amount, pending amount, and count of paid orders, all in one grouped query.

Code.

SELECT region,
       SUM(CASE WHEN status = 'paid'    THEN amount ELSE 0 END) AS paid_revenue,
       SUM(CASE WHEN status = 'refund'  THEN amount ELSE 0 END) AS refunded_amount,
       SUM(CASE WHEN status = 'pending' THEN amount ELSE 0 END) AS pending_amount,
       SUM(CASE WHEN status = 'paid'    THEN 1      ELSE 0 END) AS paid_orders
FROM orders
GROUP BY region;
Enter fullscreen mode Exit fullscreen mode

Output.

region paid_revenue refunded_amount pending_amount paid_orders
US 11,200 1,300 0 142
EU 7,900 850 200 98
APAC 5,100 300 0 64

Rule of thumb: any time the question is "split metric X by category Y in the same grouped query", combine GROUP BY with SUM(CASE WHEN …) — that's the bridge between this blog and Blog72.

SQL
Topic — conditional-aggregation
Conditional aggregation drills

Practice →

SQL
Topic — data-aggregation
Data aggregation library

Practice →


6. GROUP BY ROLLUP, CUBE, and GROUPING SETS — multi-level subtotals

Diagram of multi-level GROUP BY — a hierarchical pyramid showing (year, quarter, month) at the bottom rolling up to (year, quarter), (year), and the grand total at the top via GROUP BY ROLLUP, on a light PipeCode card.

rollup cube grouping sets sql — one query, multiple grain levels

Once you understand single and multi-column GROUP BY, the next layer is producing multiple grain levels in one query. GROUPING SETS, ROLLUP, and CUBE extend GROUP BY to emit subtotals and grand totals without UNION-ing many separate queries.

The three advanced extensions.

  • GROUPING SETS ((a, b), (a), ()) — explicit list of group combinations; total flexibility.
  • ROLLUP(a, b, c) — hierarchical subtotals from finest grain to grand total; produces n+1 grain levels for n columns.
  • CUBE(a, b, c) — every combination of the listed columns; produces 2^n grain levels.

GROUPING SETS — fully explicit.

  • ShapeGROUP BY GROUPING SETS ((col1, col2), (col1), (col2), ()).
  • What it does — runs the query once for each parenthesised tuple and UNION ALLs the results.
  • () is the empty grouping — produces one row with everything aggregated (grand total).
  • Use it when — you need a custom mix of grain levels that doesn't match ROLLUP or CUBE.

ROLLUP — hierarchical subtotals.

  • ShapeGROUP BY ROLLUP(a, b, c).
  • Equivalent toGROUPING SETS ((a, b, c), (a, b), (a), ()).
  • Grain countn+1 levels for n columns: finest grain, peel off the rightmost column each step, plus the grand total.
  • Use it for — year → quarter → month subtotals with a grand total; geography → country → region rollups.

CUBE — combinatorial subtotals.

  • ShapeGROUP BY CUBE(a, b, c).
  • Equivalent toGROUPING SETS ((a, b, c), (a, b), (a, c), (b, c), (a), (b), (c), ()).
  • Grain count2^n levels for n columns; explodes fast.
  • Use it for — exploratory OLAP analysis where every combination matters; pivot reports with marginal totals.

GROUPING(col) helper.

  • Returns 1 when the column was rolled-up (the row's column value is NULL as part of the rollup).
  • Returns 0 when the row has a real value for that column.
  • Use it to label rowsCASE WHEN GROUPING(quarter) = 1 THEN 'year_total' ELSE quarter::text END.

Use cases.

  • Finance subtotals — year → quarter → month → grand total (ROLLUP).
  • Pivot reports with marginal totals — every row + every column + the corner cell (CUBE).
  • OLAP dashboards — pre-computed grain levels for fast drill-down.
  • Dashboard exports — a single query produces the body + the subtotal lines + the grand total.

Dialect support.

  • PostgreSQL, SQL Server, Oracle, BigQuery, Snowflake — all support ROLLUP, CUBE, GROUPING SETS in the standard syntax.
  • MySQL — supports the legacy WITH ROLLUP syntax (GROUP BY a, b WITH ROLLUP); not CUBE or GROUPING SETS.
  • SQL Server limit — caps at 32 expressions per ROLLUP / CUBE.

Worked example — revenue by (year, quarter) with year subtotals and grand total

Question. From orders(order_date, amount), compute total revenue at three grain levels in one query: per (year, quarter), per (year), and the grand total.

Code.

SELECT EXTRACT(YEAR FROM order_date)    AS year,
       EXTRACT(QUARTER FROM order_date) AS quarter,
       SUM(amount)                      AS revenue
FROM orders
GROUP BY ROLLUP(
    EXTRACT(YEAR FROM order_date),
    EXTRACT(QUARTER FROM order_date)
)
ORDER BY year NULLS LAST, quarter NULLS LAST;
Enter fullscreen mode Exit fullscreen mode

Output.

year quarter revenue
2026 1 5,200
2026 2 6,800
2026 3 7,400
2026 4 8,100
2026 NULL 27,500
NULL NULL 27,500
  • Rows 1–4 are the per-quarter detail.
  • Row 5 (quarter = NULL) is the year subtotal for 2026.
  • Row 6 (both NULL) is the grand total.

Rule of thumb: if the report needs subtotals plus a grand total in one query, reach for ROLLUP; if it needs every combination, reach for CUBE; if it needs an irregular custom mix, reach for GROUPING SETS.

SQL
Topic — aggregations
Multi-level GROUP BY drills

Practice →

SQL
Topic — data-aggregation
ROLLUP / CUBE / GROUPING SETS patterns

Practice →


7. GROUP BY interview gotchas — fan-out, non-aggregate columns, NULL groups

The six bugs interviewers test most often

GROUP BY has a small surface area but a long tail of edge cases that fail candidates. These are the six gotchas reviewers test most often.

Fan-out after a JOIN.

  • The bug — joining a one-to-many relationship and aggregating afterwards multiplies the SUM.
  • Exampleorders LEFT JOIN order_items then SUM(orders.amount) triples the amount when each order has three items.
  • Defence — pre-aggregate inside a CTE to the parent grain before joining; or join after aggregating.
  • State the grain aloud — "one row per (order_id, item_id)" — every reviewer expects this.

Non-aggregate column error.

  • The bugSELECT region, customer_id, SUM(amount) FROM orders GROUP BY regioncustomer_id is neither in GROUP BY nor an aggregate.
  • ANSI SQL raises an error.
  • mysql group by pre-8.0 silently picked an arbitrary customer_id per group (notorious data-quality bug source).
  • Modern MySQL with ONLY_FULL_GROUP_BY enforces the rule.
  • Fix — wrap in MIN(customer_id), MAX(customer_id), ANY_VALUE(customer_id), or add to GROUP BY.

COUNT(*) vs COUNT(col) confusion.

  • The bug — using COUNT(col) to count rows when col is sometimes NULL; the count silently drops the NULL rows.
  • Subtle — query "looks right" and returns a plausible number; only an audit against COUNT(*) catches it.
  • Fix — use COUNT(*) for row counts; use COUNT(col) only when you mean "rows where col is not NULL"; use COUNT(DISTINCT col) for unique values.
  • The count distinct group by sql test — interviewer hands you a table with NULLs and asks "how many customers per region?"; correct answer is COUNT(DISTINCT customer_id), not COUNT(*) or COUNT(customer_id) if you want unique customers.

SUM over empty group returns NULL, not 0.

  • The bugLEFT JOIN produces empty groups; SUM over them returns NULL; downstream consumers expect 0.
  • Symptom — dashboards render "—" or blank instead of 0; downstream arithmetic produces unexpected NULLs.
  • FixCOALESCE(SUM(x), 0) — pairs cleanly with the safe-arithmetic patterns from Blog72.

NULL group key behaviour.

  • The behaviour — rows with NULL in a GROUP BY column form their own group (one group for all NULL rows together).
  • Subtle — when downstream code joins back on the group key, NULL = NULL is UNKNOWN and the NULL group row never matches.
  • FixCOALESCE(group_col, 'UNKNOWN') if you want NULLs to display under a labelled bucket.

GROUP BY alias rules and sql group by month patterns.

  • The rule — a SELECT alias is NOT visible inside the same query's GROUP BY (logical processing order).
  • Workaround #1 — repeat the expression in GROUP BY: SELECT DATE_TRUNC('month', t) AS month … GROUP BY DATE_TRUNC('month', t).
  • Workaround #2 — positional shorthand: GROUP BY 1 refers to the first SELECT column.
  • Workaround #3 — wrap in a CTE: define the alias in the inner query, reference it in the outer.
  • The sql group by month pattern ships with all three workarounds in production code.

SQL
Topic — group-by
GROUP BY gotcha drills

Practice →

SQL
Topic — having-clause
HAVING gotcha drills

Practice →


Choosing where to filter (cheat sheet)

A one-screen cheat sheet for using GROUP BY and HAVING in SQL — pick the clause that matches your predicate type, then copy the corresponding shape into your query.

Predicate references … Use clause … Why
Only raw row columns (date, status, user_id) WHERE Runs before GROUP BY; planner can push down
An aggregate (SUM, COUNT, AVG, MIN, MAX) HAVING Runs after GROUP BY; aggregates exist by then
Both raw columns AND aggregates WHERE + HAVING Separate them; row filters in WHERE, group filters in HAVING
A derived bucket label (CASE expression) WHERE or GROUP BY Repeat the CASE expression — alias not visible in WHERE / GROUP BY
A row that should be counted only when condition holds SUM(CASE WHEN cond THEN 1 ELSE 0 END) Conditional aggregation pattern from Blog72
A specific group should be excluded by an aggregate threshold HAVING <aggregate predicate> E.g. HAVING COUNT(*) >= 10 for "groups with ≥ 10 rows"
The output sort ORDER BY (after SELECT) ORDER BY can reference the SELECT alias

Frequently asked questions

What does GROUP BY do in SQL?

GROUP BY collapses rows that share the same key values into one row per distinct combination of the listed columns, and aggregate functions populate the summary columns. The output grain equals the GROUP BY column combination — GROUP BY region returns one row per region, GROUP BY region, channel returns one row per (region, channel) pair. Every non-aggregate column in the SELECT list must appear in GROUP BY (ANSI rule, modern MySQL with ONLY_FULL_GROUP_BY enforces this), and rows with NULL in a GROUP BY column form their own group together.

What's the difference between WHERE and HAVING in SQL?

WHERE filters rows before GROUP BY collapses them; HAVING filters groups after the collapse. The execution order is FROMWHEREGROUP BYHAVINGSELECTORDER BY, so WHERE cannot reference aggregates (they don't exist yet), and HAVING is the only place you can predicate on SUM, COUNT, AVG, etc. against the current GROUP BY. The canonical interview answer to difference between where and having in sql is: row-level predicates go in WHERE (cheaper, planner can push them down); aggregate predicates go in HAVING (no other clause can reference them).

How do I use GROUP BY and HAVING together in SQL?

Use them together in the canonical shape: WHERE for row-level filters, GROUP BY for the grain, then HAVING for aggregate predicates. For example, "find regions with 2026 revenue over $10k, excluding refunds" becomes WHERE order_date >= '2026-01-01' AND status <> 'refund' GROUP BY region HAVING SUM(amount) > 10000. The group by and having in sql pattern lets you express "filter rows, then group, then filter groups" without breaking the query into multiple steps. Reviewers test this specifically because it conflates two filtering concepts that share a clause name in many languages but are distinct in SQL.

What's the difference between COUNT(*), COUNT(col), and COUNT(DISTINCT col)?

COUNT(*) counts every row in the group, including rows with NULLs in any column. COUNT(col) counts rows where col is not NULL; it silently skips NULL rows. COUNT(DISTINCT col) counts unique non-NULL values of col. The sql count group by interview question typically wants COUNT(*) for "how many rows landed in this group" and COUNT(DISTINCT customer_id) for "how many unique customers". A common bug is using COUNT(col) when you meant COUNT(*) and silently undercounting rows where col is NULL — audit against COUNT(*) to catch this.

What are ROLLUP, CUBE, and GROUPING SETS in SQL?

The rollup cube grouping sets sql family extends GROUP BY to produce multiple grain levels in one query. GROUPING SETS ((a, b), (a), ()) runs the query once per parenthesised tuple and UNION ALLs the results. ROLLUP(a, b, c) produces hierarchical subtotals — finest grain, then peel off the rightmost column each step, plus the grand total — n+1 levels for n columns. CUBE(a, b, c) produces every combination — 2^n levels. Use ROLLUP for finance subtotals (year → quarter → month → grand total), CUBE for exploratory OLAP, and GROUPING SETS for custom mixes. Supported by Postgres, SQL Server, Oracle, BigQuery, Snowflake; MySQL has legacy WITH ROLLUP.


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including PostgreSQL-first SQL practice keyed to GROUP BY, HAVING, aggregate functions, conditional aggregation, and the rollup cube grouping sets sql family. Whether you are drilling sql interview questions with answers for sql for data engineers loops or grinding through data engineering interview questions end-to-end, the practice library mirrors the same row-collapse mental model this guide teaches.

Kick off via Explore practice →; drill the dedicated GROUP BY SQL lane →; fan out into the HAVING clause SQL lane →; reinforce aggregation SQL drills →; rehearse conditional aggregation problems →; widen coverage on the full SQL practice library →.

Top comments (0)