DEV Community

Cover image for Hyper Data Engineering Interview Questions
Gowtham Potureddi
Gowtham Potureddi

Posted on

Hyper Data Engineering Interview Questions

Hyper data engineering interview questions are SQL end-to-end with a sharp retail-analytics edge: six SQL primitives (aggregation with ORDER BY ... LIMIT 1 for the most-expensive product, DATE_TRUNC('month', sale_date) with SUM for monthly revenue rollups, multi-table JOIN plus GROUP BY for state-level sales volume, ROW_NUMBER() OVER (PARTITION BY state ORDER BY units_sold DESC) for the best-selling product per state, HAVING COUNT(DISTINCT state) = total_states for nationwide-popularity completeness checks, and LAG(units) OVER (PARTITION BY category ORDER BY week) for week-over-week category growth tracking). The framings are retail data—SKU pricing, monthly revenue, state-level rankings, top-sellers per state, nationwide-availability checks, and weekly category growth deltas.

This guide walks through the six topic clusters Hyper actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 6-problem Hyper set (3 easy, 2 medium, 1 hard)—a SQL-only hub that rewards aggregation discipline, partitioned ROW_NUMBER, and LAG-driven weekly-growth fluency. There is no Python in this loop; every minute of prep should go to SQL.

Hyper data engineering interview questions cover image with bold headline, SQL chip, and pipecode.ai attribution.


Top Hyper data engineering interview topics

From the Hyper data engineering practice set, the six numbered sections below follow this topic map (one row per H2):

# Topic (sections 1–6) Why it shows up at Hyper
1 SQL aggregation and ORDER BY for the most expensive product Most Expensive Product Sold—MAX(price) versus ORDER BY price DESC LIMIT 1 and the row-vs-value distinction.
2 SQL date truncation and SUM for monthly revenue rollups Total Revenue by Month Report—DATE_TRUNC('month', sale_date) plus SUM(amount) plus ORDER BY month.
3 SQL multi-table JOINs for state-level sales volume State with Highest Sales Volume—JOIN customers to bring state into a sales-keyed query, then GROUP BY state + ORDER BY total DESC LIMIT 1.
4 SQL window functions for best-selling product per state Best-Selling Product by State—ROW_NUMBER() OVER (PARTITION BY state ORDER BY units_sold DESC) then WHERE rn = 1.
5 SQL JOINs with HAVING for nationwide-popularity products Products with Nationwide Popularity—HAVING COUNT(DISTINCT state) = (SELECT COUNT(DISTINCT state) FROM customers) completeness check.
6 SQL window functions for weekly category growth tracking Weekly Category Growth Tracker (HARD)—LAG(units) OVER (PARTITION BY category ORDER BY week) plus (curr − prev) * 100.0 / NULLIF(prev, 0).

Retail-analytics framing rule: Hyper's prompts span pure retail SQL—product pricing, monthly revenue, state-level rankings, top-seller-per-state, nationwide-popularity completeness, weekly category growth. The interviewer is grading whether you map each business framing to the right SQL primitive: most-expensive → ORDER BY DESC LIMIT 1; monthly revenue → DATE_TRUNC('month', ...) + SUM; per-state totals → JOIN + GROUP BY + ORDER BY DESC LIMIT 1; per-state top-N → partitioned ROW_NUMBER; nationwide completeness → HAVING COUNT(DISTINCT state) = total; WoW growth → LAG over week-truncated partitions. State the mapping out loud.


1. SQL Aggregation and ORDER BY for the Most Expensive Product

Aggregation and ordered-row pickers in SQL for data engineering

"Find the most expensive product sold" is the canonical aggregation interview prompt for retail. The mental model: MAX(price) returns one number; ORDER BY price DESC LIMIT 1 returns one full row. Pick the form that matches what you actually need to emit. The MAX form is one aggregate; the ORDER BY form is one sort plus a LIMIT. Both are O(N) in row count but they pay different costs for different downstream needs.

Pro tip: Interviewers love asking "what if two products tie for the highest price?" The MAX(price) form silently collapses ties to one value; ORDER BY price DESC LIMIT 1 arbitrarily picks one row. State the tie-break explicitly: ORDER BY price DESC, name ASC LIMIT 1 or move to DENSE_RANK() OVER (ORDER BY price DESC) = 1 if the prompt wants all top-priced products.

MAX(price) returns the value, not the row

MAX(price) is a scalar aggregate—one column, one row, one value. The invariant: MAX answers "what is the highest price?" but cannot tell you which product had it without an additional self-join or subquery filter. If the question is "the highest price," MAX is correct in one line. If the question is "the product with the highest price," MAX alone is wrong.

  • SELECT MAX(price) FROM products — one row, column max.
  • SELECT * FROM products WHERE price = (SELECT MAX(price) FROM products) — full-row recovery via scalar subquery.
  • Tied tops — the subquery form returns all tied rows; the MAX form silently collapses to one number.
  • Empty tableMAX returns NULL; the subquery form returns zero rows.

Worked example. Five products. Use MAX to get the price; use the subquery form to get the row.

product_id name price
1 Notebook 14.50
2 Laptop 1299.00
3 Mouse 25.00
4 Monitor 459.00
5 Keyboard 89.00

Worked-example solution.

-- Just the value
SELECT MAX(price) AS max_price FROM products;
-- max_price = 1299.00

-- Full row, by scalar subquery
SELECT product_id, name, price
FROM products
WHERE price = (SELECT MAX(price) FROM products);
-- (2, Laptop, 1299.00)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if you only need a number, use MAX. If you need attributes alongside it (name, id, category), reach for the ORDER BY ... LIMIT or scalar-subquery form.

ORDER BY price DESC LIMIT 1 returns the full row

The ORDER BY ... LIMIT invariant: sort the table, then keep the top row. Most engines don't materialize the full sort—they use a top-N index pushdown when there's an index on the sort column, or a streaming top-K sort otherwise. Either way, the result shape is "one row with every selected column."

  • SELECT * FROM products ORDER BY price DESC LIMIT 1 — full row of the most expensive product.
  • SELECT name, price FROM products ORDER BY price DESC LIMIT 1 — pick specific columns.
  • OFFSET 1 — second-most expensive; generalizes to N-th.
  • Index helpsCREATE INDEX ON products(price DESC) makes the lookup near-O(log N).

Worked example. Same products table. Pull the top row, then the second-highest using OFFSET.

query result
ORDER BY price DESC LIMIT 1 (2, Laptop, 1299.00)
ORDER BY price DESC LIMIT 1 OFFSET 1 (4, Monitor, 459.00)
ORDER BY price DESC LIMIT 1 OFFSET 2 (5, Keyboard, 89.00)

Worked-example solution.

SELECT product_id, name, price
FROM products
ORDER BY price DESC
LIMIT 1;
-- (2, Laptop, 1299.00)

SELECT product_id, name, price
FROM products
ORDER BY price DESC
LIMIT 1 OFFSET 1;
-- (4, Monitor, 459.00)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: ORDER BY ... LIMIT is the right primitive when the answer is "the row with the X-th biggest value of this column."

Tie-breaker: ORDER BY price DESC, name ASC LIMIT 1

The tie-break invariant: when two rows tie on the sort key, the result is non-deterministic unless you add a secondary key. Adding , name ASC (or any other unique-or-near-unique column) makes the result reproducible. Without a tie-break, your CI tests will flake when input rows ship in different orders.

  • ORDER BY price DESC, name ASC LIMIT 1 — deterministic top-priced product.
  • DENSE_RANK() OVER (ORDER BY price DESC) = 1 — return all tied top-priced products.
  • ROW_NUMBER() OVER (ORDER BY price DESC) = 1 — return one tied product (engine picks).
  • Always add a unique tie-break — even a synthetic id column works.

Worked example. Two products tie at 459.00.

product_id name price
4 Monitor 459.00
6 Tablet 459.00
query result
ORDER BY price DESC LIMIT 1 non-deterministic (4 or 6)
ORDER BY price DESC, name ASC LIMIT 1 (4, Monitor, 459.00)
DENSE_RANK() OVER (ORDER BY price DESC) = 1 both 4 and 6

Worked-example solution.

-- Deterministic single row
SELECT product_id, name, price
FROM products
ORDER BY price DESC, name ASC
LIMIT 1;

-- All tied top-priced rows
SELECT product_id, name, price
FROM (
  SELECT product_id, name, price,
         DENSE_RANK() OVER (ORDER BY price DESC) AS dr
  FROM products
) t
WHERE dr = 1;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: every LIMIT 1 after ORDER BY should have a tie-break column listed second, or you'll regret it on every retest.

Common beginner mistakes

  • Returning MAX(price) and the row's name in the same query without GROUP BY—a SQL error in strict mode.
  • Trying WHERE price = MAX(price)—aggregates aren't allowed in WHERE; use a scalar subquery or HAVING.
  • Forgetting LIMIT 1—returns the entire sorted table.
  • Sorting by a different column than the prompt asked—ORDER BY price defaults to ascending; the most expensive needs DESC.
  • Skipping the tie-break and surprising the test harness with non-deterministic output.

SQL Interview Question on the Most Expensive Product

Table products(product_id INT, name TEXT, price NUMERIC). Return the single row for the most expensive product, with columns product_id, name, price. Break ties by alphabetical name ascending.

Solution Using ORDER BY price DESC, name ASC LIMIT 1

SELECT product_id, name, price
FROM products
ORDER BY price DESC, name ASC
LIMIT 1;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 5 rows below):

product_id name price
1 Notebook 14.50
2 Laptop 1299.00
3 Mouse 25.00
4 Monitor 459.00
5 Keyboard 89.00
  1. Scan products — read all five rows into the sort operator.
  2. Sort by price DESC — order: Laptop 1299, Monitor 459, Keyboard 89, Mouse 25, Notebook 14.50.
  3. Apply secondary name ASC — no ties exist, so secondary order is a no-op here. (If two rows tied at 459, alphabetic ascending would deterministically pick the alphabetically-earlier name.)
  4. LIMIT 1 — keep only the first row from the sorted output.
  5. Project columns — emit product_id = 2, name = 'Laptop', price = 1299.00.

Output:

product_id name price
2 Laptop 1299.00

Why this works — concept by concept:

  • ORDER BY price DESC — sorts by the target column descending; the highest price ends up first.
  • Secondary name ASC tie-break — guarantees deterministic output when two products share the top price; without it, the result depends on storage order and can flake across runs.
  • LIMIT 1 — keeps exactly one row, which is precisely what the prompt requires.
  • Top-N pushdown — most engines don't materialize the full sort; they stream the top-K rows, so this is closer to O(N) than O(N log N) in practice.
  • Whole-row recovery — unlike MAX(price), this form lets us project the product's name and id without a self-join or scalar subquery.
  • Cost — one scan of products, partial top-K sort → O(N) time, O(1) extra memory beyond the streaming top-K buffer.

SQL
Topic — aggregation
Aggregation problems

Practice →

SQL
Topic — sorting
Sorting problems

Practice →


2. SQL Date Truncation and SUM for Monthly Revenue Rollups

Date truncation and aggregation for time-series rollups in SQL for data engineering

"Build a total-revenue-by-month report" is the canonical time-series rollup interview prompt. The mental model: DATE_TRUNC('month', sale_date) collapses every day in a month to the same YYYY-MM-01 value, then SUM(amount) GROUP BY <truncated date> rolls revenue up to a single row per month. The same shape powers daily / weekly / quarterly rollups—swap the truncation argument and the grouping key follows.

Pro tip: Always cast or pin the truncated value to DATE (not TIMESTAMP) before grouping; mixing time zones across a TIMESTAMP WITH TIME ZONE column and a DATE_TRUNC call will silently bucket rows wrong at month boundaries. DATE_TRUNC('month', sale_date)::date is the safe form on PostgreSQL.

DATE_TRUNC('month', sale_date) collapses days to month-start

DATE_TRUNC rounds a date or timestamp down to the nearest unit. The invariant: for any sale_date in March 2026, DATE_TRUNC('month', sale_date) returns 2026-03-01. Every March row maps to the same bucket key.

  • DATE_TRUNC('month', d) — first day of the month containing d.
  • DATE_TRUNC('week', d) — first day of ISO week (Monday on PostgreSQL; engine-specific on others).
  • DATE_TRUNC('day', d) — strip time-of-day; useful when sale_date is a TIMESTAMP.
  • DATE_TRUNC('quarter', d) — first day of the quarter; pairs cleanly with quarterly revenue reports.

Worked example. Three sales rows in March + two in April. After DATE_TRUNC('month', sale_date):

sale_id sale_date month_bucket
1 2026-03-04 2026-03-01
2 2026-03-19 2026-03-01
3 2026-03-31 2026-03-01
4 2026-04-02 2026-04-01
5 2026-04-25 2026-04-01

Worked-example solution.

SELECT
  sale_id,
  sale_date,
  DATE_TRUNC('month', sale_date)::date AS month_bucket
FROM sales
ORDER BY sale_date;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: whenever the prompt says "by month / by week / by quarter," reach for DATE_TRUNC before you reach for EXTRACT(MONTH FROM ...)DATE_TRUNC keeps the year, so March 2025 and March 2026 stay distinct.

SUM(amount) over the month bucket

The aggregate invariant: SUM(amount) GROUP BY DATE_TRUNC('month', sale_date) produces one row per month with the total revenue. Every monthly aggregate report follows this shape; only the truncation unit and aggregate function change.

  • SUM(amount) — totals the column.
  • COUNT(*) — row count instead of total.
  • AVG(amount) — average instead of total.
  • COUNT(DISTINCT customer_id) — unique customers instead of revenue.

Worked example. Monthly revenue from the 5 rows above with amount values 50, 30, 20, 40, 60.

month_bucket rows SUM(amount)
2026-03-01 3 100.00
2026-04-01 2 100.00

Worked-example solution.

SELECT
  DATE_TRUNC('month', sale_date)::date AS month,
  SUM(amount) AS total_revenue
FROM sales
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY month;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: the GROUP BY expression must match the SELECT expression exactly (or use the column position alias) — engines disagree on letting you GROUP BY 1 for safety.

ORDER BY month for time-sorted output

The output invariant: GROUP BY does not guarantee output order. For a time-series report, append ORDER BY month to surface chronological output. Without it, downstream charting tools see arbitrary order.

  • ORDER BY month — ascending chronological.
  • ORDER BY month DESC — most recent first; common for ops dashboards.
  • ORDER BY month + densification — for "include zero-revenue months," left-join against a generated calendar table.
  • No ORDER BY after GROUP BY — flaky output across engines and partition orderings.

Worked example. Add a January 2026 row with revenue 75. Output should now span Jan, Mar, Apr—but Feb is correctly absent (no sales).

month total_revenue
2026-01-01 75.00
2026-03-01 100.00
2026-04-01 100.00

Worked-example solution.

SELECT
  DATE_TRUNC('month', sale_date)::date AS month,
  SUM(amount) AS total_revenue
FROM sales
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY month ASC;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: a time-series report without ORDER BY is a bug waiting for a code review; always sort the output explicitly.

Common beginner mistakes

  • Using EXTRACT(MONTH FROM sale_date)—collapses every March across years into one bucket; lose the year.
  • Forgetting ::date cast—on TIMESTAMP WITH TIME ZONE columns, the result drifts at month boundaries by your session's offset.
  • Aliasing the truncated column as month and then GROUP BY month (engine-dependent—Postgres allows it, others don't); safer to repeat the expression.
  • Skipping ORDER BY—non-deterministic chronological output.
  • Densifying with LEFT JOIN generate_series(...) ON DATE_TRUNC = ...—correct if the prompt asks for zero-revenue months, but check first; the canonical answer is sparse.

SQL Interview Question on Total Revenue by Month

Table sales(sale_id INT, sale_date DATE, amount NUMERIC). Return one row per month with the total revenue. Columns: month (first day of the month), total_revenue. Sort ascending by month.

Solution Using DATE_TRUNC and SUM

SELECT
  DATE_TRUNC('month', sale_date)::date AS month,
  SUM(amount) AS total_revenue
FROM sales
GROUP BY DATE_TRUNC('month', sale_date)
ORDER BY month ASC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 7 sales rows across 3 months):

sale_id sale_date amount
1 2026-01-15 75.00
2 2026-03-04 50.00
3 2026-03-19 30.00
4 2026-03-31 20.00
5 2026-04-02 40.00
6 2026-04-25 60.00
7 2026-04-30 25.00
  1. Scan sales — read all 7 rows.
  2. Compute DATE_TRUNC('month', sale_date) — January rows → 2026-01-01; March rows → 2026-03-01; April rows → 2026-04-01.
  3. GROUP BY DATE_TRUNC('month', sale_date) — three groups: Jan (1 row), Mar (3 rows), Apr (3 rows).
  4. SUM(amount) per group — Jan = 75; Mar = 50 + 30 + 20 = 100; Apr = 40 + 60 + 25 = 125.
  5. Projectmonth = 2026-01-01 / 2026-03-01 / 2026-04-01; total_revenue = 75 / 100 / 125.
  6. ORDER BY month ASC — sort the three rows ascending; Feb is correctly absent.

Output:

month total_revenue
2026-01-01 75.00
2026-03-01 100.00
2026-04-01 125.00

Why this works — concept by concept:

  • DATE_TRUNC('month', sale_date) — collapses every day in a calendar month to the same first-of-month key, giving a stable group key for the aggregate.
  • ::date cast — strips any time / timezone component so the bucket key is a clean DATE value, avoiding boundary drift.
  • SUM(amount) — sums revenue within each month bucket; the canonical aggregate for monetary rollups.
  • GROUP BY matches the projected expression — safest across engines; some accept GROUP BY month (the alias) but not all.
  • Sparse-by-default — months with no sales rows are absent from the output; correct unless the prompt asks for zero-revenue rows via a calendar join.
  • ORDER BY month ASC — gives the consumer a chronological time-series; no client-side sort needed.
  • Cost — one scan of sales, one hash aggregate → O(N) time, O(M) space where M is the number of distinct months.

SQL
Topic — date functions
Date-functions problems

Practice →

SQL
Topic — date arithmetic
Date-arithmetic problems

Practice →


3. SQL Multi-Table JOINs for State-Level Sales Volume

Multi-table JOINs and grouped aggregates in SQL for data engineering

"Find the state with the highest sales volume" is the canonical multi-table JOIN interview prompt. The mental model: the state column lives on customers; the amount column lives on sales; one is keyed by customer_id. You JOIN to bring state into a sales-keyed query, then GROUP BY state, sum amounts, and pick the top row. This is the bread-and-butter pattern for any cross-table rollup—orders by region, traffic by referrer, revenue by plan tier.

Pro tip: Always state the JOIN type out loud—INNER JOIN for "sales whose customer exists" (the canonical answer), LEFT JOIN if you're worried about orphaned customer_ids in sales. Hyper's prompts assume referential integrity, so INNER JOIN is the expected answer.

Inner JOIN: bring state into a sales-keyed query

INNER JOIN is the structural primitive: for every sales row, find the matching customers row by id and emit a combined row with both sides' columns. The invariant: the join key must be on the matching column from each table (here, sales.customer_id = customers.id).

  • INNER JOIN customers ON sales.customer_id = customers.id — combines rows that have a match on both sides.
  • LEFT JOIN — keeps sales rows that have no matching customer (rare, but useful for audit).
  • USING (customer_id) — terser when both columns share the same name.
  • Filter columns from both tables — you can SELECT customers.state, sales.amount after the join.

Worked example. Three sales joined to two customers.

sales: sale_id customer_id amount
1 100 50
2 100 30
3 200 80
customers: id name state
100 Alice CA
200 Bob TX

After INNER JOIN:

sale_id customer_id amount name state
1 100 50 Alice CA
2 100 30 Alice CA
3 200 80 Bob TX

Worked-example solution.

SELECT s.sale_id, s.amount, c.state
FROM sales s
INNER JOIN customers c ON c.id = s.customer_id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always alias your tables (s, c) — verbose sales.customer_id = customers.id gets unreadable in 4-table joins.

GROUP BY state + SUM(amount) for per-state totals

The aggregation invariant: after the JOIN, GROUP BY customers.state collapses every joined row into one row per state with SUM(amount). The state column drives the bucketing; amount feeds the aggregate.

  • GROUP BY c.state — one row per state.
  • SUM(s.amount) — total revenue per state.
  • COUNT(*) — number of sales per state (different metric).
  • COUNT(DISTINCT s.customer_id) — distinct customers per state.

Worked example. Continuing the previous join, group by state.

state SUM(amount)
CA 80
TX 80

Worked-example solution.

SELECT c.state, SUM(s.amount) AS total_revenue
FROM sales s
INNER JOIN customers c ON c.id = s.customer_id
GROUP BY c.state
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: the GROUP BY key must be every column in the SELECT that is not behind an aggregate; otherwise strict-mode SQL will reject the query.

Top-state pick: ORDER BY total DESC LIMIT 1

The pick invariant: after GROUP BY, the row count is one per state; ORDER BY total DESC LIMIT 1 returns the top-revenue state. This is the same ORDER BY ... LIMIT 1 pattern from §1, just applied to an aggregated result instead of a raw table.

  • ORDER BY total_revenue DESC — descending revenue.
  • LIMIT 1 — top state only.
  • Tie-breakORDER BY total_revenue DESC, c.state ASC LIMIT 1 if two states tie.
  • All-ties variantWITH t AS (...) SELECT * FROM t WHERE total_revenue = (SELECT MAX(total_revenue) FROM t).

Worked example. Add a third customer in CA with amount = 25.

state SUM(amount) rank
CA 105 1
TX 80 2

After LIMIT 1: just (CA, 105).

Worked-example solution.

SELECT c.state, SUM(s.amount) AS total_revenue
FROM sales s
INNER JOIN customers c ON c.id = s.customer_id
GROUP BY c.state
ORDER BY total_revenue DESC, c.state ASC
LIMIT 1;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the prompt says "the state with…" (singular), LIMIT 1 is the contract; if it says "the top three states," use LIMIT 3 and consider switching to RANK().

Common beginner mistakes

  • Forgetting the JOIN entirely and trying to GROUP BY state from sales—the state column doesn't exist on sales.
  • Joining sales.customer_id = customers.customer_id when the customer table's PK is id—silent zero-row result.
  • GROUP BY on customer_id instead of state—per-customer totals, not per-state.
  • Skipping ORDER BY before LIMIT—non-deterministic top row.
  • LEFT JOIN when the prompt assumes integrity—introduces NULL state rows that the aggregate then groups under one phantom bucket.

SQL Interview Question on State with Highest Sales Volume

Tables sales(sale_id INT, customer_id INT, amount NUMERIC) and customers(id INT, name TEXT, state TEXT). Return the single state with the highest total sales revenue and the revenue itself. Columns: state, total_revenue. Break ties alphabetically.

Solution Using INNER JOIN, GROUP BY, and LIMIT 1

SELECT
  c.state,
  SUM(s.amount) AS total_revenue
FROM sales s
INNER JOIN customers c ON c.id = s.customer_id
GROUP BY c.state
ORDER BY total_revenue DESC, c.state ASC
LIMIT 1;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 4 sales × 3 customers):

sale_id customer_id amount
1 100 50
2 100 30
3 200 80
4 300 25
id name state
100 Alice CA
200 Bob TX
300 Carol CA
  1. Scan + JOIN — every sales row is matched to its customers row via customer_id = id. Result: 4 joined rows (3 in CA, 1 in TX after we group).
  2. Project relevant columnsc.state, s.amount. The 4 joined rows: (CA, 50), (CA, 30), (TX, 80), (CA, 25).
  3. GROUP BY c.state — collapse to 2 groups: CA (3 rows), TX (1 row).
  4. SUM(s.amount) per group — CA = 50 + 30 + 25 = 105; TX = 80.
  5. ORDER BY total_revenue DESC, c.state ASC — primary descending revenue puts CA (105) before TX (80); secondary tie-break is unused here but would protect against future ties.
  6. LIMIT 1 — keep CA.
  7. Project — emit (CA, 105.00).

Output:

state total_revenue
CA 105.00

Why this works — concept by concept:

  • INNER JOIN — pulls the state column out of customers and onto every matching sales row; the join is the only way to express "sales rolled up by customer attribute" in one query.
  • ON c.id = s.customer_id — the join key uses each table's actual primary / foreign key columns; mismatching them (e.g. s.id) silently returns wrong or zero rows.
  • GROUP BY c.state — bucket every joined row by the state column; SUM(s.amount) then totals revenue within each bucket.
  • SUM(s.amount) — the canonical aggregate for revenue rollups; counting rows (COUNT(*)) would answer a different question.
  • ORDER BY total_revenue DESC — sort the aggregate result descending so the top-revenue state ends up first.
  • Secondary c.state ASC tie-break — guarantees deterministic output if two states tie at the top.
  • LIMIT 1 — caps the result to one state, matching the prompt's singular contract.
  • Cost — one scan of sales, one indexed lookup per row into customers (or one hash join), one hash aggregate → O(N + M) time where N = sales rows and M = customers rows.

SQL
Topic — joins
JOIN problems

Practice →

SQL
Hyper — joins
Hyper JOIN problems

Practice →


4. SQL Window Functions for Best-Selling Product Per State

Partitioned ROW_NUMBER for top-N-per-group queries in SQL for data engineering

"Find the best-selling product in each state" is the canonical top-N-per-group window-function prompt. The mental model: partition the rows by state, rank within each partition by units_sold descending, then keep rows where the rank is 1. This is the structural pattern for any "top-K per group" question—best-seller per region, highest-paid employee per department, longest trip per driver. Replace LIMIT 1-after-ORDER BY with ROW_NUMBER() once you need the top per every group, not just one global top.

Diagram showing a sales table partitioned by state, ROW_NUMBER ranks within each partition, and a WHERE rn = 1 filter producing one best-seller row per state.

Pro tip: ROW_NUMBER() and RANK() and DENSE_RANK() differ on ties. ROW_NUMBER arbitrarily breaks them (engine-dependent). RANK skips numbers on ties (1, 1, 3). DENSE_RANK doesn't skip (1, 1, 2). For "the best-seller per state," prompt usually expects one row per state—ROW_NUMBER with a deterministic ORDER BY tiebreaker is the safe default.

PARTITION BY state for per-state windows

The window-function invariant: PARTITION BY slices the row set into independent groups; the window function then runs separately within each slice. For best-seller-per-state, the partition key is state; rankings within California don't bleed into Texas.

  • OVER (PARTITION BY state ...) — declare the partition key.
  • No PARTITION BY — the window spans the whole table (one global ranking).
  • Multiple partition keysPARTITION BY region, state for hierarchical groups.
  • PARTITION BY is independent of GROUP BY — windows don't collapse rows; they annotate them.

Worked example. Six sales-by-state-by-product rows, partitioned by state.

state product units_sold
CA P1 120
CA P2 95
CA P3 80
TX P2 200
TX P4 150
TX P5 110

The partition splits this into two independent groups: {CA: P1, P2, P3} and {TX: P2, P4, P5}.

Worked-example solution.

-- Just attach the partition key and inspect with COUNT
SELECT
  state, product, units_sold,
  COUNT(*) OVER (PARTITION BY state) AS state_rowcount
FROM sales_by_state;
-- CA rows show state_rowcount = 3; TX rows show state_rowcount = 3
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: whenever the question says "per X" (per state, per category, per user), reach for PARTITION BY X.

ROW_NUMBER() OVER (... ORDER BY units DESC) ranks within each state

The ranking invariant: ROW_NUMBER() OVER (PARTITION BY state ORDER BY units_sold DESC) assigns 1 to the highest-selling product in each state, 2 to the second-highest, etc.. The ORDER BY inside the window is what defines "best."

  • ROW_NUMBER() — one row gets each rank; ties broken arbitrarily (or by your secondary ORDER BY key).
  • RANK() — ties share a rank; gaps follow.
  • DENSE_RANK() — ties share a rank; no gap.
  • Always add a tie-break inside ORDER BYORDER BY units_sold DESC, product ASC for determinism.

Worked example. Apply ROW_NUMBER() to the table above.

state product units_sold rn
CA P1 120 1
CA P2 95 2
CA P3 80 3
TX P2 200 1
TX P4 150 2
TX P5 110 3

Worked-example solution.

SELECT
  state, product, units_sold,
  ROW_NUMBER() OVER (
    PARTITION BY state
    ORDER BY units_sold DESC, product ASC
  ) AS rn
FROM sales_by_state;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: the function is ROW_NUMBER() (no argument); the magic happens entirely inside the OVER (...) clause.

Filter WHERE rn = 1 after the window

The filter invariant: window functions cannot be used in WHERE directly; wrap the windowed query in a CTE or subquery, then filter on the alias. This is a very common interview gotcha.

  • CTE formWITH t AS (SELECT ..., ROW_NUMBER() OVER (...) AS rn FROM ...) SELECT * FROM t WHERE rn = 1.
  • Subquery formSELECT * FROM (SELECT ..., ROW_NUMBER() OVER (...) AS rn FROM ...) sub WHERE rn = 1.
  • QUALIFY — supported on Snowflake / BigQuery / DuckDB; lets you skip the wrapper. Postgres does not support QUALIFY.
  • WHERE ROW_NUMBER() ... = 1 — illegal; engine error.

Worked example. Filter the table above to keep only rn = 1.

state product units_sold
CA P1 120
TX P2 200

Worked-example solution.

WITH ranked AS (
  SELECT
    state, product, units_sold,
    ROW_NUMBER() OVER (
      PARTITION BY state
      ORDER BY units_sold DESC, product ASC
    ) AS rn
  FROM sales_by_state
)
SELECT state, product, units_sold
FROM ranked
WHERE rn = 1
ORDER BY state;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: "wrap in CTE, filter on the alias" is the correct ritual for every ROW_NUMBER()-based top-N-per-group query.

Common beginner mistakes

  • Trying WHERE ROW_NUMBER() OVER (...) = 1 — illegal; the engine errors.
  • Forgetting the ORDER BY inside the window — ROW_NUMBER then assigns ranks in undefined order.
  • Using RANK() and being confused when ties produce two rn = 1 rows for one state (correct! that's what RANK does — but ROW_NUMBER is what most prompts want).
  • PARTITION BY state, product — collapses each (state, product) pair into a one-row partition, every rn = 1, no useful ranking.
  • Skipping the tie-break inside the window's ORDER BY — non-deterministic best-seller output when units tie.

SQL Interview Question on the Best-Selling Product by State

Table sales_by_state(state TEXT, product TEXT, units_sold INT) with one row per (state, product) combination. Return the best-selling product in each state: the product with the highest units_sold per state. Break ties alphabetically by product. Sort output ascending by state.

Solution Using ROW_NUMBER() partitioned by state

WITH ranked AS (
  SELECT
    state,
    product,
    units_sold,
    ROW_NUMBER() OVER (
      PARTITION BY state
      ORDER BY units_sold DESC, product ASC
    ) AS rn
  FROM sales_by_state
)
SELECT state, product, units_sold
FROM ranked
WHERE rn = 1
ORDER BY state ASC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 7 rows, 3 states):

state product units_sold
CA P1 120
CA P2 95
CA P3 80
NY P1 60
NY P6 60
TX P2 200
TX P4 150
  1. Scan sales_by_state — read all 7 rows.
  2. Partition by state — three partitions: CA (3 rows), NY (2 rows), TX (2 rows).
  3. Apply ROW_NUMBER per partition — order by units_sold DESC, product ASC:
    • CA → P1 (120, rn=1), P2 (95, rn=2), P3 (80, rn=3).
    • NY → tie at 60 between P1 and P6; secondary product ASC gives P1 (rn=1), P6 (rn=2).
    • TX → P2 (200, rn=1), P4 (150, rn=2).
  4. Filter WHERE rn = 1 — keep one row per state: (CA, P1, 120), (NY, P1, 60), (TX, P2, 200).
  5. ORDER BY state ASC — alphabetic state order.

Output:

state product units_sold
CA P1 120
NY P1 60
TX P2 200

Why this works — concept by concept:

  • PARTITION BY state — slices the row set into independent per-state windows so each state gets its own ranking; rankings never cross state boundaries.
  • ORDER BY units_sold DESC, product ASC — primary descending units puts the bestseller at rank 1 within each partition; secondary ascending product is the tie-break (used by NY in the trace above).
  • ROW_NUMBER() — assigns a unique 1, 2, 3, … to every row in the partition; combined with our tie-break it's deterministic.
  • CTE wrapper — required because window functions cannot appear in WHERE; the CTE materializes the rank as a regular column we can then filter on.
  • WHERE rn = 1 — keeps only the top-ranked row per partition; this is the moment "rank within a group" turns into "the group's best."
  • ORDER BY state ASC — final sort over the three result rows; presents the answer in stable alphabetical order.
  • Cost — one scan of sales_by_state, one window sort partitioned by state → O(N log N) worst-case (typically O(N) with a hash-partition operator), O(N) space for the temporary annotation column.

SQL
Hyper — window functions
Hyper window-function problems

Practice →

SQL
Topic — ranking
Ranking problems

Practice →


5. SQL JOINs with HAVING for Nationwide-Popularity Products

Completeness checks via HAVING and DISTINCT counts in SQL for data engineering

"Find products sold in every state" is the canonical completeness-check interview prompt. The mental model: a product is nationwide-popular iff it has been sold in **every state**. The shape: count distinct states where each product appears, then keep only products whose distinct-state count equals the total number of states. The structural primitive is HAVING COUNT(DISTINCT state) = (SELECT COUNT(DISTINCT state) FROM customers).

Pro tip: Do not hard-code the total-states number (e.g. = 50). Always pull it from a subquery so the query stays correct if the data set has fewer or more states (territories, partial coverage, future expansion).

Total-state subquery: SELECT COUNT(DISTINCT state) FROM customers

The subquery invariant: the total-states value must be computed from the same source the main query uses for state membership. If customers defines the set of states, the subquery must count distinct states from customers—not from a hard-coded list.

  • SELECT COUNT(DISTINCT state) FROM customers — scalar subquery returning one number.
  • COUNT(DISTINCT state) — counts unique non-NULL state values.
  • Inline as a scalar — usable in HAVING, WHERE, SELECT projection.
  • Materialize as a CTE — cleaner if reused multiple times in the same query.

Worked example. A customers table spanning 3 states.

id state
100 CA
200 TX
300 NY
400 CA

COUNT(DISTINCT state) returns 3.

Worked-example solution.

SELECT COUNT(DISTINCT state) AS total_states FROM customers;
-- 3
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always pull "total of X" values from a subquery in the same statement; never hard-code totals that depend on data.

HAVING COUNT(DISTINCT state) = total_states

The completeness invariant: after GROUP BY product, the per-product COUNT(DISTINCT state) is the number of distinct states where that product appears; equality with the total tells us the product appears in every state.

  • HAVING filters groups, not rows; WHERE filters rows pre-aggregate.
  • COUNT(DISTINCT state) is the per-product distinct-state count.
  • = total_states — strict equality; products missing one state are filtered out.
  • >= total_states — defensive; same answer in practice but documents the intent.

Worked example. Three products with sales in different state subsets.

product states sold in distinct count total nationwide?
P1 CA, TX, NY 3 3
P2 CA, TX 2 3
P3 CA 1 3

Worked-example solution.

WITH total_states AS (
  SELECT COUNT(DISTINCT state) AS n FROM customers
)
SELECT s.product
FROM sales s
JOIN customers c ON c.id = s.customer_id
GROUP BY s.product
HAVING COUNT(DISTINCT c.state) = (SELECT n FROM total_states);
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: HAVING is for predicates over aggregates; WHERE is for predicates over rows. The completeness check needs HAVING because it compares two aggregates.

JOIN sales → customers → group by product, filter

The composition invariant: JOIN to attach state to every sale, group by product, count distinct states per product, filter by completeness. Three steps in order; reordering breaks correctness.

  • JOIN customers ON ... — bring state onto each sale.
  • GROUP BY s.product — bucket per product.
  • COUNT(DISTINCT c.state) — distinct-states-per-product.
  • HAVING ... = total_states — keep only products in every state.

Worked example. Six sales rows, three products, three states.

sale_id customer_id product (joined state)
1 100 P1 CA
2 200 P1 TX
3 300 P1 NY
4 100 P2 CA
5 200 P2 TX
6 100 P3 CA

After grouping by product: P1 sees {CA, TX, NY} → 3; P2 sees {CA, TX} → 2; P3 sees {CA} → 1. With total_states = 3, only P1 passes.

Worked-example solution.

SELECT s.product
FROM sales s
JOIN customers c ON c.id = s.customer_id
GROUP BY s.product
HAVING COUNT(DISTINCT c.state) = (
  SELECT COUNT(DISTINCT state) FROM customers
)
ORDER BY s.product;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: the JOIN brings the dimension onto the fact; the GROUP BY collapses the fact; the HAVING filters the groups. Always in that order.

Common beginner mistakes

  • Hard-coding HAVING COUNT(DISTINCT state) = 50—silently wrong if the data has 49 or 51 states.
  • Using WHERE COUNT(DISTINCT state) = ...—aggregates aren't allowed in WHERE; engine error.
  • Counting COUNT(state) instead of COUNT(DISTINCT state)—double-counts states with multiple sales.
  • Forgetting the JOIN—state doesn't exist on sales.
  • Grouping by customer_id instead of product—per-customer state count, off-prompt.

SQL Interview Question on Products with Nationwide Popularity

Tables sales(sale_id INT, customer_id INT, product TEXT) and customers(id INT, state TEXT). Return the products that have been sold in every state present in customers. Sort ascending by product.

Solution Using JOIN, GROUP BY, and HAVING with a scalar subquery

SELECT s.product
FROM sales s
JOIN customers c ON c.id = s.customer_id
GROUP BY s.product
HAVING COUNT(DISTINCT c.state) = (
  SELECT COUNT(DISTINCT state) FROM customers
)
ORDER BY s.product ASC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 8 sales × 4 customers across 3 states):

sale_id customer_id product
1 100 P1
2 200 P1
3 300 P1
4 100 P2
5 200 P2
6 100 P3
7 300 P3
8 400 P3
id state
100 CA
200 TX
300 NY
400 CA
  1. Scalar subquery(SELECT COUNT(DISTINCT state) FROM customers) evaluates to 3.
  2. JOIN — every sale gets its customer's state. P1 sees CA, TX, NY; P2 sees CA, TX; P3 sees CA, NY, CA.
  3. GROUP BY s.product — three groups (P1, P2, P3).
  4. Per-group COUNT(DISTINCT c.state) — P1 → 3, P2 → 2, P3 → 2 (CA, NY—the second CA is a duplicate).
  5. HAVING ... = 3 — keep only P1.
  6. ORDER BY s.product ASC — single row, alphabetic order is trivially satisfied.

Output:

product
P1

Why this works — concept by concept:

  • JOIN to attach statestate lives on customers; the join brings it onto every sale so GROUP BY product can see which states a product reached.
  • COUNT(DISTINCT c.state) — counts unique state values per product; DISTINCT is the active ingredient (without it, multiple sales per state inflate the count).
  • Scalar subquery for total_states — pulls the canonical state count from customers so the query auto-adapts as the data grows or shrinks.
  • HAVING over aggregatesHAVING is the correct clause for filtering on aggregates; WHERE would be a syntax error.
  • Equality check — strict = ensures we only keep products with full state coverage; >= would behave the same but is less idiomatic.
  • ORDER BY product ASC — deterministic output even when multiple products qualify.
  • Cost — one scan of sales, one scan of customers, one hash join, one hash aggregate, one scalar-subquery evaluation → O(N + M) time.

SQL
Topic — having clause
HAVING-clause problems

Practice →

SQL
Topic — grouping
Grouping problems

Practice →


6. SQL Window Functions for Weekly Category Growth Tracking

LAG-driven window functions for week-over-week growth in SQL for data engineering

"Track week-over-week growth per category" is the Hard-tier interview prompt for retail analytics. The mental model: DATE_TRUNC('week', sale_date) collapses days to ISO-week starts; LAG(units) OVER (PARTITION BY category ORDER BY week) exposes last week's units alongside this week's; (curr − prev) * 100.0 / NULLIF(prev, 0) is the percentage formula. The same pattern handles month-over-month, day-over-day, hour-over-hour. Just swap the truncation unit.

Diagram with a horizontal week axis showing units sold per category per week, a LAG arrow from week N to week N+1, and a (curr − prev) * 100.0 / prev formula label.

Pro tip: Always wrap the divisor in NULLIF(prev, 0). If a category has zero units in the prior week (or no prior week at all), the unguarded formula either divides by zero (engine error) or produces a misleading 100% growth. NULLIF(prev, 0) returns NULL, the formula short-circuits to NULL, and downstream consumers can treat it as "no comparable baseline."

DATE_TRUNC('week', sale_date) for weekly bucketing

The week-bucket invariant: DATE_TRUNC('week', sale_date) returns the Monday of the ISO week containing sale_date (Postgres default). Same idea as DATE_TRUNC('month', ...) from §2; only the unit changes.

  • DATE_TRUNC('week', d) — first day of ISO week (engine-specific: Postgres = Monday; Snowflake = Sunday by default; check your engine).
  • Cast to DATEDATE_TRUNC('week', sale_date)::date.
  • Group key — bucket all daily sales into the week's first day.
  • Year wraparound — December 30 of one year often shares the same ISO-week with January 5 of the next; that's correct ISO behavior, not a bug.

Worked example. Five sale days collapsed to two ISO weeks.

sale_date week_bucket
2026-04-06 (Mon) 2026-04-06
2026-04-09 (Thu) 2026-04-06
2026-04-12 (Sun) 2026-04-06
2026-04-13 (Mon) 2026-04-13
2026-04-16 (Thu) 2026-04-13

Worked-example solution.

SELECT
  sale_date,
  DATE_TRUNC('week', sale_date)::date AS week_bucket
FROM sales
ORDER BY sale_date;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the prompt says "week," confirm the engine's week-start convention before writing code. Postgres, Snowflake, and BigQuery disagree.

LAG(units) OVER (PARTITION BY category ORDER BY week)

The lag invariant: LAG(col) returns the value of col from the previous row within the partition, ordered by the same key. Partition by category so previous-week lookups never cross categories; order by week so "previous" means "previous week chronologically."

  • PARTITION BY category — independent windows per category.
  • ORDER BY week — defines what "previous" means.
  • LAG(units) — returns NULL for the first row in each partition (no prior row).
  • LAG(col, n, default) — n-row lag with optional default value when no prior row exists.

Worked example. Two categories × three weeks each.

category week units lag(units)
electronics 2026-04-06 100 NULL
electronics 2026-04-13 120 100
electronics 2026-04-20 150 120
apparel 2026-04-06 80 NULL
apparel 2026-04-13 85 80
apparel 2026-04-20 70 85

Worked-example solution.

WITH weekly AS (
  SELECT
    category,
    DATE_TRUNC('week', sale_date)::date AS week,
    SUM(units) AS units
  FROM sales
  GROUP BY category, DATE_TRUNC('week', sale_date)
)
SELECT
  category, week, units,
  LAG(units) OVER (PARTITION BY category ORDER BY week) AS prev_units
FROM weekly
ORDER BY category, week;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always pre-aggregate to the bucket grain (one row per category-week) before applying LAG; otherwise LAG jumps between intra-week rows of the same category.

WoW formula: (curr - prev) * 100.0 / NULLIF(prev, 0)

The growth invariant: (curr - prev) * 100.0 / prev is the percentage change from the prior period to the current period. Multiply by 100.0 (not 100) to force float division; wrap the divisor in NULLIF(prev, 0) to avoid divide-by-zero on the first week per category.

  • (curr - prev) * 100.0 — float multiplier; makes the result a proper percentage.
  • NULLIF(prev, 0) — returns NULL when prev = 0; the whole expression then returns NULL (no error).
  • ROUND(..., 2) — clamp to 2 decimals for readable output.
  • First-week NULL — the first week per category has no prev, so LAG returns NULL and the formula propagates NULL; correct, not a bug.

Worked example. Apply the formula to the table above.

category week units prev_units growth_pct
electronics 2026-04-06 100 NULL NULL
electronics 2026-04-13 120 100 20.00
electronics 2026-04-20 150 120 25.00
apparel 2026-04-06 80 NULL NULL
apparel 2026-04-13 85 80 6.25
apparel 2026-04-20 70 85 -17.65

Worked-example solution.

WITH weekly AS (
  SELECT
    category,
    DATE_TRUNC('week', sale_date)::date AS week,
    SUM(units) AS units
  FROM sales
  GROUP BY category, DATE_TRUNC('week', sale_date)
),
with_prev AS (
  SELECT
    category, week, units,
    LAG(units) OVER (PARTITION BY category ORDER BY week) AS prev_units
  FROM weekly
)
SELECT
  category, week, units,
  ROUND(
    (units - prev_units) * 100.0 / NULLIF(prev_units, 0),
    2
  ) AS growth_pct
FROM with_prev
ORDER BY category, week;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: growth percentages always need 100.0 (not 100), NULLIF(prev, 0), and ROUND(..., 2)—skipping any of the three breaks correctness or readability.

Common beginner mistakes

  • Forgetting NULLIF(prev, 0) — divide-by-zero on first-week-per-category or zero-baseline weeks.
  • Using 100 instead of 100.0 — integer division silently rounds to zero on small differences.
  • Skipping the pre-aggregation CTE — LAG jumps over intra-week rows and the comparison becomes meaningless.
  • LAG without PARTITION BY category — comparisons cross category boundaries.
  • Returning the absolute delta instead of the percentage — common reading mistake; the prompt usually says "growth %," not "delta."

SQL Interview Question on Weekly Category Growth Tracker

Table sales(sale_id INT, sale_date DATE, category TEXT, units INT). For each (category, week) pair, return the units sold that week and the week-over-week growth percentage versus the prior week (NULL if no prior week exists for that category). Columns: category, week, units, growth_pct. Sort ascending by category, then by week. Round growth to 2 decimals.

Solution Using DATE_TRUNC, LAG, and the (curr − prev) * 100.0 / NULLIF(prev, 0) formula

WITH weekly AS (
  SELECT
    category,
    DATE_TRUNC('week', sale_date)::date AS week,
    SUM(units) AS units
  FROM sales
  GROUP BY category, DATE_TRUNC('week', sale_date)
),
with_prev AS (
  SELECT
    category,
    week,
    units,
    LAG(units) OVER (PARTITION BY category ORDER BY week) AS prev_units
  FROM weekly
)
SELECT
  category,
  week,
  units,
  ROUND(
    (units - prev_units) * 100.0 / NULLIF(prev_units, 0),
    2
  ) AS growth_pct
FROM with_prev
ORDER BY category ASC, week ASC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 8 sales rows across 2 categories × 3 weeks):

sale_id sale_date category units
1 2026-04-06 electronics 100
2 2026-04-13 electronics 120
3 2026-04-20 electronics 80
4 2026-04-20 electronics 70
5 2026-04-06 apparel 80
6 2026-04-13 apparel 85
7 2026-04-20 apparel 40
8 2026-04-20 apparel 30
  1. Build weekly — group by (category, DATE_TRUNC('week', sale_date)) and sum units. Result:

| category | week | units |
|----------|------|------:|
| electronics | 2026-04-06 | 100 |
| electronics | 2026-04-13 | 120 |
| electronics | 2026-04-20 | 150 |
| apparel | 2026-04-06 | 80 |
| apparel | 2026-04-13 | 85 |
| apparel | 2026-04-20 | 70 |

  1. Build with_prev — apply LAG(units) OVER (PARTITION BY category ORDER BY week):

| category | week | units | prev_units |
|----------|------|------:|-----------:|
| electronics | 2026-04-06 | 100 | NULL |
| electronics | 2026-04-13 | 120 | 100 |
| electronics | 2026-04-20 | 150 | 120 |
| apparel | 2026-04-06 | 80 | NULL |
| apparel | 2026-04-13 | 85 | 80 |
| apparel | 2026-04-20 | 70 | 85 |

  1. Apply the growth formula(units - prev_units) * 100.0 / NULLIF(prev_units, 0) rounded to 2 decimals:
    • electronics 04-06: (100 - NULL) ... → NULL.
    • electronics 04-13: (120 - 100) * 100.0 / 100 = 20.00.
    • electronics 04-20: (150 - 120) * 100.0 / 120 = 25.00.
    • apparel 04-06: NULL.
    • apparel 04-13: (85 - 80) * 100.0 / 80 = 6.25.
    • apparel 04-20: (70 - 85) * 100.0 / 85 = -17.65.
  2. ORDER BY category ASC, week ASC — alphabetic categories then chronological weeks.

Output:

category week units growth_pct
apparel 2026-04-06 80 NULL
apparel 2026-04-13 85 6.25
apparel 2026-04-20 70 -17.65
electronics 2026-04-06 100 NULL
electronics 2026-04-13 120 20.00
electronics 2026-04-20 150 25.00

Why this works — concept by concept:

  • DATE_TRUNC('week', sale_date) — collapses every day in an ISO week to that week's first day, giving a stable group key for weekly aggregation.
  • Pre-aggregation CTE weekly — produces one row per (category, week) pair so LAG operates at the right grain (week, not day).
  • LAG(units) OVER (PARTITION BY category ORDER BY week) — exposes the prior-week units for each category-week pair without crossing categories.
  • PARTITION BY category — keeps each category's growth series isolated; growth never leaks between categories.
  • ORDER BY week inside the window — defines "previous" as "previous week chronologically," not insertion order.
  • NULLIF(prev_units, 0) — guards divide-by-zero on first-week rows (where LAG returns NULL) and zero-baseline weeks; the entire growth expression then propagates NULL safely.
  • * 100.0 float multiplier — forces float division so the percentage is a real decimal, not integer-truncated.
  • ROUND(..., 2) — clamps the result to two decimal places for readable output.
  • Cost — one scan of sales, one hash aggregate for the CTE, one window pass with a partition sort → O(N log N) worst-case; typical O(N) with hash-partitioning.
SQL Hyper — aggregations Hyper aggregation problems

Practice →

SQL
Topic — date arithmetic
Date-arithmetic problems

Practice →


Tips to Crack Hyper Data Engineering Interviews

SQL is the entire loop—skip Python prep

Every problem in the curated Hyper SQL practice set is SQL. There is no Python in this loop. Spending interview-prep time on Python algorithms is a misallocation; spend it on DATE_TRUNC, LAG, partitioned ROW_NUMBER, and HAVING COUNT(DISTINCT) = patterns. Expect 4–6 SQL coding problems across the loop, all framed as retail analytics.

Drill the three window-function patterns Hyper actually asks

Two of the six Hyper problems are window functions: ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) for top-N-per-group (#523), and LAG(...) over a partition for week-over-week growth (#525—the Hard-tier problem). The third pattern Hyper might combine into a follow-up is RANK() or DENSE_RANK() for "all top-tied rows." Drill the window-functions practice page until the CTE-wrapper-then-filter ritual is automatic.

Retail-analytics framing on every prompt

Frame your verbal walkthrough in retail vocabulary: SKUs, products, categories, states, monthly revenue, weekly growth, cohorts, sell-through. Interviewers test whether you can translate the abstract SQL primitive to the business question—matching the interviewer's mental model of what their team builds. The Hyper medium practice set leans on retail framings; mirror that vocabulary in your answer.

Completeness checks via HAVING are a Hyper signature

The "products sold in every state" pattern (#524) is unusual; most candidates haven't seen HAVING COUNT(DISTINCT state) = (SELECT COUNT(DISTINCT state) FROM customers) written down. Memorize it. Generalize: any "X covers all Ys" prompt becomes HAVING COUNT(DISTINCT y) = (SELECT COUNT(DISTINCT y) FROM y_table). Drill more variants on the HAVING-clause practice page.

Easy-tier discipline still matters

Three of the six Hyper problems are EASY-tier (#520, #521, #522). Easy at Hyper means: the interviewer expects clean, deterministic SQL with no off-by-ones, every ORDER BY followed by a tie-break key, every aggregate paired with the right GROUP BY. Stuttering through an EASY-tier problem signals worse than failing a HARD-tier one. The Hyper easy practice set is the right warmup.

Where to practice on PipeCode

Start with the Hyper practice page for the curated 6-problem set. After that, drill the matching topic pages: aggregation, joins, window-functions, HAVING-clause, date-functions. The interview courses page bundles the SQL course if you want a structured curriculum before company drills. For broad coverage, browse by topic.

Communication and approach under time pressure

For every prompt, narrate the structure before writing code: "I'll JOIN sales to customers to bring in state, group by product, count distinct states per product, then HAVING against the total state count." Interviewers grade process as much as the final answer. Leave 5 minutes at the end of each problem for an edge-case sweep: empty input, ties, NULL columns, divide-by-zero in growth formulas, missing prior-period rows in LAG. The most common "almost passed" failure mode is a correct happy-path query that crashes on the empty input—a 30-second sweep prevents it.


Frequently Asked Questions

What is the Hyper data engineering interview process like?

Hyper's data engineering interview opens with a recruiter screen, then a technical phone screen with one SQL coding problem, then an onsite (or virtual onsite) of four to five rounds: two to three SQL coding rounds (aggregation, joins, window functions, HAVING completeness checks), one data-modeling discussion (star schemas, retail dimensional models), and one to two behavioral rounds. Hyper does not test Python in the data engineering loop; every coding round is SQL on retail-analytics framings.

What SQL topics does Hyper test for data engineers?

Hyper SQL interviews concentrate on six primitives that correspond directly to the curated 6-problem set: aggregation with ORDER BY ... LIMIT 1 for top-N rows, DATE_TRUNC('month' / 'week', ...) for date-bucketed rollups, multi-table INNER JOIN plus GROUP BY for cross-table aggregates, ROW_NUMBER() OVER (PARTITION BY ...) for top-N-per-group, HAVING COUNT(DISTINCT) = total for completeness checks, and LAG(...) over partitioned ordered windows for period-over-period growth. Memorize these six patterns and you will recognize most Hyper prompts within the first 60 seconds.

Does Hyper test Python in data engineering interviews?

The curated Hyper practice set is 6 SQL, 0 Python—Hyper's data engineering interview is SQL-only. If a Python question appears, treat it as a different role family (data scientist, analytics engineer with Python responsibilities) rather than the canonical data-engineering loop covered in this guide. Spending prep time on Python algorithms for a Hyper data-engineering interview is a misallocation; SQL fluency is the entire signal.

How hard are Hyper data engineering interview questions?

The Hyper set is 3 easy + 2 medium + 1 hard. The Hard-tier problem (#525 Weekly Category Growth Tracker) is LAG-driven WoW growth over a partitioned, week-truncated window with NULLIF divide-by-zero handling—make-or-break for senior roles. The two medium problems test partitioned ROW_NUMBER (#523) and HAVING COUNT(DISTINCT) = completeness checks (#524); both are pattern-recognition once you have seen them. The three EASY problems are aggregation classics where interviewers expect zero hesitation and clean tie-break handling.

Are window functions and weekly growth common in Hyper interviews?

Yes—two of six problems are window-function-driven, and the Hard-tier problem is explicitly week-over-week growth with LAG. If you cannot fluently write LAG(units) OVER (PARTITION BY category ORDER BY week) and the (curr - prev) * 100.0 / NULLIF(prev, 0) growth formula, you cannot pass the Hyper loop. Drill the window-functions practice page and the Hyper window-functions topic page until the patterns are muscle memory.

How many Hyper practice problems should I solve before the interview?

Solve all 6 problems on the Hyper practice page end-to-end—untimed first, then timed at 25 minutes per problem. After that, broaden to 30 to 40 additional SQL problems spread across the matching topic pages: aggregation, joins, window-functions, HAVING-clause, date-functions, ranking, date-arithmetic. The Hyper SQL practice page and the Hyper medium practice page are the right surfaces for the final week of prep.


Start practicing Hyper data engineering problems

Top comments (0)