Gowtham Potureddi

Posted on Jun 2

SQL Cheat Sheet: Clause Order, Joins, Aggregates, Windows (2026)

#python #sql #interview #dataengineering

A good sql cheat sheet is the single highest-leverage artefact a data engineer can keep open during interview prep — it compresses the language into the eight clauses that run, the eight joins that ship, the five aggregates that count, and the four window-function families that sort. When the screen-share clock starts and the interviewer says "write a query that ranks orders within each customer," you do not have time to scroll Stack Overflow; you need the muscle memory to type ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_ts DESC) without thinking.

This guide is the sql language cheat sheet every senior DE wishes they had at the start of their career. It is also a working sql query cheat sheet for production code reviews — the same clause-order rules that make WHERE reject a SELECT alias also make the planner prune partitions before any expensive join runs. We will walk the sql commands cheat sheet layer by layer: how a query is parsed versus executed, every join shape with a Venn diagram and a one-line snippet, the aggregate functions and their NULL pitfalls, and the window-function syntax with its frame clauses. Treat it as a sql syntax reference for interview prep, a sql reference card for the first six months on the job, and a teaching deck for whoever is mentoring you.

When you want hands-on reps immediately after each section, browse SQL practice library →, drill join problems →, and rehearse on window-function drills →.

On this page

Why every DE needs a SQL cheat sheet in 2026
The SQL clause execution order — what runs when
Joins cheat sheet — INNER, LEFT, RIGHT, FULL, SELF, ANTI, SEMI, CROSS
Aggregates and GROUP BY cheat sheet
Window functions cheat sheet
Cheat sheet — copy-paste templates
Frequently asked questions
Practice on PipeCode

1. Why every DE needs a SQL cheat sheet in 2026

A working SQL reference card is interview survival kit — and on-the-job leverage

The one-sentence invariant: SQL is still the lingua franca of data — every warehouse, every lakehouse, every streaming engine eventually exposes a SQL surface, and every DE interview tests whether you can write one in 90 seconds without an IDE. A printable sql syntax reference turns "I think I remember the syntax" into "I know the syntax cold" — which is the difference between solving the question on the whiteboard and burning four of your forty-five minutes on a missing GROUP BY.

The three muscle-memory levels every DE goes through.

Level 1 — syntax recall. You can type SELECT … FROM … WHERE … without typos. You know JOIN defaults to INNER. You know COUNT(*) is different from COUNT(col). This is the bar a junior interview tests.
Level 2 — plan reading. You can read an EXPLAIN ANALYZE and tell whether the planner picked a hash join, a nested loop, or a merge join. You can tell when a WHERE predicate was pushed down past a GROUP BY and when it wasn't. This is the bar a senior interview tests.
Level 3 — optimization. You can rewrite a correlated subquery as a join, an IN as an EXISTS, or a 5-table star join as a CTE chain — and explain why each rewrite changes the plan. This is the bar staff-level rounds test.

Why a one-page reference is so much faster than the docs.

Latency. Postgres docs are 4,500 pages. The most-used clauses live on one card. In a 45-minute interview, every second spent searching the docs is a second not spent thinking.
Coverage in the right shape. Documentation is reference order — alphabetical, exhaustive, abstract. A cheat sheet is decision order — "I need to rank within a group → here is ROW_NUMBER and the partition syntax" — which is the shape your brain queries during an interview.
Cross-dialect coverage. Postgres, MySQL, Snowflake, and BigQuery diverge on half the clauses. A good sql commands cheat sheet lists the dialect deltas inline, so you don't ship a LIMIT 10 to Snowflake where you needed LIMIT 10 to BigQuery's LIMIT 10 syntax (or TOP 10 on T-SQL).

Dialect differences that show up in interviews.

Pagination. Postgres / MySQL / Snowflake use LIMIT N OFFSET M; SQL Server uses OFFSET … FETCH NEXT; legacy T-SQL uses TOP N. BigQuery accepts both LIMIT and FETCH FIRST.
String concat. Postgres / Standard SQL: ||. MySQL: CONCAT(). SQL Server: +. Cross-dialect-safe: CONCAT() (works in all four).
Date arithmetic. Postgres: now() - interval '7 days'. MySQL: NOW() - INTERVAL 7 DAY. Snowflake / BigQuery: DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY). Always say the dialect out loud during the interview — interviewers reward it.
Window functions. Standard across all four engines for ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM() OVER. MySQL only added them in 8.0; BigQuery has additional analytical extensions like PERCENTILE_CONT.
QUALIFY. Snowflake / BigQuery / Databricks. Filters on a window-function result without a subquery. Postgres / MySQL still need a subquery.

The 2026 reality check.

AI assistants. They handle the boilerplate but still hallucinate joins on the wrong key and miss NULL semantics. A human cheat sheet is the verification layer.
Cross-engine workloads. A senior DE in 2026 typically writes against Snowflake, BigQuery, and Postgres in the same week. The cheat sheet is the cognitive shock absorber for dialect switching.
Interview shapes. SQL still dominates the 60-minute screen at every FAANG-tier DE loop — usually one window-function problem, one join correctness problem, and one aggregation-with-edge-cases problem.

What interviewers listen for when they ask a SQL question.

Do you ask "which dialect" before writing? — senior signal.
Do you say "GROUP BY runs before SELECT, so I can't reuse a SELECT alias here" without prompting? — clause-order mastery signal.
Do you reach for EXISTS over IN for semi-joins on nullable columns? — required answer.
Do you choose ROW_NUMBER over RANK deliberately, naming the tie-breaking semantics? — window-function fluency signal.

Worked example — the 90-second SQL whiteboard

Detailed explanation. A common opening probe is "write the SQL for the top order per customer." It tests three things at once: do you know window functions, do you handle ties correctly, and do you wrap them in a sub-query because WHERE cannot reference a window result.

Question. Given an orders(customer_id, order_id, amount, order_ts) table, write a query that returns one row per customer — the row corresponding to the largest single amount for that customer, with ties broken by the latest order_ts. Use only standard SQL.

Input.

customer_id	order_id	amount	order_ts
7	1001	120.00	2026-05-01 09:00
7	1002	250.00	2026-05-04 11:10
7	1003	250.00	2026-05-06 08:42
9	2001	80.00	2026-05-02 10:00
9	2002	95.00	2026-05-07 14:20

Code.

WITH ranked AS (
  SELECT
    customer_id,
    order_id,
    amount,
    order_ts,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY amount DESC, order_ts DESC
    ) AS rn
  FROM orders
)
SELECT customer_id, order_id, amount, order_ts
FROM ranked
WHERE rn = 1;

Step-by-step explanation.

Window first. ROW_NUMBER() assigns a strictly-increasing integer per partition. PARTITION BY customer_id restarts the count at every new customer. ORDER BY amount DESC, order_ts DESC ranks by amount, with the latest timestamp winning ties.
Wrap in a CTE. WHERE runs before SELECT, so it cannot reference the alias rn directly. The CTE materialises the rank, then the outer SELECT filters on it.
Why ROW_NUMBER, not RANK. RANK would tie order 1002 and 1003 at rank 1 and return both. ROW_NUMBER gives a deterministic tie-breaker via the secondary ORDER BY.
Why no GROUP BY. A GROUP BY customer_id with MAX(amount) would collapse the rows and lose order_id and order_ts. Window functions keep every input row available.

Output.

customer_id	order_id	amount	order_ts
7	1003	250.00	2026-05-06 08:42
9	2002	95.00	2026-05-07 14:20

Rule of thumb. "Top-N per group" is always ROW_NUMBER over a partition, filtered in an outer query. Reach for it before any MAX(...) + self-join contortion.

Worked example — write the same query in four dialects

Detailed explanation. A senior interview round sometimes asks "how would this same query change if you ran it on Snowflake instead of Postgres?" — testing whether you can switch dialects on the fly. The query below — top order per customer — appears in every DE interview shop's question bank in some form, and writing it across four dialects in 60 seconds is a senior-level fluency signal.

Question. Write the "top order per customer" query in PostgreSQL (using a CTE), Snowflake (using QUALIFY), BigQuery (using QUALIFY), and MySQL 8.0 (using a CTE). Same semantics, dialect-appropriate idioms.

Input.

customer_id	order_id	amount
7	1001	120
7	1002	250
9	2001	80
9	2002	95

Code (PostgreSQL — CTE).

WITH ranked AS (
  SELECT
    customer_id,
    order_id,
    amount,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY amount DESC
    ) AS rn
  FROM orders
)
SELECT customer_id, order_id, amount
FROM ranked
WHERE rn = 1;

Code (Snowflake / BigQuery — QUALIFY).

SELECT
  customer_id,
  order_id,
  amount
FROM orders
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY customer_id
  ORDER BY amount DESC
) = 1;

Code (MySQL 8.0 — CTE).

WITH ranked AS (
  SELECT
    customer_id,
    order_id,
    amount,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY amount DESC
    ) AS rn
  FROM orders
)
SELECT customer_id, order_id, amount
FROM ranked
WHERE rn = 1;

Step-by-step explanation.

PostgreSQL. No QUALIFY support, so the CTE pattern is mandatory. Identical to the MySQL 8.0 syntax (MySQL caught up in 8.0).
Snowflake / BigQuery / Databricks. QUALIFY lets you filter on a window function inline. Same logical plan as the CTE version; saves one indentation level and a name.
MySQL 8.0. Window functions arrived in 8.0; before that you'd have written a self-join with MAX() — slow and verbose. Avoid MySQL 5.x in 2026 unless legacy.
Why the dialect matters. The query plan is identical across all four engines; only the syntactic sugar differs. The CTE form is the dialect-portable fallback when you don't know the engine.

Output (all four dialects).

customer_id	order_id	amount
7	1002	250
9	2002	95

Rule of thumb. Default to the CTE pattern in interviews — it works in every dialect. Mention QUALIFY as a one-line shorthand when you know the engine supports it (Snowflake / BigQuery / Databricks / Teradata). That single sentence is a senior signal that you've worked across more than one warehouse.

SQL interview question on the muscle-memory drill

A common opener at a 30-minute screen is "before we look at the data, write me the skeleton of a window-function query that finds the latest event per user." It looks trivial but it tests clause order, partition syntax, and tie-breaking — all in one ~60-second answer.

Solution Using `ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_ts DESC)`

WITH ranked_events AS (
  SELECT
    user_id,
    event_id,
    event_type,
    event_ts,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY event_ts DESC
    ) AS rn
  FROM events
)
SELECT user_id, event_id, event_type, event_ts
FROM ranked_events
WHERE rn = 1;

Step-by-step trace.

Input row	user_id	event_id	event_ts	Partition rank (DESC)
1	u_42	e_001	2026-05-10 09:00	rn=3
2	u_42	e_002	2026-05-11 14:30	rn=2
3	u_42	e_003	2026-05-12 18:42	rn=1
4	u_99	e_010	2026-05-08 07:00	rn=2
5	u_99	e_011	2026-05-09 11:00	rn=1

Output:

user_id	event_id	event_type	event_ts
u_42	e_003	checkout	2026-05-12 18:42
u_99	e_011	view	2026-05-09 11:00

Why this works — concept by concept:

Window function — ROW_NUMBER() enumerates rows without collapsing them. Every input row remains addressable; only the rank column is new.
PARTITION BY — user_id resets the counter at every new partition. Two u_42 rows and three u_99 rows each get their own 1..N enumeration.
ORDER BY in OVER — event_ts DESC decides which row ends up at rank 1. The window's ORDER BY is independent of any outer ORDER BY.
Outer WHERE on rn — clause order forbids referencing window aliases inside the same SELECT's WHERE. The CTE / sub-query materialises rn so the outer query can filter on it.
Cost — O(N log N) for the partition-sort; O(N) for the outer filter. Dominant term is the sort.

SQL
Topic — window functions
Window-function problems (top-N per group, latest-event-per-user)

Practice →

2. The SQL clause execution order — what runs when

Written order is not execution order — that one sentence unlocks half the SQL interview

The mental model in one line: SQL is the only language where the order you read is not the order the engine runs. You type SELECT first; the engine runs it fifth. Once you internalise the eight-step execution pipeline, half of the "why doesn't my alias work in WHERE?" questions vanish, and the other half — window-function ordering, HAVING vs WHERE, DISTINCT placement — fall out by deduction.

The written order — what you type.

SELECT — projection list, what columns you want.
FROM — base tables and joins.
WHERE — row-level filter.
GROUP BY — grouping keys.
HAVING — group-level filter.
ORDER BY — final sort.
LIMIT — top-N (and optional OFFSET).

The logical execution order — what the engine actually runs.

Step 1 — FROM + JOIN. The engine resolves the base tables and applies any join predicates. Cross joins fan out the cartesian product; inner joins prune to matches; outer joins preserve unmatched rows.
Step 2 — WHERE. Row-level predicates are applied next. This is before aggregation, so WHERE cannot reference aggregate functions or GROUP BY aliases. The optimizer often pushes WHERE predicates down past the join when it can (predicate pushdown).
Step 3 — GROUP BY. Rows are bucketed by the grouping keys. Each bucket becomes one row in the next step; non-grouped columns must be wrapped in aggregates or they will fail strict-mode SQL.
Step 4 — HAVING. Group-level filter. HAVING is the only place you can filter on aggregate results before SELECT has finished running.
Step 5 — SELECT. The projection list is computed. Aliases defined here are visible to later clauses only. Window functions evaluate here, not earlier — which is why WHERE rn = 1 always needs a subquery.
Step 6 — DISTINCT. Deduplicates the projected rows. Surprising placement: DISTINCT runs after SELECT, so SELECT DISTINCT a, b FROM t deduplicates on the projected (a, b) tuple — never on the original column count.
Step 7 — ORDER BY. The final sort. Because it runs after SELECT, it can reference SELECT aliases directly (ORDER BY rn ASC works without a subquery).
Step 8 — LIMIT / OFFSET. The very last step. Determines how many rows the client receives.

Three consequences that show up in interviews.

WHERE can't see SELECT aliases. Step 2 < step 5. If you write WHERE total > 100 where total is a SELECT alias, the engine raises column does not exist.
ORDER BY can see SELECT aliases. Step 7 > step 5. This is why ORDER BY net_revenue DESC works on the projected alias.
Window functions cannot be filtered in WHERE. Window functions evaluate in step 5; WHERE ran in step 2. Wrap in a CTE / subquery, then filter outside.

Two mental tricks that stop the confusion.

"Engines read tables before they read columns." FROM always runs first. Everything you can do — joins, filters, projections — is downstream of which tables exist.
"Aliases are born at SELECT and live downstream." If a clause runs before SELECT, it can't see the alias. If a clause runs after SELECT, it can.

Worked example — why does this query break

Detailed explanation. A junior writes a query that looks reasonable: SELECT amount * 1.1 AS taxed FROM orders WHERE taxed > 100;. The engine raises column "taxed" does not exist. Walking through the eight-step order tells you why and how to fix it.

Question. Given orders(order_id, amount), the analyst writes the query below and gets ERROR: column "taxed" does not exist. Explain why and rewrite it correctly.

Input.

order_id	amount
1001	120.00
1002	80.00
1003	95.00
1004	250.00

Code (the broken query).

SELECT amount * 1.1 AS taxed
FROM orders
WHERE taxed > 100;

Step-by-step explanation.

Step 1 — FROM orders. Engine resolves the table. So far so good.
Step 2 — WHERE taxed > 100. Engine tries to filter rows. It looks up taxed in the orders schema — taxed is not a column there. Error.
Step 5 — SELECT amount * 1.1 AS taxed. The engine would have computed the alias here, but execution never reached step 5 because step 2 already failed.
Two fixes. Either repeat the expression in WHERE, or wrap the projection in a subquery / CTE so the outer WHERE runs after the projected alias materialises.

Code (fix 1 — repeat the expression).

SELECT amount * 1.1 AS taxed
FROM orders
WHERE amount * 1.1 > 100;

Code (fix 2 — subquery so taxed exists at WHERE time).

SELECT taxed
FROM (
  SELECT amount * 1.1 AS taxed
  FROM orders
) AS t
WHERE taxed > 100;

Output.

taxed
132.00
104.50
275.00

Rule of thumb. "Can I reference an alias here?" Ask which clause is using it and where the alias was born. Aliases born in SELECT (step 5) live in ORDER BY (step 7) but not in WHERE (step 2). HAVING (step 4) can use grouping-key aliases but not SELECT aliases.

Worked example — where window functions sit in the execution order

Detailed explanation. Window functions evaluate in step 5 (during SELECT) — after GROUP BY and HAVING, before DISTINCT and ORDER BY. This is why you can't filter a window result in WHERE and why you can ORDER BY a window alias without a subquery.

Question. Given events(user_id, event_ts, page), write a single query that returns each user's third-most-recent page view, with the user's total event count as a side column — without any subquery (Snowflake / BigQuery QUALIFY).

Input.

user_id	event_ts	page
u_1	2026-05-10 08:00	/home
u_1	2026-05-11 09:00	/cart
u_1	2026-05-12 10:00	/checkout
u_1	2026-05-13 11:00	/done
u_2	2026-05-09 07:00	/home
u_2	2026-05-10 08:30	/cart
u_2	2026-05-11 12:15	/checkout
u_2	2026-05-12 13:45	/done

Code (Snowflake / BigQuery — using QUALIFY).

SELECT
  user_id,
  event_ts,
  page,
  COUNT(*)       OVER (PARTITION BY user_id) AS total_events,
  ROW_NUMBER()   OVER (PARTITION BY user_id ORDER BY event_ts DESC) AS recency_rank
FROM events
QUALIFY recency_rank = 3
ORDER BY user_id;

Step-by-step explanation.

FROM events — step 1 — table resolved.
WHERE (absent) — step 2 — every row passes.
GROUP BY (absent) — step 3 — skipped; window functions do not require GROUP BY.
HAVING (absent) — step 4 — skipped.
SELECT … OVER (…) — step 5 — every row is annotated with the window outputs. COUNT(*) OVER (PARTITION BY user_id) adds the total per user; ROW_NUMBER() ranks events from newest to oldest.
QUALIFY recency_rank = 3 — step 5b — Snowflake/BigQuery dialect extension that filters on the window result inline. Behaves like a WHERE would, if WHERE could see window aliases.
ORDER BY user_id — step 7 — final sort for human readability.

Output.

user_id	event_ts	page	total_events	recency_rank
u_1	2026-05-11 09:00	/cart	4	3
u_2	2026-05-10 08:30	/cart	4	3

Rule of thumb. On Snowflake / BigQuery / Databricks reach for QUALIFY; on Postgres / MySQL fall back to the CTE pattern (WITH ranked AS (...) SELECT … WHERE rn = 3). The mental model is identical; only the syntax sugar differs.

SQL interview question on clause execution order

A staff-level interviewer often probes with: "I've got SELECT customer_id, COUNT(*) AS n FROM orders GROUP BY customer_id HAVING n > 5 ORDER BY n DESC; — does that work in Postgres?" The answer tests whether you know that HAVING can see aliases declared in the same SELECT only on lenient dialects, and where the dialect lines fall.

Solution Using clause-order reasoning

-- Standard-SQL safe rewrite (works in every dialect)
SELECT customer_id, COUNT(*) AS n
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 5
ORDER BY n DESC;

Step-by-step trace.

Step	Clause	What runs	Visible aliases
1	`FROM orders`	table resolved	—
2	`WHERE` (absent)	skipped	—
3	`GROUP BY customer_id`	rows bucketed	`customer_id` (grouping key)
4	`HAVING COUNT(*) > 5`	group filter	`customer_id`, aggregates
5	`SELECT customer_id, COUNT(*) AS n`	projection + alias born	`customer_id`, `n`
7	`ORDER BY n DESC`	sort using SELECT alias	`customer_id`, `n`

Output:

customer_id	n
7	12
9	9
11	7

Why this works — concept by concept:

Logical execution order — GROUP BY happens before SELECT, so HAVING (which runs between GROUP BY and SELECT) must use the aggregate expression itself in standard SQL. Postgres and SQL Server enforce this strictly; MySQL and SQLite let you reference SELECT aliases in HAVING as a non-standard convenience.
SELECT aliases are born late — they are visible from SELECT onward, so ORDER BY n DESC works in every dialect.
Aggregate visibility in HAVING — HAVING is the only place to filter on aggregate functions; using WHERE COUNT(*) > 5 would fail because aggregates are not defined yet at step 2.
Cost — O(N) for the group hash; O(G log G) for the final sort where G is the number of distinct groups.

SQL
Topic — SQL
SQL clause-order problems (HAVING vs WHERE, aliases)

Practice →

3. Joins cheat sheet — INNER, LEFT, RIGHT, FULL, SELF, ANTI, SEMI, CROSS

Eight join shapes — and at least three appear in every SQL interview

The mental model in one line: a join is two tables and one predicate; the join type decides which rows survive when the predicate is or isn't met. The eight shapes below cover ~100% of interview surface. INNER and LEFT cover 80% of production code; FULL and CROSS are niche; ANTI and SEMI are the senior signals — they have no JOIN keyword in standard SQL but the planner treats them as join types.

The eight joins, one line each.

INNER JOIN — matched rows only. A INNER JOIN B ON A.id = B.id. Default when you omit the keyword.
LEFT JOIN — all of A, nullable B. A LEFT JOIN B ON A.id = B.id. Use when "every A must appear" is the requirement.
RIGHT JOIN — all of B, nullable A. Most teams ban it in code review; rewrite as B LEFT JOIN A instead.
FULL OUTER JOIN — every row from both, nulls where they don't match. Use for reconciliation queries.
SELF JOIN — table joined to itself with two aliases. Hierarchical data, pairs, lag-by-key.
ANTI JOIN — A rows that have no match in B. No JOIN keyword: LEFT JOIN ... WHERE B.id IS NULL or NOT EXISTS.
SEMI JOIN — A rows that have at least one match in B, returning each A row once. No JOIN keyword: EXISTS (preferred) or IN.
CROSS JOIN — cartesian product. A CROSS JOIN B = every row of A paired with every row of B. Use only with care.

The five join decisions every interview probes.

"INNER or LEFT?" — does the row need to survive if there is no match in B? If yes, LEFT. If no, INNER. Default reflex should be LEFT when the right side is a lookup that may be incomplete.
"IN or EXISTS?" — EXISTS is faster on large subqueries because it short-circuits at the first match. IN materialises the entire sub-result. Use EXISTS for SEMI joins on nullable columns to avoid the NULL trap.
"How do I exclude rows that match a list?" — NOT EXISTS or LEFT JOIN ... WHERE right IS NULL. Never NOT IN when the right side has NULLs — NOT IN (NULL) returns NULL, which evaluates as FALSE for every row, so the query returns zero rows.
"Self-join vs window function?" — for "previous row per key" patterns, use LAG() (window). For "every pair satisfying a predicate," use a self-join. Window is faster when the engine supports it.
"CROSS JOIN with WHERE — danger?" — A CROSS JOIN B WHERE A.id = B.id is identical to INNER JOIN, but the planner may not realise it (older engines). Always use the explicit JOIN ... ON syntax — semantics are clearer and plans more predictable.

The NULL trap that catches every junior.

LEFT JOIN B ON A.id = B.id — unmatched A rows still appear, with every B column = NULL.
WHERE B.col = 5 after a LEFT JOIN accidentally turns it into an INNER JOIN. Always put the filter into the ON clause for outer joins: LEFT JOIN B ON A.id = B.id AND B.col = 5.
Composite key joins with one nullable column — ON A.k1 = B.k1 AND A.k2 = B.k2 — produce zero rows when A.k2 is NULL. Use IS NOT DISTINCT FROM (Postgres) or (A.k2 = B.k2 OR (A.k2 IS NULL AND B.k2 IS NULL)) for NULL-safe equality.

Dialect deltas for joins.

USING (col) — implicit ON a.col = b.col. Works in Postgres / MySQL / Snowflake / BigQuery. Convenient for natural joins on identically-named columns.
LATERAL — Postgres, BigQuery (LEFT JOIN UNNEST), Snowflake. Lets the right-hand side of the join reference left-hand columns. Powerful for top-N-per-group with a sub-query.
QUALIFY — Snowflake / BigQuery / Databricks / Teradata. Filters on a window-function result inline — no subquery needed.
ANTI JOIN / SEMI JOIN keywords — Spark SQL and Databricks support the explicit keywords; Postgres, MySQL, Snowflake, and BigQuery require the EXISTS / LEFT JOIN ... IS NULL patterns.

Worked example — LEFT JOIN with a WHERE that secretly becomes INNER

Detailed explanation. This is the single most-tested join bug in interviews. The interviewer writes a LEFT JOIN, then pushes a filter on the right table into WHERE, and the unmatched A rows silently vanish. Recognising it and pulling the filter into ON is a junior-to-mid signal.

Question. A team writes SELECT u.user_id, p.plan_name FROM users u LEFT JOIN plans p ON u.user_id = p.user_id WHERE p.tier = 'pro'; and asks why churned users have disappeared from the report. Explain the bug and fix it.

Input — users and plans.

user_id	email
1	a@ex.com
2	b@ex.com
3	c@ex.com
4	d@ex.com

user_id	plan_name	tier
1	annual	pro
2	monthly	free
3	annual	pro

Code (the broken query).

SELECT u.user_id, p.plan_name
FROM users u
LEFT JOIN plans p
  ON u.user_id = p.user_id
WHERE p.tier = 'pro';

Step-by-step explanation.

The LEFT JOIN produces 4 rows — every user, with NULLs on the right for user_id=4 (no plan).
The WHERE p.tier = 'pro' filter compares NULL against 'pro'. SQL three-valued logic: NULL = 'pro' is NULL, treated as FALSE.
So user_id=4 is filtered out — converting the LEFT JOIN into an effective INNER JOIN.
user_id=2 is also dropped because tier='free' is not 'pro'.
The fix: move the tier filter into the ON clause. That keeps the join outer — unmatched rows on the right survive — and additionally enforces the tier predicate during the match.

Code (the fix).

SELECT u.user_id, p.plan_name
FROM users u
LEFT JOIN plans p
  ON u.user_id = p.user_id
  AND p.tier = 'pro';

Output (after fix).

user_id	plan_name
1	annual
2	NULL
3	annual
4	NULL

Rule of thumb. Filters on the right side of a LEFT JOIN belong in the ON, never in WHERE. Filters on the left side belong in WHERE, never in ON. Confuse them and the join semantics silently change.

Worked example — SELF JOIN for hierarchies and previous-row patterns

Detailed explanation. A self join is the same table joined to itself with two different aliases. The two muscle patterns: hierarchical lookups (employee.manager_id → employee.id) and "previous row per key" patterns (predecessor / successor pairs). The latter is now usually written with LAG() instead — but the self-join shape is still asked in interviews because it tests join reasoning without the window-function crutch.

Question. Given employees(emp_id, name, manager_id) where manager_id references emp_id, write a query that lists every employee alongside their manager's name. Employees with no manager (CEO) should still appear.

Input.

emp_id	name	manager_id
1	Alice	NULL
2	Bob	1
3	Carla	1
4	Dan	2
5	Eli	2

Code.

SELECT
  e.emp_id,
  e.name             AS employee_name,
  m.name             AS manager_name
FROM employees e
LEFT JOIN employees m
  ON e.manager_id = m.emp_id
ORDER BY e.emp_id;

Step-by-step explanation.

The employees table is referenced twice — once as e (the employee row) and once as m (the same table viewed as the manager record).
LEFT JOIN ... ON e.manager_id = m.emp_id finds the manager record for each employee. LEFT keeps Alice (CEO) in the output despite manager_id IS NULL.
m.name is the manager's display name. For Alice it is NULL because the left-join did not find a match.
The natural alternative — INNER JOIN — would have dropped Alice silently. Always reach for LEFT JOIN when "the root must appear" is a requirement.

Output.

emp_id	employee_name	manager_name
1	Alice	NULL
2	Bob	Alice
3	Carla	Alice
4	Dan	Bob
5	Eli	Bob

Rule of thumb. Self joins are the cleanest way to model parent / child relationships in a single table. For multi-level hierarchies (CEO → VP → Manager → IC), reach for a recursive CTE (WITH RECURSIVE) instead.

Worked example — CROSS JOIN to generate a date spine

Detailed explanation. CROSS JOIN produces the cartesian product of two tables — every row of A paired with every row of B. It is almost never what you want at the row level, but it is the canonical way to generate dense reporting grids — a row for every (date, user_id) combination, for example, so downstream LEFT JOIN against fact data fills the gaps.

Question. Build a daily "active or not" report for every user across a 7-day window. Some users have no events on some days; the output must still have one row per (user_id, day) with 0 for inactive days.

Input — events and users.

user_id
u_1
u_2
u_3

user_id	event_ts
u_1	2026-05-01 09:00
u_1	2026-05-02 11:00
u_2	2026-05-03 14:00

Code.

WITH date_spine AS (
  SELECT day::date AS day
  FROM generate_series(DATE '2026-05-01', DATE '2026-05-07', INTERVAL '1 day') AS s(day)
),
grid AS (
  SELECT u.user_id, d.day
  FROM users u
  CROSS JOIN date_spine d
),
daily_events AS (
  SELECT
    user_id,
    DATE(event_ts) AS day,
    COUNT(*)       AS event_count
  FROM events
  GROUP BY user_id, DATE(event_ts)
)
SELECT
  g.user_id,
  g.day,
  COALESCE(e.event_count, 0) AS event_count,
  CASE WHEN e.event_count > 0 THEN 1 ELSE 0 END AS active_flag
FROM grid g
LEFT JOIN daily_events e
  ON e.user_id = g.user_id
  AND e.day = g.day
ORDER BY g.user_id, g.day;

Step-by-step explanation.

date_spine enumerates the seven calendar days. Use generate_series (Postgres), GENERATE_DATE_ARRAY (BigQuery), or a recursive CTE (engine-neutral) for the same effect.
grid is the CROSS JOIN of users × date_spine — 3 × 7 = 21 rows, one per (user_id, day) combination.
daily_events aggregates the raw events to one row per (user_id, day) with a count.
The outer LEFT JOIN matches the grid against the actual data; missing combinations get NULL, which COALESCE converts to 0.
The result is dense — every user appears on every day, even on days with no activity. Perfect for downstream BI tools that expect a regular grid.

Output (partial).

user_id	day	event_count	active_flag
u_1	2026-05-01	1	1
u_1	2026-05-02	1	1
u_1	2026-05-03	0	0
u_1	2026-05-04	0	0
u_1	2026-05-07	0	0
u_2	2026-05-01	0	0
u_2	2026-05-03	1	1
u_3	2026-05-01	0	0
u_3	2026-05-07	0	0

Rule of thumb. CROSS JOIN is the only join where the row count multiplies. Use it deliberately to generate dense grids; never use it accidentally — a missing ON clause on an INNER JOIN turns into a CROSS JOIN and silently explodes the row count.

SQL interview question on the anti-join idiom

A common probe is "find users who have never placed an order" — the canonical anti-join question. It tests whether you reach for LEFT JOIN ... IS NULL, NOT EXISTS, or (incorrectly) NOT IN.

Solution Using `LEFT JOIN ... WHERE right IS NULL`

SELECT u.user_id, u.email
FROM users u
LEFT JOIN orders o
  ON o.user_id = u.user_id
WHERE o.user_id IS NULL;

Step-by-step trace.

u.user_id	u.email	matching o rows	After LEFT JOIN	After WHERE o.user_id IS NULL
1	a@ex.com	2 orders	2 rows with o.user_id=1	filtered out
2	b@ex.com	0 orders	1 row with o.user_id=NULL	kept
3	c@ex.com	1 order	1 row with o.user_id=3	filtered out
4	d@ex.com	0 orders	1 row with o.user_id=NULL	kept

Output:

user_id	email
2	b@ex.com
4	d@ex.com

Why this works — concept by concept:

Anti-join shape — "in A and not in B" is a left outer join with a NULL filter on the right join column. The unmatched rows have NULL on every right column; filtering on WHERE B.join_col IS NULL keeps exactly those.
Why not NOT IN — if any orders.user_id is NULL (orders with no associated user), NOT IN (NULL, 1, 3) is NULL for every row, treated as FALSE, and the query returns zero. LEFT JOIN ... IS NULL and NOT EXISTS are immune to this.
NOT EXISTS equivalent — NOT EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.user_id) produces the same result; many planners optimise it identically.
Cost — O(N + M) for a hash-anti-join when the planner picks it; O(N × M) worst case if no index supports the join.

SQL
Topic — anti-join
Anti-join problems (in-A-not-in-B, churn, missing keys)

Practice →

4. Aggregates and GROUP BY cheat sheet

Five standard aggregates, four advanced extensions, and one universal NULL rule

The mental model in one line: an aggregate function compresses many rows into one value per group; GROUP BY defines the groups; HAVING filters the groups; every standard aggregate ignores NULL except COUNT(*). Once you have that rule, the entire aggregates surface — the silent COUNT bug, the WHERE vs HAVING decision, the GROUPING SETS vs CUBE vs ROLLUP family — is mostly bookkeeping.

The five standard aggregates and what they ignore.

COUNT(*) — counts every row, including ones where every column is NULL. Returns BIGINT. Cheapest aggregate (no column read).
COUNT(col) — counts rows where col IS NOT NULL. Returns BIGINT. The silent NULL bug: COUNT(email) is not the row count; it is the non-null-email row count.
COUNT(DISTINCT col) — counts distinct non-null values. Expensive — engine must materialise a distinct set. Approximate-COUNT-DISTINCT (APPROX_COUNT_DISTINCT / COUNT(DISTINCT col) with HyperLogLog hints) is much cheaper.
SUM(col) — adds non-null values. Returns NUMERIC or BIGINT (watch overflow on summing INT columns over billions of rows — cast to BIGINT defensively).
AVG(col) — equivalent to SUM(col) / COUNT(col). Returns NUMERIC. NULLs are excluded from both numerator and denominator (so the result is the average of defined values).
MIN(col) / MAX(col) — smallest / largest non-null value. Works on numbers, dates, and strings (lexicographic).

The silent COUNT bug.

Question: "How many users do we have?" — SELECT COUNT(email) FROM users if some users have NULL email returns the wrong number.
The right answer is SELECT COUNT(*) FROM users, or SELECT COUNT(user_id) FROM users if user_id is NOT NULL.
Every senior interviewer asks this. The required reply is "COUNT(*) counts every row; COUNT(col) ignores NULLs."

WHERE vs HAVING — the two-line rule.

WHERE filters rows before aggregation. Use it to remove input rows.
HAVING filters groups after aggregation. Use it to remove groups based on aggregate values.
WHERE COUNT(*) > 5 is a syntax error — aggregates don't exist yet at step 2.
HAVING amount > 100 works only if amount is the grouping key; on a non-key column it raises (because aggregates must wrap it).

GROUPING SETS, ROLLUP, CUBE — the advanced grouping family.

GROUPING SETS ((a), (b), ()) — explicit list of grouping combinations. The empty () is the grand total. Powerful for "one query, many subtotals."
ROLLUP (year, quarter, month) — hierarchical totals: (year, quarter, month), (year, quarter), (year), (). Used for time-hierarchy summaries.
CUBE (region, product, channel) — every combination of the dimensions. 2^N groups for N dimensions — careful with cardinality.
GROUPING(col) — returns 1 if col was grouped on this row, 0 otherwise. Distinguishes "subtotal NULL" from "real data NULL."

FILTER (WHERE …) — conditional aggregates without CASE WHEN.

Standard SQL since 2003; supported by Postgres, SQLite, and Snowflake.
SUM(amount) FILTER (WHERE status = 'paid') replaces SUM(CASE WHEN status = 'paid' THEN amount ELSE 0 END).
Cleaner syntax, identical plan. Most senior reviewers prefer it.
MySQL / BigQuery don't support FILTER — fall back to CASE WHEN.

Three classic interview questions on aggregates.

"How does COUNT(*) differ from COUNT(col)?" — covered above.
"How would you compute the median in SQL?" — PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col) (standard SQL). Pre-PERCENTILE engines use a window-function trick.
"What does GROUP BY () mean?" — single-row aggregate over the whole table (the grand total). Useful inside GROUPING SETS.

Worked example — conditional aggregates with FILTER

Detailed explanation. A common interview probe is "compute paid orders vs cancelled orders per customer in one query." You can do it with SUM(CASE WHEN ...), but FILTER (WHERE ...) is the senior-signal answer in 2026.

Question. Given orders(customer_id, amount, status) with status ∈ {'paid','cancelled','pending'}, compute per customer: total amount paid, total amount cancelled, and total amount pending — in one query.

Input.

customer_id	order_id	amount	status
7	1001	120	paid
7	1002	80	cancelled
7	1003	95	paid
7	1004	60	pending
9	2001	200	paid
9	2002	50	cancelled

Code.

SELECT
  customer_id,
  SUM(amount) FILTER (WHERE status = 'paid')      AS paid_amt,
  SUM(amount) FILTER (WHERE status = 'cancelled') AS cancelled_amt,
  SUM(amount) FILTER (WHERE status = 'pending')   AS pending_amt
FROM orders
GROUP BY customer_id;

Step-by-step explanation.

FROM orders — table resolved.
WHERE (absent) — every row passes.
GROUP BY customer_id — rows bucketed by customer.
SELECT — for each group, SUM(amount) FILTER (WHERE status = …) sums only the rows where the filter is true.
The three filters run in parallel inside the same group scan — one pass over the data, three aggregate accumulators per group.

Output.

customer_id	paid_amt	cancelled_amt	pending_amt
7	215	80	60
9	200	50	NULL

Rule of thumb. FILTER (WHERE …) is the modern conditional aggregate. Reach for it on Postgres / Snowflake / SQLite; fall back to SUM(CASE WHEN ... THEN amount ELSE 0 END) on MySQL / BigQuery.

Worked example — GROUPING SETS for one-query subtotals

Detailed explanation. A reporting team wants revenue per region, per product, and the grand total — in one query, in one round-trip to the warehouse. The wrong answer is three separate GROUP BY queries unioned together. The right answer is GROUPING SETS, which the engine evaluates in a single pass and emits all the combinations.

Question. Given sales(region, product, amount), return revenue broken down by region, by product, and as a grand total — in a single result set. Distinguish "subtotal" rows from "real data" rows.

Input.

region	product	amount
NA	A	100
NA	B	50
EU	A	80
EU	B	70
AP	A	40

Code.

SELECT
  COALESCE(region, 'ALL_REGIONS')   AS region,
  COALESCE(product, 'ALL_PRODUCTS') AS product,
  GROUPING(region)  AS is_region_subtotal,
  GROUPING(product) AS is_product_subtotal,
  SUM(amount)       AS revenue
FROM sales
GROUP BY GROUPING SETS (
  (region),
  (product),
  ()
)
ORDER BY is_region_subtotal, is_product_subtotal, region, product;

Step-by-step explanation.

GROUPING SETS ((region), (product), ()) instructs the engine to aggregate three times in one pass: once grouped by region, once grouped by product, once with no grouping (grand total).
GROUPING(region) returns 1 on rows where region was not part of the grouping (the "all regions" rows) and 0 otherwise. Same for GROUPING(product).
COALESCE replaces the NULL placeholders the engine emits for "subtotal" rows with human-readable labels.
The final ORDER BY floats the grand total to the bottom and the per-region / per-product subtotals together.

Output.

region	product	is_region_subtotal	is_product_subtotal	revenue
AP	ALL_PRODUCTS	0	1	40
EU	ALL_PRODUCTS	0	1	150
NA	ALL_PRODUCTS	0	1	150
ALL_REGIONS	A	1	0	220
ALL_REGIONS	B	1	0	120
ALL_REGIONS	ALL_PRODUCTS	1	1	340

Rule of thumb. GROUPING SETS ((a), (b), ()) is the explicit form. Reach for ROLLUP (a, b) when the dimensions form a hierarchy (e.g. year → quarter → month), and CUBE (a, b) when you need every combination. All three compile into the same one-pass execution plan.

Worked example — `HAVING` vs `WHERE` debugged the right way

Detailed explanation. A common interview trap: candidates put aggregate conditions in WHERE and row-level conditions in HAVING. Either causes confusing errors or — worse — silently incorrect results because the planner allowed it on lenient dialects (MySQL). Mastering the rule "WHERE filters rows before aggregation, HAVING filters groups after aggregation" is muscle memory you need cold.

Question. Given orders(customer_id, order_ts, amount, status), find customers who placed at least 5 paid orders in 2026. Choose the right clause for each predicate and explain why.

Input.

customer_id	order_ts	amount	status
7	2026-01-15	100	paid
7	2026-02-10	50	paid
7	2026-02-20	80	cancelled
7	2026-03-12	120	paid
7	2026-04-01	200	paid
7	2026-04-25	60	paid
9	2026-03-05	300	paid
9	2026-04-10	90	paid

Code.

SELECT
  customer_id,
  COUNT(*)              AS paid_order_count,
  SUM(amount)           AS paid_revenue
FROM orders
WHERE order_ts >= DATE '2026-01-01'
  AND order_ts <  DATE '2027-01-01'
  AND status = 'paid'
GROUP BY customer_id
HAVING COUNT(*) >= 5
ORDER BY paid_order_count DESC;

Step-by-step explanation.

Date predicate in WHERE. order_ts >= '2026-01-01' operates on a row column before aggregation runs. Putting it in HAVING would be either an error (Postgres / standard SQL) or a silent inefficiency (MySQL would aggregate the entire table and then filter).
Status predicate in WHERE. status = 'paid' is also row-level — drop cancelled / pending orders before counting. Putting it in HAVING would count cancelled orders too and apply the filter to the wrong number.
Count predicate in HAVING. COUNT(*) >= 5 is an aggregate — it only makes sense after GROUP BY. The engine raises an error if you try to use it in WHERE.
ORDER BY on SELECT alias. Works because ORDER BY runs after SELECT, so the alias paid_order_count is visible.

Output.

customer_id	paid_order_count	paid_revenue
7	5	470

Rule of thumb. Two-line decision tree: if the predicate is a column-level boolean (date, status, region), put it in WHERE. If the predicate is an aggregate (COUNT, SUM, AVG comparison), put it in HAVING. If the dialect allows both, prefer WHERE because the planner can push it down past joins and indexes.

SQL interview question on COUNT + GROUP BY edge cases

A senior interviewer probes: "Write a query that, for each product, returns total revenue, number of orders, and number of distinct customers — but only for products with at least 100 distinct customers." It tests COUNT(*) vs COUNT(DISTINCT), GROUP BY, and HAVING in one query.

Solution Using `COUNT(DISTINCT)` with `HAVING`

SELECT
  product_id,
  SUM(amount)                AS revenue,
  COUNT(*)                   AS order_count,
  COUNT(DISTINCT customer_id) AS distinct_customers
FROM orders
WHERE order_ts >= DATE '2026-01-01'
GROUP BY product_id
HAVING COUNT(DISTINCT customer_id) >= 100
ORDER BY revenue DESC;

Step-by-step trace.

Step	Clause	What it does
1	`FROM orders`	resolve table
2	`WHERE order_ts >= '2026-01-01'`	filter rows before aggregation
3	`GROUP BY product_id`	bucket by product
4	`HAVING COUNT(DISTINCT customer_id) >= 100`	drop groups below threshold
5	`SELECT ...`	compute revenue, order count, distinct customer count
7	`ORDER BY revenue DESC`	final sort

Output:

product_id	revenue	order_count	distinct_customers
P_017	482,500	4,127	1,842
P_004	391,200	3,210	1,508
P_022	312,000	2,890	1,201

Why this works — concept by concept:

COUNT(*) vs COUNT(DISTINCT col) — COUNT(*) counts rows (one per order); COUNT(DISTINCT customer_id) counts unique customers. Knowing the difference is the senior signal.
WHERE before aggregation — WHERE order_ts >= '2026-01-01' prunes rows before the group hash builds; the engine never aggregates 2024 / 2025 data unnecessarily.
HAVING on aggregate — HAVING COUNT(DISTINCT customer_id) >= 100 filters groups after aggregation. The same predicate can't go in WHERE because the aggregate doesn't exist yet at step 2.
ORDER BY on SELECT alias — ORDER BY revenue DESC works because revenue is the SELECT alias and ORDER BY runs after SELECT.
Cost — O(N) for the date filter; O(N) for the group hash; COUNT(DISTINCT) adds a per-group hash set of customer IDs, making memory O(N_groups × avg_distinct_customers).

SQL
Topic — aggregation
Aggregation problems (COUNT, GROUP BY, HAVING, conditional aggregates)

Practice →

5. Window functions cheat sheet

Four function families, one `OVER` clause, and 80% of senior SQL rounds

The mental model in one line: a window function returns one value per input row computed over a window of related rows defined by PARTITION BY, ordered by ORDER BY, and bounded by a frame clause — it does not collapse rows the way GROUP BY does. Every senior SQL interview asks at least one window-function question, and three-quarters of them reduce to "rank within a partition" or "running total."

The four function families.

Ranking. ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(n). Assign an integer per row based on the ORDER BY inside the window.
Offset. LAG(col, n, default), LEAD(col, n, default). Look backward or forward in the partition. The two extra args are offset (default 1) and default value when out of range (default NULL).
Frame-positional. FIRST_VALUE(col), LAST_VALUE(col), NTH_VALUE(col, n). Return the value at a specific position in the frame.
Aggregate-as-window. SUM() OVER, AVG() OVER, COUNT() OVER, MAX() OVER. Same aggregate functions, evaluated over the window instead of a group.

Ranking — pick the right one.

ROW_NUMBER() — strictly 1, 2, 3 with no ties. Use when you need a deterministic "the row" per partition (top-N per group).
RANK() — 1, 1, 3 — ties share a rank, next rank skips. Use when "joint first place" should be visible.
DENSE_RANK() — 1, 1, 2 — ties share a rank, next rank does not skip. Use when you want compact ranks across distinct values.
NTILE(n) — splits the partition into n roughly-equal buckets. Use for quartiles, deciles, percentile bucketing.

Offset — LAG and LEAD.

LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_ts) — previous amount in the partition. NULL on the first row.
LAG(amount, 1, 0) OVER (...) — previous amount with 0 instead of NULL on the first row. The third arg is the default.
LEAD(amount) OVER (...) — next amount. Mirror image of LAG.
Classic use: compute deltas (amount - LAG(amount) OVER (...)) without a self-join.

Frame clause anatomy.

The frame clause goes inside the OVER after ORDER BY. Default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW — surprisingly different from ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when there are ties in the ORDER BY.
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW — strict row-count frame; running total semantics row-by-row.
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW — 7-row trailing window; rolling-7 averages.
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING — symmetric 7-row window centred on the current row.
RANGE vs ROWS — RANGE groups ties in the ORDER BY into one logical row; ROWS counts physical rows. Use ROWS for predictability.

PARTITION BY + ORDER BY — the muscle pattern.

PARTITION BY restarts the window at every distinct value of the partition key. Cardinality of partitions × frame width = your memory cost.
ORDER BY inside OVER controls who comes first. Independent of the outer query's ORDER BY.
Both can be empty. OVER () means "the whole result set is one window" — useful for grand totals as a column on every row.

Common interview shapes — three patterns you'll see.

Top-N per group. ROW_NUMBER() OVER (PARTITION BY k ORDER BY v DESC) then WHERE rn <= N in an outer query.
Running total. SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).
7-day rolling average. AVG(amount) OVER (PARTITION BY user_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW).

QUALIFY — Snowflake / BigQuery / Databricks shortcut.

QUALIFY ROW_NUMBER() OVER (PARTITION BY k ORDER BY v) = 1 filters on a window result without a subquery.
Same semantics as WHERE would have if WHERE could see window results.
Postgres and MySQL still need the subquery / CTE pattern; in those engines, write the CTE.

Worked example — running total of daily revenue per customer

Detailed explanation. A reporting team wants daily revenue and the running total per customer in the same query. Without window functions you'd self-join or correlate; with SUM() OVER, it is two lines.

Question. Given orders(customer_id, order_ts, amount), produce one row per (customer_id, day) with the daily total and the cumulative total to date for that customer.

Input.

customer_id	order_ts	amount
7	2026-05-01	120
7	2026-05-01	30
7	2026-05-03	200
7	2026-05-04	50
9	2026-05-02	80
9	2026-05-05	95

Code.

WITH daily AS (
  SELECT
    customer_id,
    DATE(order_ts)         AS day,
    SUM(amount)            AS daily_amt
  FROM orders
  GROUP BY customer_id, DATE(order_ts)
)
SELECT
  customer_id,
  day,
  daily_amt,
  SUM(daily_amt) OVER (
    PARTITION BY customer_id
    ORDER BY day
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total
FROM daily
ORDER BY customer_id, day;

Step-by-step explanation.

Inner CTE. Group rows by (customer_id, day); SUM(amount) collapses multiple same-day orders into one daily total.
Outer SELECT — daily values pass through. customer_id, day, daily_amt are projected as-is.
Outer SELECT — running total via SUM() OVER. PARTITION BY customer_id resets the cumulative sum at each new customer; ORDER BY day decides the chronological order; the explicit frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ensures a true row-by-row running total.
Outer ORDER BY. Sort the final output for human reading — independent of the window's internal ORDER BY.
Why the explicit frame. The default RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW would lump rows with identical day into one logical step — fine for distinct days but surprising on ties. ROWS is safer to teach and reason about.

Output.

customer_id	day	daily_amt	running_total
7	2026-05-01	150	150
7	2026-05-03	200	350
7	2026-05-04	50	400
9	2026-05-02	80	80
9	2026-05-05	95	175

Rule of thumb. Always write the frame clause explicitly for SUM() OVER and AVG() OVER once ORDER BY is present. The default frame is RANGE not ROWS — a subtle mismatch that bites you on ties.

Worked example — `LAG` and `LEAD` to compute deltas without self-joins

Detailed explanation. Before window functions existed, "delta vs previous row" required a self-join keyed on row_number or order_ts. With LAG() the query collapses to one line — and the planner often executes it in a single pass with no extra sort. LAG looks backward; LEAD looks forward; both take an optional offset and default.

Question. Given stock_prices(symbol, day, close_price), compute each row's daily change and the percentage change vs the previous trading day, per symbol.

Input.

symbol	day	close_price
ACME	2026-05-01	100.00
ACME	2026-05-02	105.00
ACME	2026-05-03	99.00
ACME	2026-05-04	110.00
ZETA	2026-05-01	50.00
ZETA	2026-05-02	52.00

Code.

SELECT
  symbol,
  day,
  close_price,
  LAG(close_price, 1) OVER (
    PARTITION BY symbol
    ORDER BY day
  ) AS prev_close,
  close_price - LAG(close_price, 1) OVER (
    PARTITION BY symbol
    ORDER BY day
  ) AS abs_change,
  ROUND(
    100.0 * (close_price - LAG(close_price, 1) OVER (PARTITION BY symbol ORDER BY day))
    / LAG(close_price, 1) OVER (PARTITION BY symbol ORDER BY day),
    2
  ) AS pct_change
FROM stock_prices
ORDER BY symbol, day;

Step-by-step explanation.

LAG(close_price, 1) OVER (PARTITION BY symbol ORDER BY day) returns the previous trading day's close_price per symbol. The first row of each partition gets NULL because there is no previous row.
close_price - LAG(...) is the absolute daily change. NULL on the first day of each symbol.
The percentage change wraps the same expression — for production code, factor it into a CTE so LAG() only appears once and is easier to read.
The outer ORDER BY symbol, day is for human reading; it does not influence the window's internal ordering.

Output.

symbol	day	close_price	prev_close	abs_change	pct_change
ACME	2026-05-01	100.00	NULL	NULL	NULL
ACME	2026-05-02	105.00	100.00	5.00	5.00
ACME	2026-05-03	99.00	105.00	-6.00	-5.71
ACME	2026-05-04	110.00	99.00	11.00	11.11
ZETA	2026-05-01	50.00	NULL	NULL	NULL
ZETA	2026-05-02	52.00	50.00	2.00	4.00

Rule of thumb. Reach for LAG() / LEAD() whenever the requirement says "previous row" or "next row" inside a key. The default offset is 1; the third argument is the default value when out of range (use LAG(col, 1, 0) if you want 0 instead of NULL on the boundary).

Worked example — `ROW_NUMBER` vs `RANK` vs `DENSE_RANK` on tied scores

Detailed explanation. Every senior SQL interview includes one ranking-function question. The trap: candidates blindly use RANK() everywhere, or default to ROW_NUMBER() without thinking. The three functions handle ties differently, and the choice depends entirely on the business semantics.

Question. Given leaderboard(player_id, score), return three rank columns — ROW_NUMBER, RANK, and DENSE_RANK — so a single result set demonstrates how each function treats tied scores.

Input.

player_id	score
p_1	100
p_2	95
p_3	95
p_4	90
p_5	85
p_6	85
p_7	80

Code.

SELECT
  player_id,
  score,
  ROW_NUMBER() OVER (ORDER BY score DESC, player_id) AS row_num,
  RANK()       OVER (ORDER BY score DESC)            AS rk,
  DENSE_RANK() OVER (ORDER BY score DESC)            AS dense_rk
FROM leaderboard
ORDER BY score DESC, player_id;

Step-by-step explanation.

ROW_NUMBER() OVER (ORDER BY score DESC, player_id) produces 1, 2, 3, … with no ties because the secondary ORDER BY player_id breaks them deterministically. Use this when you need exactly one row per "position."
RANK() OVER (ORDER BY score DESC) produces 1, 2, 2, 4, 5, 5, 7 — tied scores share a rank and the next rank skips by the tie group's size. Use this when "joint second place" must be visible.
DENSE_RANK() OVER (ORDER BY score DESC) produces 1, 2, 2, 3, 4, 4, 5 — tied scores share a rank but the next rank is contiguous. Use this when ranks should compress to "distinct values seen so far."
The outer ORDER BY score DESC, player_id matches the ROW_NUMBER's tiebreaker so the output reads top-down.

Output.

player_id	score	row_num	rk	dense_rk
p_1	100	1	1	1
p_2	95	2	2	2
p_3	95	3	2	2
p_4	90	4	4	3
p_5	85	5	5	4
p_6	85	6	5	4
p_7	80	7	7	5

Rule of thumb. "Top-3 distinct scores including ties" → DENSE_RANK() <= 3. "Top-3 positions, ties shown" → RANK() <= 3. "Top-3 rows, ties broken by a tiebreaker" → ROW_NUMBER() <= 3. The wrong choice changes how many rows you return.

SQL interview question on rolling averages

A senior probe: "Write a 7-day rolling average of revenue per user, with the value defined as 0 if fewer than 7 days are available." Tests frames, partitioning, and the rolling-window vocabulary in one go.

Solution Using `AVG() OVER (... ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)`

WITH daily AS (
  SELECT
    user_id,
    DATE(order_ts) AS day,
    SUM(amount)    AS daily_amt
  FROM orders
  GROUP BY user_id, DATE(order_ts)
)
SELECT
  user_id,
  day,
  daily_amt,
  AVG(daily_amt) OVER (
    PARTITION BY user_id
    ORDER BY day
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7d_avg,
  COUNT(*) OVER (
    PARTITION BY user_id
    ORDER BY day
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_count
FROM daily
ORDER BY user_id, day;

Step-by-step trace.

user	day	daily_amt	rows in 6-PRECEDING-to-CURRENT frame	rolling_7d_avg	rolling_count
u_42	2026-05-01	50	1 row	50.0	1
u_42	2026-05-02	100	2 rows	75.0	2
u_42	2026-05-03	60	3 rows	70.0	3
u_42	2026-05-04	40	4 rows	62.5	4
u_42	2026-05-05	80	5 rows	66.0	5
u_42	2026-05-06	70	6 rows	66.7	6
u_42	2026-05-07	110	7 rows	72.9	7
u_42	2026-05-08	90	7 rows	78.6	7

Output:

user_id	day	daily_amt	rolling_7d_avg	rolling_count
u_42	2026-05-07	110	72.9	7
u_42	2026-05-08	90	78.6	7

Why this works — concept by concept:

7-row trailing frame — ROWS BETWEEN 6 PRECEDING AND CURRENT ROW includes the current row and the six immediately previous rows in chronological order. Seven total.
PARTITION BY user_id — each user has its own rolling-window state; the frame does not bleed across users.
Companion COUNT(*) OVER — emits the actual frame size so downstream code can mask values where fewer than 7 rows exist (treat them as 0 or NULL per business rule).
AVG over a frame — same formula as plain AVG, but evaluated row-by-row over the window instead of once per group.
Cost — O(N log N) for the per-partition sort; O(N) for the streaming frame computation. Engines maintain a running sum + count for AVG, so cost is constant per row once the partition is sorted.

SQL
Topic — window functions
Window-function problems (rolling averages, running totals, top-N)

Practice →

Cheat sheet — copy-paste templates

These six templates are the highest-frequency SQL patterns in DE interviews and production code. Memorise the shape, swap the table and column names, and you have answered most live-coding questions in under 90 seconds. Each one is dialect-portable across Postgres, MySQL 8.0+, Snowflake, and BigQuery.

Top-N per group.

WITH ranked AS (
  SELECT
    customer_id,
    order_id,
    amount,
    ROW_NUMBER() OVER (
      PARTITION BY customer_id
      ORDER BY amount DESC
    ) AS rn
  FROM orders
)
SELECT customer_id, order_id, amount
FROM ranked
WHERE rn <= 3;          -- top 3 per customer

Running total.

SELECT
  customer_id,
  order_ts,
  amount,
  SUM(amount) OVER (
    PARTITION BY customer_id
    ORDER BY order_ts
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total
FROM orders;

7-day rolling average.

SELECT
  user_id,
  day,
  daily_amt,
  AVG(daily_amt) OVER (
    PARTITION BY user_id
    ORDER BY day
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7d_avg
FROM daily;

De-duplication with ROW_NUMBER.

WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY natural_key
      ORDER BY ingested_at DESC
    ) AS rn
  FROM staging_orders
)
SELECT *
FROM ranked
WHERE rn = 1;           -- keep latest version per natural_key

Anti-join idiom (in A, not in B).

-- LEFT JOIN ... IS NULL form
SELECT u.user_id, u.email
FROM users u
LEFT JOIN orders o
  ON o.user_id = u.user_id
WHERE o.user_id IS NULL;

-- NOT EXISTS form (NULL-safe equivalent)
SELECT u.user_id, u.email
FROM users u
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.user_id = u.user_id
);

Median with PERCENTILE_CONT.

SELECT
  customer_id,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_amount,
  PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) AS p95_amount
FROM orders
GROUP BY customer_id;

Conditional aggregate with FILTER.

SELECT
  customer_id,
  SUM(amount) FILTER (WHERE status = 'paid')      AS paid_amt,
  SUM(amount) FILTER (WHERE status = 'cancelled') AS cancelled_amt,
  COUNT(*) FILTER (WHERE status = 'paid')         AS paid_count
FROM orders
GROUP BY customer_id;

Pivot one column into many (cross-tab).

SELECT
  customer_id,
  SUM(CASE WHEN month = '2026-04' THEN amount ELSE 0 END) AS apr_amt,
  SUM(CASE WHEN month = '2026-05' THEN amount ELSE 0 END) AS may_amt,
  SUM(CASE WHEN month = '2026-06' THEN amount ELSE 0 END) AS jun_amt
FROM orders_monthly
GROUP BY customer_id;

Year-over-year delta with LAG.

SELECT
  product_id,
  year,
  revenue,
  LAG(revenue, 1, 0) OVER (
    PARTITION BY product_id
    ORDER BY year
  ) AS prev_year_revenue,
  revenue - LAG(revenue, 1, 0) OVER (
    PARTITION BY product_id
    ORDER BY year
  ) AS yoy_delta
FROM yearly_revenue;

Sessionisation with LAG + cumulative sum.

WITH gaps AS (
  SELECT
    user_id,
    event_ts,
    CASE
      WHEN event_ts - LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts)
           > INTERVAL '30 minutes' THEN 1
      ELSE 0
    END AS is_new_session
  FROM events
),
sessions AS (
  SELECT
    user_id,
    event_ts,
    SUM(is_new_session) OVER (
      PARTITION BY user_id
      ORDER BY event_ts
      ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS session_id
  FROM gaps
)
SELECT user_id, session_id, MIN(event_ts) AS session_start, MAX(event_ts) AS session_end, COUNT(*) AS event_count
FROM sessions
GROUP BY user_id, session_id;

SQL
Topic — SQL
All-purpose SQL drills (top-N, running totals, pivots, sessionisation)

Practice →

Frequently asked questions

Is SQL still in demand in 2026?

Yes — more than ever. Every modern data platform (Snowflake, BigQuery, Databricks SQL Warehouse, Apache Iceberg, dbt, ksqlDB, Materialise) exposes SQL as the primary user surface, and the vast majority of analytics, ML feature engineering, and ad-hoc business questions are still answered in SQL. SQL is the only language that has remained the lingua franca of data across four decades — outliving five generations of "SQL replacements" — and 2026 has actually expanded its reach with streaming SQL and lakehouse SQL engines. A working data engineer in 2026 typically writes more SQL than Python.

Which SQL dialect should I learn first?

Learn PostgreSQL first. It is the closest to standard SQL, has the richest feature set (CTEs, window functions, LATERAL, FILTER, GROUPING SETS, full JSON support, partitions), and the lessons port cleanly to Snowflake, BigQuery, and Redshift. Once Postgres feels comfortable, scan the dialect deltas for whichever warehouse you'll work on day-to-day — Snowflake adds QUALIFY and time-travel syntax, BigQuery adds standard SQL extensions for arrays and structs, and so on. Avoid MySQL as a first dialect because some of its older non-standard behaviour (silent group-by relaxation, implicit casts) builds bad habits you'll have to unlearn.

ANSI SQL vs vendor-specific SQL — which should I write?

Default to ANSI / standard SQL — it ports across engines and survives platform migrations. Reach for vendor-specific syntax only when it is materially clearer or significantly faster. QUALIFY, LATERAL, MATCH_RECOGNIZE, and the array-handling functions are good examples of vendor extensions worth using when the engine supports them; you simply pay a portability cost. In interviews, write standard SQL by default and mention the dialect-specific shortcut as an addendum — that signals both depth and discipline.

What's the hardest SQL concept for interviews?

Window functions with non-default frame clauses. Most candidates can write ROW_NUMBER() OVER (PARTITION BY k ORDER BY v), but the moment the interviewer asks for a 7-day rolling average or a session-bucket assignment, the frame clause (ROWS BETWEEN ... AND ...) becomes the make-or-break detail. Pair that with the difference between ROWS and RANGE framing, and you have the single most common SQL interview pitfall. Drill on running totals, rolling averages, and sessionisation patterns until the syntax is automatic.

How long does it take to master SQL for DE interviews?

Plan on 8–12 weeks of consistent practice — 1–2 hours per day on real problems — to reach the bar for a senior DE SQL round. The first 4 weeks cover the fundamentals (joins, aggregates, window functions, CTEs); the next 4 weeks pattern-match the common interview shapes (top-N, dedup, sessionisation, pivot, time-series); the last weeks polish edge cases (NULL handling, frame clauses, dialect differences, EXPLAIN reading). 450+ DE-focused problems with progressively harder constraints is the canonical sequence — the muscle memory only comes from reps.

Should I learn SQL or Python first as a data engineer?

SQL first. Almost every DE interview opens with a SQL round, and every data role requires SQL fluency from day one. Python is essential at the next layer — Pandas, PySpark, orchestration scripts — but Python without SQL is "I can move data around" while SQL without Python is "I can answer business questions." Spend the first 8 weeks on SQL, the next 8 weeks on Pandas + PySpark + ETL patterns. Once both are at interview-level, you can rotate between them based on the role's emphasis. For most DE pipelines, SQL is still 60–70% of the work; Python wraps it.

Practice on PipeCode

Drill the SQL practice library → for end-to-end interview reps across joins, aggregates, and window functions.
Rehearse window-function problems → when the interviewer asks for top-N per group, rolling totals, or sessionisation.
Sharpen join problems → for the LEFT-vs-INNER bug, anti-join idiom, and self-join patterns.
Practise aggregation drills → for COUNT semantics, HAVING, and GROUPING SETS.
For company-specific reps, work through Meta SQL problems → and Amazon SQL problems →.
For the broader interview surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Take the full SQL track on SQL for Data Engineering Interviews — From Zero to FAANG →.
Pair it with data modeling for DE interviews → for schema-design fluency.

Pipecode.ai is Leetcode for Data Engineering — every clause, every join, every aggregate, every window function above ships with hands-on practice rooms where you write the SQL, run it against a real engine, and get scored on correctness and plan quality. Start with the SQL library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice SQL now →
Window-function drills →

1. Why every DE needs a SQL cheat sheet in 2026

A working SQL reference card is interview survival kit — and on-the-job leverage

Worked example — the 90-second SQL whiteboard

Worked example — write the same query in four dialects

SQL interview question on the muscle-memory drill

Solution Using ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_ts DESC)

2. The SQL clause execution order — what runs when

Written order is not execution order — that one sentence unlocks half the SQL interview

Worked example — why does this query break

Worked example — where window functions sit in the execution order

SQL interview question on clause execution order

Solution Using clause-order reasoning

3. Joins cheat sheet — INNER, LEFT, RIGHT, FULL, SELF, ANTI, SEMI, CROSS

Eight join shapes — and at least three appear in every SQL interview

Worked example — LEFT JOIN with a WHERE that secretly becomes INNER

Worked example — SELF JOIN for hierarchies and previous-row patterns

Worked example — CROSS JOIN to generate a date spine

SQL interview question on the anti-join idiom

Solution Using LEFT JOIN ... WHERE right IS NULL

4. Aggregates and GROUP BY cheat sheet

Five standard aggregates, four advanced extensions, and one universal NULL rule

Worked example — conditional aggregates with FILTER

Worked example — GROUPING SETS for one-query subtotals

Worked example — HAVING vs WHERE debugged the right way

SQL interview question on COUNT + GROUP BY edge cases

Solution Using COUNT(DISTINCT) with HAVING

5. Window functions cheat sheet

Four function families, one OVER clause, and 80% of senior SQL rounds

Worked example — running total of daily revenue per customer

Worked example — LAG and LEAD to compute deltas without self-joins

Worked example — ROW_NUMBER vs RANK vs DENSE_RANK on tied scores

SQL interview question on rolling averages

Solution Using AVG() OVER (... ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)

Cheat sheet — copy-paste templates

Frequently asked questions

Is SQL still in demand in 2026?

Which SQL dialect should I learn first?

ANSI SQL vs vendor-specific SQL — which should I write?

What's the hardest SQL concept for interviews?

How long does it take to master SQL for DE interviews?

Should I learn SQL or Python first as a data engineer?

Practice on PipeCode

Solution Using `ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_ts DESC)`

Solution Using `LEFT JOIN ... WHERE right IS NULL`

Worked example — `HAVING` vs `WHERE` debugged the right way

Solution Using `COUNT(DISTINCT)` with `HAVING`

Four function families, one `OVER` clause, and 80% of senior SQL rounds

Worked example — `LAG` and `LEAD` to compute deltas without self-joins

Worked example — `ROW_NUMBER` vs `RANK` vs `DENSE_RANK` on tied scores

Solution Using `AVG() OVER (... ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)`