A good sql cheat sheet is the single highest-leverage artefact a data engineer can keep open during interview prep — it compresses the language into the eight clauses that run, the eight joins that ship, the five aggregates that count, and the four window-function families that sort. When the screen-share clock starts and the interviewer says "write a query that ranks orders within each customer," you do not have time to scroll Stack Overflow; you need the muscle memory to type ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_ts DESC) without thinking.
This guide is the sql language cheat sheet every senior DE wishes they had at the start of their career. It is also a working sql query cheat sheet for production code reviews — the same clause-order rules that make WHERE reject a SELECT alias also make the planner prune partitions before any expensive join runs. We will walk the sql commands cheat sheet layer by layer: how a query is parsed versus executed, every join shape with a Venn diagram and a one-line snippet, the aggregate functions and their NULL pitfalls, and the window-function syntax with its frame clauses. Treat it as a sql syntax reference for interview prep, a sql reference card for the first six months on the job, and a teaching deck for whoever is mentoring you.
When you want hands-on reps immediately after each section, browse SQL practice library →, drill join problems →, and rehearse on window-function drills →.
On this page
- Why every DE needs a SQL cheat sheet in 2026
- The SQL clause execution order — what runs when
- Joins cheat sheet — INNER, LEFT, RIGHT, FULL, SELF, ANTI, SEMI, CROSS
- Aggregates and GROUP BY cheat sheet
- Window functions cheat sheet
- Cheat sheet — copy-paste templates
- Frequently asked questions
- Practice on PipeCode
1. Why every DE needs a SQL cheat sheet in 2026
A working SQL reference card is interview survival kit — and on-the-job leverage
The one-sentence invariant: SQL is still the lingua franca of data — every warehouse, every lakehouse, every streaming engine eventually exposes a SQL surface, and every DE interview tests whether you can write one in 90 seconds without an IDE. A printable sql syntax reference turns "I think I remember the syntax" into "I know the syntax cold" — which is the difference between solving the question on the whiteboard and burning four of your forty-five minutes on a missing GROUP BY.
The three muscle-memory levels every DE goes through.
-
Level 1 — syntax recall. You can type
SELECT … FROM … WHERE …without typos. You knowJOINdefaults toINNER. You knowCOUNT(*)is different fromCOUNT(col). This is the bar a junior interview tests. -
Level 2 — plan reading. You can read an
EXPLAIN ANALYZEand tell whether the planner picked a hash join, a nested loop, or a merge join. You can tell when aWHEREpredicate was pushed down past aGROUP BYand when it wasn't. This is the bar a senior interview tests. -
Level 3 — optimization. You can rewrite a correlated subquery as a join, an
INas anEXISTS, or a 5-table star join as a CTE chain — and explain why each rewrite changes the plan. This is the bar staff-level rounds test.
Why a one-page reference is so much faster than the docs.
- Latency. Postgres docs are 4,500 pages. The most-used clauses live on one card. In a 45-minute interview, every second spent searching the docs is a second not spent thinking.
-
Coverage in the right shape. Documentation is reference order — alphabetical, exhaustive, abstract. A cheat sheet is decision order — "I need to rank within a group → here is
ROW_NUMBERand the partition syntax" — which is the shape your brain queries during an interview. -
Cross-dialect coverage. Postgres, MySQL, Snowflake, and BigQuery diverge on half the clauses. A good
sql commands cheat sheetlists the dialect deltas inline, so you don't ship aLIMIT 10to Snowflake where you neededLIMIT 10to BigQuery'sLIMIT 10syntax (orTOP 10on T-SQL).
Dialect differences that show up in interviews.
-
Pagination. Postgres / MySQL / Snowflake use
LIMIT N OFFSET M; SQL Server usesOFFSET … FETCH NEXT; legacy T-SQL usesTOP N. BigQuery accepts bothLIMITandFETCH FIRST. -
String concat. Postgres / Standard SQL:
||. MySQL:CONCAT(). SQL Server:+. Cross-dialect-safe:CONCAT()(works in all four). -
Date arithmetic. Postgres:
now() - interval '7 days'. MySQL:NOW() - INTERVAL 7 DAY. Snowflake / BigQuery:DATE_SUB(CURRENT_DATE, INTERVAL 7 DAY). Always say the dialect out loud during the interview — interviewers reward it. -
Window functions. Standard across all four engines for
ROW_NUMBER,RANK,DENSE_RANK,LAG,LEAD,SUM() OVER. MySQL only added them in 8.0; BigQuery has additional analytical extensions likePERCENTILE_CONT. -
QUALIFY. Snowflake / BigQuery / Databricks. Filters on a window-function result without a subquery. Postgres / MySQL still need a subquery.
The 2026 reality check.
- AI assistants. They handle the boilerplate but still hallucinate joins on the wrong key and miss NULL semantics. A human cheat sheet is the verification layer.
- Cross-engine workloads. A senior DE in 2026 typically writes against Snowflake, BigQuery, and Postgres in the same week. The cheat sheet is the cognitive shock absorber for dialect switching.
- Interview shapes. SQL still dominates the 60-minute screen at every FAANG-tier DE loop — usually one window-function problem, one join correctness problem, and one aggregation-with-edge-cases problem.
What interviewers listen for when they ask a SQL question.
- Do you ask "which dialect" before writing? — senior signal.
- Do you say "GROUP BY runs before SELECT, so I can't reuse a SELECT alias here" without prompting? — clause-order mastery signal.
- Do you reach for
EXISTSoverINfor semi-joins on nullable columns? — required answer. - Do you choose
ROW_NUMBERoverRANKdeliberately, naming the tie-breaking semantics? — window-function fluency signal.
Worked example — the 90-second SQL whiteboard
Detailed explanation. A common opening probe is "write the SQL for the top order per customer." It tests three things at once: do you know window functions, do you handle ties correctly, and do you wrap them in a sub-query because WHERE cannot reference a window result.
Question. Given an orders(customer_id, order_id, amount, order_ts) table, write a query that returns one row per customer — the row corresponding to the largest single amount for that customer, with ties broken by the latest order_ts. Use only standard SQL.
Input.
| customer_id | order_id | amount | order_ts |
|---|---|---|---|
| 7 | 1001 | 120.00 | 2026-05-01 09:00 |
| 7 | 1002 | 250.00 | 2026-05-04 11:10 |
| 7 | 1003 | 250.00 | 2026-05-06 08:42 |
| 9 | 2001 | 80.00 | 2026-05-02 10:00 |
| 9 | 2002 | 95.00 | 2026-05-07 14:20 |
Code.
WITH ranked AS (
SELECT
customer_id,
order_id,
amount,
order_ts,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY amount DESC, order_ts DESC
) AS rn
FROM orders
)
SELECT customer_id, order_id, amount, order_ts
FROM ranked
WHERE rn = 1;
Step-by-step explanation.
-
Window first.
ROW_NUMBER()assigns a strictly-increasing integer per partition.PARTITION BY customer_idrestarts the count at every new customer.ORDER BY amount DESC, order_ts DESCranks by amount, with the latest timestamp winning ties. -
Wrap in a CTE.
WHEREruns beforeSELECT, so it cannot reference the aliasrndirectly. The CTE materialises the rank, then the outerSELECTfilters on it. -
Why
ROW_NUMBER, notRANK.RANKwould tie order 1002 and 1003 at rank 1 and return both.ROW_NUMBERgives a deterministic tie-breaker via the secondaryORDER BY. -
Why no
GROUP BY. AGROUP BY customer_idwithMAX(amount)would collapse the rows and loseorder_idandorder_ts. Window functions keep every input row available.
Output.
| customer_id | order_id | amount | order_ts |
|---|---|---|---|
| 7 | 1003 | 250.00 | 2026-05-06 08:42 |
| 9 | 2002 | 95.00 | 2026-05-07 14:20 |
Rule of thumb. "Top-N per group" is always ROW_NUMBER over a partition, filtered in an outer query. Reach for it before any MAX(...) + self-join contortion.
Worked example — write the same query in four dialects
Detailed explanation. A senior interview round sometimes asks "how would this same query change if you ran it on Snowflake instead of Postgres?" — testing whether you can switch dialects on the fly. The query below — top order per customer — appears in every DE interview shop's question bank in some form, and writing it across four dialects in 60 seconds is a senior-level fluency signal.
Question. Write the "top order per customer" query in PostgreSQL (using a CTE), Snowflake (using QUALIFY), BigQuery (using QUALIFY), and MySQL 8.0 (using a CTE). Same semantics, dialect-appropriate idioms.
Input.
| customer_id | order_id | amount |
|---|---|---|
| 7 | 1001 | 120 |
| 7 | 1002 | 250 |
| 9 | 2001 | 80 |
| 9 | 2002 | 95 |
Code (PostgreSQL — CTE).
WITH ranked AS (
SELECT
customer_id,
order_id,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY amount DESC
) AS rn
FROM orders
)
SELECT customer_id, order_id, amount
FROM ranked
WHERE rn = 1;
Code (Snowflake / BigQuery — QUALIFY).
SELECT
customer_id,
order_id,
amount
FROM orders
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY amount DESC
) = 1;
Code (MySQL 8.0 — CTE).
WITH ranked AS (
SELECT
customer_id,
order_id,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY amount DESC
) AS rn
FROM orders
)
SELECT customer_id, order_id, amount
FROM ranked
WHERE rn = 1;
Step-by-step explanation.
-
PostgreSQL. No
QUALIFYsupport, so the CTE pattern is mandatory. Identical to the MySQL 8.0 syntax (MySQL caught up in 8.0). -
Snowflake / BigQuery / Databricks.
QUALIFYlets you filter on a window function inline. Same logical plan as the CTE version; saves one indentation level and a name. -
MySQL 8.0. Window functions arrived in 8.0; before that you'd have written a self-join with
MAX()— slow and verbose. Avoid MySQL 5.x in 2026 unless legacy. - Why the dialect matters. The query plan is identical across all four engines; only the syntactic sugar differs. The CTE form is the dialect-portable fallback when you don't know the engine.
Output (all four dialects).
| customer_id | order_id | amount |
|---|---|---|
| 7 | 1002 | 250 |
| 9 | 2002 | 95 |
Rule of thumb. Default to the CTE pattern in interviews — it works in every dialect. Mention QUALIFY as a one-line shorthand when you know the engine supports it (Snowflake / BigQuery / Databricks / Teradata). That single sentence is a senior signal that you've worked across more than one warehouse.
SQL interview question on the muscle-memory drill
A common opener at a 30-minute screen is "before we look at the data, write me the skeleton of a window-function query that finds the latest event per user." It looks trivial but it tests clause order, partition syntax, and tie-breaking — all in one ~60-second answer.
Solution Using ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_ts DESC)
WITH ranked_events AS (
SELECT
user_id,
event_id,
event_type,
event_ts,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY event_ts DESC
) AS rn
FROM events
)
SELECT user_id, event_id, event_type, event_ts
FROM ranked_events
WHERE rn = 1;
Step-by-step trace.
| Input row | user_id | event_id | event_ts | Partition rank (DESC) |
|---|---|---|---|---|
| 1 | u_42 | e_001 | 2026-05-10 09:00 | rn=3 |
| 2 | u_42 | e_002 | 2026-05-11 14:30 | rn=2 |
| 3 | u_42 | e_003 | 2026-05-12 18:42 | rn=1 |
| 4 | u_99 | e_010 | 2026-05-08 07:00 | rn=2 |
| 5 | u_99 | e_011 | 2026-05-09 11:00 | rn=1 |
Output:
| user_id | event_id | event_type | event_ts |
|---|---|---|---|
| u_42 | e_003 | checkout | 2026-05-12 18:42 |
| u_99 | e_011 | view | 2026-05-09 11:00 |
Why this works — concept by concept:
-
Window function —
ROW_NUMBER()enumerates rows without collapsing them. Every input row remains addressable; only the rank column is new. -
PARTITION BY —
user_idresets the counter at every new partition. Twou_42rows and threeu_99rows each get their own 1..N enumeration. -
ORDER BY in OVER —
event_ts DESCdecides which row ends up at rank 1. The window'sORDER BYis independent of any outerORDER BY. -
Outer WHERE on
rn— clause order forbids referencing window aliases inside the sameSELECT'sWHERE. The CTE / sub-query materialisesrnso the outer query can filter on it. -
Cost —
O(N log N)for the partition-sort;O(N)for the outer filter. Dominant term is the sort.
SQL
Topic — window functions
Window-function problems (top-N per group, latest-event-per-user)
2. The SQL clause execution order — what runs when
Written order is not execution order — that one sentence unlocks half the SQL interview
The mental model in one line: SQL is the only language where the order you read is not the order the engine runs. You type SELECT first; the engine runs it fifth. Once you internalise the eight-step execution pipeline, half of the "why doesn't my alias work in WHERE?" questions vanish, and the other half — window-function ordering, HAVING vs WHERE, DISTINCT placement — fall out by deduction.
The written order — what you type.
-
SELECT— projection list, what columns you want. -
FROM— base tables and joins. -
WHERE— row-level filter. -
GROUP BY— grouping keys. -
HAVING— group-level filter. -
ORDER BY— final sort. -
LIMIT— top-N (and optionalOFFSET).
The logical execution order — what the engine actually runs.
-
Step 1 —
FROM+JOIN. The engine resolves the base tables and applies any join predicates. Cross joins fan out the cartesian product; inner joins prune to matches; outer joins preserve unmatched rows. -
Step 2 —
WHERE. Row-level predicates are applied next. This is before aggregation, soWHEREcannot reference aggregate functions orGROUP BYaliases. The optimizer often pushesWHEREpredicates down past the join when it can (predicate pushdown). -
Step 3 —
GROUP BY. Rows are bucketed by the grouping keys. Each bucket becomes one row in the next step; non-grouped columns must be wrapped in aggregates or they will fail strict-mode SQL. -
Step 4 —
HAVING. Group-level filter.HAVINGis the only place you can filter on aggregate results beforeSELECThas finished running. -
Step 5 —
SELECT. The projection list is computed. Aliases defined here are visible to later clauses only. Window functions evaluate here, not earlier — which is whyWHERE rn = 1always needs a subquery. -
Step 6 —
DISTINCT. Deduplicates the projected rows. Surprising placement:DISTINCTruns afterSELECT, soSELECT DISTINCT a, b FROM tdeduplicates on the projected(a, b)tuple — never on the original column count. -
Step 7 —
ORDER BY. The final sort. Because it runs afterSELECT, it can referenceSELECTaliases directly (ORDER BY rn ASCworks without a subquery). -
Step 8 —
LIMIT/OFFSET. The very last step. Determines how many rows the client receives.
Three consequences that show up in interviews.
-
WHEREcan't seeSELECTaliases. Step 2 < step 5. If you writeWHERE total > 100wheretotalis a SELECT alias, the engine raisescolumn does not exist. -
ORDER BYcan seeSELECTaliases. Step 7 > step 5. This is whyORDER BY net_revenue DESCworks on the projected alias. -
Window functions cannot be filtered in
WHERE. Window functions evaluate in step 5;WHEREran in step 2. Wrap in a CTE / subquery, then filter outside.
Two mental tricks that stop the confusion.
-
"Engines read tables before they read columns."
FROMalways runs first. Everything you can do — joins, filters, projections — is downstream of which tables exist. - "Aliases are born at SELECT and live downstream." If a clause runs before SELECT, it can't see the alias. If a clause runs after SELECT, it can.
Worked example — why does this query break
Detailed explanation. A junior writes a query that looks reasonable: SELECT amount * 1.1 AS taxed FROM orders WHERE taxed > 100;. The engine raises column "taxed" does not exist. Walking through the eight-step order tells you why and how to fix it.
Question. Given orders(order_id, amount), the analyst writes the query below and gets ERROR: column "taxed" does not exist. Explain why and rewrite it correctly.
Input.
| order_id | amount |
|---|---|
| 1001 | 120.00 |
| 1002 | 80.00 |
| 1003 | 95.00 |
| 1004 | 250.00 |
Code (the broken query).
SELECT amount * 1.1 AS taxed
FROM orders
WHERE taxed > 100;
Step-by-step explanation.
-
Step 1 —
FROM orders. Engine resolves the table. So far so good. -
Step 2 —
WHERE taxed > 100. Engine tries to filter rows. It looks uptaxedin theordersschema —taxedis not a column there. Error. -
Step 5 —
SELECT amount * 1.1 AS taxed. The engine would have computed the alias here, but execution never reached step 5 because step 2 already failed. -
Two fixes. Either repeat the expression in
WHERE, or wrap the projection in a subquery / CTE so the outerWHEREruns after the projected alias materialises.
Code (fix 1 — repeat the expression).
SELECT amount * 1.1 AS taxed
FROM orders
WHERE amount * 1.1 > 100;
Code (fix 2 — subquery so taxed exists at WHERE time).
SELECT taxed
FROM (
SELECT amount * 1.1 AS taxed
FROM orders
) AS t
WHERE taxed > 100;
Output.
| taxed |
|---|
| 132.00 |
| 104.50 |
| 275.00 |
Rule of thumb. "Can I reference an alias here?" Ask which clause is using it and where the alias was born. Aliases born in SELECT (step 5) live in ORDER BY (step 7) but not in WHERE (step 2). HAVING (step 4) can use grouping-key aliases but not SELECT aliases.
Worked example — where window functions sit in the execution order
Detailed explanation. Window functions evaluate in step 5 (during SELECT) — after GROUP BY and HAVING, before DISTINCT and ORDER BY. This is why you can't filter a window result in WHERE and why you can ORDER BY a window alias without a subquery.
Question. Given events(user_id, event_ts, page), write a single query that returns each user's third-most-recent page view, with the user's total event count as a side column — without any subquery (Snowflake / BigQuery QUALIFY).
Input.
| user_id | event_ts | page |
|---|---|---|
| u_1 | 2026-05-10 08:00 | /home |
| u_1 | 2026-05-11 09:00 | /cart |
| u_1 | 2026-05-12 10:00 | /checkout |
| u_1 | 2026-05-13 11:00 | /done |
| u_2 | 2026-05-09 07:00 | /home |
| u_2 | 2026-05-10 08:30 | /cart |
| u_2 | 2026-05-11 12:15 | /checkout |
| u_2 | 2026-05-12 13:45 | /done |
Code (Snowflake / BigQuery — using QUALIFY).
SELECT
user_id,
event_ts,
page,
COUNT(*) OVER (PARTITION BY user_id) AS total_events,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_ts DESC) AS recency_rank
FROM events
QUALIFY recency_rank = 3
ORDER BY user_id;
Step-by-step explanation.
-
FROM events— step 1 — table resolved. -
WHERE(absent) — step 2 — every row passes. -
GROUP BY(absent) — step 3 — skipped; window functions do not requireGROUP BY. -
HAVING(absent) — step 4 — skipped. -
SELECT … OVER (…)— step 5 — every row is annotated with the window outputs.COUNT(*) OVER (PARTITION BY user_id)adds the total per user;ROW_NUMBER()ranks events from newest to oldest. -
QUALIFY recency_rank = 3— step 5b — Snowflake/BigQuery dialect extension that filters on the window result inline. Behaves like aWHEREwould, ifWHEREcould see window aliases. -
ORDER BY user_id— step 7 — final sort for human readability.
Output.
| user_id | event_ts | page | total_events | recency_rank |
|---|---|---|---|---|
| u_1 | 2026-05-11 09:00 | /cart | 4 | 3 |
| u_2 | 2026-05-10 08:30 | /cart | 4 | 3 |
Rule of thumb. On Snowflake / BigQuery / Databricks reach for QUALIFY; on Postgres / MySQL fall back to the CTE pattern (WITH ranked AS (...) SELECT … WHERE rn = 3). The mental model is identical; only the syntax sugar differs.
SQL interview question on clause execution order
A staff-level interviewer often probes with: "I've got SELECT customer_id, COUNT(*) AS n FROM orders GROUP BY customer_id HAVING n > 5 ORDER BY n DESC; — does that work in Postgres?" The answer tests whether you know that HAVING can see aliases declared in the same SELECT only on lenient dialects, and where the dialect lines fall.
Solution Using clause-order reasoning
-- Standard-SQL safe rewrite (works in every dialect)
SELECT customer_id, COUNT(*) AS n
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 5
ORDER BY n DESC;
Step-by-step trace.
| Step | Clause | What runs | Visible aliases |
|---|---|---|---|
| 1 | FROM orders |
table resolved | — |
| 2 |
WHERE (absent) |
skipped | — |
| 3 | GROUP BY customer_id |
rows bucketed |
customer_id (grouping key) |
| 4 | HAVING COUNT(*) > 5 |
group filter |
customer_id, aggregates |
| 5 | SELECT customer_id, COUNT(*) AS n |
projection + alias born |
customer_id, n
|
| 7 | ORDER BY n DESC |
sort using SELECT alias |
customer_id, n
|
Output:
| customer_id | n |
|---|---|
| 7 | 12 |
| 9 | 9 |
| 11 | 7 |
Why this works — concept by concept:
-
Logical execution order —
GROUP BYhappens beforeSELECT, soHAVING(which runs betweenGROUP BYandSELECT) must use the aggregate expression itself in standard SQL. Postgres and SQL Server enforce this strictly; MySQL and SQLite let you reference SELECT aliases inHAVINGas a non-standard convenience. -
SELECT aliases are born late — they are visible from
SELECTonward, soORDER BY n DESCworks in every dialect. -
Aggregate visibility in HAVING —
HAVINGis the only place to filter on aggregate functions; usingWHERE COUNT(*) > 5would fail because aggregates are not defined yet at step 2. -
Cost —
O(N)for the group hash;O(G log G)for the final sort whereGis the number of distinct groups.
SQL
Topic — SQL
SQL clause-order problems (HAVING vs WHERE, aliases)
3. Joins cheat sheet — INNER, LEFT, RIGHT, FULL, SELF, ANTI, SEMI, CROSS
Eight join shapes — and at least three appear in every SQL interview
The mental model in one line: a join is two tables and one predicate; the join type decides which rows survive when the predicate is or isn't met. The eight shapes below cover ~100% of interview surface. INNER and LEFT cover 80% of production code; FULL and CROSS are niche; ANTI and SEMI are the senior signals — they have no JOIN keyword in standard SQL but the planner treats them as join types.
The eight joins, one line each.
-
INNER JOIN— matched rows only.A INNER JOIN B ON A.id = B.id. Default when you omit the keyword. -
LEFT JOIN— all of A, nullable B.A LEFT JOIN B ON A.id = B.id. Use when "every A must appear" is the requirement. -
RIGHT JOIN— all of B, nullable A. Most teams ban it in code review; rewrite asB LEFT JOIN Ainstead. -
FULL OUTER JOIN— every row from both, nulls where they don't match. Use for reconciliation queries. -
SELF JOIN— table joined to itself with two aliases. Hierarchical data, pairs, lag-by-key. -
ANTI JOIN— A rows that have no match in B. NoJOINkeyword:LEFT JOIN ... WHERE B.id IS NULLorNOT EXISTS. -
SEMI JOIN— A rows that have at least one match in B, returning each A row once. NoJOINkeyword:EXISTS(preferred) orIN. -
CROSS JOIN— cartesian product.A CROSS JOIN B= every row of A paired with every row of B. Use only with care.
The five join decisions every interview probes.
-
"INNER or LEFT?" — does the row need to survive if there is no match in B? If yes,
LEFT. If no,INNER. Default reflex should beLEFTwhen the right side is a lookup that may be incomplete. -
"
INorEXISTS?" —EXISTSis faster on large subqueries because it short-circuits at the first match.INmaterialises the entire sub-result. UseEXISTSfor SEMI joins on nullable columns to avoid the NULL trap. -
"How do I exclude rows that match a list?" —
NOT EXISTSorLEFT JOIN ... WHERE right IS NULL. NeverNOT INwhen the right side has NULLs —NOT IN (NULL)returns NULL, which evaluates as FALSE for every row, so the query returns zero rows. -
"Self-join vs window function?" — for "previous row per key" patterns, use
LAG()(window). For "every pair satisfying a predicate," use a self-join. Window is faster when the engine supports it. -
"CROSS JOIN with WHERE — danger?" —
A CROSS JOIN B WHERE A.id = B.idis identical toINNER JOIN, but the planner may not realise it (older engines). Always use the explicitJOIN ... ONsyntax — semantics are clearer and plans more predictable.
The NULL trap that catches every junior.
-
LEFT JOIN B ON A.id = B.id— unmatched A rows still appear, with every B column =NULL. -
WHERE B.col = 5after aLEFT JOINaccidentally turns it into anINNER JOIN. Always put the filter into theONclause for outer joins:LEFT JOIN B ON A.id = B.id AND B.col = 5. - Composite key joins with one nullable column —
ON A.k1 = B.k1 AND A.k2 = B.k2— produce zero rows whenA.k2is NULL. UseIS NOT DISTINCT FROM(Postgres) or(A.k2 = B.k2 OR (A.k2 IS NULL AND B.k2 IS NULL))for NULL-safe equality.
Dialect deltas for joins.
-
USING (col)— implicitON a.col = b.col. Works in Postgres / MySQL / Snowflake / BigQuery. Convenient for natural joins on identically-named columns. -
LATERAL— Postgres, BigQuery (LEFT JOIN UNNEST), Snowflake. Lets the right-hand side of the join reference left-hand columns. Powerful for top-N-per-group with a sub-query. -
QUALIFY— Snowflake / BigQuery / Databricks / Teradata. Filters on a window-function result inline — no subquery needed. -
ANTI JOIN/SEMI JOINkeywords — Spark SQL and Databricks support the explicit keywords; Postgres, MySQL, Snowflake, and BigQuery require theEXISTS/LEFT JOIN ... IS NULLpatterns.
Worked example — LEFT JOIN with a WHERE that secretly becomes INNER
Detailed explanation. This is the single most-tested join bug in interviews. The interviewer writes a LEFT JOIN, then pushes a filter on the right table into WHERE, and the unmatched A rows silently vanish. Recognising it and pulling the filter into ON is a junior-to-mid signal.
Question. A team writes SELECT u.user_id, p.plan_name FROM users u LEFT JOIN plans p ON u.user_id = p.user_id WHERE p.tier = 'pro'; and asks why churned users have disappeared from the report. Explain the bug and fix it.
Input — users and plans.
| user_id | plan_name | tier |
|---|---|---|
| 1 | annual | pro |
| 2 | monthly | free |
| 3 | annual | pro |
Code (the broken query).
SELECT u.user_id, p.plan_name
FROM users u
LEFT JOIN plans p
ON u.user_id = p.user_id
WHERE p.tier = 'pro';
Step-by-step explanation.
- The
LEFT JOINproduces 4 rows — every user, with NULLs on the right foruser_id=4(no plan). - The
WHERE p.tier = 'pro'filter compares NULL against'pro'. SQL three-valued logic:NULL = 'pro'isNULL, treated as FALSE. - So
user_id=4is filtered out — converting theLEFT JOINinto an effectiveINNER JOIN. -
user_id=2is also dropped becausetier='free'is not'pro'. - The fix: move the tier filter into the
ONclause. That keeps the join outer — unmatched rows on the right survive — and additionally enforces the tier predicate during the match.
Code (the fix).
SELECT u.user_id, p.plan_name
FROM users u
LEFT JOIN plans p
ON u.user_id = p.user_id
AND p.tier = 'pro';
Output (after fix).
| user_id | plan_name |
|---|---|
| 1 | annual |
| 2 | NULL |
| 3 | annual |
| 4 | NULL |
Rule of thumb. Filters on the right side of a LEFT JOIN belong in the ON, never in WHERE. Filters on the left side belong in WHERE, never in ON. Confuse them and the join semantics silently change.
Worked example — SELF JOIN for hierarchies and previous-row patterns
Detailed explanation. A self join is the same table joined to itself with two different aliases. The two muscle patterns: hierarchical lookups (employee.manager_id → employee.id) and "previous row per key" patterns (predecessor / successor pairs). The latter is now usually written with LAG() instead — but the self-join shape is still asked in interviews because it tests join reasoning without the window-function crutch.
Question. Given employees(emp_id, name, manager_id) where manager_id references emp_id, write a query that lists every employee alongside their manager's name. Employees with no manager (CEO) should still appear.
Input.
| emp_id | name | manager_id |
|---|---|---|
| 1 | Alice | NULL |
| 2 | Bob | 1 |
| 3 | Carla | 1 |
| 4 | Dan | 2 |
| 5 | Eli | 2 |
Code.
SELECT
e.emp_id,
e.name AS employee_name,
m.name AS manager_name
FROM employees e
LEFT JOIN employees m
ON e.manager_id = m.emp_id
ORDER BY e.emp_id;
Step-by-step explanation.
- The
employeestable is referenced twice — once ase(the employee row) and once asm(the same table viewed as the manager record). -
LEFT JOIN ... ON e.manager_id = m.emp_idfinds the manager record for each employee.LEFTkeeps Alice (CEO) in the output despitemanager_id IS NULL. -
m.nameis the manager's display name. For Alice it is NULL because the left-join did not find a match. - The natural alternative —
INNER JOIN— would have dropped Alice silently. Always reach forLEFT JOINwhen "the root must appear" is a requirement.
Output.
| emp_id | employee_name | manager_name |
|---|---|---|
| 1 | Alice | NULL |
| 2 | Bob | Alice |
| 3 | Carla | Alice |
| 4 | Dan | Bob |
| 5 | Eli | Bob |
Rule of thumb. Self joins are the cleanest way to model parent / child relationships in a single table. For multi-level hierarchies (CEO → VP → Manager → IC), reach for a recursive CTE (WITH RECURSIVE) instead.
Worked example — CROSS JOIN to generate a date spine
Detailed explanation. CROSS JOIN produces the cartesian product of two tables — every row of A paired with every row of B. It is almost never what you want at the row level, but it is the canonical way to generate dense reporting grids — a row for every (date, user_id) combination, for example, so downstream LEFT JOIN against fact data fills the gaps.
Question. Build a daily "active or not" report for every user across a 7-day window. Some users have no events on some days; the output must still have one row per (user_id, day) with 0 for inactive days.
Input — events and users.
| user_id |
|---|
| u_1 |
| u_2 |
| u_3 |
| user_id | event_ts |
|---|---|
| u_1 | 2026-05-01 09:00 |
| u_1 | 2026-05-02 11:00 |
| u_2 | 2026-05-03 14:00 |
Code.
WITH date_spine AS (
SELECT day::date AS day
FROM generate_series(DATE '2026-05-01', DATE '2026-05-07', INTERVAL '1 day') AS s(day)
),
grid AS (
SELECT u.user_id, d.day
FROM users u
CROSS JOIN date_spine d
),
daily_events AS (
SELECT
user_id,
DATE(event_ts) AS day,
COUNT(*) AS event_count
FROM events
GROUP BY user_id, DATE(event_ts)
)
SELECT
g.user_id,
g.day,
COALESCE(e.event_count, 0) AS event_count,
CASE WHEN e.event_count > 0 THEN 1 ELSE 0 END AS active_flag
FROM grid g
LEFT JOIN daily_events e
ON e.user_id = g.user_id
AND e.day = g.day
ORDER BY g.user_id, g.day;
Step-by-step explanation.
-
date_spineenumerates the seven calendar days. Usegenerate_series(Postgres),GENERATE_DATE_ARRAY(BigQuery), or a recursive CTE (engine-neutral) for the same effect. -
gridis theCROSS JOINofusers × date_spine—3 × 7 = 21rows, one per(user_id, day)combination. -
daily_eventsaggregates the raw events to one row per(user_id, day)with a count. - The outer
LEFT JOINmatches the grid against the actual data; missing combinations getNULL, whichCOALESCEconverts to0. - The result is dense — every user appears on every day, even on days with no activity. Perfect for downstream BI tools that expect a regular grid.
Output (partial).
| user_id | day | event_count | active_flag |
|---|---|---|---|
| u_1 | 2026-05-01 | 1 | 1 |
| u_1 | 2026-05-02 | 1 | 1 |
| u_1 | 2026-05-03 | 0 | 0 |
| u_1 | 2026-05-04 | 0 | 0 |
| u_1 | 2026-05-07 | 0 | 0 |
| u_2 | 2026-05-01 | 0 | 0 |
| u_2 | 2026-05-03 | 1 | 1 |
| u_3 | 2026-05-01 | 0 | 0 |
| u_3 | 2026-05-07 | 0 | 0 |
Rule of thumb. CROSS JOIN is the only join where the row count multiplies. Use it deliberately to generate dense grids; never use it accidentally — a missing ON clause on an INNER JOIN turns into a CROSS JOIN and silently explodes the row count.
SQL interview question on the anti-join idiom
A common probe is "find users who have never placed an order" — the canonical anti-join question. It tests whether you reach for LEFT JOIN ... IS NULL, NOT EXISTS, or (incorrectly) NOT IN.
Solution Using LEFT JOIN ... WHERE right IS NULL
SELECT u.user_id, u.email
FROM users u
LEFT JOIN orders o
ON o.user_id = u.user_id
WHERE o.user_id IS NULL;
Step-by-step trace.
| u.user_id | u.email | matching o rows | After LEFT JOIN | After WHERE o.user_id IS NULL |
|---|---|---|---|---|
| 1 | a@ex.com | 2 orders | 2 rows with o.user_id=1 | filtered out |
| 2 | b@ex.com | 0 orders | 1 row with o.user_id=NULL | kept |
| 3 | c@ex.com | 1 order | 1 row with o.user_id=3 | filtered out |
| 4 | d@ex.com | 0 orders | 1 row with o.user_id=NULL | kept |
Output:
Why this works — concept by concept:
-
Anti-join shape — "in A and not in B" is a left outer join with a NULL filter on the right join column. The unmatched rows have NULL on every right column; filtering on
WHERE B.join_col IS NULLkeeps exactly those. -
Why not
NOT IN— if anyorders.user_idis NULL (orders with no associated user),NOT IN (NULL, 1, 3)isNULLfor every row, treated as FALSE, and the query returns zero.LEFT JOIN ... IS NULLandNOT EXISTSare immune to this. -
NOT EXISTSequivalent —NOT EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.user_id)produces the same result; many planners optimise it identically. -
Cost —
O(N + M)for a hash-anti-join when the planner picks it;O(N × M)worst case if no index supports the join.
SQL
Topic — anti-join
Anti-join problems (in-A-not-in-B, churn, missing keys)
4. Aggregates and GROUP BY cheat sheet
Five standard aggregates, four advanced extensions, and one universal NULL rule
The mental model in one line: an aggregate function compresses many rows into one value per group; GROUP BY defines the groups; HAVING filters the groups; every standard aggregate ignores NULL except COUNT(*). Once you have that rule, the entire aggregates surface — the silent COUNT bug, the WHERE vs HAVING decision, the GROUPING SETS vs CUBE vs ROLLUP family — is mostly bookkeeping.
The five standard aggregates and what they ignore.
-
COUNT(*)— counts every row, including ones where every column is NULL. ReturnsBIGINT. Cheapest aggregate (no column read). -
COUNT(col)— counts rows wherecol IS NOT NULL. ReturnsBIGINT. The silent NULL bug:COUNT(email)is not the row count; it is the non-null-email row count. -
COUNT(DISTINCT col)— counts distinct non-null values. Expensive — engine must materialise a distinct set. Approximate-COUNT-DISTINCT (APPROX_COUNT_DISTINCT/COUNT(DISTINCT col)with HyperLogLog hints) is much cheaper. -
SUM(col)— adds non-null values. ReturnsNUMERICorBIGINT(watch overflow on summingINTcolumns over billions of rows — cast toBIGINTdefensively). -
AVG(col)— equivalent toSUM(col) / COUNT(col). ReturnsNUMERIC. NULLs are excluded from both numerator and denominator (so the result is the average of defined values). -
MIN(col)/MAX(col)— smallest / largest non-null value. Works on numbers, dates, and strings (lexicographic).
The silent COUNT bug.
- Question: "How many users do we have?" —
SELECT COUNT(email) FROM usersif some users have NULL email returns the wrong number. - The right answer is
SELECT COUNT(*) FROM users, orSELECT COUNT(user_id) FROM usersifuser_idisNOT NULL. - Every senior interviewer asks this. The required reply is "
COUNT(*)counts every row;COUNT(col)ignores NULLs."
WHERE vs HAVING — the two-line rule.
-
WHEREfilters rows before aggregation. Use it to remove input rows. -
HAVINGfilters groups after aggregation. Use it to remove groups based on aggregate values. -
WHERE COUNT(*) > 5is a syntax error — aggregates don't exist yet at step 2. -
HAVING amount > 100works only ifamountis the grouping key; on a non-key column it raises (because aggregates must wrap it).
GROUPING SETS, ROLLUP, CUBE — the advanced grouping family.
-
GROUPING SETS ((a), (b), ())— explicit list of grouping combinations. The empty()is the grand total. Powerful for "one query, many subtotals." -
ROLLUP (year, quarter, month)— hierarchical totals:(year, quarter, month),(year, quarter),(year),(). Used for time-hierarchy summaries. -
CUBE (region, product, channel)— every combination of the dimensions. 2^N groups for N dimensions — careful with cardinality. -
GROUPING(col)— returns 1 ifcolwas grouped on this row, 0 otherwise. Distinguishes "subtotal NULL" from "real data NULL."
FILTER (WHERE …) — conditional aggregates without CASE WHEN.
- Standard SQL since 2003; supported by Postgres, SQLite, and Snowflake.
-
SUM(amount) FILTER (WHERE status = 'paid')replacesSUM(CASE WHEN status = 'paid' THEN amount ELSE 0 END). - Cleaner syntax, identical plan. Most senior reviewers prefer it.
- MySQL / BigQuery don't support
FILTER— fall back toCASE WHEN.
Three classic interview questions on aggregates.
- "How does
COUNT(*)differ fromCOUNT(col)?" — covered above. - "How would you compute the median in SQL?" —
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col)(standard SQL). Pre-PERCENTILE engines use a window-function trick. - "What does
GROUP BY ()mean?" — single-row aggregate over the whole table (the grand total). Useful insideGROUPING SETS.
Worked example — conditional aggregates with FILTER
Detailed explanation. A common interview probe is "compute paid orders vs cancelled orders per customer in one query." You can do it with SUM(CASE WHEN ...), but FILTER (WHERE ...) is the senior-signal answer in 2026.
Question. Given orders(customer_id, amount, status) with status ∈ {'paid','cancelled','pending'}, compute per customer: total amount paid, total amount cancelled, and total amount pending — in one query.
Input.
| customer_id | order_id | amount | status |
|---|---|---|---|
| 7 | 1001 | 120 | paid |
| 7 | 1002 | 80 | cancelled |
| 7 | 1003 | 95 | paid |
| 7 | 1004 | 60 | pending |
| 9 | 2001 | 200 | paid |
| 9 | 2002 | 50 | cancelled |
Code.
SELECT
customer_id,
SUM(amount) FILTER (WHERE status = 'paid') AS paid_amt,
SUM(amount) FILTER (WHERE status = 'cancelled') AS cancelled_amt,
SUM(amount) FILTER (WHERE status = 'pending') AS pending_amt
FROM orders
GROUP BY customer_id;
Step-by-step explanation.
-
FROM orders— table resolved. -
WHERE(absent) — every row passes. -
GROUP BY customer_id— rows bucketed by customer. -
SELECT— for each group,SUM(amount) FILTER (WHERE status = …)sums only the rows where the filter is true. - The three filters run in parallel inside the same group scan — one pass over the data, three aggregate accumulators per group.
Output.
| customer_id | paid_amt | cancelled_amt | pending_amt |
|---|---|---|---|
| 7 | 215 | 80 | 60 |
| 9 | 200 | 50 | NULL |
Rule of thumb. FILTER (WHERE …) is the modern conditional aggregate. Reach for it on Postgres / Snowflake / SQLite; fall back to SUM(CASE WHEN ... THEN amount ELSE 0 END) on MySQL / BigQuery.
Worked example — GROUPING SETS for one-query subtotals
Detailed explanation. A reporting team wants revenue per region, per product, and the grand total — in one query, in one round-trip to the warehouse. The wrong answer is three separate GROUP BY queries unioned together. The right answer is GROUPING SETS, which the engine evaluates in a single pass and emits all the combinations.
Question. Given sales(region, product, amount), return revenue broken down by region, by product, and as a grand total — in a single result set. Distinguish "subtotal" rows from "real data" rows.
Input.
| region | product | amount |
|---|---|---|
| NA | A | 100 |
| NA | B | 50 |
| EU | A | 80 |
| EU | B | 70 |
| AP | A | 40 |
Code.
SELECT
COALESCE(region, 'ALL_REGIONS') AS region,
COALESCE(product, 'ALL_PRODUCTS') AS product,
GROUPING(region) AS is_region_subtotal,
GROUPING(product) AS is_product_subtotal,
SUM(amount) AS revenue
FROM sales
GROUP BY GROUPING SETS (
(region),
(product),
()
)
ORDER BY is_region_subtotal, is_product_subtotal, region, product;
Step-by-step explanation.
-
GROUPING SETS ((region), (product), ())instructs the engine to aggregate three times in one pass: once grouped byregion, once grouped byproduct, once with no grouping (grand total). -
GROUPING(region)returns 1 on rows whereregionwas not part of the grouping (the "all regions" rows) and 0 otherwise. Same forGROUPING(product). -
COALESCEreplaces the NULL placeholders the engine emits for "subtotal" rows with human-readable labels. - The final
ORDER BYfloats the grand total to the bottom and the per-region / per-product subtotals together.
Output.
| region | product | is_region_subtotal | is_product_subtotal | revenue |
|---|---|---|---|---|
| AP | ALL_PRODUCTS | 0 | 1 | 40 |
| EU | ALL_PRODUCTS | 0 | 1 | 150 |
| NA | ALL_PRODUCTS | 0 | 1 | 150 |
| ALL_REGIONS | A | 1 | 0 | 220 |
| ALL_REGIONS | B | 1 | 0 | 120 |
| ALL_REGIONS | ALL_PRODUCTS | 1 | 1 | 340 |
Rule of thumb. GROUPING SETS ((a), (b), ()) is the explicit form. Reach for ROLLUP (a, b) when the dimensions form a hierarchy (e.g. year → quarter → month), and CUBE (a, b) when you need every combination. All three compile into the same one-pass execution plan.
Worked example — HAVING vs WHERE debugged the right way
Detailed explanation. A common interview trap: candidates put aggregate conditions in WHERE and row-level conditions in HAVING. Either causes confusing errors or — worse — silently incorrect results because the planner allowed it on lenient dialects (MySQL). Mastering the rule "WHERE filters rows before aggregation, HAVING filters groups after aggregation" is muscle memory you need cold.
Question. Given orders(customer_id, order_ts, amount, status), find customers who placed at least 5 paid orders in 2026. Choose the right clause for each predicate and explain why.
Input.
| customer_id | order_ts | amount | status |
|---|---|---|---|
| 7 | 2026-01-15 | 100 | paid |
| 7 | 2026-02-10 | 50 | paid |
| 7 | 2026-02-20 | 80 | cancelled |
| 7 | 2026-03-12 | 120 | paid |
| 7 | 2026-04-01 | 200 | paid |
| 7 | 2026-04-25 | 60 | paid |
| 9 | 2026-03-05 | 300 | paid |
| 9 | 2026-04-10 | 90 | paid |
Code.
SELECT
customer_id,
COUNT(*) AS paid_order_count,
SUM(amount) AS paid_revenue
FROM orders
WHERE order_ts >= DATE '2026-01-01'
AND order_ts < DATE '2027-01-01'
AND status = 'paid'
GROUP BY customer_id
HAVING COUNT(*) >= 5
ORDER BY paid_order_count DESC;
Step-by-step explanation.
-
Date predicate in
WHERE.order_ts >= '2026-01-01'operates on a row column before aggregation runs. Putting it inHAVINGwould be either an error (Postgres / standard SQL) or a silent inefficiency (MySQL would aggregate the entire table and then filter). -
Status predicate in
WHERE.status = 'paid'is also row-level — drop cancelled / pending orders before counting. Putting it inHAVINGwould count cancelled orders too and apply the filter to the wrong number. -
Count predicate in
HAVING.COUNT(*) >= 5is an aggregate — it only makes sense afterGROUP BY. The engine raises an error if you try to use it inWHERE. -
ORDER BYon SELECT alias. Works becauseORDER BYruns afterSELECT, so the aliaspaid_order_countis visible.
Output.
| customer_id | paid_order_count | paid_revenue |
|---|---|---|
| 7 | 5 | 470 |
Rule of thumb. Two-line decision tree: if the predicate is a column-level boolean (date, status, region), put it in WHERE. If the predicate is an aggregate (COUNT, SUM, AVG comparison), put it in HAVING. If the dialect allows both, prefer WHERE because the planner can push it down past joins and indexes.
SQL interview question on COUNT + GROUP BY edge cases
A senior interviewer probes: "Write a query that, for each product, returns total revenue, number of orders, and number of distinct customers — but only for products with at least 100 distinct customers." It tests COUNT(*) vs COUNT(DISTINCT), GROUP BY, and HAVING in one query.
Solution Using COUNT(DISTINCT) with HAVING
SELECT
product_id,
SUM(amount) AS revenue,
COUNT(*) AS order_count,
COUNT(DISTINCT customer_id) AS distinct_customers
FROM orders
WHERE order_ts >= DATE '2026-01-01'
GROUP BY product_id
HAVING COUNT(DISTINCT customer_id) >= 100
ORDER BY revenue DESC;
Step-by-step trace.
| Step | Clause | What it does |
|---|---|---|
| 1 | FROM orders |
resolve table |
| 2 | WHERE order_ts >= '2026-01-01' |
filter rows before aggregation |
| 3 | GROUP BY product_id |
bucket by product |
| 4 | HAVING COUNT(DISTINCT customer_id) >= 100 |
drop groups below threshold |
| 5 | SELECT ... |
compute revenue, order count, distinct customer count |
| 7 | ORDER BY revenue DESC |
final sort |
Output:
| product_id | revenue | order_count | distinct_customers |
|---|---|---|---|
| P_017 | 482,500 | 4,127 | 1,842 |
| P_004 | 391,200 | 3,210 | 1,508 |
| P_022 | 312,000 | 2,890 | 1,201 |
Why this works — concept by concept:
-
COUNT(*)vsCOUNT(DISTINCT col)—COUNT(*)counts rows (one per order);COUNT(DISTINCT customer_id)counts unique customers. Knowing the difference is the senior signal. -
WHEREbefore aggregation —WHERE order_ts >= '2026-01-01'prunes rows before the group hash builds; the engine never aggregates 2024 / 2025 data unnecessarily. -
HAVINGon aggregate —HAVING COUNT(DISTINCT customer_id) >= 100filters groups after aggregation. The same predicate can't go inWHEREbecause the aggregate doesn't exist yet at step 2. -
ORDER BYon SELECT alias —ORDER BY revenue DESCworks becauserevenueis the SELECT alias andORDER BYruns afterSELECT. -
Cost —
O(N)for the date filter;O(N)for the group hash;COUNT(DISTINCT)adds a per-group hash set of customer IDs, making memoryO(N_groups × avg_distinct_customers).
SQL
Topic — aggregation
Aggregation problems (COUNT, GROUP BY, HAVING, conditional aggregates)
5. Window functions cheat sheet
Four function families, one OVER clause, and 80% of senior SQL rounds
The mental model in one line: a window function returns one value per input row computed over a window of related rows defined by PARTITION BY, ordered by ORDER BY, and bounded by a frame clause — it does not collapse rows the way GROUP BY does. Every senior SQL interview asks at least one window-function question, and three-quarters of them reduce to "rank within a partition" or "running total."
The four function families.
-
Ranking.
ROW_NUMBER(),RANK(),DENSE_RANK(),NTILE(n). Assign an integer per row based on theORDER BYinside the window. -
Offset.
LAG(col, n, default),LEAD(col, n, default). Look backward or forward in the partition. The two extra args are offset (default 1) and default value when out of range (default NULL). -
Frame-positional.
FIRST_VALUE(col),LAST_VALUE(col),NTH_VALUE(col, n). Return the value at a specific position in the frame. -
Aggregate-as-window.
SUM() OVER,AVG() OVER,COUNT() OVER,MAX() OVER. Same aggregate functions, evaluated over the window instead of a group.
Ranking — pick the right one.
-
ROW_NUMBER()— strictly 1, 2, 3 with no ties. Use when you need a deterministic "the row" per partition (top-N per group). -
RANK()— 1, 1, 3 — ties share a rank, next rank skips. Use when "joint first place" should be visible. -
DENSE_RANK()— 1, 1, 2 — ties share a rank, next rank does not skip. Use when you want compact ranks across distinct values. -
NTILE(n)— splits the partition intonroughly-equal buckets. Use for quartiles, deciles, percentile bucketing.
Offset — LAG and LEAD.
-
LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_ts)— previous amount in the partition. NULL on the first row. -
LAG(amount, 1, 0) OVER (...)— previous amount with0instead of NULL on the first row. The third arg is the default. -
LEAD(amount) OVER (...)— next amount. Mirror image ofLAG. - Classic use: compute deltas (
amount - LAG(amount) OVER (...)) without a self-join.
Frame clause anatomy.
- The frame clause goes inside the
OVERafterORDER BY. Default frame isRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW— surprisingly different fromROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWwhen there are ties in theORDER BY. -
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW— strict row-count frame; running total semantics row-by-row. -
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW— 7-row trailing window; rolling-7 averages. -
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING— symmetric 7-row window centred on the current row. -
RANGEvsROWS—RANGEgroups ties in theORDER BYinto one logical row;ROWScounts physical rows. UseROWSfor predictability.
PARTITION BY + ORDER BY — the muscle pattern.
-
PARTITION BYrestarts the window at every distinct value of the partition key. Cardinality of partitions × frame width = your memory cost. -
ORDER BYinsideOVERcontrols who comes first. Independent of the outer query'sORDER BY. -
Both can be empty.
OVER ()means "the whole result set is one window" — useful for grand totals as a column on every row.
Common interview shapes — three patterns you'll see.
-
Top-N per group.
ROW_NUMBER() OVER (PARTITION BY k ORDER BY v DESC)thenWHERE rn <= Nin an outer query. -
Running total.
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW). -
7-day rolling average.
AVG(amount) OVER (PARTITION BY user_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW).
QUALIFY — Snowflake / BigQuery / Databricks shortcut.
-
QUALIFY ROW_NUMBER() OVER (PARTITION BY k ORDER BY v) = 1filters on a window result without a subquery. - Same semantics as
WHEREwould have ifWHEREcould see window results. - Postgres and MySQL still need the subquery / CTE pattern; in those engines, write the CTE.
Worked example — running total of daily revenue per customer
Detailed explanation. A reporting team wants daily revenue and the running total per customer in the same query. Without window functions you'd self-join or correlate; with SUM() OVER, it is two lines.
Question. Given orders(customer_id, order_ts, amount), produce one row per (customer_id, day) with the daily total and the cumulative total to date for that customer.
Input.
| customer_id | order_ts | amount |
|---|---|---|
| 7 | 2026-05-01 | 120 |
| 7 | 2026-05-01 | 30 |
| 7 | 2026-05-03 | 200 |
| 7 | 2026-05-04 | 50 |
| 9 | 2026-05-02 | 80 |
| 9 | 2026-05-05 | 95 |
Code.
WITH daily AS (
SELECT
customer_id,
DATE(order_ts) AS day,
SUM(amount) AS daily_amt
FROM orders
GROUP BY customer_id, DATE(order_ts)
)
SELECT
customer_id,
day,
daily_amt,
SUM(daily_amt) OVER (
PARTITION BY customer_id
ORDER BY day
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM daily
ORDER BY customer_id, day;
Step-by-step explanation.
-
Inner CTE. Group rows by
(customer_id, day);SUM(amount)collapses multiple same-day orders into one daily total. -
Outer SELECT — daily values pass through.
customer_id, day, daily_amtare projected as-is. -
Outer SELECT — running total via
SUM() OVER.PARTITION BY customer_idresets the cumulative sum at each new customer;ORDER BY daydecides the chronological order; the explicit frameROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWensures a true row-by-row running total. -
Outer
ORDER BY. Sort the final output for human reading — independent of the window's internalORDER BY. -
Why the explicit frame. The default
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWwould lump rows with identicaldayinto one logical step — fine for distinct days but surprising on ties.ROWSis safer to teach and reason about.
Output.
| customer_id | day | daily_amt | running_total |
|---|---|---|---|
| 7 | 2026-05-01 | 150 | 150 |
| 7 | 2026-05-03 | 200 | 350 |
| 7 | 2026-05-04 | 50 | 400 |
| 9 | 2026-05-02 | 80 | 80 |
| 9 | 2026-05-05 | 95 | 175 |
Rule of thumb. Always write the frame clause explicitly for SUM() OVER and AVG() OVER once ORDER BY is present. The default frame is RANGE not ROWS — a subtle mismatch that bites you on ties.
Worked example — LAG and LEAD to compute deltas without self-joins
Detailed explanation. Before window functions existed, "delta vs previous row" required a self-join keyed on row_number or order_ts. With LAG() the query collapses to one line — and the planner often executes it in a single pass with no extra sort. LAG looks backward; LEAD looks forward; both take an optional offset and default.
Question. Given stock_prices(symbol, day, close_price), compute each row's daily change and the percentage change vs the previous trading day, per symbol.
Input.
| symbol | day | close_price |
|---|---|---|
| ACME | 2026-05-01 | 100.00 |
| ACME | 2026-05-02 | 105.00 |
| ACME | 2026-05-03 | 99.00 |
| ACME | 2026-05-04 | 110.00 |
| ZETA | 2026-05-01 | 50.00 |
| ZETA | 2026-05-02 | 52.00 |
Code.
SELECT
symbol,
day,
close_price,
LAG(close_price, 1) OVER (
PARTITION BY symbol
ORDER BY day
) AS prev_close,
close_price - LAG(close_price, 1) OVER (
PARTITION BY symbol
ORDER BY day
) AS abs_change,
ROUND(
100.0 * (close_price - LAG(close_price, 1) OVER (PARTITION BY symbol ORDER BY day))
/ LAG(close_price, 1) OVER (PARTITION BY symbol ORDER BY day),
2
) AS pct_change
FROM stock_prices
ORDER BY symbol, day;
Step-by-step explanation.
-
LAG(close_price, 1) OVER (PARTITION BY symbol ORDER BY day)returns the previous trading day'sclose_priceper symbol. The first row of each partition getsNULLbecause there is no previous row. -
close_price - LAG(...)is the absolute daily change. NULL on the first day of each symbol. - The percentage change wraps the same expression — for production code, factor it into a CTE so
LAG()only appears once and is easier to read. - The outer
ORDER BY symbol, dayis for human reading; it does not influence the window's internal ordering.
Output.
| symbol | day | close_price | prev_close | abs_change | pct_change |
|---|---|---|---|---|---|
| ACME | 2026-05-01 | 100.00 | NULL | NULL | NULL |
| ACME | 2026-05-02 | 105.00 | 100.00 | 5.00 | 5.00 |
| ACME | 2026-05-03 | 99.00 | 105.00 | -6.00 | -5.71 |
| ACME | 2026-05-04 | 110.00 | 99.00 | 11.00 | 11.11 |
| ZETA | 2026-05-01 | 50.00 | NULL | NULL | NULL |
| ZETA | 2026-05-02 | 52.00 | 50.00 | 2.00 | 4.00 |
Rule of thumb. Reach for LAG() / LEAD() whenever the requirement says "previous row" or "next row" inside a key. The default offset is 1; the third argument is the default value when out of range (use LAG(col, 1, 0) if you want 0 instead of NULL on the boundary).
Worked example — ROW_NUMBER vs RANK vs DENSE_RANK on tied scores
Detailed explanation. Every senior SQL interview includes one ranking-function question. The trap: candidates blindly use RANK() everywhere, or default to ROW_NUMBER() without thinking. The three functions handle ties differently, and the choice depends entirely on the business semantics.
Question. Given leaderboard(player_id, score), return three rank columns — ROW_NUMBER, RANK, and DENSE_RANK — so a single result set demonstrates how each function treats tied scores.
Input.
| player_id | score |
|---|---|
| p_1 | 100 |
| p_2 | 95 |
| p_3 | 95 |
| p_4 | 90 |
| p_5 | 85 |
| p_6 | 85 |
| p_7 | 80 |
Code.
SELECT
player_id,
score,
ROW_NUMBER() OVER (ORDER BY score DESC, player_id) AS row_num,
RANK() OVER (ORDER BY score DESC) AS rk,
DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rk
FROM leaderboard
ORDER BY score DESC, player_id;
Step-by-step explanation.
-
ROW_NUMBER() OVER (ORDER BY score DESC, player_id)produces 1, 2, 3, … with no ties because the secondaryORDER BY player_idbreaks them deterministically. Use this when you need exactly one row per "position." -
RANK() OVER (ORDER BY score DESC)produces 1, 2, 2, 4, 5, 5, 7 — tied scores share a rank and the next rank skips by the tie group's size. Use this when "joint second place" must be visible. -
DENSE_RANK() OVER (ORDER BY score DESC)produces 1, 2, 2, 3, 4, 4, 5 — tied scores share a rank but the next rank is contiguous. Use this when ranks should compress to "distinct values seen so far." - The outer
ORDER BY score DESC, player_idmatches theROW_NUMBER's tiebreaker so the output reads top-down.
Output.
| player_id | score | row_num | rk | dense_rk |
|---|---|---|---|---|
| p_1 | 100 | 1 | 1 | 1 |
| p_2 | 95 | 2 | 2 | 2 |
| p_3 | 95 | 3 | 2 | 2 |
| p_4 | 90 | 4 | 4 | 3 |
| p_5 | 85 | 5 | 5 | 4 |
| p_6 | 85 | 6 | 5 | 4 |
| p_7 | 80 | 7 | 7 | 5 |
Rule of thumb. "Top-3 distinct scores including ties" → DENSE_RANK() <= 3. "Top-3 positions, ties shown" → RANK() <= 3. "Top-3 rows, ties broken by a tiebreaker" → ROW_NUMBER() <= 3. The wrong choice changes how many rows you return.
SQL interview question on rolling averages
A senior probe: "Write a 7-day rolling average of revenue per user, with the value defined as 0 if fewer than 7 days are available." Tests frames, partitioning, and the rolling-window vocabulary in one go.
Solution Using AVG() OVER (... ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
WITH daily AS (
SELECT
user_id,
DATE(order_ts) AS day,
SUM(amount) AS daily_amt
FROM orders
GROUP BY user_id, DATE(order_ts)
)
SELECT
user_id,
day,
daily_amt,
AVG(daily_amt) OVER (
PARTITION BY user_id
ORDER BY day
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg,
COUNT(*) OVER (
PARTITION BY user_id
ORDER BY day
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_count
FROM daily
ORDER BY user_id, day;
Step-by-step trace.
| user | day | daily_amt | rows in 6-PRECEDING-to-CURRENT frame | rolling_7d_avg | rolling_count |
|---|---|---|---|---|---|
| u_42 | 2026-05-01 | 50 | 1 row | 50.0 | 1 |
| u_42 | 2026-05-02 | 100 | 2 rows | 75.0 | 2 |
| u_42 | 2026-05-03 | 60 | 3 rows | 70.0 | 3 |
| u_42 | 2026-05-04 | 40 | 4 rows | 62.5 | 4 |
| u_42 | 2026-05-05 | 80 | 5 rows | 66.0 | 5 |
| u_42 | 2026-05-06 | 70 | 6 rows | 66.7 | 6 |
| u_42 | 2026-05-07 | 110 | 7 rows | 72.9 | 7 |
| u_42 | 2026-05-08 | 90 | 7 rows | 78.6 | 7 |
Output:
| user_id | day | daily_amt | rolling_7d_avg | rolling_count |
|---|---|---|---|---|
| u_42 | 2026-05-07 | 110 | 72.9 | 7 |
| u_42 | 2026-05-08 | 90 | 78.6 | 7 |
Why this works — concept by concept:
-
7-row trailing frame —
ROWS BETWEEN 6 PRECEDING AND CURRENT ROWincludes the current row and the six immediately previous rows in chronological order. Seven total. -
PARTITION BY user_id— each user has its own rolling-window state; the frame does not bleed across users. -
Companion
COUNT(*) OVER— emits the actual frame size so downstream code can mask values where fewer than 7 rows exist (treat them as 0 or NULL per business rule). -
AVGover a frame — same formula as plainAVG, but evaluated row-by-row over the window instead of once per group. -
Cost —
O(N log N)for the per-partition sort;O(N)for the streaming frame computation. Engines maintain a running sum + count forAVG, so cost is constant per row once the partition is sorted.
SQL
Topic — window functions
Window-function problems (rolling averages, running totals, top-N)
Cheat sheet — copy-paste templates
These six templates are the highest-frequency SQL patterns in DE interviews and production code. Memorise the shape, swap the table and column names, and you have answered most live-coding questions in under 90 seconds. Each one is dialect-portable across Postgres, MySQL 8.0+, Snowflake, and BigQuery.
Top-N per group.
WITH ranked AS (
SELECT
customer_id,
order_id,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY amount DESC
) AS rn
FROM orders
)
SELECT customer_id, order_id, amount
FROM ranked
WHERE rn <= 3; -- top 3 per customer
Running total.
SELECT
customer_id,
order_ts,
amount,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY order_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM orders;
7-day rolling average.
SELECT
user_id,
day,
daily_amt,
AVG(daily_amt) OVER (
PARTITION BY user_id
ORDER BY day
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_avg
FROM daily;
De-duplication with ROW_NUMBER.
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY natural_key
ORDER BY ingested_at DESC
) AS rn
FROM staging_orders
)
SELECT *
FROM ranked
WHERE rn = 1; -- keep latest version per natural_key
Anti-join idiom (in A, not in B).
-- LEFT JOIN ... IS NULL form
SELECT u.user_id, u.email
FROM users u
LEFT JOIN orders o
ON o.user_id = u.user_id
WHERE o.user_id IS NULL;
-- NOT EXISTS form (NULL-safe equivalent)
SELECT u.user_id, u.email
FROM users u
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.user_id = u.user_id
);
Median with PERCENTILE_CONT.
SELECT
customer_id,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median_amount,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) AS p95_amount
FROM orders
GROUP BY customer_id;
Conditional aggregate with FILTER.
SELECT
customer_id,
SUM(amount) FILTER (WHERE status = 'paid') AS paid_amt,
SUM(amount) FILTER (WHERE status = 'cancelled') AS cancelled_amt,
COUNT(*) FILTER (WHERE status = 'paid') AS paid_count
FROM orders
GROUP BY customer_id;
Pivot one column into many (cross-tab).
SELECT
customer_id,
SUM(CASE WHEN month = '2026-04' THEN amount ELSE 0 END) AS apr_amt,
SUM(CASE WHEN month = '2026-05' THEN amount ELSE 0 END) AS may_amt,
SUM(CASE WHEN month = '2026-06' THEN amount ELSE 0 END) AS jun_amt
FROM orders_monthly
GROUP BY customer_id;
Year-over-year delta with LAG.
SELECT
product_id,
year,
revenue,
LAG(revenue, 1, 0) OVER (
PARTITION BY product_id
ORDER BY year
) AS prev_year_revenue,
revenue - LAG(revenue, 1, 0) OVER (
PARTITION BY product_id
ORDER BY year
) AS yoy_delta
FROM yearly_revenue;
Sessionisation with LAG + cumulative sum.
WITH gaps AS (
SELECT
user_id,
event_ts,
CASE
WHEN event_ts - LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts)
> INTERVAL '30 minutes' THEN 1
ELSE 0
END AS is_new_session
FROM events
),
sessions AS (
SELECT
user_id,
event_ts,
SUM(is_new_session) OVER (
PARTITION BY user_id
ORDER BY event_ts
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_id
FROM gaps
)
SELECT user_id, session_id, MIN(event_ts) AS session_start, MAX(event_ts) AS session_end, COUNT(*) AS event_count
FROM sessions
GROUP BY user_id, session_id;
SQL
Topic — SQL
All-purpose SQL drills (top-N, running totals, pivots, sessionisation)
Frequently asked questions
Is SQL still in demand in 2026?
Yes — more than ever. Every modern data platform (Snowflake, BigQuery, Databricks SQL Warehouse, Apache Iceberg, dbt, ksqlDB, Materialise) exposes SQL as the primary user surface, and the vast majority of analytics, ML feature engineering, and ad-hoc business questions are still answered in SQL. SQL is the only language that has remained the lingua franca of data across four decades — outliving five generations of "SQL replacements" — and 2026 has actually expanded its reach with streaming SQL and lakehouse SQL engines. A working data engineer in 2026 typically writes more SQL than Python.
Which SQL dialect should I learn first?
Learn PostgreSQL first. It is the closest to standard SQL, has the richest feature set (CTEs, window functions, LATERAL, FILTER, GROUPING SETS, full JSON support, partitions), and the lessons port cleanly to Snowflake, BigQuery, and Redshift. Once Postgres feels comfortable, scan the dialect deltas for whichever warehouse you'll work on day-to-day — Snowflake adds QUALIFY and time-travel syntax, BigQuery adds standard SQL extensions for arrays and structs, and so on. Avoid MySQL as a first dialect because some of its older non-standard behaviour (silent group-by relaxation, implicit casts) builds bad habits you'll have to unlearn.
ANSI SQL vs vendor-specific SQL — which should I write?
Default to ANSI / standard SQL — it ports across engines and survives platform migrations. Reach for vendor-specific syntax only when it is materially clearer or significantly faster. QUALIFY, LATERAL, MATCH_RECOGNIZE, and the array-handling functions are good examples of vendor extensions worth using when the engine supports them; you simply pay a portability cost. In interviews, write standard SQL by default and mention the dialect-specific shortcut as an addendum — that signals both depth and discipline.
What's the hardest SQL concept for interviews?
Window functions with non-default frame clauses. Most candidates can write ROW_NUMBER() OVER (PARTITION BY k ORDER BY v), but the moment the interviewer asks for a 7-day rolling average or a session-bucket assignment, the frame clause (ROWS BETWEEN ... AND ...) becomes the make-or-break detail. Pair that with the difference between ROWS and RANGE framing, and you have the single most common SQL interview pitfall. Drill on running totals, rolling averages, and sessionisation patterns until the syntax is automatic.
How long does it take to master SQL for DE interviews?
Plan on 8–12 weeks of consistent practice — 1–2 hours per day on real problems — to reach the bar for a senior DE SQL round. The first 4 weeks cover the fundamentals (joins, aggregates, window functions, CTEs); the next 4 weeks pattern-match the common interview shapes (top-N, dedup, sessionisation, pivot, time-series); the last weeks polish edge cases (NULL handling, frame clauses, dialect differences, EXPLAIN reading). 450+ DE-focused problems with progressively harder constraints is the canonical sequence — the muscle memory only comes from reps.
Should I learn SQL or Python first as a data engineer?
SQL first. Almost every DE interview opens with a SQL round, and every data role requires SQL fluency from day one. Python is essential at the next layer — Pandas, PySpark, orchestration scripts — but Python without SQL is "I can move data around" while SQL without Python is "I can answer business questions." Spend the first 8 weeks on SQL, the next 8 weeks on Pandas + PySpark + ETL patterns. Once both are at interview-level, you can rotate between them based on the role's emphasis. For most DE pipelines, SQL is still 60–70% of the work; Python wraps it.
Practice on PipeCode
- Drill the SQL practice library → for end-to-end interview reps across joins, aggregates, and window functions.
- Rehearse window-function problems → when the interviewer asks for top-N per group, rolling totals, or sessionisation.
- Sharpen join problems → for the LEFT-vs-INNER bug, anti-join idiom, and self-join patterns.
- Practise aggregation drills → for
COUNTsemantics,HAVING, andGROUPING SETS. - For company-specific reps, work through Meta SQL problems → and Amazon SQL problems →.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Take the full SQL track on SQL for Data Engineering Interviews — From Zero to FAANG →.
- Pair it with data modeling for DE interviews → for schema-design fluency.
Pipecode.ai is Leetcode for Data Engineering — every clause, every join, every aggregate, every window function above ships with hands-on practice rooms where you write the SQL, run it against a real engine, and get scored on correctness and plan quality. Start with the SQL library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.





Top comments (0)