SQL interview questions for data engineering circle around the same four primitives in every loop, regardless of the company name on the JD: JOIN semantics with LEFT JOIN ... IS NULL for anti-joins, GROUP BY with HAVING for aggregate filters and duplicate detection, window functions like ROW_NUMBER, RANK, DENSE_RANK, LAG, and LEAD for ranking and lookback, and CTEs (including recursive CTEs) plus correlated subqueries for multi-step logic and top-N-per-group queries. Whether the prompt is "find customers who never placed an order", "count duplicate emails", "second-highest salary", or "top 3 salaries per department", the same handful of mental models keeps showing up — interviewers grade fluency with these primitives over memorized syntax.
This guide walks four topic clusters end-to-end, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, common beginner mistakes, and an interview-style problem with a full solution that traces the query step by step. Every example uses PostgreSQL-flavored syntax — the dialect that drives DataLemur, the live-coding environments at most product-analytics companies, and the bulk of public SQL interview corpora — and every solution ends with a concept-by-concept breakdown that explains why the query is correct, what the cost is, and where beginners typically slip.
Top SQL data engineering interview topics
The four numbered sections below follow this topic map — one row per H2, every row expanded into a full section with sub-topics, worked examples, an interview question, and a traced solution:
| # | Topic | Why it shows up in SQL DE interviews |
|---|---|---|
| 1 | INNER JOIN vs LEFT JOIN and LEFT JOIN ... IS NULL anti-join |
Combine rows from multiple tables without inflating cardinality; the IS NULL trick after a LEFT JOIN is the canonical "find rows in A with no match in B" pattern (orphan customers, churned users). |
| 2 | GROUP BY and HAVING for aggregates and duplicate detection |
WHERE filters rows before grouping, HAVING filters groups after; the HAVING COUNT(*) > 1 shape is the universal duplicate-finder. |
| 3 | Window functions — ROW_NUMBER vs RANK vs DENSE_RANK, LAG, LEAD |
Per-partition ranking, top-N-per-group, second-highest salary, running totals, lookback for month-over-month deltas — all without collapsing rows. |
| 4 | CTEs, recursive CTEs, and correlated subqueries |
WITH clauses make multi-step logic readable; recursive CTEs generate sequences and traverse hierarchies; correlated subqueries express row-by-row predicates against the same table. |
Beginner-friendly framing: SQL engines process a query roughly in this order —
FROM/JOIN→WHERE(row filter) →GROUP BY→ aggregates →HAVING(group filter) → window functions →SELECT→ORDER BY→LIMIT. When in doubt, ask: "Am I filtering one row at a time (WHERE) or a whole bucket after summing (HAVING)?" That single question resolves more than half of the parse errors candidates hit on a live screen.
1. SQL Joins for Data Engineering — INNER, LEFT, and Anti-Joins
INNER JOIN, LEFT JOIN, and the LEFT JOIN ... IS NULL anti-join in SQL for data engineering
"Find customers who never placed an order" is the signature SQL join interview question — and the cleanest answer is not a NOT IN subquery or a NOT EXISTS correlated query but a LEFT JOIN with WHERE right.key IS NULL. The mental model: INNER JOIN keeps only matching pairs, LEFT JOIN keeps every left row and pads the right side with NULLs when there is no match, and filtering for right.id IS NULL after the LEFT JOIN isolates exactly the left rows that had no match — the anti-join. The same primitive surfaces churned-user, never-bought, never-clicked, and never-converted queries across product-analytics interviews.
Pro tip:
LEFT JOIN ... WHERE right.id IS NULLis generally as fast as or faster thanNOT IN (subquery)becauseNOT INreturnsNULL(notFALSE) for anyNULLin the subquery and silently drops every outer row. State this gotcha out loud — interviewers grade the candidate who knows whyNOT INcan return zero rows when the data has a singleNULL.
INNER JOIN: keep only matching rows
The INNER JOIN invariant: a row from the left table is paired with a row from the right table iff the join predicate evaluates to TRUE; rows on either side that have no match are discarded. The cardinality of the result is at most |left| × |right| but typically far less — bounded by the multiplicity of matching keys.
-
Predicate —
ON l.key = r.keyis the common shape; multi-column predicates are fine. - No padding — unmatched rows on either side are silently dropped.
- Result schema — every column from both tables (use aliases to disambiguate).
-
Cardinality risk —
1:Non the right inflates left rows;N:Mis a Cartesian-by-key explosion.
Worked example. Two tables, customers(id, name) with rows (1, Alice), (2, Bob), (3, Carol) and orders(order_id, customer_id, amount) with rows (101, 1, 50), (102, 1, 30), (103, 2, 80).
| customer | order_id | amount |
|---|---|---|
| Alice | 101 | 50 |
| Alice | 102 | 30 |
| Bob | 103 | 80 |
Carol has no row — INNER JOIN drops her.
Worked-example solution.
SELECT c.name AS customer, o.order_id, o.amount
FROM customers c
INNER JOIN orders o
ON o.customer_id = c.id;
Rule of thumb: reach for INNER JOIN when the question is "rows where both sides exist"; it is the smallest, fastest, most common join.
LEFT JOIN: keep every left row, pad the right with NULL
The LEFT JOIN invariant: every row from the left table appears in the output; if the join predicate matches at least one right row, the right columns are filled in; otherwise the right columns are NULL. The result has at least |left| rows and at most |left| × max_right_match rows.
- All left rows preserved — even unmatched ones.
-
Right columns are
NULLwhen there is no match — the key signal for the anti-join trick. -
RIGHT JOIN— the mirror image; rarely needed, since you can flip the table order. -
FULL OUTER JOIN— keeps unmatched rows from both sides; less common in interviews.
Worked example. Same customers and orders. Carol stays in the output.
| customer | order_id | amount |
|---|---|---|
| Alice | 101 | 50 |
| Alice | 102 | 30 |
| Bob | 103 | 80 |
| Carol | NULL | NULL |
Worked-example solution.
SELECT c.name AS customer, o.order_id, o.amount
FROM customers c
LEFT JOIN orders o
ON o.customer_id = c.id;
Rule of thumb: LEFT JOIN is the right answer whenever the question asks for "every X, with Y when it exists" — a churn report, a coverage report, a left-padded join for downstream pipelines.
LEFT JOIN ... IS NULL anti-join: rows with no match
The anti-join invariant: a LEFT JOIN followed by WHERE right.key IS NULL keeps exactly the left rows for which no right row matched. Equivalent in result to NOT EXISTS and (under NOT NULL constraints) to NOT IN, but typically faster than the latter and immune to the NULL-swallowing bug.
-
LEFT JOIN— preserves every left row. -
WHERE right.<pk> IS NULL— strips out every left row that did match, leaving only the unmatched ones. -
NOT EXISTSequivalent —WHERE NOT EXISTS (SELECT 1 FROM right WHERE right.fk = left.pk). -
NOT INpitfall — returns zero rows if the subquery contains a singleNULL; avoid in production data.
Worked example. Same customers and orders; Carol has no order.
| customer | matched_order_id | passes filter? |
|---|---|---|
| Alice | 101 | ✗ (matched) |
| Bob | 103 | ✗ (matched) |
| Carol | NULL | ✓ (anti-match) |
Worked-example solution.
SELECT c.name AS customer
FROM customers c
LEFT JOIN orders o
ON o.customer_id = c.id
WHERE o.order_id IS NULL;
Rule of thumb: "find X with no Y" → LEFT JOIN ... WHERE Y.id IS NULL. Memorize this pattern; it is the single most-asked SQL join shape in data-engineering interviews.
Common beginner mistakes
- Filtering the right table inside
WHEREafter aLEFT JOIN(e.g.,WHERE o.amount > 0) — silently turns theLEFT JOINback into anINNER JOINbecauseNULL > 0isNULL, which fails the filter. - Using
NOT IN (subquery)when the subquery can returnNULL— drops every outer row. - Forgetting to alias both sides of the join —
idis ambiguous when both tables have it. - Joining on the wrong column (
o.id = c.idinstead ofo.customer_id = c.id) — produces a Cartesian-flavored mess. - Using
LEFT JOINwhenINNER JOINis correct — leaves spuriousNULLrows in the output.
SQL Interview Question on Customers With No Orders
Given customers(id, name) and orders(order_id, customer_id, amount), return the names of customers who have never placed an order.
Solution Using LEFT JOIN ... WHERE orders.order_id IS NULL
SELECT c.name
FROM customers c
LEFT JOIN orders o
ON o.customer_id = c.id
WHERE o.order_id IS NULL;
Why this works: the LEFT JOIN preserves every customer row regardless of whether a matching order exists; for matched customers, o.order_id carries a real value; for unmatched customers, the right-side columns are NULL and o.order_id IS NULL is TRUE; filtering on that predicate keeps only the unmatched customers — the anti-join. Single pass over customers; one keyed lookup into orders per customer; no subquery materialization.
Step-by-step trace for the sample input:
| customers.id | customers.name | LEFT JOIN orders.order_id | IS NULL? | survives? |
|---|---|---|---|---|
| 1 | Alice | 101 | ✗ | ✗ |
| 1 | Alice | 102 | ✗ | ✗ |
| 2 | Bob | 103 | ✗ | ✗ |
| 3 | Carol | NULL | ✓ | ✓ |
Only Carol's row survives the WHERE filter.
Output:
| name |
|---|
| Carol |
Why this works — concept by concept:
-
LEFT JOINsemantics — keeps every left row; right side isNULLwhen there is no match. ThisNULLis the entire signal we filter on. -
WHERE o.order_id IS NULL—o.order_idis the right-side primary key; it isNULLonly when the join produced a synthetic unmatched row. A real-NULLorder-id from the source table never happens because primary keys areNOT NULL. -
Anti-join semantics — equivalent to
NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id); theLEFT JOIN ... IS NULLform is typically faster on planners that materialize a hash join. -
No
NULL-swallowing — unlikeNOT IN, the predicate isIS NULL, which is well-defined forNULLvalues. There is no silent zero-row failure. -
O(|customers| + |orders|)time — hash-join build onorders.customer_id, single probe per customer. Index onorders.customer_idmakes this near-linear.
Inline CTA: Drill the SQL joins practice page for INNER, LEFT, and anti-join shapes, and the SQL filtering practice page for the WHERE vs HAVING distinction.
SQL
Topic — joins
SQL join problems
SQL
Topic — filtering
SQL filtering problems
SQL
Language — SQL
All SQL practice problems
2. SQL Aggregations and GROUP BY for Data Engineering
GROUP BY, HAVING, and aggregate functions in SQL for data engineering
"Find duplicate emails in the users table" and "find the department with the highest average salary" are the two signature aggregation prompts — and both reduce to GROUP BY + an aggregate function + HAVING filter. The mental model: GROUP BY col collapses rows that share the same col value into a single output row; COUNT(*), SUM(...), AVG(...), MIN(...), MAX(...) summarize each bucket; WHERE filters individual rows before grouping; HAVING filters whole groups after grouping. Aggregate predicates belong in HAVING; row predicates belong in WHERE. Mixing the two up is the most common parse-error in live coding rounds.
Pro tip: When the question is "find duplicates", the canonical shape is
SELECT key, COUNT(*) FROM t GROUP BY key HAVING COUNT(*) > 1. State theHAVING COUNT(*) > 1half out loud while writing — it signals that you understandWHEREcannot reference aggregates and that you can compose row-level and group-level filters in the right order.
COUNT, SUM, AVG, MIN, MAX — NULL-aware aggregates
The aggregate-NULL invariant: COUNT(*) counts every row including ones with NULL columns; COUNT(col) counts only rows where col is not NULL; SUM, AVG, MIN, MAX skip NULL values entirely; if every value in a group is NULL, the result is NULL (not 0). Beginners conflate COUNT(*) and COUNT(col) and silently report wrong totals.
-
COUNT(*)— every row in the bucket, regardless ofNULLs. -
COUNT(col)— non-NULLvalues ofcolonly. -
COUNT(DISTINCT col)— unique non-NULLvalues; essential after aJOINthat may have inflated rows. -
SUM/AVG— numeric only;AVGis sum-of-non-null-divided-by-count-of-non-null, soNULLdoes not count as0.
Worked example. Three rows in one group: amount = 10, NULL, 30.
| aggregate | result |
|---|---|
COUNT(*) |
3 |
COUNT(amount) |
2 |
SUM(amount) |
40 |
AVG(amount) |
20 |
MIN(amount) |
10 |
MAX(amount) |
30 |
Worked-example solution.
SELECT user_id,
COUNT(*) AS n_rows,
COUNT(amount) AS n_known,
SUM(amount) AS total,
AVG(amount) AS mean,
MIN(amount) AS lo,
MAX(amount) AS hi
FROM orders
GROUP BY user_id;
Rule of thumb: if the metric is "people who clicked" use COUNT(DISTINCT user_id); if it is "click events" use COUNT(*); never confuse the two.
WHERE vs HAVING — row filter vs group filter
The two-clause invariant: WHERE runs before GROUP BY and references raw row columns only; HAVING runs after grouping and can reference aggregate functions; trying to use WHERE COUNT(*) > 1 is a parse error because COUNT(*) does not exist until after grouping. They are not interchangeable — both can appear in the same query.
-
WHERE— filter rows; usescol,col2, etc. -
HAVING— filter groups; usesCOUNT(*),SUM(col), etc. -
Order of evaluation —
FROM→WHERE→GROUP BY→HAVING→SELECT. -
Performance — push predicates into
WHEREwhenever possible;WHEREfilters before the (often expensive) sort/hash step.
Worked example. employees(department, salary) with six rows; ask for departments whose average salary exceeds 50,000 across employees earning more than 30,000.
| department | salary |
|---|---|
| eng | 40,000 |
| eng | 70,000 |
| eng | 25,000 |
| sales | 60,000 |
| sales | 60,000 |
| sales | 20,000 |
WHERE salary > 30000 drops the 25k and 20k rows; GROUP BY department then computes AVG; HAVING AVG(salary) > 50000 keeps only departments whose surviving rows average above 50k.
Worked-example solution.
SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE salary > 30000
GROUP BY department
HAVING AVG(salary) > 50000;
Rule of thumb: aggregate predicate → HAVING; row predicate → WHERE. If the predicate uses SUM / COUNT / AVG / MIN / MAX, it must live in HAVING.
HAVING COUNT(*) > 1 — the universal duplicate finder
The duplicate-detection invariant: SELECT key, COUNT(*) FROM t GROUP BY key HAVING COUNT(*) > 1 returns every distinct key value that appears more than once in t, along with its multiplicity. Replace key with the column you want to dedupe on (email, user_id, order_id), and the same query surfaces every duplicate group.
-
GROUP BY key— one bucket per distinct value. -
COUNT(*) > 1— keeps only buckets with at least two rows. -
COUNT(*) = N— variant for "exactly N copies"; rare but cleanly expressible. -
Composite keys —
GROUP BY a, b, c HAVING COUNT(*) > 1finds duplicate(a, b, c)triples.
Worked example. users(id, email) with two alice@example.com rows.
| id | |
|---|---|
| 1 | alice@example.com |
| 2 | bob@example.com |
| 3 | alice@example.com |
Group by email; the alice@example.com group has COUNT(*) = 2; the bob@example.com group has COUNT(*) = 1 and is filtered out.
Worked-example solution.
SELECT email, COUNT(*) AS n_copies
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
Rule of thumb: every "find duplicates" question reduces to GROUP BY key HAVING COUNT(*) > 1; reach for ROW_NUMBER() OVER (PARTITION BY key) only when you need to delete the duplicates and keep one canonical row.
Common beginner mistakes
- Writing
WHERE COUNT(*) > 1— parse error;WHEREcannot reference aggregates. - Selecting a non-aggregated, non-
GROUP BYcolumn — strict SQL rejects this; lax dialects pick an arbitrary value silently. - Forgetting
COUNT(DISTINCT user_id)after aJOINthat inflates rows — reports inflated user counts. - Using
AVG(col)and forgetting thatNULLrows are excluded from the denominator — wrong for "treat missing as 0" metrics; useAVG(COALESCE(col, 0))only when the spec says so. - Putting
HAVINGbeforeGROUP BY— syntax error; the order isWHERE → GROUP BY → HAVING.
SQL Interview Question on Duplicate Emails
Given users(id, email), return every email that appears more than once in the table, along with the number of copies. Output email and n_copies.
Solution Using GROUP BY email HAVING COUNT(*) > 1
SELECT email,
COUNT(*) AS n_copies
FROM users
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY n_copies DESC, email;
Why this works: GROUP BY email collapses every row with the same email into a single bucket; COUNT(*) counts how many rows fell into each bucket; HAVING COUNT(*) > 1 keeps only buckets with at least two rows; ORDER BY n_copies DESC, email produces a deterministic, reviewer-friendly output. Single pass over users; sort cost dominates only when the email cardinality is huge.
Step-by-step trace for the sample input:
| id | |
|---|---|
| 1 | alice@example.com |
| 2 | bob@example.com |
| 3 | alice@example.com |
| 4 | carol@example.com |
| 5 | bob@example.com |
| 6 | alice@example.com |
-
FROM users— read all six rows. -
No
WHERE— every row passes. -
GROUP BY email— three buckets:alice(3 rows),bob(2 rows),carol(1 row). -
COUNT(*)— 3, 2, 1 respectively. -
HAVING COUNT(*) > 1— drops thecarolbucket. -
ORDER BY n_copies DESC, email—alice(3), thenbob(2).
Output:
| n_copies | |
|---|---|
| alice@example.com | 3 |
| bob@example.com | 2 |
Why this works — concept by concept:
-
GROUP BY email— collapses to one bucket per distinct email; the bucket is the unit of all subsequent aggregates and group-level filters. -
COUNT(*)— counts every row in the bucket, including ones withNULLnon-key columns; perfect for "how many copies". -
HAVING COUNT(*) > 1— group-level filter; the aggregate predicate must live here, not inWHERE. This is the precise interview signal for duplicate detection. -
ORDER BY n_copies DESC, email— deterministic ordering; tie-broken byemailso the output is stable across runs and reviewers. -
O(|users| + G log G)time — single hash-aggregation overusersproducesGgroup rows; the final sort isG log G. With an index onemail, the planner may use stream aggregation and skip the hash step.
Inline CTA: Drill the SQL aggregation practice page for GROUP BY and HAVING shapes, and the SQL null-handling practice page for NULL-aware aggregates.
SQL
Topic — aggregation
SQL aggregation problems
SQL
Topic — null handling
SQL null-handling problems
SQL
Topic — filtering
SQL filtering problems
3. SQL Window Functions for Data Engineering — ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD
Window functions for ranking and lookback in SQL for data engineering
"Find the second-highest salary" and "find the top 3 salaries per department" are the two signature window-function prompts — and both reduce to a DENSE_RANK() OVER (PARTITION BY ... ORDER BY ...) filter. The mental model: a window function computes a value across a set of rows ("the window") that are related to the current row, without collapsing the rows like GROUP BY does; OVER (PARTITION BY col) defines the window boundary; OVER (ORDER BY col) defines the order within the window. ROW_NUMBER assigns unique sequential numbers; RANK skips after ties (1, 2, 2, 4); DENSE_RANK does not skip (1, 2, 2, 3). LAG looks back; LEAD looks forward. These five primitives drive almost every "ranking" or "lookback" SQL interview question.
Pro tip: When the question is "second-highest salary", reach for
DENSE_RANKoverRANK—DENSE_RANK = 2reliably means "second distinct salary" even when ties exist at the top, whileRANK = 2skips entirely if there are two rows tied for first. State this distinction; interviewers grade it.
ROW_NUMBER — unique sequential numbering per partition
The ROW_NUMBER invariant: ROW_NUMBER() OVER (PARTITION BY p ORDER BY o) assigns a unique integer 1, 2, 3, ... to every row inside each partition p, ordered by o; ties in o are broken arbitrarily by the planner. Use it when you need a unique sequence per group regardless of tie semantics — most often for deduplication.
-
OVER (PARTITION BY ...)— bucket the rows. -
OVER (ORDER BY ...)— order within the bucket. -
Ties broken arbitrarily — add a tiebreaker column to
ORDER BYfor determinism. -
Top-N-per-group —
WHERE rn <= NafterROW_NUMBER; works only when ties at rank N are not desired.
Worked example. employees(department, name, salary) with three engineers; rank by salary desc.
| department | name | salary | row_number |
|---|---|---|---|
| eng | Alice | 90,000 | 1 |
| eng | Bob | 80,000 | 2 |
| eng | Carol | 80,000 | 3 |
Bob and Carol tie on salary; ROW_NUMBER still gives them unique ranks (planner-chosen).
Worked-example solution.
SELECT department, name, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC, name) AS rn
FROM employees;
Rule of thumb: ROW_NUMBER is the right tool for deduplication (WHERE rn = 1) and for ordered streams; reach for RANK or DENSE_RANK when ties must be honored.
RANK vs DENSE_RANK — tie semantics
The rank-vs-dense-rank invariant: both assign the same rank to tied rows; RANK then skips the next k-1 ranks (gap), while DENSE_RANK continues without a gap. For "find the Nth distinct value" questions, DENSE_RANK = N is the correct filter; for "find the Nth row in order" questions, ROW_NUMBER = N is correct.
-
RANK—1, 2, 2, 4— skips after ties. -
DENSE_RANK—1, 2, 2, 3— no skip. -
ROW_NUMBER—1, 2, 3, 4— never ties. -
Pick by question semantics — "Nth highest distinct salary" →
DENSE_RANK = N; "Nth-highest-salaried row in ranking order with skips" →RANK = N.
Worked example. Same employees, plus a third tied row.
| name | salary | rank | dense_rank | row_number |
|---|---|---|---|---|
| Alice | 90,000 | 1 | 1 | 1 |
| Bob | 80,000 | 2 | 2 | 2 |
| Carol | 80,000 | 2 | 2 | 3 |
| Dan | 70,000 | 4 | 3 | 4 |
RANK jumps 2 → 4; DENSE_RANK continues 2 → 3.
Worked-example solution.
SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rnk,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dr,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn
FROM employees;
Rule of thumb: "second highest salary" → DENSE_RANK = 2; "top 3 distinct salaries" → DENSE_RANK <= 3; never use RANK for these unless the spec explicitly says ties should consume rank slots.
LAG and LEAD — lookback and lookahead
The lookback invariant: LAG(col) OVER (PARTITION BY p ORDER BY o) returns the value of col in the previous row within the partition (or NULL for the first row); LEAD returns the next row's value. Both take an optional offset (default 1) and an optional default value to substitute for NULL. They power running deltas, month-over-month growth, sessionization gap detection, and previous-event-aware analytics.
-
LAG(col, n, dflt)— valuenrows back;dflt(defaultNULL) when out of range. -
LEAD(col, n, dflt)— symmetric forward. -
PARTITION BY user_id— restart the lookback at each user. -
col - LAG(col)— the delta-from-previous-row idiom.
Worked example. sales(sales_date, amount) with a contiguous month series; compute month-over-month delta.
| sales_date | amount | lag_amount | delta |
|---|---|---|---|
| 2026-01-01 | 100 | NULL | NULL |
| 2026-02-01 | 130 | 100 | 30 |
| 2026-03-01 | 120 | 130 | -10 |
The first row's LAG is NULL; subtraction yields NULL; consumers usually COALESCE(delta, 0) for display.
Worked-example solution.
SELECT sales_date,
amount,
LAG(amount) OVER (ORDER BY sales_date) AS prev_amount,
amount - LAG(amount) OVER (ORDER BY sales_date) AS mom_delta
FROM sales;
Rule of thumb: LAG for "compare this row to its predecessor" (delta, retention, gap); LEAD for "what happens next" (sessionization, churn-from-here). Always PARTITION BY the entity if the table holds multiple series.
Common beginner mistakes
- Using
RANKwhen the question wants the Nth distinct value —RANK = 2skips entirely if two rows tie for first. - Forgetting
PARTITION BYfor a per-group ranking — produces a global ranking instead of per-department. - Using
WHERE rn = 2directly without wrapping in a subquery or CTE — window functions cannot be referenced inWHEREof the sameSELECT(they run afterWHERE). - Confusing
LAG(previous) withLEAD(next) — quietly produces inverted deltas. - Forgetting
ORDER BYinsideOVER— required forROW_NUMBER,RANK,LAG,LEAD; the result is non-deterministic without it.
SQL Interview Question on Second-Highest Salary
Given employees(emp_id, name, salary), return the second-highest distinct salary. If there is no second-highest distinct salary (e.g., all employees earn the same), return NULL.
Solution Using DENSE_RANK() OVER (ORDER BY salary DESC)
SELECT MAX(salary) AS second_highest_salary
FROM (
SELECT salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dr
FROM employees
) ranked
WHERE dr = 2;
Why this works: DENSE_RANK() OVER (ORDER BY salary DESC) numbers each row by its salary rank, with no gaps after ties — dr = 1 is the highest distinct salary, dr = 2 is the second-highest distinct salary; the outer WHERE dr = 2 filters to that group; MAX(salary) collapses to a single row and (critically) returns NULL if no rows match — handling the "no second-highest" edge case gracefully without a LIMIT 1 OFFSET 1 that would error or return zero rows.
Step-by-step trace for the sample input:
| emp_id | name | salary |
|---|---|---|
| 1 | Alice | 90,000 |
| 2 | Bob | 80,000 |
| 3 | Carol | 80,000 |
| 4 | Dan | 70,000 |
-
Inner
SELECTreads all four rows. -
DENSE_RANK() OVER (ORDER BY salary DESC)assigns: Alice → 1, Bob → 2, Carol → 2, Dan → 3. -
Outer
WHERE dr = 2keeps Bob and Carol. -
MAX(salary)collapses to a single row: 80,000. -
No-match path — if every employee earned 90,000, the inner query has only
dr = 1rows; outerWHERE dr = 2filters to zero rows;MAX(salary)returnsNULL(the spec).
Output:
| second_highest_salary |
|---|
| 80000 |
Why this works — concept by concept:
-
DENSE_RANK()overRANK()—DENSE_RANK = 2is the second-distinct-salary, regardless of how many people tie for first;RANK = 2would skip if two people tied for first. -
OVER (ORDER BY salary DESC)— single global window, ordered by salary descending; noPARTITION BYbecause the question is over the whole table. -
Subquery wrapper — required because window functions cannot be referenced in
WHEREof the sameSELECT; the outer query reads the materialized rank. -
MAX(salary)collapses ties — whendr = 2matches multiple rows (a tie),MAXreturns one value; whendr = 2matches zero rows,MAXreturnsNULL— both edge cases handled. -
O(N log N)time — sort by salary dominates; ranking isO(N)over the sorted stream.
Inline CTA: Drill the SQL window-functions practice page for ROW_NUMBER, RANK, DENSE_RANK, LAG, and LEAD shapes, and the SQL date-functions practice page for time-series window patterns.
SQL
Topic — window functions
SQL window-function problems
SQL
Topic — date functions
SQL date-function problems
SQL
Topic — aggregation
SQL aggregation problems
4. SQL CTEs and Subqueries for Data Engineering — WITH, Recursive CTEs, and Correlated Subqueries
CTE composition, recursive CTEs, and correlated subqueries in SQL for data engineering
"Find the top 3 salaries per department" and "find employees earning above their department average" are the two signature CTE-and-subquery prompts — and they showcase the two complementary patterns. The mental model: a CTE (WITH name AS (SELECT ...)) names an intermediate result you reference like a table; a recursive CTE (WITH RECURSIVE) repeatedly evaluates a base case plus a recursive case, terminating when no new rows are added; a correlated subquery is a subquery whose WHERE clause references the outer query's alias, re-evaluating per outer row. CTEs win on readability for multi-step logic; correlated subqueries win on per-row predicates against the same table.
Pro tip: When the prompt is "top N per group", reach for the CTE-plus-
DENSE_RANKpattern, notLIMIT N—LIMITdoes not respect partitions and only works on the global stream. The CTE makes the per-group ranking explicit and trivial to reason about; the alternative correlated-subqueryWHERE col >= (SELECT col FROM t2 ... LIMIT 1 OFFSET N - 1)is an interview red flag.
WITH name AS (SELECT ...) — non-recursive CTEs for readability
The CTE invariant: WITH name AS (SELECT ...) SELECT ... FROM name defines an intermediate result (a "common table expression") that subsequent SELECTs reference like a table; the engine may inline or materialize it depending on cost; multiple CTEs can be chained in a single WITH clause separated by commas. CTEs let you build up complex queries step by step without nesting subqueries five layers deep.
-
Single CTE —
WITH a AS (...) SELECT * FROM a. -
Chained CTEs —
WITH a AS (...), b AS (SELECT * FROM a WHERE ...) SELECT * FROM b. - Re-reference — a CTE can be referenced multiple times in the main query.
- No mandatory materialization — modern planners often inline; PostgreSQL 12+ removed the implicit fence.
Worked example. Find departments where the average salary is above 50,000, then list every employee in those departments.
| step | output rows |
|---|---|
high_paying CTE |
departments with AVG(salary) > 50000
|
outer SELECT
|
every employee whose department is in high_paying
|
Worked-example solution.
WITH high_paying AS (
SELECT department
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000
)
SELECT e.name, e.salary, e.department
FROM employees e
JOIN high_paying h ON h.department = e.department;
Rule of thumb: if you find yourself nesting a subquery three levels deep, refactor to a CTE — the engine produces the same plan but the human review takes a tenth of the time.
WITH RECURSIVE — recursive CTEs for sequences and hierarchies
The recursive-CTE invariant: a WITH RECURSIVE CTE has two parts joined by UNION ALL: an anchor query (base case, evaluated once) and a recursive query (refers to the CTE itself, evaluated repeatedly until it returns no new rows); the planner accumulates results across iterations until termination. Use it for sequence generation (1..N, dates), hierarchy traversal (org charts, BOM trees, parent-child relationships), and graph-style reachability queries.
-
Anchor (base) —
SELECT 1orSELECT root_id FROM table WHERE parent IS NULL. -
UNION ALL— combines anchor and recursive output. - Recursive query — references the CTE name; must converge (return zero rows eventually).
- Termination — the engine stops when the recursive step produces no new rows.
Worked example. Generate the integers 1 through 5.
| iteration | new row |
|---|---|
| anchor | 1 |
| step 1 | 2 |
| step 2 | 3 |
| step 3 | 4 |
| step 4 | 5 |
| step 5 | (none — n = 5 fails the predicate) |
Worked-example solution.
WITH RECURSIVE nums(n) AS (
SELECT 1
UNION ALL
SELECT n + 1
FROM nums
WHERE n < 5
)
SELECT n FROM nums;
Rule of thumb: always include a termination predicate (WHERE n < N); a missing or wrong predicate produces an infinite loop that most planners eventually kill with an out-of-memory error.
Correlated subqueries — per-row predicates against the same table
The correlated-subquery invariant: a subquery whose WHERE clause references a column of the outer query is re-evaluated for every outer row; this enables predicates like "salary above this row's department average" without a JOIN. Powerful but expensive; the planner sometimes rewrites them into joins or hash-aggregations automatically.
-
Outer alias reference —
WHERE inner.dept = outer.dept. -
EXISTS— short-circuit-friendly; stop on first match. -
NOT EXISTS— anti-join equivalent; immune toNULLlikeLEFT JOIN ... IS NULL. - Performance — index the inner referenced column; otherwise the planner runs an inner scan per outer row.
Worked example. Find employees whose salary is above their department's average.
| name | dept | salary | dept_avg | survives? |
|---|---|---|---|---|
| Alice | eng | 90,000 | 76,667 | ✓ |
| Bob | eng | 80,000 | 76,667 | ✓ |
| Carol | eng | 60,000 | 76,667 | ✗ |
| Dan | sales | 60,000 | 60,000 | ✗ |
Worked-example solution.
SELECT name, department, salary
FROM employees e
WHERE salary > (
SELECT AVG(salary)
FROM employees
WHERE department = e.department
);
Rule of thumb: prefer a CTE + window function (AVG(salary) OVER (PARTITION BY department)) for performance on large tables; the correlated subquery is the clearer but slower form. Interviewers grade you on knowing both.
Common beginner mistakes
- Forgetting
WITH RECURSIVEfor a recursive CTE — non-RECURSIVECTEs cannot self-reference; parse error. - Missing the termination predicate in
WITH RECURSIVE— infinite loop. - Using
LIMIT Nfor top-N-per-group —LIMITis global; it does not respect partitions. - Writing a correlated subquery without an index on the inner column — quadratic blow-up on large tables.
- Re-materializing the same CTE under different names — the planner may run it twice; inline a single shared CTE instead.
SQL Interview Question on Top 3 Salaries Per Department
Given employees(emp_id, name, department, salary), return the top 3 distinct salaries per department, with ties at rank 3 included. Output department, name, salary, and the rank.
Solution Using DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) in a CTE
WITH ranked AS (
SELECT department,
name,
salary,
DENSE_RANK() OVER (
PARTITION BY department
ORDER BY salary DESC
) AS dr
FROM employees
)
SELECT department, name, salary, dr
FROM ranked
WHERE dr <= 3
ORDER BY department, dr, name;
Why this works: the CTE ranked materializes a per-department DENSE_RANK keyed by salary descending — dr = 1 is the highest distinct salary in that department, dr = 2 is the second-highest, and so on; the outer WHERE dr <= 3 keeps every row whose salary is in the top three distinct salaries of its department, including all ties at rank 3; the ORDER BY produces a deterministic, reviewer-friendly output. DENSE_RANK over RANK because we want the top three distinct salaries; DENSE_RANK over ROW_NUMBER because ties at rank 3 must be retained (the spec).
Step-by-step trace for the sample input:
| emp_id | name | department | salary |
|---|---|---|---|
| 1 | Alice | eng | 90,000 |
| 2 | Bob | eng | 80,000 |
| 3 | Carol | eng | 80,000 |
| 4 | Dan | eng | 70,000 |
| 5 | Eve | eng | 60,000 |
| 6 | Frank | sales | 100,000 |
| 7 | Grace | sales | 90,000 |
| 8 | Heidi | sales | 80,000 |
-
CTE
ranked— partition bydepartment; order bysalary DESC. -
DENSE_RANKper partition — eng: Alice → 1, Bob → 2, Carol → 2, Dan → 3, Eve → 4. sales: Frank → 1, Grace → 2, Heidi → 3. -
Outer
WHERE dr <= 3— drops Eve (dr = 4); keeps both Bob and Carol (tied at 2) and Dan (3). -
ORDER BY department, dr, name— eng rows first, then sales; within department bydr, thennamefor tiebreak.
Output:
| department | name | salary | dr |
|---|---|---|---|
| eng | Alice | 90000 | 1 |
| eng | Bob | 80000 | 2 |
| eng | Carol | 80000 | 2 |
| eng | Dan | 70000 | 3 |
| sales | Frank | 100000 | 1 |
| sales | Grace | 90000 | 2 |
| sales | Heidi | 80000 | 3 |
Why this works — concept by concept:
-
CTE
ranked— names the intermediate ranked result; the outer query then filters it like a regular table. Far cleaner than a nested subquery. -
PARTITION BY department— restarts the rank at each department boundary; without this, the rank is global and the answer is wrong. -
ORDER BY salary DESC— defines "highest first" inside each partition; required for any deterministic ranking. -
DENSE_RANKoverRANK— the spec wants the top three distinct salaries;RANKwould skip after ties and miss the third distinct salary if there is a two-way tie above it. -
WHERE dr <= 3in the outer — window functions cannot be referenced inWHEREof the sameSELECT; the CTE provides the materialized column the outer can filter on. -
O(N log N)time — sort within each partition dominates; with an index on(department, salary DESC)the planner can stream rather than sort.
Inline CTA: More SQL CTE practice problems and SQL subquery practice problems on PipeCode.
SQL
Topic — CTE
SQL CTE problems
SQL
Topic — subqueries
SQL subquery problems
SQL
Topic — window functions
SQL window-function problems
Tips to crack SQL interviews for data engineering roles
Master the four primitives — joins, aggregates, windows, CTEs
If you can write LEFT JOIN ... IS NULL for orphans, GROUP BY ... HAVING COUNT(*) > 1 for duplicates, DENSE_RANK() OVER (PARTITION BY ...) for top-N-per-group, and a WITH RECURSIVE CTE for sequence generation without thinking — you can pass most fresher and mid-level data-engineering SQL rounds. These four primitives compose into 80% of the questions you will see; the remaining 20% is LAG/LEAD lookback, COALESCE null-safety, and date arithmetic.
Know the difference between WHERE, HAVING, and the OVER clause
WHERE filters individual rows before grouping; HAVING filters groups after aggregation; OVER (...) defines a window for a window function and runs after WHERE but before SELECT. Window functions cannot be referenced in WHERE of the same SELECT — wrap them in a CTE or subquery first. State the order of evaluation (FROM → WHERE → GROUP BY → HAVING → window → SELECT → ORDER BY → LIMIT) when an interviewer asks; it is graded as fundamental literacy.
Pick DENSE_RANK for "Nth distinct"; pick ROW_NUMBER for deduplication
The single most-graded ranking distinction: DENSE_RANK = N is the Nth distinct value; RANK = N is the Nth row in skip-aware ranking order; ROW_NUMBER = N is the Nth row in arbitrary order. For "second-highest distinct salary" → DENSE_RANK = 2. For "remove duplicate rows keeping the canonical one" → ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreaker) = 1. State which one and why.
Use LEFT JOIN ... IS NULL over NOT IN for anti-joins
NOT IN (subquery) returns zero rows if the subquery contains a single NULL because x NOT IN (..., NULL, ...) evaluates to NULL, which fails the WHERE predicate. LEFT JOIN ... WHERE right.id IS NULL and NOT EXISTS (...) are both immune to this. Production data engineers who have been bitten by this once never write NOT IN again. State the gotcha out loud.
Practice on PostgreSQL — it is the default dialect of most live coders
DataLemur, CoderPad, HackerRank's SQL practice, most product-analytics live screens, and most public SQL interview corpora use PostgreSQL syntax. Drill EXTRACT(...), INTERVAL '1 month', DATE_TRUNC, ::DATE casting, and COALESCE until they are reflexive. MySQL-only quirks (back-tick identifiers, LIMIT N OFFSET M) and SQL-Server-only quirks (TOP N, OFFSET ... FETCH ...) are second priority — only drill them if you know the company uses that dialect.
Always ORDER BY and add a tiebreaker for determinism
Window functions, LIMIT N, and "top result" queries all require an ORDER BY with a deterministic tiebreaker (e.g., ORDER BY salary DESC, name). Without one, two runs of the same query can return different rows in the tie band — silently wrong in production and visibly wrong in an interview if the reviewer's reference answer locks an ordering. Always state your tiebreaker.
Read the data shape before writing the query
Before typing, ask: what is the grain of each table (one row per...)? What are the keys? Are there NULLs in the join columns? Is email indexed? The most common "almost passed" failure mode is correct happy-path SQL that breaks on the actual schema — duplicate primary keys, NULL foreign keys, off-by-one cardinality. A 30-second schema sweep prevents it.
Where to practice on PipeCode
Start with the SQL practice surface for the all-language SQL corpus. Drill the four-primitive pages: SQL joins, SQL aggregation, SQL window functions, SQL CTE, SQL subqueries. Add adjacent topics: SQL filtering, SQL null-handling, SQL date functions. The interview courses page bundles structured curricula. For broader coverage, browse by topic or read the top data engineering interview questions 2026 blog and the SQL data types in PostgreSQL guide.
Frequently Asked Questions
What SQL topics are most asked in data engineering interviews?
Four primitives carry the loop: joins (especially LEFT JOIN ... IS NULL for anti-joins), aggregations (GROUP BY plus HAVING, with COUNT(*) > 1 as the universal duplicate finder), window functions (ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD for ranking and lookback), and CTEs (WITH and WITH RECURSIVE for multi-step logic and sequence generation). Adjacent shapes — WHERE vs HAVING, NULL handling with COALESCE, dedup via ROW_NUMBER, and indexes — round out the typical fresher and mid-level loop.
What is the difference between RANK, DENSE_RANK, and ROW_NUMBER?
All three assign integers within a window. ROW_NUMBER gives every row a unique sequential integer (1, 2, 3, 4), even on ties. RANK gives tied rows the same rank but skips after them (1, 2, 2, 4). DENSE_RANK gives tied rows the same rank with no skip (1, 2, 2, 3). For "Nth distinct value" use DENSE_RANK = N; for "Nth row in skip-aware ranking order" use RANK = N; for "Nth row in arbitrary order" or "deduplicate keeping one canonical row" use ROW_NUMBER = 1.
What is the difference between WHERE and HAVING?
WHERE filters individual rows before the GROUP BY step and can reference only raw row columns. HAVING filters whole groups after the GROUP BY step and can reference aggregate functions like COUNT(*), SUM(col), AVG(col). Trying to use an aggregate in WHERE (e.g., WHERE COUNT(*) > 1) is a parse error because the aggregate does not yet exist. Both clauses can appear in the same query; WHERE runs first.
What is a CTE and when should I use one over a subquery?
A CTE (Common Table Expression, written as WITH name AS (SELECT ...)) is a named intermediate result you reference like a table in subsequent SELECTs. Use a CTE when the same intermediate is referenced multiple times, when the multi-step logic is deeply nested as a subquery, or when you need recursion (WITH RECURSIVE). Use a subquery when the intermediate is referenced exactly once and inlined cleanly. Modern PostgreSQL no longer materializes CTEs by default, so the performance gap has narrowed — pick by readability.
How do I find duplicate rows in SQL?
The canonical pattern is SELECT key, COUNT(*) FROM table GROUP BY key HAVING COUNT(*) > 1, which returns every distinct key value that appears more than once along with its multiplicity. To find duplicate (a, b, c) triples, use GROUP BY a, b, c. To delete duplicates while keeping one canonical row per key, use ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreaker) inside a CTE and delete every row where ROW_NUMBER > 1.
What is the best way to find the second-highest salary?
The cleanest answer is SELECT MAX(salary) FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS dr FROM employees) t WHERE dr = 2. DENSE_RANK = 2 reliably gives the second-distinct-salary even when ties exist at the top; the outer MAX(salary) collapses ties at rank 2 into a single value and returns NULL when no second-distinct-salary exists — both edge cases handled. Avoid LIMIT 1 OFFSET 1 — it errors on empty result sets in some dialects and does not handle ties correctly.
Should I learn PostgreSQL or MySQL for SQL interviews?
PostgreSQL is the default dialect of most modern data-engineering interview platforms (DataLemur, CoderPad, most product-analytics live screens) because of its strong window-function support, EXTRACT / INTERVAL date arithmetic, and ::TYPE casting syntax. Drill PostgreSQL first; learn MySQL-specific quirks (back-tick identifiers, slightly different LIMIT syntax, no FULL OUTER JOIN until 8.0.31) only if the company explicitly uses MySQL. Either way, the four primitives — joins, aggregates, windows, CTEs — are dialect-agnostic.





Top comments (0)