Gowtham Potureddi

Posted on May 10

SQL Interview Questions for Data Engineering

#python #sql #interview #dataengineering

SQL interview questions for data engineering circle around the same four primitives in every loop, regardless of the company name on the JD: JOIN semantics with LEFT JOIN ... IS NULL for anti-joins, GROUP BY with HAVING for aggregate filters and duplicate detection, window functions like ROW_NUMBER, RANK, DENSE_RANK, LAG, and LEAD for ranking and lookback, and CTEs (including recursive CTEs) plus correlated subqueries for multi-step logic and top-N-per-group queries. Whether the prompt is "find customers who never placed an order", "count duplicate emails", "second-highest salary", or "top 3 salaries per department", the same handful of mental models keeps showing up — interviewers grade fluency with these primitives over memorized syntax.

This guide walks four topic clusters end-to-end, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, common beginner mistakes, and an interview-style problem with a full solution that traces the query step by step. Every example uses PostgreSQL-flavored syntax — the dialect that drives DataLemur, the live-coding environments at most product-analytics companies, and the bulk of public SQL interview corpora — and every solution ends with a concept-by-concept breakdown that explains why the query is correct, what the cost is, and where beginners typically slip.

#	Topic	Why it shows up in SQL DE interviews
1	`INNER JOIN` vs `LEFT JOIN` and `LEFT JOIN ... IS NULL` anti-join	Combine rows from multiple tables without inflating cardinality; the `IS NULL` trick after a `LEFT JOIN` is the canonical "find rows in A with no match in B" pattern (orphan customers, churned users).
2	`GROUP BY` and `HAVING` for aggregates and duplicate detection	`WHERE` filters rows before grouping, `HAVING` filters groups after; the `HAVING COUNT(*) > 1` shape is the universal duplicate-finder.
3	Window functions — `ROW_NUMBER` vs `RANK` vs `DENSE_RANK`, `LAG`, `LEAD`	Per-partition ranking, top-N-per-group, second-highest salary, running totals, lookback for month-over-month deltas — all without collapsing rows.
4	CTEs, recursive CTEs, and correlated subqueries	`WITH` clauses make multi-step logic readable; recursive CTEs generate sequences and traverse hierarchies; correlated subqueries express row-by-row predicates against the same table.

1. SQL Joins for Data Engineering — `INNER`, `LEFT`, and Anti-Joins

`INNER JOIN`, `LEFT JOIN`, and the `LEFT JOIN ... IS NULL` anti-join in SQL for data engineering

"Find customers who never placed an order" is the signature SQL join interview question — and the cleanest answer is not a NOT IN subquery or a NOT EXISTS correlated query but a LEFT JOIN with WHERE right.key IS NULL. The mental model: INNER JOIN keeps only matching pairs, LEFT JOIN keeps every left row and pads the right side with NULLs when there is no match, and filtering for right.id IS NULL after the LEFT JOIN isolates exactly the left rows that had no match — the anti-join. The same primitive surfaces churned-user, never-bought, never-clicked, and never-converted queries across product-analytics interviews.

Pro tip: LEFT JOIN ... WHERE right.id IS NULL is generally as fast as or faster than NOT IN (subquery) because NOT IN returns NULL (not FALSE) for any NULL in the subquery and silently drops every outer row. State this gotcha out loud — interviewers grade the candidate who knows why NOT IN can return zero rows when the data has a single NULL.

`INNER JOIN`: keep only matching rows

The INNER JOIN invariant: a row from the left table is paired with a row from the right table iff the join predicate evaluates to TRUE; rows on either side that have no match are discarded. The cardinality of the result is at most |left| × |right| but typically far less — bounded by the multiplicity of matching keys.

Predicate — ON l.key = r.key is the common shape; multi-column predicates are fine.
No padding — unmatched rows on either side are silently dropped.
Result schema — every column from both tables (use aliases to disambiguate).
Cardinality risk — 1:N on the right inflates left rows; N:M is a Cartesian-by-key explosion.

Worked example. Two tables, customers(id, name) with rows (1, Alice), (2, Bob), (3, Carol) and orders(order_id, customer_id, amount) with rows (101, 1, 50), (102, 1, 30), (103, 2, 80).

customer	order_id	amount
Alice	101	50
Alice	102	30
Bob	103	80

Carol has no row — INNER JOIN drops her.

Worked-example solution.

SELECT c.name AS customer, o.order_id, o.amount
FROM customers c
INNER JOIN orders o
  ON o.customer_id = c.id;

Rule of thumb: reach for INNER JOIN when the question is "rows where both sides exist"; it is the smallest, fastest, most common join.

`LEFT JOIN`: keep every left row, pad the right with `NULL`

The LEFT JOIN invariant: every row from the left table appears in the output; if the join predicate matches at least one right row, the right columns are filled in; otherwise the right columns are NULL. The result has at least |left| rows and at most |left| × max_right_match rows.

All left rows preserved — even unmatched ones.
Right columns are NULL when there is no match — the key signal for the anti-join trick.
RIGHT JOIN — the mirror image; rarely needed, since you can flip the table order.
FULL OUTER JOIN — keeps unmatched rows from both sides; less common in interviews.

Worked example. Same customers and orders. Carol stays in the output.

customer	order_id	amount
Alice	101	50
Alice	102	30
Bob	103	80
Carol	NULL	NULL

Worked-example solution.

SELECT c.name AS customer, o.order_id, o.amount
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id;

Rule of thumb: LEFT JOIN is the right answer whenever the question asks for "every X, with Y when it exists" — a churn report, a coverage report, a left-padded join for downstream pipelines.

`LEFT JOIN ... IS NULL` anti-join: rows with no match

The anti-join invariant: a LEFT JOIN followed by WHERE right.key IS NULL keeps exactly the left rows for which no right row matched. Equivalent in result to NOT EXISTS and (under NOT NULL constraints) to NOT IN, but typically faster than the latter and immune to the NULL-swallowing bug.

LEFT JOIN — preserves every left row.
WHERE right.<pk> IS NULL — strips out every left row that did match, leaving only the unmatched ones.
NOT EXISTS equivalent — WHERE NOT EXISTS (SELECT 1 FROM right WHERE right.fk = left.pk).
NOT IN pitfall — returns zero rows if the subquery contains a single NULL; avoid in production data.

Worked example. Same customers and orders; Carol has no order.

customer	matched_order_id	passes filter?
Alice	101	✗ (matched)
Bob	103	✗ (matched)
Carol	NULL	✓ (anti-match)

Worked-example solution.

SELECT c.name AS customer
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id
WHERE o.order_id IS NULL;

Rule of thumb: "find X with no Y" → LEFT JOIN ... WHERE Y.id IS NULL. Memorize this pattern; it is the single most-asked SQL join shape in data-engineering interviews.

Common beginner mistakes

Filtering the right table inside WHERE after a LEFT JOIN (e.g., WHERE o.amount > 0) — silently turns the LEFT JOIN back into an INNER JOIN because NULL > 0 is NULL, which fails the filter.
Using NOT IN (subquery) when the subquery can return NULL — drops every outer row.
Forgetting to alias both sides of the join — id is ambiguous when both tables have it.
Joining on the wrong column (o.id = c.id instead of o.customer_id = c.id) — produces a Cartesian-flavored mess.
Using LEFT JOIN when INNER JOIN is correct — leaves spurious NULL rows in the output.

SQL Interview Question on Customers With No Orders

Given customers(id, name) and orders(order_id, customer_id, amount), return the names of customers who have never placed an order.

Solution Using `LEFT JOIN ... WHERE orders.order_id IS NULL`

SELECT c.name
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id
WHERE o.order_id IS NULL;

Why this works: the LEFT JOIN preserves every customer row regardless of whether a matching order exists; for matched customers, o.order_id carries a real value; for unmatched customers, the right-side columns are NULL and o.order_id IS NULL is TRUE; filtering on that predicate keeps only the unmatched customers — the anti-join. Single pass over customers; one keyed lookup into orders per customer; no subquery materialization.

Step-by-step trace for the sample input:

customers.id	customers.name	LEFT JOIN orders.order_id	IS NULL?	survives?
1	Alice	101	✗	✗
1	Alice	102	✗	✗
2	Bob	103	✗	✗
3	Carol	NULL	✓	✓

Only Carol's row survives the WHERE filter.

Output:

name
Carol

Why this works — concept by concept:

LEFT JOIN semantics — keeps every left row; right side is NULL when there is no match. This NULL is the entire signal we filter on.
WHERE o.order_id IS NULL — o.order_id is the right-side primary key; it is NULL only when the join produced a synthetic unmatched row. A real-NULL order-id from the source table never happens because primary keys are NOT NULL.
Anti-join semantics — equivalent to NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id); the LEFT JOIN ... IS NULL form is typically faster on planners that materialize a hash join.
No NULL-swallowing — unlike NOT IN, the predicate is IS NULL, which is well-defined for NULL values. There is no silent zero-row failure.
O(|customers| + |orders|) time — hash-join build on orders.customer_id, single probe per customer. Index on orders.customer_id makes this near-linear.

Inline CTA: Drill the SQL joins practice page for INNER, LEFT, and anti-join shapes, and the SQL filtering practice page for the WHERE vs HAVING distinction.

SQL
Topic — joins
SQL join problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

2. SQL Aggregations and `GROUP BY` for Data Engineering

`GROUP BY`, `HAVING`, and aggregate functions in SQL for data engineering

"Find duplicate emails in the users table" and "find the department with the highest average salary" are the two signature aggregation prompts — and both reduce to GROUP BY + an aggregate function + HAVING filter. The mental model: GROUP BY col collapses rows that share the same col value into a single output row; COUNT(*), SUM(...), AVG(...), MIN(...), MAX(...) summarize each bucket; WHERE filters individual rows before grouping; HAVING filters whole groups after grouping. Aggregate predicates belong in HAVING; row predicates belong in WHERE. Mixing the two up is the most common parse-error in live coding rounds.

Pro tip: When the question is "find duplicates", the canonical shape is SELECT key, COUNT(*) FROM t GROUP BY key HAVING COUNT(*) > 1. State the HAVING COUNT(*) > 1 half out loud while writing — it signals that you understand WHERE cannot reference aggregates and that you can compose row-level and group-level filters in the right order.

`COUNT`, `SUM`, `AVG`, `MIN`, `MAX` — `NULL`-aware aggregates

The aggregate-NULL invariant: COUNT(*) counts every row including ones with NULL columns; COUNT(col) counts only rows where col is not NULL; SUM, AVG, MIN, MAX skip NULL values entirely; if every value in a group is NULL, the result is NULL (not 0). Beginners conflate COUNT(*) and COUNT(col) and silently report wrong totals.

COUNT(*) — every row in the bucket, regardless of NULLs.
COUNT(col) — non-NULL values of col only.
COUNT(DISTINCT col) — unique non-NULL values; essential after a JOIN that may have inflated rows.
SUM / AVG — numeric only; AVG is sum-of-non-null-divided-by-count-of-non-null, so NULL does not count as 0.

Worked example. Three rows in one group: amount = 10, NULL, 30.

aggregate	result
`COUNT(*)`	3
`COUNT(amount)`	2
`SUM(amount)`	40
`AVG(amount)`	20
`MIN(amount)`	10
`MAX(amount)`	30

Worked-example solution.

SELECT user_id,
       COUNT(*)        AS n_rows,
       COUNT(amount)   AS n_known,
       SUM(amount)     AS total,
       AVG(amount)     AS mean,
       MIN(amount)     AS lo,
       MAX(amount)     AS hi
FROM orders
GROUP BY user_id;

Rule of thumb: if the metric is "people who clicked" use COUNT(DISTINCT user_id); if it is "click events" use COUNT(*); never confuse the two.

`WHERE` vs `HAVING` — row filter vs group filter

The two-clause invariant: WHERE runs before GROUP BY and references raw row columns only; HAVING runs after grouping and can reference aggregate functions; trying to use WHERE COUNT(*) > 1 is a parse error because COUNT(*) does not exist until after grouping. They are not interchangeable — both can appear in the same query.

WHERE — filter rows; uses col, col2, etc.
HAVING — filter groups; uses COUNT(*), SUM(col), etc.
Order of evaluation — FROM → WHERE → GROUP BY → HAVING → SELECT.
Performance — push predicates into WHERE whenever possible; WHERE filters before the (often expensive) sort/hash step.

Worked example. employees(department, salary) with six rows; ask for departments whose average salary exceeds 50,000 across employees earning more than 30,000.

department	salary
eng	40,000
eng	70,000
eng	25,000
sales	60,000
sales	60,000
sales	20,000

WHERE salary > 30000 drops the 25k and 20k rows; GROUP BY department then computes AVG; HAVING AVG(salary) > 50000 keeps only departments whose surviving rows average above 50k.

Worked-example solution.

SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE salary > 30000
GROUP BY department
HAVING AVG(salary) > 50000;

Rule of thumb: aggregate predicate → HAVING; row predicate → WHERE. If the predicate uses SUM / COUNT / AVG / MIN / MAX, it must live in HAVING.

`HAVING COUNT(*) > 1` — the universal duplicate finder

The duplicate-detection invariant: SELECT key, COUNT(*) FROM t GROUP BY key HAVING COUNT(*) > 1 returns every distinct key value that appears more than once in t, along with its multiplicity. Replace key with the column you want to dedupe on (email, user_id, order_id), and the same query surfaces every duplicate group.

GROUP BY key — one bucket per distinct value.
COUNT(*) > 1 — keeps only buckets with at least two rows.
COUNT(*) = N — variant for "exactly N copies"; rare but cleanly expressible.
Composite keys — GROUP BY a, b, c HAVING COUNT(*) > 1 finds duplicate (a, b, c) triples.

Worked example. users(id, email) with two alice@example.com rows.

id	email
1	alice@example.com
2	bob@example.com
3	alice@example.com

Group by email; the alice@example.com group has COUNT(*) = 2; the bob@example.com group has COUNT(*) = 1 and is filtered out.

Worked-example solution.

SELECT email, COUNT(*) AS n_copies
FROM users
GROUP BY email
HAVING COUNT(*) > 1;

Rule of thumb: every "find duplicates" question reduces to GROUP BY key HAVING COUNT(*) > 1; reach for ROW_NUMBER() OVER (PARTITION BY key) only when you need to delete the duplicates and keep one canonical row.

Common beginner mistakes

Writing WHERE COUNT(*) > 1 — parse error; WHERE cannot reference aggregates.
Selecting a non-aggregated, non-GROUP BY column — strict SQL rejects this; lax dialects pick an arbitrary value silently.
Forgetting COUNT(DISTINCT user_id) after a JOIN that inflates rows — reports inflated user counts.
Using AVG(col) and forgetting that NULL rows are excluded from the denominator — wrong for "treat missing as 0" metrics; use AVG(COALESCE(col, 0)) only when the spec says so.
Putting HAVING before GROUP BY — syntax error; the order is WHERE → GROUP BY → HAVING.

SQL Interview Question on Duplicate Emails

Given users(id, email), return every email that appears more than once in the table, along with the number of copies. Output email and n_copies.

Solution Using `GROUP BY email HAVING COUNT(*) > 1`

SELECT email,
       COUNT(*) AS n_copies
FROM users
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY n_copies DESC, email;

Why this works: GROUP BY email collapses every row with the same email into a single bucket; COUNT(*) counts how many rows fell into each bucket; HAVING COUNT(*) > 1 keeps only buckets with at least two rows; ORDER BY n_copies DESC, email produces a deterministic, reviewer-friendly output. Single pass over users; sort cost dominates only when the email cardinality is huge.

Step-by-step trace for the sample input:

id	email
1	alice@example.com
2	bob@example.com
3	alice@example.com
4	carol@example.com
5	bob@example.com
6	alice@example.com

FROM users — read all six rows.
No WHERE — every row passes.
GROUP BY email — three buckets: alice (3 rows), bob (2 rows), carol (1 row).
COUNT(*) — 3, 2, 1 respectively.
HAVING COUNT(*) > 1 — drops the carol bucket.
ORDER BY n_copies DESC, email — alice (3), then bob (2).

Output:

email	n_copies
alice@example.com	3
bob@example.com	2

Why this works — concept by concept:

GROUP BY email — collapses to one bucket per distinct email; the bucket is the unit of all subsequent aggregates and group-level filters.
COUNT(*) — counts every row in the bucket, including ones with NULL non-key columns; perfect for "how many copies".
HAVING COUNT(*) > 1 — group-level filter; the aggregate predicate must live here, not in WHERE. This is the precise interview signal for duplicate detection.
ORDER BY n_copies DESC, email — deterministic ordering; tie-broken by email so the output is stable across runs and reviewers.
O(|users| + G log G) time — single hash-aggregation over users produces G group rows; the final sort is G log G. With an index on email, the planner may use stream aggregation and skip the hash step.

Inline CTA: Drill the SQL aggregation practice page for GROUP BY and HAVING shapes, and the SQL null-handling practice page for NULL-aware aggregates.

SQL
Topic — aggregation
SQL aggregation problems

Practice →

SQL
Topic — null handling
SQL null-handling problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

3. SQL Window Functions for Data Engineering — `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `LAG`, `LEAD`

Window functions for ranking and lookback in SQL for data engineering

"Find the second-highest salary" and "find the top 3 salaries per department" are the two signature window-function prompts — and both reduce to a DENSE_RANK() OVER (PARTITION BY ... ORDER BY ...) filter. The mental model: a window function computes a value across a set of rows ("the window") that are related to the current row, without collapsing the rows like GROUP BY does; OVER (PARTITION BY col) defines the window boundary; OVER (ORDER BY col) defines the order within the window. ROW_NUMBER assigns unique sequential numbers; RANK skips after ties (1, 2, 2, 4); DENSE_RANK does not skip (1, 2, 2, 3). LAG looks back; LEAD looks forward. These five primitives drive almost every "ranking" or "lookback" SQL interview question.

Pro tip: When the question is "second-highest salary", reach for DENSE_RANK over RANK — DENSE_RANK = 2 reliably means "second distinct salary" even when ties exist at the top, while RANK = 2 skips entirely if there are two rows tied for first. State this distinction; interviewers grade it.

`ROW_NUMBER` — unique sequential numbering per partition

The ROW_NUMBER invariant: ROW_NUMBER() OVER (PARTITION BY p ORDER BY o) assigns a unique integer 1, 2, 3, ... to every row inside each partition p, ordered by o; ties in o are broken arbitrarily by the planner. Use it when you need a unique sequence per group regardless of tie semantics — most often for deduplication.

OVER (PARTITION BY ...) — bucket the rows.
OVER (ORDER BY ...) — order within the bucket.
Ties broken arbitrarily — add a tiebreaker column to ORDER BY for determinism.
Top-N-per-group — WHERE rn <= N after ROW_NUMBER; works only when ties at rank N are not desired.

Worked example. employees(department, name, salary) with three engineers; rank by salary desc.

department	name	salary	row_number
eng	Alice	90,000	1
eng	Bob	80,000	2
eng	Carol	80,000	3

Bob and Carol tie on salary; ROW_NUMBER still gives them unique ranks (planner-chosen).

Worked-example solution.

SELECT department, name, salary,
       ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC, name) AS rn
FROM employees;

Rule of thumb: ROW_NUMBER is the right tool for deduplication (WHERE rn = 1) and for ordered streams; reach for RANK or DENSE_RANK when ties must be honored.

`RANK` vs `DENSE_RANK` — tie semantics

The rank-vs-dense-rank invariant: both assign the same rank to tied rows; RANK then skips the next k-1 ranks (gap), while DENSE_RANK continues without a gap. For "find the Nth distinct value" questions, DENSE_RANK = N is the correct filter; for "find the Nth row in order" questions, ROW_NUMBER = N is correct.

RANK — 1, 2, 2, 4 — skips after ties.
DENSE_RANK — 1, 2, 2, 3 — no skip.
ROW_NUMBER — 1, 2, 3, 4 — never ties.
Pick by question semantics — "Nth highest distinct salary" → DENSE_RANK = N; "Nth-highest-salaried row in ranking order with skips" → RANK = N.

Worked example. Same employees, plus a third tied row.

name	salary	rank	dense_rank	row_number
Alice	90,000	1	1	1
Bob	80,000	2	2	2
Carol	80,000	2	2	3
Dan	70,000	4	3	4

RANK jumps 2 → 4; DENSE_RANK continues 2 → 3.

Worked-example solution.

SELECT name, salary,
       RANK()       OVER (ORDER BY salary DESC) AS rnk,
       DENSE_RANK() OVER (ORDER BY salary DESC) AS dr,
       ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn
FROM employees;

Rule of thumb: "second highest salary" → DENSE_RANK = 2; "top 3 distinct salaries" → DENSE_RANK <= 3; never use RANK for these unless the spec explicitly says ties should consume rank slots.

`LAG` and `LEAD` — lookback and lookahead

The lookback invariant: LAG(col) OVER (PARTITION BY p ORDER BY o) returns the value of col in the previous row within the partition (or NULL for the first row); LEAD returns the next row's value. Both take an optional offset (default 1) and an optional default value to substitute for NULL. They power running deltas, month-over-month growth, sessionization gap detection, and previous-event-aware analytics.

LAG(col, n, dflt) — value n rows back; dflt (default NULL) when out of range.
LEAD(col, n, dflt) — symmetric forward.
PARTITION BY user_id — restart the lookback at each user.
col - LAG(col) — the delta-from-previous-row idiom.

Worked example. sales(sales_date, amount) with a contiguous month series; compute month-over-month delta.

sales_date	amount	lag_amount	delta
2026-01-01	100	NULL	NULL
2026-02-01	130	100	30
2026-03-01	120	130	-10

The first row's LAG is NULL; subtraction yields NULL; consumers usually COALESCE(delta, 0) for display.

Worked-example solution.

SELECT sales_date,
       amount,
       LAG(amount) OVER (ORDER BY sales_date) AS prev_amount,
       amount - LAG(amount) OVER (ORDER BY sales_date) AS mom_delta
FROM sales;

Rule of thumb: LAG for "compare this row to its predecessor" (delta, retention, gap); LEAD for "what happens next" (sessionization, churn-from-here). Always PARTITION BY the entity if the table holds multiple series.

Common beginner mistakes

Using RANK when the question wants the Nth distinct value — RANK = 2 skips entirely if two rows tie for first.
Forgetting PARTITION BY for a per-group ranking — produces a global ranking instead of per-department.
Using WHERE rn = 2 directly without wrapping in a subquery or CTE — window functions cannot be referenced in WHERE of the same SELECT (they run after WHERE).
Confusing LAG (previous) with LEAD (next) — quietly produces inverted deltas.
Forgetting ORDER BY inside OVER — required for ROW_NUMBER, RANK, LAG, LEAD; the result is non-deterministic without it.

SQL Interview Question on Second-Highest Salary

Given employees(emp_id, name, salary), return the second-highest distinct salary. If there is no second-highest distinct salary (e.g., all employees earn the same), return NULL.

Solution Using `DENSE_RANK() OVER (ORDER BY salary DESC)`

SELECT MAX(salary) AS second_highest_salary
FROM (
    SELECT salary,
           DENSE_RANK() OVER (ORDER BY salary DESC) AS dr
    FROM employees
) ranked
WHERE dr = 2;

Why this works: DENSE_RANK() OVER (ORDER BY salary DESC) numbers each row by its salary rank, with no gaps after ties — dr = 1 is the highest distinct salary, dr = 2 is the second-highest distinct salary; the outer WHERE dr = 2 filters to that group; MAX(salary) collapses to a single row and (critically) returns NULL if no rows match — handling the "no second-highest" edge case gracefully without a LIMIT 1 OFFSET 1 that would error or return zero rows.

Step-by-step trace for the sample input:

emp_id	name	salary
1	Alice	90,000
2	Bob	80,000
3	Carol	80,000
4	Dan	70,000

Inner SELECT reads all four rows.
DENSE_RANK() OVER (ORDER BY salary DESC) assigns: Alice → 1, Bob → 2, Carol → 2, Dan → 3.
Outer WHERE dr = 2 keeps Bob and Carol.
MAX(salary) collapses to a single row: 80,000.
No-match path — if every employee earned 90,000, the inner query has only dr = 1 rows; outer WHERE dr = 2 filters to zero rows; MAX(salary) returns NULL (the spec).

Output:

second_highest_salary
80000

Why this works — concept by concept:

DENSE_RANK() over RANK() — DENSE_RANK = 2 is the second-distinct-salary, regardless of how many people tie for first; RANK = 2 would skip if two people tied for first.
OVER (ORDER BY salary DESC) — single global window, ordered by salary descending; no PARTITION BY because the question is over the whole table.
Subquery wrapper — required because window functions cannot be referenced in WHERE of the same SELECT; the outer query reads the materialized rank.
MAX(salary) collapses ties — when dr = 2 matches multiple rows (a tie), MAX returns one value; when dr = 2 matches zero rows, MAX returns NULL — both edge cases handled.
O(N log N) time — sort by salary dominates; ranking is O(N) over the sorted stream.

Inline CTA: Drill the SQL window-functions practice page for ROW_NUMBER, RANK, DENSE_RANK, LAG, and LEAD shapes, and the SQL date-functions practice page for time-series window patterns.

SQL
Topic — window functions
SQL window-function problems

Practice →

SQL
Topic — date functions
SQL date-function problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →

4. SQL CTEs and Subqueries for Data Engineering — `WITH`, Recursive CTEs, and Correlated Subqueries

CTE composition, recursive CTEs, and correlated subqueries in SQL for data engineering

"Find the top 3 salaries per department" and "find employees earning above their department average" are the two signature CTE-and-subquery prompts — and they showcase the two complementary patterns. The mental model: a CTE (WITH name AS (SELECT ...)) names an intermediate result you reference like a table; a recursive CTE (WITH RECURSIVE) repeatedly evaluates a base case plus a recursive case, terminating when no new rows are added; a correlated subquery is a subquery whose WHERE clause references the outer query's alias, re-evaluating per outer row. CTEs win on readability for multi-step logic; correlated subqueries win on per-row predicates against the same table.

Pro tip: When the prompt is "top N per group", reach for the CTE-plus-DENSE_RANK pattern, not LIMIT N — LIMIT does not respect partitions and only works on the global stream. The CTE makes the per-group ranking explicit and trivial to reason about; the alternative correlated-subquery WHERE col >= (SELECT col FROM t2 ... LIMIT 1 OFFSET N - 1) is an interview red flag.

`WITH name AS (SELECT ...)` — non-recursive CTEs for readability

The CTE invariant: WITH name AS (SELECT ...) SELECT ... FROM name defines an intermediate result (a "common table expression") that subsequent SELECTs reference like a table; the engine may inline or materialize it depending on cost; multiple CTEs can be chained in a single WITH clause separated by commas. CTEs let you build up complex queries step by step without nesting subqueries five layers deep.

Single CTE — WITH a AS (...) SELECT * FROM a.
Chained CTEs — WITH a AS (...), b AS (SELECT * FROM a WHERE ...) SELECT * FROM b.
Re-reference — a CTE can be referenced multiple times in the main query.
No mandatory materialization — modern planners often inline; PostgreSQL 12+ removed the implicit fence.

Worked example. Find departments where the average salary is above 50,000, then list every employee in those departments.

step	output rows
`high_paying` CTE	departments with `AVG(salary) > 50000`
outer `SELECT`	every employee whose department is in `high_paying`

Worked-example solution.

WITH high_paying AS (
    SELECT department
    FROM employees
    GROUP BY department
    HAVING AVG(salary) > 50000
)
SELECT e.name, e.salary, e.department
FROM employees e
JOIN high_paying h ON h.department = e.department;

Rule of thumb: if you find yourself nesting a subquery three levels deep, refactor to a CTE — the engine produces the same plan but the human review takes a tenth of the time.

`WITH RECURSIVE` — recursive CTEs for sequences and hierarchies

The recursive-CTE invariant: a WITH RECURSIVE CTE has two parts joined by UNION ALL: an anchor query (base case, evaluated once) and a recursive query (refers to the CTE itself, evaluated repeatedly until it returns no new rows); the planner accumulates results across iterations until termination. Use it for sequence generation (1..N, dates), hierarchy traversal (org charts, BOM trees, parent-child relationships), and graph-style reachability queries.

Anchor (base) — SELECT 1 or SELECT root_id FROM table WHERE parent IS NULL.
UNION ALL — combines anchor and recursive output.
Recursive query — references the CTE name; must converge (return zero rows eventually).
Termination — the engine stops when the recursive step produces no new rows.

Worked example. Generate the integers 1 through 5.

iteration	new row
anchor	1
step 1	2
step 2	3
step 3	4
step 4	5
step 5	(none — `n = 5` fails the predicate)

Worked-example solution.

WITH RECURSIVE nums(n) AS (
    SELECT 1
    UNION ALL
    SELECT n + 1
    FROM nums
    WHERE n < 5
)
SELECT n FROM nums;

Rule of thumb: always include a termination predicate (WHERE n < N); a missing or wrong predicate produces an infinite loop that most planners eventually kill with an out-of-memory error.

Correlated subqueries — per-row predicates against the same table

The correlated-subquery invariant: a subquery whose WHERE clause references a column of the outer query is re-evaluated for every outer row; this enables predicates like "salary above this row's department average" without a JOIN. Powerful but expensive; the planner sometimes rewrites them into joins or hash-aggregations automatically.

Outer alias reference — WHERE inner.dept = outer.dept.
EXISTS — short-circuit-friendly; stop on first match.
NOT EXISTS — anti-join equivalent; immune to NULL like LEFT JOIN ... IS NULL.
Performance — index the inner referenced column; otherwise the planner runs an inner scan per outer row.

Worked example. Find employees whose salary is above their department's average.

name	dept	salary	dept_avg	survives?
Alice	eng	90,000	76,667	✓
Bob	eng	80,000	76,667	✓
Carol	eng	60,000	76,667	✗
Dan	sales	60,000	60,000	✗

Worked-example solution.

SELECT name, department, salary
FROM employees e
WHERE salary > (
    SELECT AVG(salary)
    FROM employees
    WHERE department = e.department
);

Rule of thumb: prefer a CTE + window function (AVG(salary) OVER (PARTITION BY department)) for performance on large tables; the correlated subquery is the clearer but slower form. Interviewers grade you on knowing both.

Common beginner mistakes

Forgetting WITH RECURSIVE for a recursive CTE — non-RECURSIVE CTEs cannot self-reference; parse error.
Missing the termination predicate in WITH RECURSIVE — infinite loop.
Using LIMIT N for top-N-per-group — LIMIT is global; it does not respect partitions.
Writing a correlated subquery without an index on the inner column — quadratic blow-up on large tables.
Re-materializing the same CTE under different names — the planner may run it twice; inline a single shared CTE instead.

SQL Interview Question on Top 3 Salaries Per Department

Given employees(emp_id, name, department, salary), return the top 3 distinct salaries per department, with ties at rank 3 included. Output department, name, salary, and the rank.

Solution Using `DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC)` in a CTE

WITH ranked AS (
    SELECT department,
           name,
           salary,
           DENSE_RANK() OVER (
               PARTITION BY department
               ORDER BY salary DESC
           ) AS dr
    FROM employees
)
SELECT department, name, salary, dr
FROM ranked
WHERE dr <= 3
ORDER BY department, dr, name;

Why this works: the CTE ranked materializes a per-department DENSE_RANK keyed by salary descending — dr = 1 is the highest distinct salary in that department, dr = 2 is the second-highest, and so on; the outer WHERE dr <= 3 keeps every row whose salary is in the top three distinct salaries of its department, including all ties at rank 3; the ORDER BY produces a deterministic, reviewer-friendly output. DENSE_RANK over RANK because we want the top three distinct salaries; DENSE_RANK over ROW_NUMBER because ties at rank 3 must be retained (the spec).

Step-by-step trace for the sample input:

emp_id	name	department	salary
1	Alice	eng	90,000
2	Bob	eng	80,000
3	Carol	eng	80,000
4	Dan	eng	70,000
5	Eve	eng	60,000
6	Frank	sales	100,000
7	Grace	sales	90,000
8	Heidi	sales	80,000

CTE ranked — partition by department; order by salary DESC.
DENSE_RANK per partition — eng: Alice → 1, Bob → 2, Carol → 2, Dan → 3, Eve → 4. sales: Frank → 1, Grace → 2, Heidi → 3.
Outer WHERE dr <= 3 — drops Eve (dr = 4); keeps both Bob and Carol (tied at 2) and Dan (3).
ORDER BY department, dr, name — eng rows first, then sales; within department by dr, then name for tiebreak.

Output:

department	name	salary	dr
eng	Alice	90000	1
eng	Bob	80000	2
eng	Carol	80000	2
eng	Dan	70000	3
sales	Frank	100000	1
sales	Grace	90000	2
sales	Heidi	80000	3

Why this works — concept by concept:

CTE ranked — names the intermediate ranked result; the outer query then filters it like a regular table. Far cleaner than a nested subquery.
PARTITION BY department — restarts the rank at each department boundary; without this, the rank is global and the answer is wrong.
ORDER BY salary DESC — defines "highest first" inside each partition; required for any deterministic ranking.
DENSE_RANK over RANK — the spec wants the top three distinct salaries; RANK would skip after ties and miss the third distinct salary if there is a two-way tie above it.
WHERE dr <= 3 in the outer — window functions cannot be referenced in WHERE of the same SELECT; the CTE provides the materialized column the outer can filter on.
O(N log N) time — sort within each partition dominates; with an index on (department, salary DESC) the planner can stream rather than sort.

Inline CTA: More SQL CTE practice problems and SQL subquery practice problems on PipeCode.

SQL
Topic — CTE
SQL CTE problems

Practice →

SQL
Topic — subqueries
SQL subquery problems

Practice →

SQL
Topic — window functions
SQL window-function problems

Practice →

Tips to crack SQL interviews for data engineering roles

Master the four primitives — joins, aggregates, windows, CTEs

If you can write LEFT JOIN ... IS NULL for orphans, GROUP BY ... HAVING COUNT(*) > 1 for duplicates, DENSE_RANK() OVER (PARTITION BY ...) for top-N-per-group, and a WITH RECURSIVE CTE for sequence generation without thinking — you can pass most fresher and mid-level data-engineering SQL rounds. These four primitives compose into 80% of the questions you will see; the remaining 20% is LAG/LEAD lookback, COALESCE null-safety, and date arithmetic.

Know the difference between `WHERE`, `HAVING`, and the `OVER` clause

WHERE filters individual rows before grouping; HAVING filters groups after aggregation; OVER (...) defines a window for a window function and runs after WHERE but before SELECT. Window functions cannot be referenced in WHERE of the same SELECT — wrap them in a CTE or subquery first. State the order of evaluation (FROM → WHERE → GROUP BY → HAVING → window → SELECT → ORDER BY → LIMIT) when an interviewer asks; it is graded as fundamental literacy.

Pick `DENSE_RANK` for "Nth distinct"; pick `ROW_NUMBER` for deduplication

The single most-graded ranking distinction: DENSE_RANK = N is the Nth distinct value; RANK = N is the Nth row in skip-aware ranking order; ROW_NUMBER = N is the Nth row in arbitrary order. For "second-highest distinct salary" → DENSE_RANK = 2. For "remove duplicate rows keeping the canonical one" → ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreaker) = 1. State which one and why.

Use `LEFT JOIN ... IS NULL` over `NOT IN` for anti-joins

NOT IN (subquery) returns zero rows if the subquery contains a single NULL because x NOT IN (..., NULL, ...) evaluates to NULL, which fails the WHERE predicate. LEFT JOIN ... WHERE right.id IS NULL and NOT EXISTS (...) are both immune to this. Production data engineers who have been bitten by this once never write NOT IN again. State the gotcha out loud.

Practice on PostgreSQL — it is the default dialect of most live coders

DataLemur, CoderPad, HackerRank's SQL practice, most product-analytics live screens, and most public SQL interview corpora use PostgreSQL syntax. Drill EXTRACT(...), INTERVAL '1 month', DATE_TRUNC, ::DATE casting, and COALESCE until they are reflexive. MySQL-only quirks (back-tick identifiers, LIMIT N OFFSET M) and SQL-Server-only quirks (TOP N, OFFSET ... FETCH ...) are second priority — only drill them if you know the company uses that dialect.

Always `ORDER BY` and add a tiebreaker for determinism

Window functions, LIMIT N, and "top result" queries all require an ORDER BY with a deterministic tiebreaker (e.g., ORDER BY salary DESC, name). Without one, two runs of the same query can return different rows in the tie band — silently wrong in production and visibly wrong in an interview if the reviewer's reference answer locks an ordering. Always state your tiebreaker.

Read the data shape before writing the query

Before typing, ask: what is the grain of each table (one row per...)? What are the keys? Are there NULLs in the join columns? Is email indexed? The most common "almost passed" failure mode is correct happy-path SQL that breaks on the actual schema — duplicate primary keys, NULL foreign keys, off-by-one cardinality. A 30-second schema sweep prevents it.

Where to practice on PipeCode

Start with the SQL practice surface for the all-language SQL corpus. Drill the four-primitive pages: SQL joins, SQL aggregation, SQL window functions, SQL CTE, SQL subqueries. Add adjacent topics: SQL filtering, SQL null-handling, SQL date functions. The interview courses page bundles structured curricula. For broader coverage, browse by topic or read the top data engineering interview questions 2026 blog and the SQL data types in PostgreSQL guide.

Frequently Asked Questions

What SQL topics are most asked in data engineering interviews?

Four primitives carry the loop: joins (especially LEFT JOIN ... IS NULL for anti-joins), aggregations (GROUP BY plus HAVING, with COUNT(*) > 1 as the universal duplicate finder), window functions (ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD for ranking and lookback), and CTEs (WITH and WITH RECURSIVE for multi-step logic and sequence generation). Adjacent shapes — WHERE vs HAVING, NULL handling with COALESCE, dedup via ROW_NUMBER, and indexes — round out the typical fresher and mid-level loop.

What is the difference between `RANK`, `DENSE_RANK`, and `ROW_NUMBER`?

All three assign integers within a window. ROW_NUMBER gives every row a unique sequential integer (1, 2, 3, 4), even on ties. RANK gives tied rows the same rank but skips after them (1, 2, 2, 4). DENSE_RANK gives tied rows the same rank with no skip (1, 2, 2, 3). For "Nth distinct value" use DENSE_RANK = N; for "Nth row in skip-aware ranking order" use RANK = N; for "Nth row in arbitrary order" or "deduplicate keeping one canonical row" use ROW_NUMBER = 1.

What is the difference between `WHERE` and `HAVING`?

WHERE filters individual rows before the GROUP BY step and can reference only raw row columns. HAVING filters whole groups after the GROUP BY step and can reference aggregate functions like COUNT(*), SUM(col), AVG(col). Trying to use an aggregate in WHERE (e.g., WHERE COUNT(*) > 1) is a parse error because the aggregate does not yet exist. Both clauses can appear in the same query; WHERE runs first.

What is a CTE and when should I use one over a subquery?

A CTE (Common Table Expression, written as WITH name AS (SELECT ...)) is a named intermediate result you reference like a table in subsequent SELECTs. Use a CTE when the same intermediate is referenced multiple times, when the multi-step logic is deeply nested as a subquery, or when you need recursion (WITH RECURSIVE). Use a subquery when the intermediate is referenced exactly once and inlined cleanly. Modern PostgreSQL no longer materializes CTEs by default, so the performance gap has narrowed — pick by readability.

How do I find duplicate rows in SQL?

The canonical pattern is SELECT key, COUNT(*) FROM table GROUP BY key HAVING COUNT(*) > 1, which returns every distinct key value that appears more than once along with its multiplicity. To find duplicate (a, b, c) triples, use GROUP BY a, b, c. To delete duplicates while keeping one canonical row per key, use ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreaker) inside a CTE and delete every row where ROW_NUMBER > 1.

What is the best way to find the second-highest salary?

The cleanest answer is SELECT MAX(salary) FROM (SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS dr FROM employees) t WHERE dr = 2. DENSE_RANK = 2 reliably gives the second-distinct-salary even when ties exist at the top; the outer MAX(salary) collapses ties at rank 2 into a single value and returns NULL when no second-distinct-salary exists — both edge cases handled. Avoid LIMIT 1 OFFSET 1 — it errors on empty result sets in some dialects and does not handle ties correctly.

Should I learn PostgreSQL or MySQL for SQL interviews?

PostgreSQL is the default dialect of most modern data-engineering interview platforms (DataLemur, CoderPad, most product-analytics live screens) because of its strong window-function support, EXTRACT / INTERVAL date arithmetic, and ::TYPE casting syntax. Drill PostgreSQL first; learn MySQL-specific quirks (back-tick identifiers, slightly different LIMIT syntax, no FULL OUTER JOIN until 8.0.31) only if the company explicitly uses MySQL. Either way, the four primitives — joins, aggregates, windows, CTEs — are dialect-agnostic.

Top SQL data engineering interview topics

1. SQL Joins for Data Engineering — INNER, LEFT, and Anti-Joins

INNER JOIN, LEFT JOIN, and the LEFT JOIN ... IS NULL anti-join in SQL for data engineering

INNER JOIN: keep only matching rows

LEFT JOIN: keep every left row, pad the right with NULL

LEFT JOIN ... IS NULL anti-join: rows with no match

Common beginner mistakes

SQL Interview Question on Customers With No Orders

Solution Using LEFT JOIN ... WHERE orders.order_id IS NULL

2. SQL Aggregations and GROUP BY for Data Engineering

GROUP BY, HAVING, and aggregate functions in SQL for data engineering

COUNT, SUM, AVG, MIN, MAX — NULL-aware aggregates

WHERE vs HAVING — row filter vs group filter

HAVING COUNT(*) > 1 — the universal duplicate finder

Common beginner mistakes

SQL Interview Question on Duplicate Emails

Solution Using GROUP BY email HAVING COUNT(*) > 1

3. SQL Window Functions for Data Engineering — ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD

Window functions for ranking and lookback in SQL for data engineering

ROW_NUMBER — unique sequential numbering per partition

RANK vs DENSE_RANK — tie semantics

LAG and LEAD — lookback and lookahead

Common beginner mistakes

SQL Interview Question on Second-Highest Salary

Solution Using DENSE_RANK() OVER (ORDER BY salary DESC)

4. SQL CTEs and Subqueries for Data Engineering — WITH, Recursive CTEs, and Correlated Subqueries

CTE composition, recursive CTEs, and correlated subqueries in SQL for data engineering

WITH name AS (SELECT ...) — non-recursive CTEs for readability

WITH RECURSIVE — recursive CTEs for sequences and hierarchies

Correlated subqueries — per-row predicates against the same table

Common beginner mistakes

SQL Interview Question on Top 3 Salaries Per Department

Solution Using DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) in a CTE

Tips to crack SQL interviews for data engineering roles

Master the four primitives — joins, aggregates, windows, CTEs

Know the difference between WHERE, HAVING, and the OVER clause

Pick DENSE_RANK for "Nth distinct"; pick ROW_NUMBER for deduplication

Use LEFT JOIN ... IS NULL over NOT IN for anti-joins

Practice on PostgreSQL — it is the default dialect of most live coders

Always ORDER BY and add a tiebreaker for determinism

Read the data shape before writing the query

Where to practice on PipeCode

Frequently Asked Questions

What SQL topics are most asked in data engineering interviews?

What is the difference between RANK, DENSE_RANK, and ROW_NUMBER?

What is the difference between WHERE and HAVING?

What is a CTE and when should I use one over a subquery?

How do I find duplicate rows in SQL?

What is the best way to find the second-highest salary?

Should I learn PostgreSQL or MySQL for SQL interviews?

Start practicing SQL data engineering problems

1. SQL Joins for Data Engineering — `INNER`, `LEFT`, and Anti-Joins

`INNER JOIN`, `LEFT JOIN`, and the `LEFT JOIN ... IS NULL` anti-join in SQL for data engineering

`INNER JOIN`: keep only matching rows

`LEFT JOIN`: keep every left row, pad the right with `NULL`

`LEFT JOIN ... IS NULL` anti-join: rows with no match

Solution Using `LEFT JOIN ... WHERE orders.order_id IS NULL`

2. SQL Aggregations and `GROUP BY` for Data Engineering

`GROUP BY`, `HAVING`, and aggregate functions in SQL for data engineering

`COUNT`, `SUM`, `AVG`, `MIN`, `MAX` — `NULL`-aware aggregates

`WHERE` vs `HAVING` — row filter vs group filter

`HAVING COUNT(*) > 1` — the universal duplicate finder

Solution Using `GROUP BY email HAVING COUNT(*) > 1`

3. SQL Window Functions for Data Engineering — `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `LAG`, `LEAD`

`ROW_NUMBER` — unique sequential numbering per partition

`RANK` vs `DENSE_RANK` — tie semantics

`LAG` and `LEAD` — lookback and lookahead

Solution Using `DENSE_RANK() OVER (ORDER BY salary DESC)`

4. SQL CTEs and Subqueries for Data Engineering — `WITH`, Recursive CTEs, and Correlated Subqueries

`WITH name AS (SELECT ...)` — non-recursive CTEs for readability

`WITH RECURSIVE` — recursive CTEs for sequences and hierarchies

Solution Using `DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC)` in a CTE

Know the difference between `WHERE`, `HAVING`, and the `OVER` clause

Pick `DENSE_RANK` for "Nth distinct"; pick `ROW_NUMBER` for deduplication

Use `LEFT JOIN ... IS NULL` over `NOT IN` for anti-joins

Always `ORDER BY` and add a tiebreaker for determinism

What is the difference between `RANK`, `DENSE_RANK`, and `ROW_NUMBER`?

What is the difference between `WHERE` and `HAVING`?