SQL joins are the single most-tested topic in data engineering interviews after window functions — every screening round eventually asks you to combine two tables, pick the right INNER, LEFT, RIGHT, or FULL OUTER flavour, and explain what happens to rows that fail the match. A join combines rows from two tables based on a match condition; INNER keeps only matching rows, OUTER joins (LEFT / RIGHT / FULL) preserve unmatched rows on one or both sides, and SELF / ANTI / SEMI joins are special patterns built on those primitives.
This guide walks through the six core join families (INNER, LEFT, RIGHT, FULL, CROSS, SELF) plus the two essential interview patterns (ANTI joins and the ON vs WHERE trap) the way real data engineering interview questions test them. Every section ends as sql interview questions with answers: a runnable PostgreSQL query, a traced execution, an output table, and a concept-by-concept why this works breakdown — the exact shape sql for data engineers rounds reward.
When you want hands-on reps immediately after reading, browse Joins SQL practice →, drill the LEFT JOIN lane →, sharpen anti-join patterns →, rehearse self-join problems →, or widen coverage on the full SQL practice library →.
On this page
- Why SQL joins matter in data engineering interviews
- INNER JOIN — keep only matching rows
- LEFT JOIN and RIGHT JOIN — preserve unmatched rows on one side
- FULL OUTER JOIN — keep everything from both sides
- CROSS JOIN — the Cartesian product
- SELF JOIN — joining a table to itself
- ANTI JOIN — rows in one table with no match in another
- ON vs WHERE — the join-condition vs filter trap
- Choosing the right join (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why SQL joins matter in data engineering interviews
Joins decide which rows survive the combine — naming the family is half the answer
The one-sentence invariant interviewers chase: a SQL join combines rows from two tables based on a match condition; INNER keeps only matching rows, OUTER joins preserve unmatched rows on one or both sides, and SELF / ANTI / SEMI are special patterns built on top of the primitives. If you can say that aloud and pick the right family for each question, you have already cleared the bar most candidates fail to clear.
What a join does at a glance.
-
Definition — a clause that combines rows from two (or more) tables using a match condition expressed in
ONorUSING. -
Required syntax —
FROM left_table {INNER | LEFT | RIGHT | FULL | CROSS} JOIN right_table ON …. - Row count effect — depends entirely on the join family (see Cardinality below); INNER never inflates beyond the smaller side, OUTER never drops the preserved side, CROSS multiplies both.
-
NULL semantics —
NULL = NULLisUNKNOWN, notTRUE. Join conditions on nullable columns silently drop NULL-bearing rows unless you useIS NOT DISTINCT FROM(Postgres) or pre-filter.
The six core join families (the spine of every interview).
-
INNER JOIN— keep only rows that match on both sides. -
LEFT JOIN(aliasLEFT OUTER JOIN) — keep every left row plus matches from the right; unmatched right columns becomeNULL. -
RIGHT JOIN(aliasRIGHT OUTER JOIN) — mirror of LEFT; rarely used in practice. -
FULL OUTER JOIN— keep every row from both sides; unmatched cells becomeNULLon the missing side. -
CROSS JOIN— Cartesian product; every left row paired with every right row; noONclause. -
SELF JOIN— the same table joined to itself, aliased twice; a pattern rather than a new keyword.
Two special patterns built on top.
-
ANTI JOIN— "rows in L that have NO match in R"; phrased asNOT EXISTS,LEFT JOIN ... IS NULL, or (carefully)NOT IN. -
SEMI JOIN— "rows in L that DO have a match in R, but keep only L's columns"; phrased asEXISTSorIN.
Cardinality and fan-out (the gotcha that fails most candidates).
-
Fan-out — joining a one-to-many relationship and treating the inflated row count as if it were the original grain (one order spawning three rows after joining with
order_itemsdoubles yourSUM). -
Defence in depth:
- Verify the uniqueness of the join key on at least one side before aggregating.
- Aggregate inside a CTE first to collapse to one row per join key, then JOIN.
- State the grain of every CTE explicitly ("one row per
(user_id, day)").
- Many-to-many joins multiply rows on both sides; almost always a mistake unless you intend a Cartesian-like cross product.
What interviewers listen for.
- Do you name the join family aloud before writing it? ("This is a LEFT join because we want every customer even if they have zero orders.")
- Do you mention fan-out before joining a one-to-many table? — senior-signal.
- Do you distinguish
ONvsWHEREwhen the right side is nullable? — see §8. - Do you reach for
NOT EXISTS(notNOT IN) for anti-joins? — see §7.
Worked example — INNER vs LEFT on the same input
Question. Show every employee with their department name; first using INNER JOIN, then using LEFT JOIN, so the row-count difference is visible.
Input. employees(emp_id, name, dept_id) and departments(dept_id, dept_name).
| emp_id | name | dept_id |
|---|---|---|
| 1 | Ava | 10 |
| 2 | Omar | 10 |
| 3 | Lin | NULL |
| 4 | Sam | 20 |
| dept_id | dept_name |
|---|---|
| 10 | Retail |
| 20 | Insights |
Code (INNER).
SELECT e.emp_id, e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON d.dept_id = e.dept_id;
Code (LEFT).
SELECT e.emp_id, e.name, d.dept_name
FROM employees e
LEFT JOIN departments d ON d.dept_id = e.dept_id;
Step-by-step explanation.
- For each employee row, the planner looks for matching
dept_idrows indepartments. - Ava and Omar match dept 10 → both kept by both joins.
- Sam matches dept 20 → kept by both joins.
- Lin has
dept_id = NULL—NULL = NULLisUNKNOWN, so no match: INNER drops Lin; LEFT keeps Lin withdept_name = NULL.
Output (INNER).
| emp_id | name | dept_name |
|---|---|---|
| 1 | Ava | Retail |
| 2 | Omar | Retail |
| 4 | Sam | Insights |
Output (LEFT).
| emp_id | name | dept_name |
|---|---|---|
| 1 | Ava | Retail |
| 2 | Omar | Retail |
| 3 | Lin | NULL |
| 4 | Sam | Insights |
Rule of thumb: if the question is "every employee, even orphans," reach for LEFT; if it's "only employees who actually have a department," reach for INNER.
SQL
Topic — joins
Joins SQL problems
SQL
Topic — multi-join
Multi-table join drills
2. INNER JOIN — keep only matching rows
Returns only rows where the match condition is true on both sides
INNER JOIN is the default join when you write just JOIN; the INNER keyword is optional but always preferred in interview code because it states the intent aloud. Use INNER when both sides must exist for the row to be meaningful (orders that actually have a customer, songs that actually have an artist, transactions that actually have a merchant).
INNER JOIN at a glance.
-
Syntax —
FROM L INNER JOIN R ON L.k = R.k. -
Output — every
(l_row, r_row)pair where theONpredicate isTRUE; rows with no match are dropped. -
INNERkeyword — optional;JOINalone meansINNER JOIN. -
NULLs in the join key — silently dropped (since
NULL = anythingisUNKNOWN). - Order of tables — does not change the result for INNER; left and right are interchangeable.
Cardinality rules — INNER never inflates beyond the smaller side… unless duplicates.
-
Upper bound —
|result| ≤ |L| × |R|(worst case when many duplicates). -
One-to-one join —
|result| ≤ min(|L|, |R|). -
One-to-many join —
|result|= number of child rows whose parent exists. -
Many-to-many join —
|result|can balloon fast; almost always a fan-out bug unless intentional.
EQUI JOIN vs THETA JOIN.
-
EQUI JOIN — join condition uses
=(e.g.ON a.k = b.k). The default in 99% of interview problems. -
THETA JOIN — join condition uses any comparison (
<,>,BETWEEN, range overlap). Used for price-tier matching, date-range overlap, time-series join onBETWEEN start_ts AND end_ts. - Planner cost — equi-joins typically run as hash or merge joins; theta joins often degenerate to nested loops (slow on large tables).
ON vs USING vs NATURAL — three ways to specify the join column.
-
ON L.col = R.col— most explicit; the two columns appear separately in the output. -
USING (col)— shorthand when the column has the same name on both sides; produces one merged column. -
NATURAL JOIN— joins on every column with a matching name in both tables; dangerous because schema drift can silently change the join surface — avoid in production code.
Use INNER JOIN when …
- Both sides must exist for the row to be meaningful.
- You are connecting a parent and a child where orphans should be dropped.
- You are filtering rows by existence in another table — though
EXISTS(semi-join) is usually clearer for that intent.
SQL interview question — list every employee with their department name
Assume employees(emp_id, name, dept_id) and departments(dept_id, dept_name). Return every employee with their department name; drop employees with no department.
Solution Using INNER JOIN
Code solution.
SELECT e.emp_id,
e.name,
d.dept_name
FROM employees e
INNER JOIN departments d ON d.dept_id = e.dept_id
ORDER BY d.dept_name, e.name;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 | Scan employees
|
4 candidate rows: Ava, Omar, Lin (NULL), Sam |
| 2 | Probe departments on dept_id
|
Ava, Omar, Sam find matches; Lin's NULL join key never matches |
| 3 | INNER join keeps matched pairs only | 3 rows survive |
| 4 | Project (emp_id, name, dept_name)
|
final shape |
| 5 | Outer ORDER BY d.dept_name, e.name
|
output ordering |
Output:
| emp_id | name | dept_name |
|---|---|---|
| 4 | Sam | Insights |
| 1 | Ava | Retail |
| 2 | Omar | Retail |
Why this works — concept by concept:
-
INNER JOIN semantics — drops every row that fails the
ONpredicate on either side; theNULL-bearing employee Lin disappears without warning. -
Equi-join on dept_id — both sides indexed (
departments.dept_idis the primary key) means the planner can use a hash join inΘ(|L| + |R|). -
Explicit
INNERkeyword — signals intent in interview code even thoughJOINalone would behave identically. -
Column projection — selecting only
(emp_id, name, dept_name)keeps the output narrow;SELECT *would include the duplicatedept_idcolumn from both sides. -
Cost — hash join
Θ(|employees| + |departments|)whendept_idis indexed; falls back toΘ(|employees| × |departments|)nested loop without indexes.
SQL
Topic — joins
INNER join interview drills
SQL
Topic — aggregation
Joins + aggregation combos
3. LEFT JOIN and RIGHT JOIN — preserve unmatched rows on one side
LEFT keeps every left row + matches; RIGHT is its mirror — and you almost never need RIGHT
LEFT JOIN (aliased LEFT OUTER JOIN) is the workhorse of analytics SQL — any time you want "every X, even if some are missing Y", reach for it. RIGHT JOIN is the symmetric mirror and is almost always re-expressible as LEFT JOIN with the tables swapped — most teams ban RIGHT JOIN from style guides for that reason.
LEFT JOIN at a glance.
-
Syntax —
FROM L LEFT JOIN R ON L.k = R.k. -
Output — every row of
Lpaired with the matching row ofR; if no match, theRcolumns becomeNULL. -
Cardinality —
|result| ≥ |L|; can exceed|L|ifRhas duplicate join keys. -
LEFTvsLEFT OUTER— identical; theOUTERkeyword is optional. - Order of tables matters — swap left and right and you swap the preserved side.
RIGHT JOIN at a glance — and why you rarely need it.
-
Syntax —
FROM L RIGHT JOIN R ON L.k = R.kkeeps every row ofR. -
Equivalent expression —
FROM R LEFT JOIN L ON L.k = R.kreads left-to-right and gives the same result. - Convention — most style guides standardize on LEFT JOIN; mention this aloud in interviews.
- MySQL / Postgres — both fully support RIGHT JOIN; the convention is preference, not compatibility.
Why LEFT is preferred over RIGHT.
- Readability — humans read left-to-right; LEFT JOIN naturally communicates "everything from the first table, plus matches from the second."
- Code review consistency — switching between LEFT and RIGHT in the same codebase is a frequent source of bugs.
-
Refactor-friendliness —
LEFT JOIN ... LEFT JOIN ... LEFT JOINchains compose cleanly; mixing LEFT and RIGHT confuses the chain.
Cardinality rules for LEFT JOIN.
-
Lower bound —
|result| ≥ |L|(every left row is preserved). -
Upper bound —
|result| ≤ |L| × max_dups_per_left_key_in_R(each left row can match multiple right rows). -
One-to-one join —
|result| = |L|. - One-to-many — left rows with multiple right matches appear multiple times (fan-out — see §1).
The WHERE right_col IS NULL trick — convert LEFT JOIN to anti-join.
-
Pattern —
LEFT JOIN R ON L.k = R.k WHERE R.k IS NULL. -
Semantics — keeps only left rows that had no match in
R; the textbook "anti-join via outer + NULL filter" idiom. -
Trade-off — works on every dialect; planners sometimes produce a worse plan than
NOT EXISTS(see §7). - Common interview use — "find customers who never placed an order" or "find pages with zero likes."
SQL interview question — list every employee with their department name, keeping employees who have no department
Assume employees(emp_id, name, dept_id) and departments(dept_id, dept_name). Return every employee with their department name; employees with no department should appear with NULL in dept_name.
Solution Using LEFT JOIN
Code solution.
SELECT e.emp_id,
e.name,
d.dept_name
FROM employees e
LEFT JOIN departments d ON d.dept_id = e.dept_id
ORDER BY e.emp_id;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 | Scan employees
|
4 candidate rows: Ava, Omar, Lin (NULL), Sam |
| 2 | Probe departments on dept_id
|
Ava, Omar match dept 10; Sam matches dept 20; Lin has NULL — no match |
| 3 | LEFT join preserves Lin | 4 rows survive (Lin's d.dept_name becomes NULL) |
| 4 | Outer ORDER BY e.emp_id
|
stable output ordering |
Output:
| emp_id | name | dept_name |
|---|---|---|
| 1 | Ava | Retail |
| 2 | Omar | Retail |
| 3 | Lin | NULL |
| 4 | Sam | Insights |
Why this works — concept by concept:
-
LEFT JOIN preserves every left row — Lin's
NULL-bearingdept_idwould have been dropped by INNER; LEFT keeps her and insertsNULLfor the right-side columns. -
Direction matters —
employeesis on the left because we want every employee preserved; reversing the tables would preserve every department instead. -
No filter on the right side — moving
WHERE d.dept_name IS NOT NULLinto the query would silently degrade this to INNER JOIN (see §8 for why). -
Stable
ORDER BY—emp_idis unique, so the output order is deterministic without further tiebreakers. -
Cost — hash join
Θ(|employees| + |departments|)whendept_idis indexed; otherwise nested loopΘ(|employees| × |departments|).
SQL
Topic — left-join
LEFT JOIN interview problems
SQL
Topic — joins
All-join SQL drills
4. FULL OUTER JOIN — keep everything from both sides
FULL OUTER preserves every row from both tables — perfect for reconciliation reports
FULL OUTER JOIN keeps every row from both tables, filling NULL on whichever side lacks a match. Use it when you need to reconcile two systems and identify rows that exist in one but not the other.
FULL OUTER JOIN at a glance.
-
Syntax —
FROM L FULL OUTER JOIN R ON L.k = R.k. -
Output — three buckets in one result:
- Matched rows (
L_cols+R_cols). - Left-only rows (
L_cols+NULLs on the right). - Right-only rows (
NULLs on the left +R_cols).
- Matched rows (
-
FULLvsFULL OUTER— identical; theOUTERkeyword is optional. - Order of tables — does not matter (symmetric).
Reconciliation use case.
- "Show me every
order_idthat exists in either the orders system or the deliveries system." - "Identify SKUs present in the inventory snapshot but missing from the catalog (or vice versa)."
- "Build a unified view of users across two CRM systems before merging."
IS NULL filters expose the three buckets.
-
WHERE L.k IS NULL— keeps right-only rows (orphans on the right). -
WHERE R.k IS NULL— keeps left-only rows (orphans on the left). -
WHERE L.k IS NOT NULL AND R.k IS NOT NULL— keeps only the matched rows (equivalent to INNER JOIN). -
No
WHERE— keeps all three buckets.
MySQL note (dialect quirk).
- MySQL does not support
FULL OUTER JOINdirectly. - Emulate with
LEFT JOIN ... UNION ALL ... RIGHT JOIN ... WHERE L.k IS NULLto combine left-only + matched + right-only. - PostgreSQL, SQL Server, Oracle, BigQuery, Snowflake, Redshift all support it natively.
SQL interview question — reconcile orders against deliveries
Assume orders(order_id, customer_id) and deliveries(delivery_id, order_id, status). Return every order_id that exists in either table; flag whether the row came from orders, deliveries, or both.
Solution Using FULL OUTER JOIN
Code solution.
SELECT COALESCE(o.order_id, d.order_id) AS order_id,
o.customer_id,
d.status,
CASE
WHEN o.order_id IS NULL THEN 'delivery_only'
WHEN d.order_id IS NULL THEN 'order_only'
ELSE 'matched'
END AS source
FROM orders o
FULL OUTER JOIN deliveries d ON d.order_id = o.order_id
ORDER BY 1;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 | Scan orders and deliveries
|
both source tables read |
| 2 | FULL OUTER join on order_id
|
every order_id from either side preserved; mismatched cells are NULL
|
| 3 | COALESCE(o.order_id, d.order_id) |
merges the two order_id columns into one canonical value |
| 4 |
CASE on which side is NULL
|
tags each row as order_only, delivery_only, or matched
|
| 5 | Outer ORDER BY 1
|
stable output ordering on the merged order_id
|
Output:
| order_id | customer_id | status | source |
|---|---|---|---|
| 1001 | 42 | shipped | matched |
| 1002 | 17 | NULL | order_only |
| 1003 | NULL | delivered | delivery_only |
Why this works — concept by concept:
- FULL OUTER semantics — preserves every row from both sides; the two "orphan" buckets are exactly what a reconciliation report needs.
-
COALESCE for the canonical key —
o.order_idisNULLfor delivery-only rows and vice versa;COALESCEpicks whichever is non-null. -
CASE for source labelling — tagging each row with its origin lets downstream consumers filter
WHERE source = 'order_only'to isolate one bucket. -
No
WHEREclause — keeps all three buckets; addingWHERE source = 'matched'would degrade the query to an INNER JOIN. -
Cost — hash join
Θ(|orders| + |deliveries|)on indexedorder_id; the post-join CASE isΘ(n)over the result.
SQL
Topic — joins
FULL OUTER join drills
SQL
Topic — multi-join
Reconciliation problems
5. CROSS JOIN — the Cartesian product
CROSS JOIN multiplies every row from L by every row from R — useful for grids, dangerous when accidental
CROSS JOIN has no ON clause; it returns the Cartesian product of the two tables. Used deliberately, it builds dense calendars and complete grids; used accidentally, it produces query results that crash dashboards.
CROSS JOIN at a glance.
-
Syntax —
FROM L CROSS JOIN R(noON). -
Output size —
|L| × |R|rows. -
Implicit form —
FROM L, Ris the same asCROSS JOIN; this is why forgettingONis the canonical Cartesian bug. -
Use case — dense grids: every
(date × store)slot for a missing-day report, every(product × region)combination for a pivot, every(user × cohort)slot for funnel analysis.
Use CROSS JOIN deliberately for dense calendars and grids.
- Generate a calendar dimension:
generate_series('2026-01-01', '2026-12-31', INTERVAL '1 day')thenCROSS JOIN storesto get one row per(day, store). - Test fixtures:
CROSS JOINtwo small dimension tables to create exhaustive combinations. - Dense pivot scaffolds:
CROSS JOINthe row dimension with the column dimension, then LEFT JOIN the fact table to fill in actuals (missing combinations show asNULL).
Trap — the accidental Cartesian.
- Forgetting the
ONclause on a regular JOIN — older SQL syntaxFROM a, bwithout aWHERE a.k = b.kproduces a Cartesian. - A typo like
ON a.k = a.k(instead ofa.k = b.k) joins every row to every row. - Result size explodes —
|a| × |b|can be 100M rows from two 10k tables.
CROSS JOIN LATERAL (Postgres bonus).
-
CROSS JOIN LATERAL (SELECT … FROM r WHERE r.fk = L.id LIMIT 5)— for each left row, run a parameterised subquery that can reference left columns. - Useful for "top-N children per parent" patterns without window functions.
- Snowflake / BigQuery offer
LATERAL/UNNESTwith similar semantics.
Worked example — dense (date × store) grid for a sales report
Question. Generate a row for every (date, store_id) combination over the first three days of January and the two stores in stores, then LEFT JOIN the sales table to fill in revenue (zeros for missing combinations).
Input. stores(store_id) = {S1, S2}; sales(sale_date, store_id, revenue) has a row only for (2026-01-01, S1, 500).
Code.
WITH calendar AS (
SELECT generate_series('2026-01-01'::date, '2026-01-03'::date, INTERVAL '1 day')::date AS d
),
grid AS (
SELECT c.d, s.store_id
FROM calendar c
CROSS JOIN stores s
)
SELECT g.d, g.store_id, COALESCE(s.revenue, 0) AS revenue
FROM grid g
LEFT JOIN sales s ON s.sale_date = g.d AND s.store_id = g.store_id
ORDER BY g.d, g.store_id;
Step-by-step explanation.
-
calendargenerates 3 dates. -
CROSS JOIN storesmultiplies — 3 dates × 2 stores = 6 rows ingrid. - LEFT JOIN
salesmatches on(d, store_id); only(2026-01-01, S1)has data. -
COALESCEconverts missing-revenueNULLs to0.
Output.
| d | store_id | revenue |
|---|---|---|
| 2026-01-01 | S1 | 500 |
| 2026-01-01 | S2 | 0 |
| 2026-01-02 | S1 | 0 |
| 2026-01-02 | S2 | 0 |
| 2026-01-03 | S1 | 0 |
| 2026-01-03 | S2 | 0 |
Rule of thumb: if the answer requires "a row for every combination, even the empty ones," CROSS JOIN + LEFT JOIN is the canonical pattern.
SQL
Topic — cross-join
CROSS JOIN practice
SQL
Topic — joins
All-join SQL library
6. SELF JOIN — joining a table to itself
SELF JOIN is a pattern, not a keyword — alias the table twice and join it to itself
A SELF JOIN is not a new SQL keyword; it's the pattern of joining a table to itself, aliasing it twice so the planner can resolve column references unambiguously. Use it for hierarchies (employee → manager), pairwise comparisons (find users with overlapping tags), and sequential events (gap detection) — though modern SQL often replaces SELF JOINs with LAG window functions for cleaner code.
SELF JOIN at a glance.
- Definition — a join where both sides are the same table, distinguished by aliases.
-
Syntax —
FROM employees e1 JOIN employees e2 ON e2.manager_id = e1.emp_id. -
Keyword — there is no
SELF JOINkeyword; you simply alias the table twice and use INNER / LEFT / FULL as needed. -
Output — pairs of rows from the same table where the
ONcondition holds.
Why aliasing is mandatory.
- Without aliases, references like
employees.emp_idare ambiguous (which copy?). - Conventional aliases —
e1/e2,parent/child,prev/curr— signal intent. - Same alias rules apply to chained self-joins (
e1,e2,e3) for multi-level hierarchies.
Use cases for SELF JOIN.
- Hierarchies — employee → manager, comment → parent_comment, file → parent_directory.
- Pairwise comparisons — find users with the same tag, products with the same SKU prefix, transactions with the same amount.
-
Sequential events — match each row with the previous row by
id - 1orevent_ts < self.event_ts ORDER BY ts DESC LIMIT 1. -
Tag-matching —
taggings t1 JOIN taggings t2 ON t1.tag_id = t2.tag_id AND t1.user_id < t2.user_idto enumerate pairs.
SELF JOIN vs window functions — when to prefer which.
- SELF JOIN — wins when the relationship is structural (parent-child hierarchies, peer pairs).
-
LAG/LEAD— wins when the relationship is ordered-neighbour (previous-row / next-row on a sorted partition); see Blog70 §5. - Recursive CTE — wins for multi-level hierarchies (full org-chart traversal); see Blog44 §6.
-
Cost — SELF JOIN is
Θ(n²)worst-case without indexes;LAGisΘ(n log n)for the sort plusΘ(n)for the scan.
SQL interview question — for every employee, show their manager's name
Assume employees(emp_id, name, manager_id). manager_id is NULL for top-of-org employees. Return every employee with their name and their manager's name; include employees who have no manager (their manager column should be NULL).
Solution Using SELF JOIN
Code solution.
SELECT e.emp_id,
e.name AS employee,
m.name AS manager
FROM employees e
LEFT JOIN employees m ON m.emp_id = e.manager_id
ORDER BY e.emp_id;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 | Scan employees AS e
|
every employee row in the output |
| 2 | Scan employees AS m aliased as "manager" |
second copy of the same table |
| 3 | LEFT JOIN ON m.emp_id = e.manager_id |
each employee's manager_id looked up against emp_id on the manager copy |
| 4 | Top-of-org rows (manager_id IS NULL) |
preserved by LEFT JOIN with m.name = NULL
|
| 5 | Outer ORDER BY e.emp_id
|
deterministic output |
Output:
| emp_id | employee | manager |
|---|---|---|
| 1 | Quinn | NULL |
| 2 | Ravi | Quinn |
| 3 | Lin | Ravi |
| 4 | Ava | Ravi |
Why this works — concept by concept:
-
Self join via aliasing —
employees eandemployees mare two logical references to the same physical table; the planner treats them independently for the join. -
LEFT JOIN preserves top-of-org — Quinn has
manager_id = NULL; INNER JOIN would drop her, LEFT keeps her withm.name = NULL. -
Alias renaming in
SELECT—e.name AS employeeandm.name AS managerkeep the output readable even though both columns areemployees.name. -
Direction of the join —
m.emp_id = e.manager_idmeans "find the manager's row whoseemp_idmatches this employee'smanager_id." -
Cost — hash join
Θ(|employees|)per side whenemp_idis indexed; nested loopΘ(|employees|²)if not.
SQL
Topic — self-join
SELF JOIN interview problems
SQL
Topic — joins
Joins SQL library
7. ANTI JOIN — rows in one table with no match in another
ANTI JOIN keeps rows in L with no match in R — three phrasings, one preferred
ANTI JOIN answers the canonical interview question "find rows in table A that do NOT exist in table B." It is not a separate SQL keyword in most dialects; you write it as NOT EXISTS, LEFT JOIN ... IS NULL, or (carefully) NOT IN. All three are logically equivalent in spirit, but their NULL handling and planner cost differ enough that picking the right phrasing is itself an interview signal.
ANTI JOIN at a glance.
-
Definition — return rows in
Lthat have no matching row inRunder theONpredicate. - Inverse of SEMI JOIN — SEMI keeps "L rows that have a match"; ANTI keeps "L rows that do not."
-
Keyword support — DuckDB has
ANTI JOIN; Snowflake / BigQuery / Postgres rely onNOT EXISTSorLEFT JOIN ... IS NULL. -
Cardinality —
|result| ≤ |L|; never inflates.
Three equivalent phrasings — and their NULL behaviour.
-
WHERE NOT EXISTS (SELECT 1 FROM R WHERE R.k = L.k)— the preferred phrasing; planners produce a clean anti-semi-join plan, and NULLs inR.kare handled correctly (they fail the=predicate and don't disturb the result). -
LEFT JOIN R ON L.k = R.k WHERE R.k IS NULL— the "outer + NULL filter" idiom; works on every dialect; planner may produce a slower plan because it materializes the LEFT JOIN before filtering. -
WHERE L.k NOT IN (SELECT k FROM R)— avoid unlessR.kis guaranteedNOT NULL. If any row inR.kisNULL, the whole query returns empty becausex NOT IN (..., NULL)evaluates toUNKNOWNfor everyx.
Why NOT EXISTS is the preferred phrasing.
-
Plan quality — most planners (Postgres, SQL Server, Oracle, Snowflake) emit a dedicated anti-semi-join operator that short-circuits on the first match found in
R. -
NULL safety — the correlated
=predicate treatsNULL = anythingasUNKNOWN, which is filtered out, so the outer row survives correctly. -
Reads cleanly — "where there is not a row in
Rmatching this one" mirrors the spoken intent. -
Composes with multiple conditions —
WHERE NOT EXISTS (SELECT 1 FROM R WHERE R.k = L.k AND R.status = 'active')is straightforward.
The NOT IN NULL trap (the canonical interview gotcha).
-
Setup —
WHERE L.k NOT IN (SELECT k FROM R). -
Trap — if any row in
R.kisNULL, the subquery returns{..., NULL}. ThenL.k NOT IN (..., NULL)isUNKNOWNfor everyL.k, and the whole query returns zero rows. -
Fix — either use
NOT EXISTS, or filter the subquery:WHERE L.k NOT IN (SELECT k FROM R WHERE k IS NOT NULL). -
Why interviewers love this — it's a silent bug; the query "works" until production data introduces a
NULL.
Use cases for ANTI JOIN.
- Customers who never placed an order.
- Pages with zero likes (DataLemur Meta interview question).
- Subscribers with no recent activity.
- Inventory items missing from the catalog.
- Users who have not yet completed onboarding.
SQL interview question — find Facebook pages with zero likes
Assume pages(page_id, page_name) and page_likes(page_id, user_id). Return the page_id of every page that has not received a single like.
Solution Using NOT EXISTS
Code solution.
SELECT p.page_id,
p.page_name
FROM pages p
WHERE NOT EXISTS (
SELECT 1
FROM page_likes pl
WHERE pl.page_id = p.page_id
)
ORDER BY p.page_id;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 | Scan pages p
|
every candidate page row |
| 2 | For each p, probe page_likes for any row with pl.page_id = p.page_id
|
match found → EXISTS is TRUE → NOT EXISTS is FALSE → row dropped |
| 3 | No match found → NOT EXISTS is TRUE → row kept |
only pages with zero likes survive |
| 4 | Outer ORDER BY p.page_id
|
stable output ordering |
Output:
| page_id | page_name |
|---|---|
| 22 | abandoned-page |
| 45 | new-page-no-likes-yet |
Why this works — concept by concept:
-
NOT EXISTS for anti-semi-join — the correlated subquery short-circuits on the first match; planners emit a dedicated anti-semi-join operator that beats
LEFT JOIN ... IS NULLin most cases. -
NULL safety —
pl.page_id = p.page_idreturnsUNKNOWNforNULLkeys, which is filtered out; the outer page row survives correctly without any extraIS NOT NULLfilter. -
Why not
NOT IN— ifpage_likes.page_idever containsNULL,NOT INwould silently return zero rows;NOT EXISTSis robust. -
SELECT 1 inside the subquery — convention; the column list doesn't matter for
EXISTS, soSELECT 1signals "I only care about existence." -
Cost — anti-semi-join
Θ(|pages| + |page_likes|)whenpage_likes.page_idis indexed; otherwiseΘ(|pages| × |page_likes|)worst case.
SQL
Topic — anti-join
ANTI JOIN interview problems
SQL
Topic — joins
All-join SQL drills
8. ON vs WHERE — the join-condition vs filter trap
ON runs during the join; WHERE runs after — and for OUTER joins, the choice changes the result
The ON clause defines the join condition; the WHERE clause filters the join's result. For INNER joins, the choice does not matter — both clauses produce the same final rows. For OUTER joins, the choice silently changes the result, and most interview rejections happen because a candidate moved a predicate from ON to WHERE without realising they had downgraded a LEFT JOIN to an INNER JOIN.
ON runs during the join.
-
Role — controls which
(L, R)row pairs are considered a "match." - Timing — applied while the join is being constructed.
-
For OUTER joins — unmatched rows on the preserved side are kept with
NULLs; predicates inONagainst the non-preserved side do not drop preserved rows.
WHERE runs after the join completes.
- Role — filters the final join result.
- Timing — applied after the join has produced its full output (matched rows + preserved orphans).
-
For OUTER joins — predicates against the non-preserved side do drop preserved rows because their value is
NULL(andNULL op anythingisUNKNOWN).
For INNER joins — ON and WHERE are equivalent.
- Both clauses run on the same set of rows.
- Convention: put the join-key equality in
ONand downstream filters inWHEREso reviewers can spot intent. - The optimiser typically rewrites them to the same plan.
For OUTER joins — placement changes the result.
-
Right-side predicate in
ON— preserves unmatched left rows (right columns stayNULL). -
Same predicate in
WHERE— drops unmatched left rows (becauseNULL = xisUNKNOWN), effectively silently degrading the LEFT JOIN to an INNER JOIN. -
Practical rule — if you want a row to survive even when the right side has no match, the predicate against the right side must stay in
ON.
Cardinality and fan-out reminder.
- Always verify the uniqueness of the join key on at least one side before aggregating after a join.
- One-to-many joins inflate row counts; aggregating without dedup multiplies your sums.
- The safest pattern: aggregate inside a CTE first to collapse to one row per join key, then JOIN.
Worked example — active=true predicate moves between ON and WHERE
Question. List every employee with their department name, but include the department name only when the department is active; employees should appear even if their department is inactive or missing.
Input. employees(emp_id, name, dept_id) and departments(dept_id, dept_name, active). Suppose Ava is in dept 10 (active), Omar is in dept 10 (active), Sam is in dept 20 (inactive), Lin has dept_id = NULL.
Code (correct — predicate in ON).
SELECT e.emp_id, e.name, d.dept_name
FROM employees e
LEFT JOIN departments d
ON d.dept_id = e.dept_id
AND d.active = TRUE
ORDER BY e.emp_id;
Code (subtly wrong — predicate in WHERE).
SELECT e.emp_id, e.name, d.dept_name
FROM employees e
LEFT JOIN departments d ON d.dept_id = e.dept_id
WHERE d.active = TRUE
ORDER BY e.emp_id;
Step-by-step explanation.
- The correct version applies
active = TRUEwhile matching — Sam (dept 20, inactive) and Lin (no dept) both survive the join withdept_name = NULL. - The wrong version applies
active = TRUEafter the join — Sam'sd.activeisFALSE(filtered), and Lin'sd.activeisNULL(NULL = TRUEisUNKNOWN, filtered). Both rows disappear, silently degrading LEFT JOIN to INNER JOIN.
Output (correct).
| emp_id | name | dept_name |
|---|---|---|
| 1 | Ava | Retail |
| 2 | Omar | Retail |
| 3 | Lin | NULL |
| 4 | Sam | NULL |
Output (wrong).
| emp_id | name | dept_name |
|---|---|---|
| 1 | Ava | Retail |
| 2 | Omar | Retail |
Rule of thumb: for OUTER joins, predicates against the non-preserved side belong in ON; predicates against the preserved side belong in WHERE. State this aloud in the interview.
SQL
Topic — joins
ON vs WHERE join drills
SQL
Topic — left-join
LEFT JOIN problems
Choosing the right join (cheat sheet)
| You want … | Reach for … | Watch out for |
|---|---|---|
| Only rows that exist on both sides | INNER JOIN |
NULL keys are silently dropped |
| Every row from one side + matches from the other |
LEFT JOIN (rare: RIGHT JOIN) |
Fan-out when the right side has duplicate join keys |
| Every row from both sides — reconciliation | FULL OUTER JOIN |
Not supported in MySQL; emulate with LEFT UNION ALL RIGHT WHERE … IS NULL
|
| Cartesian product — dense grids | CROSS JOIN |
Result size = ` |
| Same table joined to itself | {% raw %}SELF JOIN (aliased twice) |
LAG / LEAD often clearer for ordered-neighbour problems |
| Rows in L with NO match in R |
NOT EXISTS (preferred), LEFT JOIN ... IS NULL
|
NOT IN is buggy if R has NULLs |
| Rows in L with a match in R, keep only L cols |
EXISTS (semi-join) |
IN works but may fan out duplicates if you accidentally JOIN |
Range / overlap match (BETWEEN, <, >) |
Theta join — INNER JOIN ON L.start <= R.t AND R.t < L.end
|
Falls back to nested loop; index range carefully |
Frequently asked questions
What's the difference between INNER JOIN and OUTER JOIN?
INNER JOIN keeps only rows that match on both sides — anything unmatched on either side is dropped. OUTER JOIN is an umbrella for LEFT, RIGHT, and FULL OUTER, each of which preserves unmatched rows on a specific side and fills the missing columns with NULL. Use INNER when both sides must exist for the row to be meaningful (e.g. orders that have a real customer); use an OUTER variant when you need "everything from one or both sides, even orphans" (e.g. all customers, including those with zero orders).
What's the difference between LEFT JOIN and RIGHT JOIN in SQL?
LEFT JOIN returns every row from the left table plus matching rows from the right; unmatched right columns are NULL. RIGHT JOIN is the mirror — every right row plus matches from the left. They are symmetric: A LEFT JOIN B ON … is identical to B RIGHT JOIN A ON …, so most teams standardise on LEFT JOIN for left-to-right readability and ban RIGHT JOIN from their style guide. The interview-grade soundbite: "If you ever reach for RIGHT JOIN, swap the table order and use LEFT JOIN instead."
What's the difference between the ON clause and the WHERE clause in a SQL join?
ON runs during the join and decides which (L, R) row pairs count as a match; WHERE runs after the join produces its output and filters the final result. For INNER joins the two clauses produce the same answer, but for OUTER joins their placement changes the result: a predicate against the non-preserved side in ON keeps the preserved rows (filling the predicate columns with NULL), while the same predicate in WHERE drops those rows (since NULL op anything is UNKNOWN), silently degrading the OUTER join to an INNER join.
How do I write a SELF JOIN in SQL?
There is no SELF JOIN keyword; you write a regular JOIN (INNER, LEFT, etc.) using the same table twice with different aliases so the planner can disambiguate column references. For example, FROM employees e LEFT JOIN employees m ON m.emp_id = e.manager_id matches each employee row to the manager row inside the same employees table. Use SELF JOIN for hierarchies (employee → manager), pairwise comparisons inside one table, or sequential events when window functions like LAG are not available or not as clear.
How do I write an anti-join — NOT EXISTS vs LEFT JOIN IS NULL?
Three phrasings are equivalent in spirit: WHERE NOT EXISTS (SELECT 1 FROM R WHERE R.k = L.k), LEFT JOIN R ON L.k = R.k WHERE R.k IS NULL, and WHERE L.k NOT IN (SELECT k FROM R). Prefer NOT EXISTS — it emits a clean anti-semi-join plan, is robust to NULLs in R.k, and reads as "where there is not a row in R matching this one." LEFT JOIN ... IS NULL works everywhere but may produce a worse plan. NOT IN is the trap to avoid: if R.k ever contains NULL, the entire query returns zero rows because x NOT IN (..., NULL) is UNKNOWN.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including PostgreSQL-first SQL practice keyed to joins, anti-join, self-join, aggregation + joins, and multi-table reconciliation patterns.
Kick off via Explore practice →; drill the dedicated joins SQL lane →; fan out into LEFT JOIN problems →; reinforce anti-join patterns →; rehearse SELF JOIN drills →; widen coverage on the full SQL practice library →.





Top comments (0)