Gowtham Potureddi

Posted on May 22

SQL Joins Interview Questions: INNER, LEFT, RIGHT, FULL, SELF & ANTI Joins

#python #sql #interview #dataengineering

SQL joins are the single most-tested topic in data engineering interviews after window functions — every screening round eventually asks you to combine two tables, pick the right INNER, LEFT, RIGHT, or FULL OUTER flavour, and explain what happens to rows that fail the match. A join combines rows from two tables based on a match condition; INNER keeps only matching rows, OUTER joins (LEFT / RIGHT / FULL) preserve unmatched rows on one or both sides, and SELF / ANTI / SEMI joins are special patterns built on those primitives.

This guide walks through the six core join families (INNER, LEFT, RIGHT, FULL, CROSS, SELF) plus the two essential interview patterns (ANTI joins and the ON vs WHERE trap) the way real data engineering interview questions test them. Every section ends as sql interview questions with answers: a runnable PostgreSQL query, a traced execution, an output table, and a concept-by-concept why this works breakdown — the exact shape sql for data engineers rounds reward.

When you want hands-on reps immediately after reading, browse Joins SQL practice →, drill the LEFT JOIN lane →, sharpen anti-join patterns →, rehearse self-join problems →, or widen coverage on the full SQL practice library →.

On this page

Why SQL joins matter in data engineering interviews
INNER JOIN — keep only matching rows
LEFT JOIN and RIGHT JOIN — preserve unmatched rows on one side
FULL OUTER JOIN — keep everything from both sides
CROSS JOIN — the Cartesian product
SELF JOIN — joining a table to itself
ANTI JOIN — rows in one table with no match in another
ON vs WHERE — the join-condition vs filter trap
Choosing the right join (cheat sheet)
Frequently asked questions
Practice on PipeCode

1. Why SQL joins matter in data engineering interviews

Joins decide which rows survive the combine — naming the family is half the answer

The one-sentence invariant interviewers chase: a SQL join combines rows from two tables based on a match condition; INNER keeps only matching rows, OUTER joins preserve unmatched rows on one or both sides, and SELF / ANTI / SEMI are special patterns built on top of the primitives. If you can say that aloud and pick the right family for each question, you have already cleared the bar most candidates fail to clear.

What a join does at a glance.

Definition — a clause that combines rows from two (or more) tables using a match condition expressed in ON or USING.
Required syntax — FROM left_table {INNER | LEFT | RIGHT | FULL | CROSS} JOIN right_table ON ….
Row count effect — depends entirely on the join family (see Cardinality below); INNER never inflates beyond the smaller side, OUTER never drops the preserved side, CROSS multiplies both.
NULL semantics — NULL = NULL is UNKNOWN, not TRUE. Join conditions on nullable columns silently drop NULL-bearing rows unless you use IS NOT DISTINCT FROM (Postgres) or pre-filter.

The six core join families (the spine of every interview).

INNER JOIN — keep only rows that match on both sides.
LEFT JOIN (alias LEFT OUTER JOIN) — keep every left row plus matches from the right; unmatched right columns become NULL.
RIGHT JOIN (alias RIGHT OUTER JOIN) — mirror of LEFT; rarely used in practice.
FULL OUTER JOIN — keep every row from both sides; unmatched cells become NULL on the missing side.
CROSS JOIN — Cartesian product; every left row paired with every right row; no ON clause.
SELF JOIN — the same table joined to itself, aliased twice; a pattern rather than a new keyword.

Two special patterns built on top.

ANTI JOIN — "rows in L that have NO match in R"; phrased as NOT EXISTS, LEFT JOIN ... IS NULL, or (carefully) NOT IN.
SEMI JOIN — "rows in L that DO have a match in R, but keep only L's columns"; phrased as EXISTS or IN.

Cardinality and fan-out (the gotcha that fails most candidates).

Fan-out — joining a one-to-many relationship and treating the inflated row count as if it were the original grain (one order spawning three rows after joining with order_items doubles your SUM).
Defence in depth:
- Verify the uniqueness of the join key on at least one side before aggregating.
- Aggregate inside a CTE first to collapse to one row per join key, then JOIN.
- State the grain of every CTE explicitly ("one row per (user_id, day)").
Many-to-many joins multiply rows on both sides; almost always a mistake unless you intend a Cartesian-like cross product.

What interviewers listen for.

Do you name the join family aloud before writing it? ("This is a LEFT join because we want every customer even if they have zero orders.")
Do you mention fan-out before joining a one-to-many table? — senior-signal.
Do you distinguish ON vs WHERE when the right side is nullable? — see §8.
Do you reach for NOT EXISTS (not NOT IN) for anti-joins? — see §7.

Worked example — INNER vs LEFT on the same input

Question. Show every employee with their department name; first using INNER JOIN, then using LEFT JOIN, so the row-count difference is visible.

Input. employees(emp_id, name, dept_id) and departments(dept_id, dept_name).

emp_id	name	dept_id
1	Ava	10
2	Omar	10
3	Lin	NULL
4	Sam	20

dept_id	dept_name
10	Retail
20	Insights

Code (INNER).

SELECT e.emp_id, e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON d.dept_id = e.dept_id;

Code (LEFT).

SELECT e.emp_id, e.name, d.dept_name
FROM employees e
LEFT JOIN departments d ON d.dept_id = e.dept_id;

Step-by-step explanation.

For each employee row, the planner looks for matching dept_id rows in departments.
Ava and Omar match dept 10 → both kept by both joins.
Sam matches dept 20 → kept by both joins.
Lin has dept_id = NULL — NULL = NULL is UNKNOWN, so no match: INNER drops Lin; LEFT keeps Lin with dept_name = NULL.

Output (INNER).

emp_id	name	dept_name
1	Ava	Retail
2	Omar	Retail
4	Sam	Insights

Output (LEFT).

emp_id	name	dept_name
1	Ava	Retail
2	Omar	Retail
3	Lin	NULL
4	Sam	Insights

Rule of thumb: if the question is "every employee, even orphans," reach for LEFT; if it's "only employees who actually have a department," reach for INNER.

SQL
Topic — joins
Joins SQL problems

Practice →

SQL
Topic — multi-join
Multi-table join drills

Practice →

2. INNER JOIN — keep only matching rows

Returns only rows where the match condition is true on both sides

INNER JOIN is the default join when you write just JOIN; the INNER keyword is optional but always preferred in interview code because it states the intent aloud. Use INNER when both sides must exist for the row to be meaningful (orders that actually have a customer, songs that actually have an artist, transactions that actually have a merchant).

INNER JOIN at a glance.

Syntax — FROM L INNER JOIN R ON L.k = R.k.
Output — every (l_row, r_row) pair where the ON predicate is TRUE; rows with no match are dropped.
INNER keyword — optional; JOIN alone means INNER JOIN.
NULLs in the join key — silently dropped (since NULL = anything is UNKNOWN).
Order of tables — does not change the result for INNER; left and right are interchangeable.

Cardinality rules — INNER never inflates beyond the smaller side… unless duplicates.

Upper bound — |result| ≤ |L| × |R| (worst case when many duplicates).
One-to-one join — |result| ≤ min(|L|, |R|).
One-to-many join — |result| = number of child rows whose parent exists.
Many-to-many join — |result| can balloon fast; almost always a fan-out bug unless intentional.

EQUI JOIN vs THETA JOIN.

EQUI JOIN — join condition uses = (e.g. ON a.k = b.k). The default in 99% of interview problems.
THETA JOIN — join condition uses any comparison (<, >, BETWEEN, range overlap). Used for price-tier matching, date-range overlap, time-series join on BETWEEN start_ts AND end_ts.
Planner cost — equi-joins typically run as hash or merge joins; theta joins often degenerate to nested loops (slow on large tables).

ON vs USING vs NATURAL — three ways to specify the join column.

ON L.col = R.col — most explicit; the two columns appear separately in the output.
USING (col) — shorthand when the column has the same name on both sides; produces one merged column.
NATURAL JOIN — joins on every column with a matching name in both tables; dangerous because schema drift can silently change the join surface — avoid in production code.

Use INNER JOIN when …

Both sides must exist for the row to be meaningful.
You are connecting a parent and a child where orphans should be dropped.
You are filtering rows by existence in another table — though EXISTS (semi-join) is usually clearer for that intent.

SQL interview question — list every employee with their department name

Assume employees(emp_id, name, dept_id) and departments(dept_id, dept_name). Return every employee with their department name; drop employees with no department.

Solution Using INNER JOIN

Code solution.

SELECT e.emp_id,
       e.name,
       d.dept_name
FROM employees e
INNER JOIN departments d ON d.dept_id = e.dept_id
ORDER BY d.dept_name, e.name;

Step-by-step trace.

step	relation	outcome
1	Scan `employees`	4 candidate rows: Ava, Omar, Lin (NULL), Sam
2	Probe `departments` on `dept_id`	Ava, Omar, Sam find matches; Lin's `NULL` join key never matches
3	INNER join keeps matched pairs only	3 rows survive
4	Project `(emp_id, name, dept_name)`	final shape
5	Outer `ORDER BY d.dept_name, e.name`	output ordering

Output:

emp_id	name	dept_name
4	Sam	Insights
1	Ava	Retail
2	Omar	Retail

Why this works — concept by concept:

INNER JOIN semantics — drops every row that fails the ON predicate on either side; the NULL-bearing employee Lin disappears without warning.
Equi-join on dept_id — both sides indexed (departments.dept_id is the primary key) means the planner can use a hash join in Θ(|L| + |R|).
Explicit INNER keyword — signals intent in interview code even though JOIN alone would behave identically.
Column projection — selecting only (emp_id, name, dept_name) keeps the output narrow; SELECT * would include the duplicate dept_id column from both sides.
Cost — hash join Θ(|employees| + |departments|) when dept_id is indexed; falls back to Θ(|employees| × |departments|) nested loop without indexes.

SQL
Topic — joins
INNER join interview drills

Practice →

SQL
Topic — aggregation
Joins + aggregation combos

Practice →

3. LEFT JOIN and RIGHT JOIN — preserve unmatched rows on one side

LEFT keeps every left row + matches; RIGHT is its mirror — and you almost never need RIGHT

LEFT JOIN (aliased LEFT OUTER JOIN) is the workhorse of analytics SQL — any time you want "every X, even if some are missing Y", reach for it. RIGHT JOIN is the symmetric mirror and is almost always re-expressible as LEFT JOIN with the tables swapped — most teams ban RIGHT JOIN from style guides for that reason.

LEFT JOIN at a glance.

Syntax — FROM L LEFT JOIN R ON L.k = R.k.
Output — every row of L paired with the matching row of R; if no match, the R columns become NULL.
Cardinality — |result| ≥ |L|; can exceed |L| if R has duplicate join keys.
LEFT vs LEFT OUTER — identical; the OUTER keyword is optional.
Order of tables matters — swap left and right and you swap the preserved side.

RIGHT JOIN at a glance — and why you rarely need it.

Syntax — FROM L RIGHT JOIN R ON L.k = R.k keeps every row of R.
Equivalent expression — FROM R LEFT JOIN L ON L.k = R.k reads left-to-right and gives the same result.
Convention — most style guides standardize on LEFT JOIN; mention this aloud in interviews.
MySQL / Postgres — both fully support RIGHT JOIN; the convention is preference, not compatibility.

Why LEFT is preferred over RIGHT.

Readability — humans read left-to-right; LEFT JOIN naturally communicates "everything from the first table, plus matches from the second."
Code review consistency — switching between LEFT and RIGHT in the same codebase is a frequent source of bugs.
Refactor-friendliness — LEFT JOIN ... LEFT JOIN ... LEFT JOIN chains compose cleanly; mixing LEFT and RIGHT confuses the chain.

Cardinality rules for LEFT JOIN.

Lower bound — |result| ≥ |L| (every left row is preserved).
Upper bound — |result| ≤ |L| × max_dups_per_left_key_in_R (each left row can match multiple right rows).
One-to-one join — |result| = |L|.
One-to-many — left rows with multiple right matches appear multiple times (fan-out — see §1).

The WHERE right_col IS NULL trick — convert LEFT JOIN to anti-join.

Pattern — LEFT JOIN R ON L.k = R.k WHERE R.k IS NULL.
Semantics — keeps only left rows that had no match in R; the textbook "anti-join via outer + NULL filter" idiom.
Trade-off — works on every dialect; planners sometimes produce a worse plan than NOT EXISTS (see §7).
Common interview use — "find customers who never placed an order" or "find pages with zero likes."

SQL interview question — list every employee with their department name, keeping employees who have no department

Assume employees(emp_id, name, dept_id) and departments(dept_id, dept_name). Return every employee with their department name; employees with no department should appear with NULL in dept_name.

Solution Using LEFT JOIN

Code solution.

SELECT e.emp_id,
       e.name,
       d.dept_name
FROM employees e
LEFT JOIN departments d ON d.dept_id = e.dept_id
ORDER BY e.emp_id;

Step-by-step trace.

step	relation	outcome
1	Scan `employees`	4 candidate rows: Ava, Omar, Lin (NULL), Sam
2	Probe `departments` on `dept_id`	Ava, Omar match dept 10; Sam matches dept 20; Lin has `NULL` — no match
3	LEFT join preserves Lin	4 rows survive (Lin's `d.dept_name` becomes `NULL`)
4	Outer `ORDER BY e.emp_id`	stable output ordering

Output:

emp_id	name	dept_name
1	Ava	Retail
2	Omar	Retail
3	Lin	NULL
4	Sam	Insights

Why this works — concept by concept:

LEFT JOIN preserves every left row — Lin's NULL-bearing dept_id would have been dropped by INNER; LEFT keeps her and inserts NULL for the right-side columns.
Direction matters — employees is on the left because we want every employee preserved; reversing the tables would preserve every department instead.
No filter on the right side — moving WHERE d.dept_name IS NOT NULL into the query would silently degrade this to INNER JOIN (see §8 for why).
Stable ORDER BY — emp_id is unique, so the output order is deterministic without further tiebreakers.
Cost — hash join Θ(|employees| + |departments|) when dept_id is indexed; otherwise nested loop Θ(|employees| × |departments|).

SQL
Topic — left-join
LEFT JOIN interview problems

Practice →

SQL
Topic — joins
All-join SQL drills

Practice →

4. FULL OUTER JOIN — keep everything from both sides

FULL OUTER preserves every row from both tables — perfect for reconciliation reports

FULL OUTER JOIN keeps every row from both tables, filling NULL on whichever side lacks a match. Use it when you need to reconcile two systems and identify rows that exist in one but not the other.

FULL OUTER JOIN at a glance.

Syntax — FROM L FULL OUTER JOIN R ON L.k = R.k.
Output — three buckets in one result:
- Matched rows (L_cols + R_cols).
- Left-only rows (L_cols + NULLs on the right).
- Right-only rows (NULLs on the left + R_cols).
FULL vs FULL OUTER — identical; the OUTER keyword is optional.
Order of tables — does not matter (symmetric).

Reconciliation use case.

"Show me every order_id that exists in either the orders system or the deliveries system."
"Identify SKUs present in the inventory snapshot but missing from the catalog (or vice versa)."
"Build a unified view of users across two CRM systems before merging."

IS NULL filters expose the three buckets.

WHERE L.k IS NULL — keeps right-only rows (orphans on the right).
WHERE R.k IS NULL — keeps left-only rows (orphans on the left).
WHERE L.k IS NOT NULL AND R.k IS NOT NULL — keeps only the matched rows (equivalent to INNER JOIN).
No WHERE — keeps all three buckets.

MySQL note (dialect quirk).

MySQL does not support FULL OUTER JOIN directly.
Emulate with LEFT JOIN ... UNION ALL ... RIGHT JOIN ... WHERE L.k IS NULL to combine left-only + matched + right-only.
PostgreSQL, SQL Server, Oracle, BigQuery, Snowflake, Redshift all support it natively.

SQL interview question — reconcile orders against deliveries

Assume orders(order_id, customer_id) and deliveries(delivery_id, order_id, status). Return every order_id that exists in either table; flag whether the row came from orders, deliveries, or both.

Solution Using FULL OUTER JOIN

Code solution.

SELECT COALESCE(o.order_id, d.order_id)               AS order_id,
       o.customer_id,
       d.status,
       CASE
           WHEN o.order_id IS NULL THEN 'delivery_only'
           WHEN d.order_id IS NULL THEN 'order_only'
           ELSE 'matched'
       END                                            AS source
FROM orders o
FULL OUTER JOIN deliveries d ON d.order_id = o.order_id
ORDER BY 1;

Step-by-step trace.

step	relation	outcome
1	Scan `orders` and `deliveries`	both source tables read
2	FULL OUTER join on `order_id`	every order_id from either side preserved; mismatched cells are `NULL`
3	`COALESCE(o.order_id, d.order_id)`	merges the two `order_id` columns into one canonical value
4	`CASE` on which side is `NULL`	tags each row as `order_only`, `delivery_only`, or `matched`
5	Outer `ORDER BY 1`	stable output ordering on the merged `order_id`

Output:

order_id	customer_id	status	source
1001	42	shipped	matched
1002	17	NULL	order_only
1003	NULL	delivered	delivery_only

Why this works — concept by concept:

FULL OUTER semantics — preserves every row from both sides; the two "orphan" buckets are exactly what a reconciliation report needs.
COALESCE for the canonical key — o.order_id is NULL for delivery-only rows and vice versa; COALESCE picks whichever is non-null.
CASE for source labelling — tagging each row with its origin lets downstream consumers filter WHERE source = 'order_only' to isolate one bucket.
No WHERE clause — keeps all three buckets; adding WHERE source = 'matched' would degrade the query to an INNER JOIN.
Cost — hash join Θ(|orders| + |deliveries|) on indexed order_id; the post-join CASE is Θ(n) over the result.

SQL
Topic — joins
FULL OUTER join drills

Practice →

SQL
Topic — multi-join
Reconciliation problems

Practice →

5. CROSS JOIN — the Cartesian product

CROSS JOIN multiplies every row from L by every row from R — useful for grids, dangerous when accidental

CROSS JOIN has no ON clause; it returns the Cartesian product of the two tables. Used deliberately, it builds dense calendars and complete grids; used accidentally, it produces query results that crash dashboards.

CROSS JOIN at a glance.

Syntax — FROM L CROSS JOIN R (no ON).
Output size — |L| × |R| rows.
Implicit form — FROM L, R is the same as CROSS JOIN; this is why forgetting ON is the canonical Cartesian bug.
Use case — dense grids: every (date × store) slot for a missing-day report, every (product × region) combination for a pivot, every (user × cohort) slot for funnel analysis.

Use CROSS JOIN deliberately for dense calendars and grids.

Generate a calendar dimension: generate_series('2026-01-01', '2026-12-31', INTERVAL '1 day') then CROSS JOIN stores to get one row per (day, store).
Test fixtures: CROSS JOIN two small dimension tables to create exhaustive combinations.
Dense pivot scaffolds: CROSS JOIN the row dimension with the column dimension, then LEFT JOIN the fact table to fill in actuals (missing combinations show as NULL).

Trap — the accidental Cartesian.

Forgetting the ON clause on a regular JOIN — older SQL syntax FROM a, b without a WHERE a.k = b.k produces a Cartesian.
A typo like ON a.k = a.k (instead of a.k = b.k) joins every row to every row.
Result size explodes — |a| × |b| can be 100M rows from two 10k tables.

CROSS JOIN LATERAL (Postgres bonus).

CROSS JOIN LATERAL (SELECT … FROM r WHERE r.fk = L.id LIMIT 5) — for each left row, run a parameterised subquery that can reference left columns.
Useful for "top-N children per parent" patterns without window functions.
Snowflake / BigQuery offer LATERAL / UNNEST with similar semantics.

Worked example — dense `(date × store)` grid for a sales report

Question. Generate a row for every (date, store_id) combination over the first three days of January and the two stores in stores, then LEFT JOIN the sales table to fill in revenue (zeros for missing combinations).

Input. stores(store_id) = {S1, S2}; sales(sale_date, store_id, revenue) has a row only for (2026-01-01, S1, 500).

Code.

WITH calendar AS (
  SELECT generate_series('2026-01-01'::date, '2026-01-03'::date, INTERVAL '1 day')::date AS d
),
grid AS (
  SELECT c.d, s.store_id
  FROM calendar c
  CROSS JOIN stores s
)
SELECT g.d, g.store_id, COALESCE(s.revenue, 0) AS revenue
FROM grid g
LEFT JOIN sales s ON s.sale_date = g.d AND s.store_id = g.store_id
ORDER BY g.d, g.store_id;

Step-by-step explanation.

calendar generates 3 dates.
CROSS JOIN stores multiplies — 3 dates × 2 stores = 6 rows in grid.
LEFT JOIN sales matches on (d, store_id); only (2026-01-01, S1) has data.
COALESCE converts missing-revenue NULLs to 0.

Output.

d	store_id	revenue
2026-01-01	S1	500
2026-01-01	S2	0
2026-01-02	S1	0
2026-01-02	S2	0
2026-01-03	S1	0
2026-01-03	S2	0

Rule of thumb: if the answer requires "a row for every combination, even the empty ones," CROSS JOIN + LEFT JOIN is the canonical pattern.

SQL
Topic — cross-join
CROSS JOIN practice

Practice →

SQL
Topic — joins
All-join SQL library

Practice →

6. SELF JOIN — joining a table to itself

SELF JOIN is a pattern, not a keyword — alias the table twice and join it to itself

A SELF JOIN is not a new SQL keyword; it's the pattern of joining a table to itself, aliasing it twice so the planner can resolve column references unambiguously. Use it for hierarchies (employee → manager), pairwise comparisons (find users with overlapping tags), and sequential events (gap detection) — though modern SQL often replaces SELF JOINs with LAG window functions for cleaner code.

SELF JOIN at a glance.

Definition — a join where both sides are the same table, distinguished by aliases.
Syntax — FROM employees e1 JOIN employees e2 ON e2.manager_id = e1.emp_id.
Keyword — there is no SELF JOIN keyword; you simply alias the table twice and use INNER / LEFT / FULL as needed.
Output — pairs of rows from the same table where the ON condition holds.

Why aliasing is mandatory.

Without aliases, references like employees.emp_id are ambiguous (which copy?).
Conventional aliases — e1 / e2, parent / child, prev / curr — signal intent.
Same alias rules apply to chained self-joins (e1, e2, e3) for multi-level hierarchies.

Use cases for SELF JOIN.

Hierarchies — employee → manager, comment → parent_comment, file → parent_directory.
Pairwise comparisons — find users with the same tag, products with the same SKU prefix, transactions with the same amount.
Sequential events — match each row with the previous row by id - 1 or event_ts < self.event_ts ORDER BY ts DESC LIMIT 1.
Tag-matching — taggings t1 JOIN taggings t2 ON t1.tag_id = t2.tag_id AND t1.user_id < t2.user_id to enumerate pairs.

SELF JOIN vs window functions — when to prefer which.

SELF JOIN — wins when the relationship is structural (parent-child hierarchies, peer pairs).
LAG / LEAD — wins when the relationship is ordered-neighbour (previous-row / next-row on a sorted partition); see Blog70 §5.
Recursive CTE — wins for multi-level hierarchies (full org-chart traversal); see Blog44 §6.
Cost — SELF JOIN is Θ(n²) worst-case without indexes; LAG is Θ(n log n) for the sort plus Θ(n) for the scan.

SQL interview question — for every employee, show their manager's name

Assume employees(emp_id, name, manager_id). manager_id is NULL for top-of-org employees. Return every employee with their name and their manager's name; include employees who have no manager (their manager column should be NULL).

Solution Using SELF JOIN

Code solution.

SELECT e.emp_id,
       e.name        AS employee,
       m.name        AS manager
FROM employees e
LEFT JOIN employees m ON m.emp_id = e.manager_id
ORDER BY e.emp_id;

Step-by-step trace.

step	relation	outcome
1	Scan `employees AS e`	every employee row in the output
2	Scan `employees AS m` aliased as "manager"	second copy of the same table
3	`LEFT JOIN ON m.emp_id = e.manager_id`	each employee's `manager_id` looked up against `emp_id` on the manager copy
4	Top-of-org rows (`manager_id IS NULL`)	preserved by `LEFT JOIN` with `m.name = NULL`
5	Outer `ORDER BY e.emp_id`	deterministic output

Output:

emp_id	employee	manager
1	Quinn	NULL
2	Ravi	Quinn
3	Lin	Ravi
4	Ava	Ravi

Why this works — concept by concept:

Self join via aliasing — employees e and employees m are two logical references to the same physical table; the planner treats them independently for the join.
LEFT JOIN preserves top-of-org — Quinn has manager_id = NULL; INNER JOIN would drop her, LEFT keeps her with m.name = NULL.
Alias renaming in SELECT — e.name AS employee and m.name AS manager keep the output readable even though both columns are employees.name.
Direction of the join — m.emp_id = e.manager_id means "find the manager's row whose emp_id matches this employee's manager_id."
Cost — hash join Θ(|employees|) per side when emp_id is indexed; nested loop Θ(|employees|²) if not.

SQL
Topic — self-join
SELF JOIN interview problems

Practice →

SQL
Topic — joins
Joins SQL library

Practice →

7. ANTI JOIN — rows in one table with no match in another

ANTI JOIN keeps rows in L with no match in R — three phrasings, one preferred

ANTI JOIN answers the canonical interview question "find rows in table A that do NOT exist in table B." It is not a separate SQL keyword in most dialects; you write it as NOT EXISTS, LEFT JOIN ... IS NULL, or (carefully) NOT IN. All three are logically equivalent in spirit, but their NULL handling and planner cost differ enough that picking the right phrasing is itself an interview signal.

ANTI JOIN at a glance.

Definition — return rows in L that have no matching row in R under the ON predicate.
Inverse of SEMI JOIN — SEMI keeps "L rows that have a match"; ANTI keeps "L rows that do not."
Keyword support — DuckDB has ANTI JOIN; Snowflake / BigQuery / Postgres rely on NOT EXISTS or LEFT JOIN ... IS NULL.
Cardinality — |result| ≤ |L|; never inflates.

Three equivalent phrasings — and their NULL behaviour.

WHERE NOT EXISTS (SELECT 1 FROM R WHERE R.k = L.k) — the preferred phrasing; planners produce a clean anti-semi-join plan, and NULLs in R.k are handled correctly (they fail the = predicate and don't disturb the result).
LEFT JOIN R ON L.k = R.k WHERE R.k IS NULL — the "outer + NULL filter" idiom; works on every dialect; planner may produce a slower plan because it materializes the LEFT JOIN before filtering.
WHERE L.k NOT IN (SELECT k FROM R) — avoid unless R.k is guaranteed NOT NULL. If any row in R.k is NULL, the whole query returns empty because x NOT IN (..., NULL) evaluates to UNKNOWN for every x.

Why NOT EXISTS is the preferred phrasing.

Plan quality — most planners (Postgres, SQL Server, Oracle, Snowflake) emit a dedicated anti-semi-join operator that short-circuits on the first match found in R.
NULL safety — the correlated = predicate treats NULL = anything as UNKNOWN, which is filtered out, so the outer row survives correctly.
Reads cleanly — "where there is not a row in R matching this one" mirrors the spoken intent.
Composes with multiple conditions — WHERE NOT EXISTS (SELECT 1 FROM R WHERE R.k = L.k AND R.status = 'active') is straightforward.

The NOT IN NULL trap (the canonical interview gotcha).

Setup — WHERE L.k NOT IN (SELECT k FROM R).
Trap — if any row in R.k is NULL, the subquery returns {..., NULL}. Then L.k NOT IN (..., NULL) is UNKNOWN for every L.k, and the whole query returns zero rows.
Fix — either use NOT EXISTS, or filter the subquery: WHERE L.k NOT IN (SELECT k FROM R WHERE k IS NOT NULL).
Why interviewers love this — it's a silent bug; the query "works" until production data introduces a NULL.

Use cases for ANTI JOIN.

Customers who never placed an order.
Pages with zero likes (DataLemur Meta interview question).
Subscribers with no recent activity.
Inventory items missing from the catalog.
Users who have not yet completed onboarding.

SQL interview question — find Facebook pages with zero likes

Assume pages(page_id, page_name) and page_likes(page_id, user_id). Return the page_id of every page that has not received a single like.

Solution Using NOT EXISTS

Code solution.

SELECT p.page_id,
       p.page_name
FROM pages p
WHERE NOT EXISTS (
    SELECT 1
    FROM page_likes pl
    WHERE pl.page_id = p.page_id
)
ORDER BY p.page_id;

Step-by-step trace.

step	relation	outcome
1	Scan `pages p`	every candidate page row
2	For each `p`, probe `page_likes` for any row with `pl.page_id = p.page_id`	match found → `EXISTS` is `TRUE` → `NOT EXISTS` is `FALSE` → row dropped
3	No match found → `NOT EXISTS` is `TRUE` → row kept	only pages with zero likes survive
4	Outer `ORDER BY p.page_id`	stable output ordering

Output:

page_id	page_name
22	abandoned-page
45	new-page-no-likes-yet

Why this works — concept by concept:

NOT EXISTS for anti-semi-join — the correlated subquery short-circuits on the first match; planners emit a dedicated anti-semi-join operator that beats LEFT JOIN ... IS NULL in most cases.
NULL safety — pl.page_id = p.page_id returns UNKNOWN for NULL keys, which is filtered out; the outer page row survives correctly without any extra IS NOT NULL filter.
Why not NOT IN — if page_likes.page_id ever contains NULL, NOT IN would silently return zero rows; NOT EXISTS is robust.
SELECT 1 inside the subquery — convention; the column list doesn't matter for EXISTS, so SELECT 1 signals "I only care about existence."
Cost — anti-semi-join Θ(|pages| + |page_likes|) when page_likes.page_id is indexed; otherwise Θ(|pages| × |page_likes|) worst case.

SQL
Topic — anti-join
ANTI JOIN interview problems

Practice →

SQL
Topic — joins
All-join SQL drills

Practice →

8. ON vs WHERE — the join-condition vs filter trap

`ON` runs during the join; `WHERE` runs after — and for OUTER joins, the choice changes the result

The ON clause defines the join condition; the WHERE clause filters the join's result. For INNER joins, the choice does not matter — both clauses produce the same final rows. For OUTER joins, the choice silently changes the result, and most interview rejections happen because a candidate moved a predicate from ON to WHERE without realising they had downgraded a LEFT JOIN to an INNER JOIN.

ON runs during the join.

Role — controls which (L, R) row pairs are considered a "match."
Timing — applied while the join is being constructed.
For OUTER joins — unmatched rows on the preserved side are kept with NULLs; predicates in ON against the non-preserved side do not drop preserved rows.

WHERE runs after the join completes.

Role — filters the final join result.
Timing — applied after the join has produced its full output (matched rows + preserved orphans).
For OUTER joins — predicates against the non-preserved side do drop preserved rows because their value is NULL (and NULL op anything is UNKNOWN).

For INNER joins — ON and WHERE are equivalent.

Both clauses run on the same set of rows.
Convention: put the join-key equality in ON and downstream filters in WHERE so reviewers can spot intent.
The optimiser typically rewrites them to the same plan.

For OUTER joins — placement changes the result.

Right-side predicate in ON — preserves unmatched left rows (right columns stay NULL).
Same predicate in WHERE — drops unmatched left rows (because NULL = x is UNKNOWN), effectively silently degrading the LEFT JOIN to an INNER JOIN.
Practical rule — if you want a row to survive even when the right side has no match, the predicate against the right side must stay in ON.

Cardinality and fan-out reminder.

Always verify the uniqueness of the join key on at least one side before aggregating after a join.
One-to-many joins inflate row counts; aggregating without dedup multiplies your sums.
The safest pattern: aggregate inside a CTE first to collapse to one row per join key, then JOIN.

Worked example — `active=true` predicate moves between ON and WHERE

Question. List every employee with their department name, but include the department name only when the department is active; employees should appear even if their department is inactive or missing.

Input. employees(emp_id, name, dept_id) and departments(dept_id, dept_name, active). Suppose Ava is in dept 10 (active), Omar is in dept 10 (active), Sam is in dept 20 (inactive), Lin has dept_id = NULL.

Code (correct — predicate in ON).

SELECT e.emp_id, e.name, d.dept_name
FROM employees e
LEFT JOIN departments d
       ON d.dept_id = e.dept_id
      AND d.active  = TRUE
ORDER BY e.emp_id;

Code (subtly wrong — predicate in WHERE).

SELECT e.emp_id, e.name, d.dept_name
FROM employees e
LEFT JOIN departments d ON d.dept_id = e.dept_id
WHERE d.active = TRUE
ORDER BY e.emp_id;

Step-by-step explanation.

The correct version applies active = TRUE while matching — Sam (dept 20, inactive) and Lin (no dept) both survive the join with dept_name = NULL.
The wrong version applies active = TRUE after the join — Sam's d.active is FALSE (filtered), and Lin's d.active is NULL (NULL = TRUE is UNKNOWN, filtered). Both rows disappear, silently degrading LEFT JOIN to INNER JOIN.

Output (correct).

emp_id	name	dept_name
1	Ava	Retail
2	Omar	Retail
3	Lin	NULL
4	Sam	NULL

Output (wrong).

emp_id	name	dept_name
1	Ava	Retail
2	Omar	Retail

Rule of thumb: for OUTER joins, predicates against the non-preserved side belong in ON; predicates against the preserved side belong in WHERE. State this aloud in the interview.

SQL
Topic — joins
ON vs WHERE join drills

Practice →

SQL
Topic — left-join
LEFT JOIN problems

Practice →

Choosing the right join (cheat sheet)

You want …	Reach for …	Watch out for
Only rows that exist on both sides	`INNER JOIN`	NULL keys are silently dropped
Every row from one side + matches from the other	`LEFT JOIN` (rare: `RIGHT JOIN`)	Fan-out when the right side has duplicate join keys
Every row from both sides — reconciliation	`FULL OUTER JOIN`	Not supported in MySQL; emulate with `LEFT UNION ALL RIGHT WHERE … IS NULL`
Cartesian product — dense grids	`CROSS JOIN`	Result size = `
Same table joined to itself	{% raw %}`SELF JOIN` (aliased twice)	`LAG` / `LEAD` often clearer for ordered-neighbour problems
Rows in L with NO match in R	`NOT EXISTS` (preferred), `LEFT JOIN ... IS NULL`	`NOT IN` is buggy if R has `NULL`s
Rows in L with a match in R, keep only L cols	`EXISTS` (semi-join)	`IN` works but may fan out duplicates if you accidentally JOIN
Range / overlap match (`BETWEEN`, `<`, `>`)	Theta join — `INNER JOIN ON L.start <= R.t AND R.t < L.end`	Falls back to nested loop; index range carefully

Frequently asked questions

What's the difference between INNER JOIN and OUTER JOIN?

INNER JOIN keeps only rows that match on both sides — anything unmatched on either side is dropped. OUTER JOIN is an umbrella for LEFT, RIGHT, and FULL OUTER, each of which preserves unmatched rows on a specific side and fills the missing columns with NULL. Use INNER when both sides must exist for the row to be meaningful (e.g. orders that have a real customer); use an OUTER variant when you need "everything from one or both sides, even orphans" (e.g. all customers, including those with zero orders).

What's the difference between LEFT JOIN and RIGHT JOIN in SQL?

LEFT JOIN returns every row from the left table plus matching rows from the right; unmatched right columns are NULL. RIGHT JOIN is the mirror — every right row plus matches from the left. They are symmetric: A LEFT JOIN B ON … is identical to B RIGHT JOIN A ON …, so most teams standardise on LEFT JOIN for left-to-right readability and ban RIGHT JOIN from their style guide. The interview-grade soundbite: "If you ever reach for RIGHT JOIN, swap the table order and use LEFT JOIN instead."

What's the difference between the ON clause and the WHERE clause in a SQL join?

ON runs during the join and decides which (L, R) row pairs count as a match; WHERE runs after the join produces its output and filters the final result. For INNER joins the two clauses produce the same answer, but for OUTER joins their placement changes the result: a predicate against the non-preserved side in ON keeps the preserved rows (filling the predicate columns with NULL), while the same predicate in WHERE drops those rows (since NULL op anything is UNKNOWN), silently degrading the OUTER join to an INNER join.

How do I write a SELF JOIN in SQL?

There is no SELF JOIN keyword; you write a regular JOIN (INNER, LEFT, etc.) using the same table twice with different aliases so the planner can disambiguate column references. For example, FROM employees e LEFT JOIN employees m ON m.emp_id = e.manager_id matches each employee row to the manager row inside the same employees table. Use SELF JOIN for hierarchies (employee → manager), pairwise comparisons inside one table, or sequential events when window functions like LAG are not available or not as clear.

How do I write an anti-join — NOT EXISTS vs LEFT JOIN IS NULL?

Three phrasings are equivalent in spirit: WHERE NOT EXISTS (SELECT 1 FROM R WHERE R.k = L.k), LEFT JOIN R ON L.k = R.k WHERE R.k IS NULL, and WHERE L.k NOT IN (SELECT k FROM R). Prefer NOT EXISTS — it emits a clean anti-semi-join plan, is robust to NULLs in R.k, and reads as "where there is not a row in R matching this one." LEFT JOIN ... IS NULL works everywhere but may produce a worse plan. NOT IN is the trap to avoid: if R.k ever contains NULL, the entire query returns zero rows because x NOT IN (..., NULL) is UNKNOWN.

Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including PostgreSQL-first SQL practice keyed to joins, anti-join, self-join, aggregation + joins, and multi-table reconciliation patterns.

Kick off via Explore practice →; drill the dedicated joins SQL lane →; fan out into LEFT JOIN problems →; reinforce anti-join patterns →; rehearse SELF JOIN drills →; widen coverage on the full SQL practice library →.

DEV Community

SQL Joins Interview Questions: INNER, LEFT, RIGHT, FULL, SELF & ANTI Joins

1. Why SQL joins matter in data engineering interviews

Joins decide which rows survive the combine — naming the family is half the answer

Worked example — INNER vs LEFT on the same input

2. INNER JOIN — keep only matching rows

Returns only rows where the match condition is true on both sides

SQL interview question — list every employee with their department name

Solution Using INNER JOIN

3. LEFT JOIN and RIGHT JOIN — preserve unmatched rows on one side

LEFT keeps every left row + matches; RIGHT is its mirror — and you almost never need RIGHT

SQL interview question — list every employee with their department name, keeping employees who have no department

Solution Using LEFT JOIN

4. FULL OUTER JOIN — keep everything from both sides

FULL OUTER preserves every row from both tables — perfect for reconciliation reports

SQL interview question — reconcile orders against deliveries

Solution Using FULL OUTER JOIN

5. CROSS JOIN — the Cartesian product

CROSS JOIN multiplies every row from L by every row from R — useful for grids, dangerous when accidental

Worked example — dense `(date × store)` grid for a sales report

6. SELF JOIN — joining a table to itself

SELF JOIN is a pattern, not a keyword — alias the table twice and join it to itself

SQL interview question — for every employee, show their manager's name

Solution Using SELF JOIN

7. ANTI JOIN — rows in one table with no match in another

ANTI JOIN keeps rows in L with no match in R — three phrasings, one preferred

SQL interview question — find Facebook pages with zero likes

Solution Using NOT EXISTS

8. ON vs WHERE — the join-condition vs filter trap

`ON` runs during the join; `WHERE` runs after — and for OUTER joins, the choice changes the result

Worked example — `active=true` predicate moves between ON and WHERE

Choosing the right join (cheat sheet)

Frequently asked questions

What's the difference between INNER JOIN and OUTER JOIN?

What's the difference between LEFT JOIN and RIGHT JOIN in SQL?

What's the difference between the ON clause and the WHERE clause in a SQL join?

How do I write a SELF JOIN in SQL?

How do I write an anti-join — NOT EXISTS vs LEFT JOIN IS NULL?

Practice on PipeCode

Top comments (0)

1. Why SQL joins matter in data engineering interviews

Joins decide which rows survive the combine — naming the family is half the answer

Worked example — INNER vs LEFT on the same input

2. INNER JOIN — keep only matching rows

Returns only rows where the match condition is true on both sides

SQL interview question — list every employee with their department name

Solution Using INNER JOIN

3. LEFT JOIN and RIGHT JOIN — preserve unmatched rows on one side

LEFT keeps every left row + matches; RIGHT is its mirror — and you almost never need RIGHT

SQL interview question — list every employee with their department name, keeping employees who have no department

Solution Using LEFT JOIN

4. FULL OUTER JOIN — keep everything from both sides

FULL OUTER preserves every row from both tables — perfect for reconciliation reports

SQL interview question — reconcile orders against deliveries

Solution Using FULL OUTER JOIN

5. CROSS JOIN — the Cartesian product

CROSS JOIN multiplies every row from L by every row from R — useful for grids, dangerous when accidental

Worked example — dense (date × store) grid for a sales report

6. SELF JOIN — joining a table to itself

SELF JOIN is a pattern, not a keyword — alias the table twice and join it to itself

SQL interview question — for every employee, show their manager's name

Solution Using SELF JOIN

7. ANTI JOIN — rows in one table with no match in another

ANTI JOIN keeps rows in L with no match in R — three phrasings, one preferred

SQL interview question — find Facebook pages with zero likes

Solution Using NOT EXISTS

8. ON vs WHERE — the join-condition vs filter trap

ON runs during the join; WHERE runs after — and for OUTER joins, the choice changes the result

Worked example — active=true predicate moves between ON and WHERE

Choosing the right join (cheat sheet)

Frequently asked questions

What's the difference between INNER JOIN and OUTER JOIN?

What's the difference between LEFT JOIN and RIGHT JOIN in SQL?

What's the difference between the ON clause and the WHERE clause in a SQL join?

How do I write a SELF JOIN in SQL?

How do I write an anti-join — NOT EXISTS vs LEFT JOIN IS NULL?

Practice on PipeCode

Worked example — dense `(date × store)` grid for a sales report

`ON` runs during the join; `WHERE` runs after — and for OUTER joins, the choice changes the result

Worked example — `active=true` predicate moves between ON and WHERE