sql set operations stitch two or more result sets together along the row axis: union sql stacks rows and drops duplicates, union all sql stacks rows and keeps every duplicate, intersect sql keeps only rows that appear in both inputs, and except sql (called minus sql in Oracle) keeps rows from the left input that do not appear in the right. These four operators answer the bulk of the sql interview questions in the "combine result sets" cluster, and the union vs union all distinction alone trips up more candidates than almost any other shape on a SQL whiteboard.
This guide walks through every operator in the sql set operations family that reviewers love to test in data engineering interview questions: the four-way mental model, the column-count + type-compatibility contract that binds every set op, the union vs union all deduplication-and-performance trap, intersect in sql for finding rows present in both inputs, except sql / minus sql for subtraction, how set operators compare to JOIN and WHERE IN for the same intent, and the seven interview gotchas (NULL equality, ORDER BY placement, type widening) that fail most candidates. Every section ends as sql interview questions with answers: a runnable PostgreSQL query, a traced execution, an output table, and a concept-by-concept why this works breakdown — the exact shape sql for data engineers rounds reward when set operations in sql come up.
When you want hands-on reps immediately after reading, browse set-operations SQL practice →, drill the UNION ALL lane →, sharpen aggregation SQL drills →, rehearse JOIN SQL problems →, or widen coverage on the full SQL practice library →.
On this page
- Why SQL set operations matter in data engineering interviews
- Anatomy of a set operation — column-count and type-compatibility rules
- UNION vs UNION ALL — deduplication and the performance trap
- INTERSECT — rows present in both inputs
- EXCEPT and MINUS — rows in A but not in B
- Set operations vs JOIN, WHERE IN, and NOT EXISTS
- Interview gotchas — NULL equality, ORDER BY, type widening
- Choosing the right set operator (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why SQL set operations matter in data engineering interviews
Four operators, one contract — the whole set-ops interview surface in a sentence
The one-sentence invariant: sql set operations stack two result sets vertically — UNION drops duplicates, UNION ALL keeps every row, INTERSECT keeps shared rows, EXCEPT (MINUS in Oracle) keeps rows from the left input absent from the right — and every operator requires the same column count and compatible types between the two inputs. Once you internalise that, every prompt in the union intersect except in sql family becomes "pick the right operator and confirm the column contract."
The four operators at a glance.
-
UNION— vertical stack with DISTINCT semantics; drops duplicates across both inputs.union sqlandunion in sqlare the two most-searched names for this same operator. -
UNION ALL— vertical stack with bag semantics; keeps every row, including duplicates. -
INTERSECT— keeps rows that appear in both inputs; treats them as sets (DISTINCT). -
EXCEPT/MINUS— keeps rows in the left input that do not appear in the right;DISTINCTsemantics by default.
The non-negotiable contract every set op enforces.
-
Same number of columns in both
SELECTlists —(a, b, c)cannot UNION with(a, b). -
Compatible types per column position —
INTEGERUNIONBIGINTwidens toBIGINT;INTEGERUNIONTEXTerrors or silently casts depending on dialect. -
Column names come from the left input — the second
SELECT's names are ignored in the output. -
One
ORDER BYfor the whole expression — placed after the finalSELECT, not after each input.
Why interviewers love set operations.
-
Every analytic question with "combine these two tables" reduces to a
set operations in sqlprompt — active vs churned users, this month vs last, paying vs trial;union in sqlandintersect in sqlare the most-tested variants. -
The
union vs union allperformance trap is universal — candidates who reach forUNIONinstinctively burn an unnecessary dedup pass on every prompt. -
NULL semantics diverge from
JOIN—INTERSECTtreatsNULL = NULLas true (set-equality);JOIN ON a = btreats it asUNKNOWN. Senior signal. -
Dialect divergence on
EXCEPTvsMINUS— Oracle is the lone holdout withMINUS; every other major engine usesEXCEPT.
What interviewers listen for.
- Do you name the column contract before writing the query? — "both inputs return
(user_id, region)in the same order with compatible types." - Do you default to
UNION ALLunless deduplication is explicitly required? — the cost-aware answer. - Do you mention that
EXCEPTandINTERSECTtreatNULLas equal when compared toJOIN'sUNKNOWN? — bonus points. - Do you place
ORDER BYat the bottom of the whole expression, not after each input? — basic-but-tested fluency.
Worked example — combine active and churned users into one feed
Detailed explanation. When a stakeholder asks for "all users who interacted with the product in 2026, whether currently active or churned," the natural shape is a vertical stack of two queries. The two inputs already share the same projection — (user_id, status) — so a single set op gives the union of the two cohorts.
Question. From active_users(user_id) and churned_users(user_id), return a deduplicated list of all interacting users with a status label.
Input. active_users slice and churned_users slice.
| active_users.user_id |
|---|
| 1 |
| 2 |
| 3 |
| churned_users.user_id |
|---|
| 3 |
| 4 |
| 5 |
Code.
SELECT user_id, 'active' AS status FROM active_users
UNION
SELECT user_id, 'churned' AS status FROM churned_users
ORDER BY user_id;
Step-by-step explanation.
- The first
SELECTprojects every active user with a literal'active'label. - The second
SELECTprojects every churned user with a literal'churned'label. -
UNIONstacks the two result sets vertically and drops exact-duplicate rows (a row is a duplicate only when every column matches, so(3, 'active')and(3, 'churned')are distinct and both survive). - The single
ORDER BY user_idat the bottom sorts the combined output. - Column names come from the first
SELECT— the secondSELECT's aliases are discarded.
Output.
| user_id | status |
|---|---|
| 1 | active |
| 2 | active |
| 3 | active |
| 3 | churned |
| 4 | churned |
| 5 | churned |
Rule of thumb: when the question says "combine these two cohorts" and the projection already matches, set ops are the cleanest shape; if you'd lose information by deduping, switch to UNION ALL.
SQL
Topic — set-operations
Set operations SQL drills
SQL
Topic — union-all
UNION ALL SQL practice
2. Anatomy of a set operation — column-count and type-compatibility rules
The contract every set op enforces — column count, compatible types, single ORDER BY
Before any deduplication or intersection happens, every set operation enforces the same structural contract on its two inputs. Get the contract right and the four operators behave consistently; get it wrong and you get a dialect-specific cocktail of errors.
Rule 1 — same number of columns in both inputs.
- Hard requirement — every dialect raises an error if the column counts differ.
- Error message — Postgres says "each UNION query must have the same number of columns"; MySQL says "The used SELECT statements have a different number of columns"; SQL Server says "All queries combined using a UNION, INTERSECT or EXCEPT operator must have an equal number of expressions in their target lists."
-
Fix — pad the shorter side with
NULLliterals:SELECT a, b, NULL FROM ….
Rule 2 — column types must be compatible at each position.
- Position-based, not name-based — column 1 of left input pairs with column 1 of right input regardless of column names.
-
Compatible —
INTEGERandBIGINTwiden toBIGINT;VARCHAR(10)andVARCHAR(50)widen toVARCHAR(50);DATEandTIMESTAMPwiden toTIMESTAMP. -
Incompatible —
INTEGERandDATEerror in Postgres; MySQL silently casts; SQL Server errors. Senior fix — write explicit casts in both sides. - Type-widening rule — the output type is the "least common supertype" of the two inputs at each column position.
Rule 3 — column names come from the first input.
-
The output's column names equal the first
SELECT's aliases — every other input's aliases are ignored. -
Practical consequence — if you want the output column called
status, alias itAS statusin the firstSELECT. -
ORDER BYreferences — must use the first-input column names or positional indexes (ORDER BY 1).
Rule 4 — one ORDER BY for the whole expression.
-
Placement — at the very bottom, after the last
SELECT. Postgres / MySQL / Snowflake all enforce this. -
Per-input
ORDER BY— most dialects either error or silently discardORDER BYplaced inside an innerSELECTof a set op (the order is undefined until the final operator runs). - Workaround — wrap each input in a CTE with the desired internal order, but that order is not guaranteed to survive the set op anyway.
Rule 5 — parentheses control precedence in chained set ops.
-
INTERSECTbinds tighter thanUNIONandEXCEPTin ANSI SQL —A UNION B INTERSECT Cis parsed asA UNION (B INTERSECT C). -
Always parenthesise when chaining different operators —
(A UNION B) INTERSECT Creads as you expect. - Dialect divergence — some engines (older MySQL) don't honor parentheses correctly; rewrite as CTEs to force order.
DISTINCT semantics vs ALL modifier — every operator supports both.
-
UNION DISTINCT— same asUNION(default isDISTINCT). -
UNION ALL— keep duplicates. -
INTERSECT [DISTINCT]— default isDISTINCT; rows that appear in both inputs, deduplicated. -
INTERSECT ALL— bag-multiplicity intersection; if a row appears 3 times in A and 5 times in B, it appears 3 times in the output (min of the two counts). Postgres / DB2 / Snowflake support this; MySQL and SQL Server do not. -
EXCEPT [DISTINCT]— rows in A not in B, deduplicated. -
EXCEPT ALL— bag-multiplicity difference; if a row appears 5 times in A and 2 times in B, it appears 3 times in the output. Postgres / DB2 / Snowflake support this; MySQL and SQL Server do not.
SQL
Topic — set-operations
Set-operation contract drills
SQL
Topic — type-casting
Type-compatibility SQL practice
3. UNION vs UNION ALL — deduplication and the performance trap
union vs union all — the single most-tested set-ops fact
The union vs union all distinction is the single most-asked set-operations interview question. The answer in one sentence: UNION performs an implicit DISTINCT over the combined result; UNION ALL skips that step and is materially cheaper on every dialect. Reaching for UNION when you do not need dedup is the textbook sql for data engineers mistake.
UNION — deduplicates the combined result.
- Semantics — vertical stack, then drop exact-duplicate rows.
-
"Exact duplicate" — every column matches (including
NULLs;INTERSECT/UNIONtreatNULLas equal toNULL). -
Plan — typically
HashAggregateon the union output orSort+Unique; both add a pass over the data. -
Cost —
Θ(n + m)for the scan plusΘ((n + m))for the dedup hash; memory grows with distinct row count. - Use it when — you truly need a deduplicated stack and the inputs may overlap (e.g. "all users who ever interacted, once each").
UNION ALL — passes every row through.
- Semantics — vertical stack with bag semantics; preserves duplicates.
-
Plan — typically
Append(Postgres) orConcatenation(SQL Server); essentially free beyond the input scans. -
Cost —
Θ(n + m)for the scans only; no extra hash or sort. - Use it when — the inputs are known to be disjoint (e.g. partitioned tables stitched together) OR when duplicates are meaningful (e.g. event-log unions).
difference between union and union all — the headline contrasts.
-
Deduplication —
UNIONdrops duplicates;UNION ALLkeeps them. -
Performance —
UNION ALLis faster by aΘ(dedup)pass; the gap widens with row count. -
Output count —
UNION≤ sum of input counts;UNION ALL= sum of input counts exactly. -
NULL handling — both treat
NULL = NULLas true for dedup purposes (set semantics, not JOIN semantics).
When UNION ALL is strictly correct.
-
Disjoint inputs — partitioned tables where you know no row exists in both. Stack with
UNION ALL; dedup adds zero rows and just costs CPU. - Meaningful duplicates — combining event streams where each occurrence matters.
-
Pre-deduplicated inputs — each side already ran
DISTINCTupstream; another dedup is wasted work.
When UNION is the right call.
- Overlapping inputs and you need a set — "users who appear in either cohort, once each".
- The downstream consumer expects unique rows — reporting tables that should never contain dupes.
- You have no way to know the inputs are disjoint — the cautious default for ad-hoc analytics.
SQL interview question — total daily revenue across two regions, kept disjoint
Assume eu_sales(sale_date, amount) and us_sales(sale_date, amount) are two disjoint tables (one per region). Return total revenue per day across both regions, sorted descending.
Solution Using UNION ALL + GROUP BY
Detailed explanation. Because EU and US sales never overlap (they are partitioned by region), UNION ALL is the strictly correct stack — every row is unique by construction, and any UNION-style dedup would be wasted work without changing the result.
Code (PostgreSQL).
SELECT sale_date,
SUM(amount) AS total_revenue
FROM (
SELECT sale_date, amount FROM eu_sales
UNION ALL
SELECT sale_date, amount FROM us_sales
) all_sales
GROUP BY sale_date
ORDER BY total_revenue DESC;
Equivalent in MySQL.
SELECT sale_date,
SUM(amount) AS total_revenue
FROM (
SELECT sale_date, amount FROM eu_sales
UNION ALL
SELECT sale_date, amount FROM us_sales
) AS all_sales
GROUP BY sale_date
ORDER BY total_revenue DESC;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 | Scan eu_sales and us_sales independently |
two row streams |
| 2 |
UNION ALL stacks them |
one combined stream, no dedup |
| 3 | Wrapped as all_sales derived table |
named source for the outer SELECT
|
| 4 | GROUP BY sale_date |
one row per calendar day |
| 5 |
SUM(amount) per day |
aggregate across both regions |
| 6 | ORDER BY total_revenue DESC |
best days first |
Output:
| sale_date | total_revenue |
|---|---|
| 2026-05-22 | 32,500 |
| 2026-05-21 | 28,200 |
| 2026-05-20 | 25,800 |
Why this works — concept by concept:
- UNION ALL over UNION — the inputs are disjoint by construction; dedup adds nothing but CPU.
-
Derived table wraps the stack —
(SELECT … UNION ALL SELECT …) all_salesexposes the combined stream as a single relation. - GROUP BY sale_date — collapses the combined stream to one row per day.
-
SUM(amount) aggregates across regions — the post-stack aggregate is the same as you'd get with two separate
GROUP BYs plus an outerSUM. -
Single ORDER BY at the bottom — sorts the final result; any
ORDER BYinside the innerSELECTs would be discarded. -
Cost — two scans + one append + one hash aggregate;
Θ(n + m)end-to-end.
SQL
Topic — union-all
UNION ALL SQL drills
SQL
Topic — aggregation
Aggregation + UNION ALL patterns
4. INTERSECT — rows present in both inputs
intersect in sql — find rows that exist in BOTH inputs
INTERSECT returns rows that appear in both result sets. Where UNION is the OR of two sets, INTERSECT is the AND — keep only the rows present in every input. Default semantics are DISTINCT: even if a row appears 3× in A and 5× in B, the intersection ships it once unless you ask for INTERSECT ALL.
INTERSECT — the canonical AND of two sets.
-
Signature —
SELECT … FROM A INTERSECT SELECT … FROM B. -
Semantics — keep rows that appear in both inputs; treat
NULL = NULLas true (set semantics). -
Default —
DISTINCT; one occurrence per shared row in the output. -
Returns — at most
MIN(distinct_rows(A), distinct_rows(B))rows.
INTERSECT ALL — bag-multiplicity intersection.
-
Semantics — preserves multiplicity: row
rappearsMIN(count_in_A(r), count_in_B(r))times. - Postgres / DB2 / Snowflake — supported.
-
MySQL pre-8.0.31 / SQL Server / Oracle — not supported; emulate via
INNER JOIN+GROUP BY. - Use case — counting overlap between two event streams where each occurrence matters.
Dialect availability.
-
PostgreSQL — full support since 8.4, including
INTERSECT ALL. -
SQL Server — supported since 2005; no
INTERSECT ALL. -
Oracle — supported; no
INTERSECT ALL. -
Snowflake / BigQuery / Databricks — supported, including
INTERSECT ALL(Snowflake). -
MySQL — supported since 8.0.31 (relatively recent); older MySQL must emulate via
INNER JOIN.
Emulating INTERSECT without the operator (older MySQL).
SELECT DISTINCT a.col1, a.col2
FROM A a
INNER JOIN B b
ON a.col1 = b.col1
AND a.col2 = b.col2;
-
The
NULL = NULLgap —INTERSECTtreats twoNULLs as equal;JOIN ON a = btreatsNULL = NULLasUNKNOWNand drops the rows. To emulate set semantics in aJOIN, useIS NOT DISTINCT FROM(Postgres) or<=>(MySQL) — see Blog75 §2.
SQL interview question — users who both purchased AND reviewed
Assume purchases(user_id, product_id) and reviews(user_id, product_id). Return distinct (user_id, product_id) pairs where the user both purchased AND reviewed the same product.
Solution Using INTERSECT
Detailed explanation. The natural set-ops shape: keep (user_id, product_id) pairs that appear in both inputs. The DISTINCT semantics of INTERSECT automatically dedup users who reviewed the same product twice.
Code (PostgreSQL / Snowflake).
SELECT user_id, product_id FROM purchases
INTERSECT
SELECT user_id, product_id FROM reviews
ORDER BY user_id, product_id;
Equivalent in MySQL 8.0.31+.
SELECT user_id, product_id FROM purchases
INTERSECT
SELECT user_id, product_id FROM reviews
ORDER BY user_id, product_id;
Equivalent in older MySQL (no INTERSECT).
SELECT DISTINCT p.user_id, p.product_id
FROM purchases p
INNER JOIN reviews r
ON p.user_id = r.user_id
AND p.product_id = r.product_id
ORDER BY p.user_id, p.product_id;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 | Scan purchases projecting (user_id, product_id)
|
left input |
| 2 | Scan reviews projecting (user_id, product_id)
|
right input |
| 3 |
INTERSECT keeps pairs present in both |
DISTINCT set output |
| 4 | ORDER BY user_id, product_id |
sorted final output |
Output:
| user_id | product_id |
|---|---|
| 17 | P-42 |
| 17 | P-88 |
| 42 | P-12 |
Why this works — concept by concept:
- INTERSECT is the AND of two sets — keeps only pairs found in both inputs; perfect for "did both X AND Y".
-
DISTINCT semantics by default — duplicate reviews don't inflate the output; each shared
(user_id, product_id)pair appears once. -
No JOIN needed — the set operator handles the equality check internally and treats
NULLas equal toNULL. -
JOIN emulation is equivalent only when no NULLs exist — if any column can be
NULL, theJOINform silently drops those rows whileINTERSECTpreserves them. -
Cost —
Θ(n + m)for the scans + one hash for the set match; no extra GROUP BY needed.
SQL
Topic — set-operations
INTERSECT SQL drills
SQL
Topic — joins
JOIN SQL problems
5. EXCEPT and MINUS — rows in A but not in B
except sql / minus sql — subtract one result set from another
EXCEPT (called MINUS in Oracle) returns rows that appear in the left input but not in the right. It is the textbook "rows in A and not in B" shape — perfect for diff queries, churn detection, missing-rows audits, and any prompt phrased as "X but not Y."
EXCEPT — subtract one set from another.
-
Signature —
SELECT … FROM A EXCEPT SELECT … FROM B. -
Semantics — keep rows in A that do not appear in B; treat
NULL = NULLas true. -
Default —
DISTINCT; the output has at mostdistinct_rows(A)rows. - Returns — rows present in the left input minus rows present in the right input.
MINUS — Oracle's alias.
-
Oracle uses
MINUShistorically; modern Oracle versions (21c+) also supportEXCEPTas an alias. - DB2 supports both.
-
Other dialects —
EXCEPTonly.
EXCEPT ALL — bag-multiplicity difference.
-
Semantics — if row
rappearsmtimes in A andntimes in B, the output hasMAX(m - n, 0)occurrences. - Postgres / DB2 / Snowflake — supported.
- MySQL 8.0.31+ — supported.
-
SQL Server / Oracle — not supported; emulate via
LEFT JOIN+WHERE b.col IS NULL+GROUP BYwith multiplicity math.
Common use cases.
-
Churn detection —
last_month_active_users EXCEPT this_month_active_users. -
Missing-rows audit —
expected_keys EXCEPT actual_keysreturns the missing IDs. -
Diff queries —
expected_snapshot EXCEPT actual_snapshotplus the reverse direction surfaces both missing and extra rows. - Data quality — find rows in source that didn't land in destination.
EXCEPT vs NOT IN vs NOT EXISTS — three ways to spell the same intent.
-
EXCEPT— set-level subtraction; cleanest when the two inputs share an identical projection. -
NOT IN (SELECT …)— dangerous with NULL: if any row in the subquery hasNULLin the compared column, the entire predicate returnsUNKNOWNand the outer query returns zero rows. -
NOT EXISTS— robust toNULL; correlated subquery form; the canonical alternative toEXCEPT.
SQL interview question — users who were active last month but not this month
Assume last_month_active(user_id) and this_month_active(user_id). Return distinct user_ids of users who were active last month but not this month — i.e. the churned cohort.
Solution Using EXCEPT
Detailed explanation. This is the textbook churn-detection shape: "rows in A and not in B". EXCEPT is the cleanest expression because the projection (user_id) is identical in both inputs and the operator handles NULL safely.
Code (PostgreSQL / SQL Server / Snowflake).
SELECT user_id FROM last_month_active
EXCEPT
SELECT user_id FROM this_month_active
ORDER BY user_id;
Equivalent in Oracle.
SELECT user_id FROM last_month_active
MINUS
SELECT user_id FROM this_month_active
ORDER BY user_id;
Equivalent using NOT EXISTS (every dialect).
SELECT DISTINCT l.user_id
FROM last_month_active l
WHERE NOT EXISTS (
SELECT 1
FROM this_month_active t
WHERE t.user_id = l.user_id
)
ORDER BY l.user_id;
Step-by-step trace.
| step | relation | outcome |
|---|---|---|
| 1 | Scan last_month_active
|
left input |
| 2 | Scan this_month_active
|
right input |
| 3 |
EXCEPT removes any row from left that appears in right |
churned users only |
| 4 | DISTINCT semantics dedup within the left set | one row per churned user |
| 5 | ORDER BY user_id |
sorted final output |
Output:
| user_id |
|---|
| 12 |
| 27 |
| 88 |
Why this works — concept by concept:
- EXCEPT is subtraction at the set level — perfect when both inputs share the same projection and you want "A minus B".
- DISTINCT semantics by default — even if a user appeared twice in last month's table (unlikely but possible), they appear once in the output.
-
NULL-safe by construction — set-semantic
NULL = NULLavoids theNOT INNULL trap that quietly returns zero rows. -
Oracle alias MINUS is identical — same operator, different keyword; both clamp to
DISTINCTby default. - NOT EXISTS is the dialect-portable fallback — works on every engine; just verbose.
-
Cost —
Θ(n + m)for scans + one hash anti-join; the planner often picks the same plan asNOT EXISTS.
SQL
Topic — set-operations
EXCEPT / MINUS SQL drills
SQL
Topic — joins
JOIN + anti-join library
6. Set operations vs JOIN, WHERE IN, and NOT EXISTS
Four shapes, one intent — pick the operator that matches the projection and the NULL story
For "rows in A also in B" and "rows in A not in B", SQL gives you multiple ways to spell the same intent. Knowing the trade-offs — projection, NULL handling, deduplication, dialect availability — is the senior signal interviewers chase.
"Rows in A also in B" — four shapes.
-
INTERSECT— set-level AND; identical projections required;DISTINCTsemantics;NULL = NULLis true. -
INNER JOIN— row-level matching on explicit keys; preserves all of A's columns (and B's, if projected); duplicates inflate;NULL = NULLisUNKNOWN. -
WHERE col IN (SELECT col FROM B)— semi-join; preserves A's row count and columns;NULLin the subquery still works (unlikeNOT IN). -
WHERE EXISTS (SELECT 1 FROM B WHERE B.x = A.x)— correlated semi-join; functionally identical toIN; often the same plan; safe withNULLs in A.
"Rows in A not in B" — four shapes.
-
EXCEPT/MINUS— set-level subtraction; identical projections;DISTINCT;NULL = NULLtrue. -
LEFT JOIN … WHERE B.col IS NULL— anti-join via left join + null filter; classic; verbose; behaves like a set difference only when the join keys cover the full row. -
WHERE col NOT IN (SELECT col FROM B)— dangerous: if any row in B hasNULLin the compared column, the predicate isUNKNOWNand A returns zero rows. -
WHERE NOT EXISTS (SELECT 1 FROM B WHERE B.x = A.x)— robust anti-join; works withNULLs; portable; the canonical replacement forEXCEPTwhen projections diverge.
Decision table.
| Intent | Identical projection? | NULLs present? | Best shape |
|---|---|---|---|
| Rows in A also in B | Yes | Yes | INTERSECT |
| Rows in A also in B | No (need A's columns) | Yes | EXISTS |
| Rows in A not in B | Yes | Yes |
EXCEPT / MINUS
|
| Rows in A not in B | No (need A's columns) | Yes | NOT EXISTS |
| Cross-tabular metrics | No (need both sides' cols) | Either | JOIN |
Why INTERSECT and EXCEPT differ from JOIN on NULL.
-
Set-op semantics —
INTERSECTandEXCEPTuse distinct-row equality, whereNULL = NULLis TRUE. -
JOIN semantics —
JOIN ON a = bevaluates the predicate per row;NULL = NULLis UNKNOWN, so the row is dropped. -
Practical consequence —
INTERSECTkeeps(NULL, 'x')pairs that appear in both inputs;INNER JOIN ON ...drops them. -
Fix for JOIN — use
IS NOT DISTINCT FROM(Postgres) or<=>(MySQL); see Blog75 §2.
UNION ALL vs OR in WHERE.
-
UNION ALLof two selectivities is sometimes faster thanWHERE … OR …when the predicates target different indexes. -
The planner can use a separate index per leg of
UNION ALLand append the results; anORmay force a single index choice or a full scan. -
UNION ALLis also more readable for "rows matching condition A OR condition B with shared projection."
NOT IN vs NOT EXISTS vs EXCEPT — the senior trap.
-
NOT IN (SELECT x FROM B)— ifB.xcontains anyNULL, the result is silently empty. Almost never the right operator unless you explicitly filterWHERE x IS NOT NULLinside. -
NOT EXISTS— null-safe; the canonical anti-join. -
EXCEPT/MINUS— null-safe; requires identical projection. -
Pick
NOT EXISTSfor cross-table anti-joins; pickEXCEPTwhen projections already match.
SQL
Topic — joins
JOIN vs set-op SQL drills
SQL
Topic — set-operations
Set-operations SQL library
7. Interview gotchas — NULL equality, ORDER BY, type widening
The seven bugs interviewers test most often on set operations
Set operations have a small surface area but a long tail of edge cases that fail candidates. These are the seven gotchas reviewers test most often when sql interview questions drift into the set-ops zone.
Gotcha 1 — UNION is silently expensive on large inputs.
-
The bug — reaching for
UNIONwhenUNION ALLwould do; the dedup pass costs a sort or hash over the combined output. -
Symptom — query slows by
Θ(n + m)for no functional benefit on disjoint inputs. -
Fix — default to
UNION ALL; reach forUNIONonly when you actively need a set.
Gotcha 2 — ORDER BY placement is undefined except at the bottom.
-
The bug —
ORDER BYinside an innerSELECTof a set op is either an error or silently ignored. -
Fix — write a single
ORDER BYat the very end of the whole expression; reference the first input's column names or positional indexes.
Gotcha 3 — Column count mismatch errors are easy to introduce.
-
The bug — adding a column to one side of a UNION without adding it (or
NULL) to the other. - Symptom — "each UNION query must have the same number of columns" error.
-
Fix — pad missing columns with explicit
NULL AS <name>; this also documents intent.
Gotcha 4 — Type widening surprises with strings and numbers.
-
The bug — UNION'ing
INTEGERandVARCHARcolumns; Postgres errors, MySQL silently casts toVARCHAR(which then sorts lexicographically —'10'<'2'). - Fix — write explicit casts on both sides; never rely on implicit widening.
Gotcha 5 — NULL equality differs between set ops and JOINs.
-
The behaviour —
INTERSECT/EXCEPTtreatNULL = NULLas true;JOIN ON a = btreats it asUNKNOWN. -
Symptom — rewriting
INTERSECTasINNER JOINsilently drops rows where the join column isNULL. -
Fix — use
IS NOT DISTINCT FROMin the JOIN, or keep the set operator.
Gotcha 6 — NOT IN (subquery) collapses to zero rows on any NULL in the subquery.
-
The bug —
WHERE x NOT IN (SELECT y FROM B)returns zero rows the momentB.ycontains a singleNULL. - Symptom — query returns empty; the trap is silent.
-
Fix — use
NOT EXISTSorEXCEPT; or addWHERE y IS NOT NULLinside the subquery.
Gotcha 7 — Column names come from the first input only.
-
The behaviour —
SELECT a AS x … UNION SELECT b AS y FROM …outputs the column asx;yis discarded. -
Subtle —
ORDER BY ywould error; you must referencexor useORDER BY 1. -
Fix — alias every column in the first
SELECTconsistently; ignore aliases in the rest.
SQL
Topic — set-operations
Set-ops gotcha drills
SQL
Topic — union-all
UNION ALL gotcha drills
Choosing the right set operator (cheat sheet)
A one-screen cheat sheet for using SQL set operations — pick the operator that matches your intent, then confirm the column contract.
| You want to … | Operator | DISTINCT vs ALL | Typical cost |
|---|---|---|---|
| Stack two result sets and keep every row | UNION ALL |
bag (keeps duplicates) | Θ(n + m) |
| Stack two result sets and dedupe | UNION |
set (DISTINCT default) | Θ(n + m) + Θ(dedup) |
| Rows in both inputs | INTERSECT |
set (DISTINCT default) | Θ(n + m) + Θ(hash) |
| Rows in both inputs with multiplicity |
INTERSECT ALL (Postgres / Snowflake) |
bag | Θ(n + m) + Θ(hash) |
| Rows in A but not in B |
EXCEPT (MINUS on Oracle) |
set (DISTINCT default) | Θ(n + m) + Θ(anti-join) |
| Rows in A but not in B with multiplicity |
EXCEPT ALL (Postgres / Snowflake) |
bag | Θ(n + m) + Θ(anti-join) |
| "Rows in A also in B" with extra A columns |
EXISTS (semi-join) |
preserves A | Θ(n + m) + Θ(hash) |
| "Rows in A not in B" with extra A columns |
NOT EXISTS (anti-join) |
preserves A | Θ(n + m) + Θ(hash) |
| Combine two index-friendly predicates |
UNION ALL over OR
|
bag | usually faster than WHERE … OR …
|
| Diff two snapshots in both directions | (A EXCEPT B) UNION ALL (B EXCEPT A) |
bag | symmetric difference |
Frequently asked questions
What's the difference between UNION and UNION ALL in SQL?
UNION stacks two result sets vertically and then drops exact-duplicate rows; UNION ALL stacks them and keeps every row, including duplicates. The semantic difference is set-vs-bag; the performance difference is that UNION adds a DISTINCT pass over the combined output (a sort or hash) while UNION ALL is essentially free beyond the input scans. The canonical interview rule of thumb: default to UNION ALL for performance, and reach for UNION only when you actively need deduplication. On large tables the gap is material — a UNION on 100 million rows can be several times slower than the equivalent UNION ALL for no functional benefit when the inputs are disjoint.
What does INTERSECT do in SQL?
INTERSECT returns rows that appear in both input result sets. It treats the inputs as sets — NULL = NULL is true (unlike JOIN's UNKNOWN), and the default DISTINCT semantics return each shared row exactly once even if it appears multiple times in either input. INTERSECT requires the two inputs to have the same column count and compatible types at each position, and the output column names come from the first input. Use INTERSECT for "rows in both" prompts where the two inputs already share an identical projection; for "rows in both with extra columns from the left input", reach for INNER JOIN or EXISTS instead. Modern PostgreSQL, SQL Server, Oracle, Snowflake, BigQuery, and MySQL 8.0.31+ all support INTERSECT; older MySQL must emulate via INNER JOIN + DISTINCT.
What's the difference between EXCEPT and MINUS in SQL?
EXCEPT (ANSI standard) and MINUS (Oracle's historical keyword) are the same operator: both return rows in the left input that do not appear in the right input, with DISTINCT semantics by default and set-equality NULL = NULL true. Oracle has supported EXCEPT as an alias for MINUS since 21c. PostgreSQL, SQL Server, Snowflake, BigQuery, and MySQL 8.0.31+ all use EXCEPT. The canonical interview phrasing — "rows in A but not in B" — points to EXCEPT / MINUS when both inputs share a projection; for "rows in A not in B with extra A columns," use NOT EXISTS instead. Note that NOT IN (SELECT … FROM B) is not equivalent: if B contains any NULL in the compared column, NOT IN collapses to zero rows silently — EXCEPT and NOT EXISTS are both null-safe.
When should I use UNION vs JOIN in SQL?
UNION (and UNION ALL) combine result sets vertically — stacking rows on top of each other, requiring matching column counts and types. JOIN combines result sets horizontally — matching rows by a key and producing a wider row with columns from both sides. Use UNION / UNION ALL when you want a single column projection but rows from multiple sources (e.g. "all sales from EU + US tables"). Use JOIN when you want columns from multiple tables aligned by a key (e.g. "orders with customer names"). They are not interchangeable: UNION cannot add columns, and JOIN cannot stack rows. The same intent can sometimes be expressed both ways — "users in either cohort A or cohort B" can be cohort_a UNION cohort_b or users LEFT JOIN cohort_a … LEFT JOIN cohort_b … WHERE …, but the UNION form is usually shorter and faster when projections align.
Are UNION, INTERSECT, and EXCEPT NULL-safe?
Yes — all three set operators treat NULL = NULL as true for the purpose of deduplication and equality, which is opposite to how JOIN ON a = b evaluates the same predicate (where NULL = NULL is UNKNOWN and the row is dropped). This makes set operators the right tool when your join key can be NULL and you want NULL-NULL matches to count. The practical consequence: rewriting INTERSECT as INNER JOIN silently drops rows where the join column is NULL; rewriting EXCEPT as LEFT JOIN … WHERE B.x IS NULL has the same problem at a different boundary. If you must use a JOIN, switch to IS NOT DISTINCT FROM (Postgres) or <=> (MySQL) for null-safe equality; otherwise keep the set operator.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including PostgreSQL-first SQL practice keyed to set operations (UNION, UNION ALL, INTERSECT, EXCEPT/MINUS), anti-join patterns with NOT EXISTS, semi-join patterns with EXISTS / IN, and the dialect quirks that fail candidates who memorise only one engine. Whether you're drilling sql interview questions with answers for sql for data engineers loops or grinding through data engineering interview questions end-to-end, the practice library mirrors the same four-operator taxonomy this guide teaches.
Kick off via Explore practice →; drill the dedicated set-operations SQL lane →; fan out into the UNION ALL lane →; rehearse JOIN SQL problems →; reinforce aggregation + UNION ALL patterns →; widen coverage on the full SQL practice library →.





Top comments (0)