DEV Community

Cover image for SQL UNION vs UNION ALL vs INTERSECT vs EXCEPT
Gowtham Potureddi
Gowtham Potureddi

Posted on

SQL UNION vs UNION ALL vs INTERSECT vs EXCEPT

sql set operations stitch two or more result sets together along the row axis: union sql stacks rows and drops duplicates, union all sql stacks rows and keeps every duplicate, intersect sql keeps only rows that appear in both inputs, and except sql (called minus sql in Oracle) keeps rows from the left input that do not appear in the right. These four operators answer the bulk of the sql interview questions in the "combine result sets" cluster, and the union vs union all distinction alone trips up more candidates than almost any other shape on a SQL whiteboard.

This guide walks through every operator in the sql set operations family that reviewers love to test in data engineering interview questions: the four-way mental model, the column-count + type-compatibility contract that binds every set op, the union vs union all deduplication-and-performance trap, intersect in sql for finding rows present in both inputs, except sql / minus sql for subtraction, how set operators compare to JOIN and WHERE IN for the same intent, and the seven interview gotchas (NULL equality, ORDER BY placement, type widening) that fail most candidates. Every section ends as sql interview questions with answers: a runnable PostgreSQL query, a traced execution, an output table, and a concept-by-concept why this works breakdown — the exact shape sql for data engineers rounds reward when set operations in sql come up.

PipeCode blog header for a SQL set operations tutorial — bold white headline 'UNION · UNION ALL · INTERSECT · EXCEPT' with subtitle 'set operations for data engineers' and a minimal Venn-diagram code snippet on a dark gradient with purple, green, and orange accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse set-operations SQL practice →, drill the UNION ALL lane →, sharpen aggregation SQL drills →, rehearse JOIN SQL problems →, or widen coverage on the full SQL practice library →.


On this page


1. Why SQL set operations matter in data engineering interviews

Four operators, one contract — the whole set-ops interview surface in a sentence

The one-sentence invariant: sql set operations stack two result sets vertically — UNION drops duplicates, UNION ALL keeps every row, INTERSECT keeps shared rows, EXCEPT (MINUS in Oracle) keeps rows from the left input absent from the right — and every operator requires the same column count and compatible types between the two inputs. Once you internalise that, every prompt in the union intersect except in sql family becomes "pick the right operator and confirm the column contract."

The four operators at a glance.

  • UNION — vertical stack with DISTINCT semantics; drops duplicates across both inputs. union sql and union in sql are the two most-searched names for this same operator.
  • UNION ALL — vertical stack with bag semantics; keeps every row, including duplicates.
  • INTERSECT — keeps rows that appear in both inputs; treats them as sets (DISTINCT).
  • EXCEPT / MINUS — keeps rows in the left input that do not appear in the right; DISTINCT semantics by default.

The non-negotiable contract every set op enforces.

  • Same number of columns in both SELECT lists — (a, b, c) cannot UNION with (a, b).
  • Compatible types per column positionINTEGER UNION BIGINT widens to BIGINT; INTEGER UNION TEXT errors or silently casts depending on dialect.
  • Column names come from the left input — the second SELECT's names are ignored in the output.
  • One ORDER BY for the whole expression — placed after the final SELECT, not after each input.

Why interviewers love set operations.

  • Every analytic question with "combine these two tables" reduces to a set operations in sql prompt — active vs churned users, this month vs last, paying vs trial; union in sql and intersect in sql are the most-tested variants.
  • The union vs union all performance trap is universal — candidates who reach for UNION instinctively burn an unnecessary dedup pass on every prompt.
  • NULL semantics diverge from JOININTERSECT treats NULL = NULL as true (set-equality); JOIN ON a = b treats it as UNKNOWN. Senior signal.
  • Dialect divergence on EXCEPT vs MINUS — Oracle is the lone holdout with MINUS; every other major engine uses EXCEPT.

What interviewers listen for.

  • Do you name the column contract before writing the query? — "both inputs return (user_id, region) in the same order with compatible types."
  • Do you default to UNION ALL unless deduplication is explicitly required? — the cost-aware answer.
  • Do you mention that EXCEPT and INTERSECT treat NULL as equal when compared to JOIN's UNKNOWN? — bonus points.
  • Do you place ORDER BY at the bottom of the whole expression, not after each input? — basic-but-tested fluency.

Worked example — combine active and churned users into one feed

Detailed explanation. When a stakeholder asks for "all users who interacted with the product in 2026, whether currently active or churned," the natural shape is a vertical stack of two queries. The two inputs already share the same projection — (user_id, status) — so a single set op gives the union of the two cohorts.

Question. From active_users(user_id) and churned_users(user_id), return a deduplicated list of all interacting users with a status label.

Input. active_users slice and churned_users slice.

active_users.user_id
1
2
3
churned_users.user_id
3
4
5

Code.

SELECT user_id, 'active'  AS status FROM active_users
UNION
SELECT user_id, 'churned' AS status FROM churned_users
ORDER BY user_id;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The first SELECT projects every active user with a literal 'active' label.
  2. The second SELECT projects every churned user with a literal 'churned' label.
  3. UNION stacks the two result sets vertically and drops exact-duplicate rows (a row is a duplicate only when every column matches, so (3, 'active') and (3, 'churned') are distinct and both survive).
  4. The single ORDER BY user_id at the bottom sorts the combined output.
  5. Column names come from the first SELECT — the second SELECT's aliases are discarded.

Output.

user_id status
1 active
2 active
3 active
3 churned
4 churned
5 churned

Rule of thumb: when the question says "combine these two cohorts" and the projection already matches, set ops are the cleanest shape; if you'd lose information by deduping, switch to UNION ALL.

SQL
Topic — set-operations
Set operations SQL drills

Practice →

SQL
Topic — union-all
UNION ALL SQL practice

Practice →


2. Anatomy of a set operation — column-count and type-compatibility rules

Diagram of the set operations contract — two SELECT statements side by side with arrows showing the column-count and type-compatibility rules between the two inputs, plus an annotation pointing to the single ORDER BY at the bottom, on a light PipeCode card.

The contract every set op enforces — column count, compatible types, single ORDER BY

Before any deduplication or intersection happens, every set operation enforces the same structural contract on its two inputs. Get the contract right and the four operators behave consistently; get it wrong and you get a dialect-specific cocktail of errors.

Rule 1 — same number of columns in both inputs.

  • Hard requirement — every dialect raises an error if the column counts differ.
  • Error message — Postgres says "each UNION query must have the same number of columns"; MySQL says "The used SELECT statements have a different number of columns"; SQL Server says "All queries combined using a UNION, INTERSECT or EXCEPT operator must have an equal number of expressions in their target lists."
  • Fix — pad the shorter side with NULL literals: SELECT a, b, NULL FROM ….

Rule 2 — column types must be compatible at each position.

  • Position-based, not name-based — column 1 of left input pairs with column 1 of right input regardless of column names.
  • CompatibleINTEGER and BIGINT widen to BIGINT; VARCHAR(10) and VARCHAR(50) widen to VARCHAR(50); DATE and TIMESTAMP widen to TIMESTAMP.
  • IncompatibleINTEGER and DATE error in Postgres; MySQL silently casts; SQL Server errors. Senior fix — write explicit casts in both sides.
  • Type-widening rule — the output type is the "least common supertype" of the two inputs at each column position.

Rule 3 — column names come from the first input.

  • The output's column names equal the first SELECT's aliases — every other input's aliases are ignored.
  • Practical consequence — if you want the output column called status, alias it AS status in the first SELECT.
  • ORDER BY references — must use the first-input column names or positional indexes (ORDER BY 1).

Rule 4 — one ORDER BY for the whole expression.

  • Placement — at the very bottom, after the last SELECT. Postgres / MySQL / Snowflake all enforce this.
  • Per-input ORDER BY — most dialects either error or silently discard ORDER BY placed inside an inner SELECT of a set op (the order is undefined until the final operator runs).
  • Workaround — wrap each input in a CTE with the desired internal order, but that order is not guaranteed to survive the set op anyway.

Rule 5 — parentheses control precedence in chained set ops.

  • INTERSECT binds tighter than UNION and EXCEPT in ANSI SQL — A UNION B INTERSECT C is parsed as A UNION (B INTERSECT C).
  • Always parenthesise when chaining different operators — (A UNION B) INTERSECT C reads as you expect.
  • Dialect divergence — some engines (older MySQL) don't honor parentheses correctly; rewrite as CTEs to force order.

DISTINCT semantics vs ALL modifier — every operator supports both.

  • UNION DISTINCT — same as UNION (default is DISTINCT).
  • UNION ALL — keep duplicates.
  • INTERSECT [DISTINCT] — default is DISTINCT; rows that appear in both inputs, deduplicated.
  • INTERSECT ALL — bag-multiplicity intersection; if a row appears 3 times in A and 5 times in B, it appears 3 times in the output (min of the two counts). Postgres / DB2 / Snowflake support this; MySQL and SQL Server do not.
  • EXCEPT [DISTINCT] — rows in A not in B, deduplicated.
  • EXCEPT ALL — bag-multiplicity difference; if a row appears 5 times in A and 2 times in B, it appears 3 times in the output. Postgres / DB2 / Snowflake support this; MySQL and SQL Server do not.

SQL
Topic — set-operations
Set-operation contract drills

Practice →

SQL
Topic — type-casting
Type-compatibility SQL practice

Practice →


3. UNION vs UNION ALL — deduplication and the performance trap

Diagram of UNION vs UNION ALL — left card shows UNION's de-duplication pass eliminating identical rows between two inputs; right card shows UNION ALL passing every row through unchanged, with arrows annotated for the extra sort/hash cost on the UNION side, on a light PipeCode card.

union vs union all — the single most-tested set-ops fact

The union vs union all distinction is the single most-asked set-operations interview question. The answer in one sentence: UNION performs an implicit DISTINCT over the combined result; UNION ALL skips that step and is materially cheaper on every dialect. Reaching for UNION when you do not need dedup is the textbook sql for data engineers mistake.

UNION — deduplicates the combined result.

  • Semantics — vertical stack, then drop exact-duplicate rows.
  • "Exact duplicate" — every column matches (including NULLs; INTERSECT/UNION treat NULL as equal to NULL).
  • Plan — typically HashAggregate on the union output or Sort + Unique; both add a pass over the data.
  • CostΘ(n + m) for the scan plus Θ((n + m)) for the dedup hash; memory grows with distinct row count.
  • Use it when — you truly need a deduplicated stack and the inputs may overlap (e.g. "all users who ever interacted, once each").

UNION ALL — passes every row through.

  • Semantics — vertical stack with bag semantics; preserves duplicates.
  • Plan — typically Append (Postgres) or Concatenation (SQL Server); essentially free beyond the input scans.
  • CostΘ(n + m) for the scans only; no extra hash or sort.
  • Use it when — the inputs are known to be disjoint (e.g. partitioned tables stitched together) OR when duplicates are meaningful (e.g. event-log unions).

difference between union and union all — the headline contrasts.

  • DeduplicationUNION drops duplicates; UNION ALL keeps them.
  • PerformanceUNION ALL is faster by a Θ(dedup) pass; the gap widens with row count.
  • Output countUNION ≤ sum of input counts; UNION ALL = sum of input counts exactly.
  • NULL handling — both treat NULL = NULL as true for dedup purposes (set semantics, not JOIN semantics).

When UNION ALL is strictly correct.

  • Disjoint inputs — partitioned tables where you know no row exists in both. Stack with UNION ALL; dedup adds zero rows and just costs CPU.
  • Meaningful duplicates — combining event streams where each occurrence matters.
  • Pre-deduplicated inputs — each side already ran DISTINCT upstream; another dedup is wasted work.

When UNION is the right call.

  • Overlapping inputs and you need a set — "users who appear in either cohort, once each".
  • The downstream consumer expects unique rows — reporting tables that should never contain dupes.
  • You have no way to know the inputs are disjoint — the cautious default for ad-hoc analytics.

SQL interview question — total daily revenue across two regions, kept disjoint

Assume eu_sales(sale_date, amount) and us_sales(sale_date, amount) are two disjoint tables (one per region). Return total revenue per day across both regions, sorted descending.

Solution Using UNION ALL + GROUP BY

Detailed explanation. Because EU and US sales never overlap (they are partitioned by region), UNION ALL is the strictly correct stack — every row is unique by construction, and any UNION-style dedup would be wasted work without changing the result.

Code (PostgreSQL).

SELECT sale_date,
       SUM(amount) AS total_revenue
FROM (
    SELECT sale_date, amount FROM eu_sales
    UNION ALL
    SELECT sale_date, amount FROM us_sales
) all_sales
GROUP BY sale_date
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

Equivalent in MySQL.

SELECT sale_date,
       SUM(amount) AS total_revenue
FROM (
    SELECT sale_date, amount FROM eu_sales
    UNION ALL
    SELECT sale_date, amount FROM us_sales
) AS all_sales
GROUP BY sale_date
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step relation outcome
1 Scan eu_sales and us_sales independently two row streams
2 UNION ALL stacks them one combined stream, no dedup
3 Wrapped as all_sales derived table named source for the outer SELECT
4 GROUP BY sale_date one row per calendar day
5 SUM(amount) per day aggregate across both regions
6 ORDER BY total_revenue DESC best days first

Output:

sale_date total_revenue
2026-05-22 32,500
2026-05-21 28,200
2026-05-20 25,800

Why this works — concept by concept:

  • UNION ALL over UNION — the inputs are disjoint by construction; dedup adds nothing but CPU.
  • Derived table wraps the stack(SELECT … UNION ALL SELECT …) all_sales exposes the combined stream as a single relation.
  • GROUP BY sale_date — collapses the combined stream to one row per day.
  • SUM(amount) aggregates across regions — the post-stack aggregate is the same as you'd get with two separate GROUP BYs plus an outer SUM.
  • Single ORDER BY at the bottom — sorts the final result; any ORDER BY inside the inner SELECTs would be discarded.
  • Cost — two scans + one append + one hash aggregate; Θ(n + m) end-to-end.

SQL
Topic — union-all
UNION ALL SQL drills

Practice →

SQL
Topic — aggregation
Aggregation + UNION ALL patterns

Practice →


4. INTERSECT — rows present in both inputs

intersect in sql — find rows that exist in BOTH inputs

INTERSECT returns rows that appear in both result sets. Where UNION is the OR of two sets, INTERSECT is the AND — keep only the rows present in every input. Default semantics are DISTINCT: even if a row appears 3× in A and 5× in B, the intersection ships it once unless you ask for INTERSECT ALL.

INTERSECT — the canonical AND of two sets.

  • SignatureSELECT … FROM A INTERSECT SELECT … FROM B.
  • Semantics — keep rows that appear in both inputs; treat NULL = NULL as true (set semantics).
  • DefaultDISTINCT; one occurrence per shared row in the output.
  • Returns — at most MIN(distinct_rows(A), distinct_rows(B)) rows.

INTERSECT ALL — bag-multiplicity intersection.

  • Semantics — preserves multiplicity: row r appears MIN(count_in_A(r), count_in_B(r)) times.
  • Postgres / DB2 / Snowflake — supported.
  • MySQL pre-8.0.31 / SQL Server / Oracle — not supported; emulate via INNER JOIN + GROUP BY.
  • Use case — counting overlap between two event streams where each occurrence matters.

Dialect availability.

  • PostgreSQL — full support since 8.4, including INTERSECT ALL.
  • SQL Server — supported since 2005; no INTERSECT ALL.
  • Oracle — supported; no INTERSECT ALL.
  • Snowflake / BigQuery / Databricks — supported, including INTERSECT ALL (Snowflake).
  • MySQL — supported since 8.0.31 (relatively recent); older MySQL must emulate via INNER JOIN.

Emulating INTERSECT without the operator (older MySQL).

SELECT DISTINCT a.col1, a.col2
FROM A a
INNER JOIN B b
  ON a.col1 = b.col1
 AND a.col2 = b.col2;
Enter fullscreen mode Exit fullscreen mode
  • The NULL = NULL gapINTERSECT treats two NULLs as equal; JOIN ON a = b treats NULL = NULL as UNKNOWN and drops the rows. To emulate set semantics in a JOIN, use IS NOT DISTINCT FROM (Postgres) or <=> (MySQL) — see Blog75 §2.

SQL interview question — users who both purchased AND reviewed

Assume purchases(user_id, product_id) and reviews(user_id, product_id). Return distinct (user_id, product_id) pairs where the user both purchased AND reviewed the same product.

Solution Using INTERSECT

Detailed explanation. The natural set-ops shape: keep (user_id, product_id) pairs that appear in both inputs. The DISTINCT semantics of INTERSECT automatically dedup users who reviewed the same product twice.

Code (PostgreSQL / Snowflake).

SELECT user_id, product_id FROM purchases
INTERSECT
SELECT user_id, product_id FROM reviews
ORDER BY user_id, product_id;
Enter fullscreen mode Exit fullscreen mode

Equivalent in MySQL 8.0.31+.

SELECT user_id, product_id FROM purchases
INTERSECT
SELECT user_id, product_id FROM reviews
ORDER BY user_id, product_id;
Enter fullscreen mode Exit fullscreen mode

Equivalent in older MySQL (no INTERSECT).

SELECT DISTINCT p.user_id, p.product_id
FROM purchases p
INNER JOIN reviews r
  ON p.user_id    = r.user_id
 AND p.product_id = r.product_id
ORDER BY p.user_id, p.product_id;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step relation outcome
1 Scan purchases projecting (user_id, product_id) left input
2 Scan reviews projecting (user_id, product_id) right input
3 INTERSECT keeps pairs present in both DISTINCT set output
4 ORDER BY user_id, product_id sorted final output

Output:

user_id product_id
17 P-42
17 P-88
42 P-12

Why this works — concept by concept:

  • INTERSECT is the AND of two sets — keeps only pairs found in both inputs; perfect for "did both X AND Y".
  • DISTINCT semantics by default — duplicate reviews don't inflate the output; each shared (user_id, product_id) pair appears once.
  • No JOIN needed — the set operator handles the equality check internally and treats NULL as equal to NULL.
  • JOIN emulation is equivalent only when no NULLs exist — if any column can be NULL, the JOIN form silently drops those rows while INTERSECT preserves them.
  • CostΘ(n + m) for the scans + one hash for the set match; no extra GROUP BY needed.

SQL
Topic — set-operations
INTERSECT SQL drills

Practice →

SQL
Topic — joins
JOIN SQL problems

Practice →


5. EXCEPT and MINUS — rows in A but not in B

Diagram of EXCEPT — a left circle labelled 'Set A' with a right circle labelled 'Set B' overlapping it; the shaded region is the part of A that does not overlap with B, labelled 'A EXCEPT B'; a small note shows the Oracle MINUS alias and the EXCEPT ALL bag-difference variant, on a light PipeCode card.

except sql / minus sql — subtract one result set from another

EXCEPT (called MINUS in Oracle) returns rows that appear in the left input but not in the right. It is the textbook "rows in A and not in B" shape — perfect for diff queries, churn detection, missing-rows audits, and any prompt phrased as "X but not Y."

EXCEPT — subtract one set from another.

  • SignatureSELECT … FROM A EXCEPT SELECT … FROM B.
  • Semantics — keep rows in A that do not appear in B; treat NULL = NULL as true.
  • DefaultDISTINCT; the output has at most distinct_rows(A) rows.
  • Returns — rows present in the left input minus rows present in the right input.

MINUS — Oracle's alias.

  • Oracle uses MINUS historically; modern Oracle versions (21c+) also support EXCEPT as an alias.
  • DB2 supports both.
  • Other dialectsEXCEPT only.

EXCEPT ALL — bag-multiplicity difference.

  • Semantics — if row r appears m times in A and n times in B, the output has MAX(m - n, 0) occurrences.
  • Postgres / DB2 / Snowflake — supported.
  • MySQL 8.0.31+ — supported.
  • SQL Server / Oracle — not supported; emulate via LEFT JOIN + WHERE b.col IS NULL + GROUP BY with multiplicity math.

Common use cases.

  • Churn detectionlast_month_active_users EXCEPT this_month_active_users.
  • Missing-rows auditexpected_keys EXCEPT actual_keys returns the missing IDs.
  • Diff queriesexpected_snapshot EXCEPT actual_snapshot plus the reverse direction surfaces both missing and extra rows.
  • Data quality — find rows in source that didn't land in destination.

EXCEPT vs NOT IN vs NOT EXISTS — three ways to spell the same intent.

  • EXCEPT — set-level subtraction; cleanest when the two inputs share an identical projection.
  • NOT IN (SELECT …)dangerous with NULL: if any row in the subquery has NULL in the compared column, the entire predicate returns UNKNOWN and the outer query returns zero rows.
  • NOT EXISTS — robust to NULL; correlated subquery form; the canonical alternative to EXCEPT.

SQL interview question — users who were active last month but not this month

Assume last_month_active(user_id) and this_month_active(user_id). Return distinct user_ids of users who were active last month but not this month — i.e. the churned cohort.

Solution Using EXCEPT

Detailed explanation. This is the textbook churn-detection shape: "rows in A and not in B". EXCEPT is the cleanest expression because the projection (user_id) is identical in both inputs and the operator handles NULL safely.

Code (PostgreSQL / SQL Server / Snowflake).

SELECT user_id FROM last_month_active
EXCEPT
SELECT user_id FROM this_month_active
ORDER BY user_id;
Enter fullscreen mode Exit fullscreen mode

Equivalent in Oracle.

SELECT user_id FROM last_month_active
MINUS
SELECT user_id FROM this_month_active
ORDER BY user_id;
Enter fullscreen mode Exit fullscreen mode

Equivalent using NOT EXISTS (every dialect).

SELECT DISTINCT l.user_id
FROM last_month_active l
WHERE NOT EXISTS (
    SELECT 1
    FROM this_month_active t
    WHERE t.user_id = l.user_id
)
ORDER BY l.user_id;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step relation outcome
1 Scan last_month_active left input
2 Scan this_month_active right input
3 EXCEPT removes any row from left that appears in right churned users only
4 DISTINCT semantics dedup within the left set one row per churned user
5 ORDER BY user_id sorted final output

Output:

user_id
12
27
88

Why this works — concept by concept:

  • EXCEPT is subtraction at the set level — perfect when both inputs share the same projection and you want "A minus B".
  • DISTINCT semantics by default — even if a user appeared twice in last month's table (unlikely but possible), they appear once in the output.
  • NULL-safe by construction — set-semantic NULL = NULL avoids the NOT IN NULL trap that quietly returns zero rows.
  • Oracle alias MINUS is identical — same operator, different keyword; both clamp to DISTINCT by default.
  • NOT EXISTS is the dialect-portable fallback — works on every engine; just verbose.
  • CostΘ(n + m) for scans + one hash anti-join; the planner often picks the same plan as NOT EXISTS.

SQL
Topic — set-operations
EXCEPT / MINUS SQL drills

Practice →

SQL
Topic — joins
JOIN + anti-join library

Practice →


6. Set operations vs JOIN, WHERE IN, and NOT EXISTS

Diagram comparing four shapes for 'find rows in A also in B' — INTERSECT, INNER JOIN, WHERE IN, and EXISTS — as four side-by-side code cards with annotations about NULL handling, deduplication, and performance, on a light PipeCode card.

Four shapes, one intent — pick the operator that matches the projection and the NULL story

For "rows in A also in B" and "rows in A not in B", SQL gives you multiple ways to spell the same intent. Knowing the trade-offs — projection, NULL handling, deduplication, dialect availability — is the senior signal interviewers chase.

"Rows in A also in B" — four shapes.

  • INTERSECT — set-level AND; identical projections required; DISTINCT semantics; NULL = NULL is true.
  • INNER JOIN — row-level matching on explicit keys; preserves all of A's columns (and B's, if projected); duplicates inflate; NULL = NULL is UNKNOWN.
  • WHERE col IN (SELECT col FROM B) — semi-join; preserves A's row count and columns; NULL in the subquery still works (unlike NOT IN).
  • WHERE EXISTS (SELECT 1 FROM B WHERE B.x = A.x) — correlated semi-join; functionally identical to IN; often the same plan; safe with NULLs in A.

"Rows in A not in B" — four shapes.

  • EXCEPT / MINUS — set-level subtraction; identical projections; DISTINCT; NULL = NULL true.
  • LEFT JOIN … WHERE B.col IS NULL — anti-join via left join + null filter; classic; verbose; behaves like a set difference only when the join keys cover the full row.
  • WHERE col NOT IN (SELECT col FROM B)dangerous: if any row in B has NULL in the compared column, the predicate is UNKNOWN and A returns zero rows.
  • WHERE NOT EXISTS (SELECT 1 FROM B WHERE B.x = A.x) — robust anti-join; works with NULLs; portable; the canonical replacement for EXCEPT when projections diverge.

Decision table.

Intent Identical projection? NULLs present? Best shape
Rows in A also in B Yes Yes INTERSECT
Rows in A also in B No (need A's columns) Yes EXISTS
Rows in A not in B Yes Yes EXCEPT / MINUS
Rows in A not in B No (need A's columns) Yes NOT EXISTS
Cross-tabular metrics No (need both sides' cols) Either JOIN

Why INTERSECT and EXCEPT differ from JOIN on NULL.

  • Set-op semanticsINTERSECT and EXCEPT use distinct-row equality, where NULL = NULL is TRUE.
  • JOIN semanticsJOIN ON a = b evaluates the predicate per row; NULL = NULL is UNKNOWN, so the row is dropped.
  • Practical consequenceINTERSECT keeps (NULL, 'x') pairs that appear in both inputs; INNER JOIN ON ... drops them.
  • Fix for JOIN — use IS NOT DISTINCT FROM (Postgres) or <=> (MySQL); see Blog75 §2.

UNION ALL vs OR in WHERE.

  • UNION ALL of two selectivities is sometimes faster than WHERE … OR … when the predicates target different indexes.
  • The planner can use a separate index per leg of UNION ALL and append the results; an OR may force a single index choice or a full scan.
  • UNION ALL is also more readable for "rows matching condition A OR condition B with shared projection."

NOT IN vs NOT EXISTS vs EXCEPT — the senior trap.

  • NOT IN (SELECT x FROM B) — if B.x contains any NULL, the result is silently empty. Almost never the right operator unless you explicitly filter WHERE x IS NOT NULL inside.
  • NOT EXISTS — null-safe; the canonical anti-join.
  • EXCEPT / MINUS — null-safe; requires identical projection.
  • Pick NOT EXISTS for cross-table anti-joins; pick EXCEPT when projections already match.

SQL
Topic — joins
JOIN vs set-op SQL drills

Practice →

SQL
Topic — set-operations
Set-operations SQL library

Practice →


7. Interview gotchas — NULL equality, ORDER BY, type widening

The seven bugs interviewers test most often on set operations

Set operations have a small surface area but a long tail of edge cases that fail candidates. These are the seven gotchas reviewers test most often when sql interview questions drift into the set-ops zone.

Gotcha 1 — UNION is silently expensive on large inputs.

  • The bug — reaching for UNION when UNION ALL would do; the dedup pass costs a sort or hash over the combined output.
  • Symptom — query slows by Θ(n + m) for no functional benefit on disjoint inputs.
  • Fix — default to UNION ALL; reach for UNION only when you actively need a set.

Gotcha 2 — ORDER BY placement is undefined except at the bottom.

  • The bugORDER BY inside an inner SELECT of a set op is either an error or silently ignored.
  • Fix — write a single ORDER BY at the very end of the whole expression; reference the first input's column names or positional indexes.

Gotcha 3 — Column count mismatch errors are easy to introduce.

  • The bug — adding a column to one side of a UNION without adding it (or NULL) to the other.
  • Symptom — "each UNION query must have the same number of columns" error.
  • Fix — pad missing columns with explicit NULL AS <name>; this also documents intent.

Gotcha 4 — Type widening surprises with strings and numbers.

  • The bug — UNION'ing INTEGER and VARCHAR columns; Postgres errors, MySQL silently casts to VARCHAR (which then sorts lexicographically — '10' < '2').
  • Fix — write explicit casts on both sides; never rely on implicit widening.

Gotcha 5 — NULL equality differs between set ops and JOINs.

  • The behaviourINTERSECT/EXCEPT treat NULL = NULL as true; JOIN ON a = b treats it as UNKNOWN.
  • Symptom — rewriting INTERSECT as INNER JOIN silently drops rows where the join column is NULL.
  • Fix — use IS NOT DISTINCT FROM in the JOIN, or keep the set operator.

Gotcha 6 — NOT IN (subquery) collapses to zero rows on any NULL in the subquery.

  • The bugWHERE x NOT IN (SELECT y FROM B) returns zero rows the moment B.y contains a single NULL.
  • Symptom — query returns empty; the trap is silent.
  • Fix — use NOT EXISTS or EXCEPT; or add WHERE y IS NOT NULL inside the subquery.

Gotcha 7 — Column names come from the first input only.

  • The behaviourSELECT a AS x … UNION SELECT b AS y FROM … outputs the column as x; y is discarded.
  • SubtleORDER BY y would error; you must reference x or use ORDER BY 1.
  • Fix — alias every column in the first SELECT consistently; ignore aliases in the rest.

SQL
Topic — set-operations
Set-ops gotcha drills

Practice →

SQL
Topic — union-all
UNION ALL gotcha drills

Practice →


Choosing the right set operator (cheat sheet)

A one-screen cheat sheet for using SQL set operations — pick the operator that matches your intent, then confirm the column contract.

You want to … Operator DISTINCT vs ALL Typical cost
Stack two result sets and keep every row UNION ALL bag (keeps duplicates) Θ(n + m)
Stack two result sets and dedupe UNION set (DISTINCT default) Θ(n + m) + Θ(dedup)
Rows in both inputs INTERSECT set (DISTINCT default) Θ(n + m) + Θ(hash)
Rows in both inputs with multiplicity INTERSECT ALL (Postgres / Snowflake) bag Θ(n + m) + Θ(hash)
Rows in A but not in B EXCEPT (MINUS on Oracle) set (DISTINCT default) Θ(n + m) + Θ(anti-join)
Rows in A but not in B with multiplicity EXCEPT ALL (Postgres / Snowflake) bag Θ(n + m) + Θ(anti-join)
"Rows in A also in B" with extra A columns EXISTS (semi-join) preserves A Θ(n + m) + Θ(hash)
"Rows in A not in B" with extra A columns NOT EXISTS (anti-join) preserves A Θ(n + m) + Θ(hash)
Combine two index-friendly predicates UNION ALL over OR bag usually faster than WHERE … OR …
Diff two snapshots in both directions (A EXCEPT B) UNION ALL (B EXCEPT A) bag symmetric difference

Frequently asked questions

What's the difference between UNION and UNION ALL in SQL?

UNION stacks two result sets vertically and then drops exact-duplicate rows; UNION ALL stacks them and keeps every row, including duplicates. The semantic difference is set-vs-bag; the performance difference is that UNION adds a DISTINCT pass over the combined output (a sort or hash) while UNION ALL is essentially free beyond the input scans. The canonical interview rule of thumb: default to UNION ALL for performance, and reach for UNION only when you actively need deduplication. On large tables the gap is material — a UNION on 100 million rows can be several times slower than the equivalent UNION ALL for no functional benefit when the inputs are disjoint.

What does INTERSECT do in SQL?

INTERSECT returns rows that appear in both input result sets. It treats the inputs as sets — NULL = NULL is true (unlike JOIN's UNKNOWN), and the default DISTINCT semantics return each shared row exactly once even if it appears multiple times in either input. INTERSECT requires the two inputs to have the same column count and compatible types at each position, and the output column names come from the first input. Use INTERSECT for "rows in both" prompts where the two inputs already share an identical projection; for "rows in both with extra columns from the left input", reach for INNER JOIN or EXISTS instead. Modern PostgreSQL, SQL Server, Oracle, Snowflake, BigQuery, and MySQL 8.0.31+ all support INTERSECT; older MySQL must emulate via INNER JOIN + DISTINCT.

What's the difference between EXCEPT and MINUS in SQL?

EXCEPT (ANSI standard) and MINUS (Oracle's historical keyword) are the same operator: both return rows in the left input that do not appear in the right input, with DISTINCT semantics by default and set-equality NULL = NULL true. Oracle has supported EXCEPT as an alias for MINUS since 21c. PostgreSQL, SQL Server, Snowflake, BigQuery, and MySQL 8.0.31+ all use EXCEPT. The canonical interview phrasing — "rows in A but not in B" — points to EXCEPT / MINUS when both inputs share a projection; for "rows in A not in B with extra A columns," use NOT EXISTS instead. Note that NOT IN (SELECT … FROM B) is not equivalent: if B contains any NULL in the compared column, NOT IN collapses to zero rows silently — EXCEPT and NOT EXISTS are both null-safe.

When should I use UNION vs JOIN in SQL?

UNION (and UNION ALL) combine result sets vertically — stacking rows on top of each other, requiring matching column counts and types. JOIN combines result sets horizontally — matching rows by a key and producing a wider row with columns from both sides. Use UNION / UNION ALL when you want a single column projection but rows from multiple sources (e.g. "all sales from EU + US tables"). Use JOIN when you want columns from multiple tables aligned by a key (e.g. "orders with customer names"). They are not interchangeable: UNION cannot add columns, and JOIN cannot stack rows. The same intent can sometimes be expressed both ways — "users in either cohort A or cohort B" can be cohort_a UNION cohort_b or users LEFT JOIN cohort_a … LEFT JOIN cohort_b … WHERE …, but the UNION form is usually shorter and faster when projections align.

Are UNION, INTERSECT, and EXCEPT NULL-safe?

Yes — all three set operators treat NULL = NULL as true for the purpose of deduplication and equality, which is opposite to how JOIN ON a = b evaluates the same predicate (where NULL = NULL is UNKNOWN and the row is dropped). This makes set operators the right tool when your join key can be NULL and you want NULL-NULL matches to count. The practical consequence: rewriting INTERSECT as INNER JOIN silently drops rows where the join column is NULL; rewriting EXCEPT as LEFT JOIN … WHERE B.x IS NULL has the same problem at a different boundary. If you must use a JOIN, switch to IS NOT DISTINCT FROM (Postgres) or <=> (MySQL) for null-safe equality; otherwise keep the set operator.


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including PostgreSQL-first SQL practice keyed to set operations (UNION, UNION ALL, INTERSECT, EXCEPT/MINUS), anti-join patterns with NOT EXISTS, semi-join patterns with EXISTS / IN, and the dialect quirks that fail candidates who memorise only one engine. Whether you're drilling sql interview questions with answers for sql for data engineers loops or grinding through data engineering interview questions end-to-end, the practice library mirrors the same four-operator taxonomy this guide teaches.

Kick off via Explore practice →; drill the dedicated set-operations SQL lane →; fan out into the UNION ALL lane →; rehearse JOIN SQL problems →; reinforce aggregation + UNION ALL patterns →; widen coverage on the full SQL practice library →.

Top comments (0)