sql unique is the most under-rated keyword in the SQL standard — a one-line schema declaration that turns "this column should not have duplicates" from a tribal comment in a Slack channel into a hard rule the database enforces on every write. Interviewers know that engineers who can't separate hard uniqueness (constraint-enforced at insert time) from soft uniqueness (collapsed at query time with ROW_NUMBER() or DISTINCT ON) ship dashboards that double-count, joins that explode row counts, and upserts that randomly trigger duplicate-key errors in production.
This guide walks the entire uniqueness surface end to end — the unique constraint vs primary key decision matrix, composite unique for multi-tenant schemas, partial unique index for soft-delete and only-one-active patterns, the read-time toolkit (row_number deduplication, distinct on, QUALIFY), and the three-dialect upsert matrix (Postgres ON CONFLICT, MySQL ON DUPLICATE KEY UPDATE, MERGE for SQL Server / Snowflake / BigQuery). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the DISTINCT practice library →, rehearse on window-function deduplication problems →, and stack the schema muscles with aggregation drills →.
On this page
- Why uniqueness is the foundation of clean data
- UNIQUE vs PRIMARY KEY — the matrix
- Composite & partial unique indexes
- Hard vs soft deduplication — constraint vs query
- UPSERT — INSERT or UPDATE on duplicate key
- Cheat sheet — uniqueness recipes
- Frequently asked questions
- Practice on PipeCode
1. Why uniqueness is the foundation of clean data
Uniqueness is a contract, not a comment — the schema either enforces it on every write, or you re-pay the bill on every read
The one-sentence invariant: every table has at least one "business key" that must be unique, and the cheapest way to keep it unique is a UNIQUE constraint on the write path — anything you delegate to a query is a tax you pay forever. Once you internalise the difference between hard uniqueness (a write-time constraint the database refuses to violate) and soft uniqueness (a read-time projection that collapses duplicates), the entire family of sql unique and deduplication interview questions becomes a sequence of small choices.
The two flavours.
-
Hard uniqueness — constraint-enforced.
UNIQUE (col)orPRIMARY KEY (col)rejects the offending row at insert time with a duplicate-key error. The duplicate never reaches the table; downstream consumers never see it. -
Soft uniqueness — query-enforced. The table allows duplicates; every read projects through
ROW_NUMBER() OVER (PARTITION BY key ORDER BY recency DESC) WHERE rn = 1orDISTINCT ON (key)to pick the "right" row. The duplicates linger in storage; every read pays the dedup cost.
The three places duplicates enter.
- Source feeds with no contract. Vendors push CSVs that re-include yesterday's rows by accident. Every nightly load creates duplicates unless the load is idempotent.
-
Retries. A network blip causes the producer to re-emit the same event. Without a dedup key, the message lands twice — once under the original
event_idand once under a new one if the producer assigned ids before retry. - Late-arriving CDC. Change-data-capture streams emit one row per change. If the same row updates twice in five seconds, the warehouse holds both versions; the query layer must pick the latest.
Cost of duplicates.
-
Inflated aggregates.
SUM(amount)over a table with 3% duplicates over-reports revenue by 3%. Multiply by an executive bonus tied to that metric and the bug becomes expensive. - Broken joins. A row that appears twice on the right side of a join doubles the left side — row counts explode multiplicatively. The classic "my join returned 10x more rows than I expected" ticket.
- Idempotency failures. Retried writes that re-insert the same logical record cause downstream consumers to double-process. Event-driven systems get particularly hurt because a duplicate trigger looks identical to a fresh trigger.
What interviewers listen for.
- Do you say "uniqueness is a write-time constraint" before "uniqueness is a query-time filter"? — senior signal.
- Do you reach for
UNIQUE (tenant_id, slug)instead ofUNIQUE (slug)when the schema is multi-tenant? — required answer. - Do you mention
ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) = 1when asked how you'd dedup a CDC stream? — required answer. - Do you recognise that Snowflake / BigQuery treat
UNIQUEas informational only and require adbt uniquetest for enforcement? — senior signal.
The 2026 reality.
-
OLTP databases (Postgres, MySQL, SQL Server, Oracle) enforce
UNIQUEandPRIMARY KEYat write time. The constraint is a contract. -
Cloud warehouses (Snowflake, BigQuery, Redshift) accept the syntax but do not enforce it —
UNIQUEis a hint to the optimiser, nothing more. Enforce withdbt test unique(or its equivalent) in the transformation layer. -
SQL:2023 ships
UNIQUE NULLS NOT DISTINCT— the long-missing knob that makes UNIQUE treat NULLs as equal so only one NULL row is allowed (Postgres 15+, DuckDB). -
Partial unique indexes (Postgres, SQL Server filtered indexes, SQLite) let you scope uniqueness to a
WHEREpredicate — the only clean way to express "one active subscription per user, but unlimited cancelled ones."
Worked example — the duplicate-source feed disaster
Detailed explanation. A finance team loads a vendor CSV every night into a transactions table. The vendor's CSV occasionally re-includes the previous day's tail rows because their export script uses a date-window join that overlaps by a few hours. With no UNIQUE constraint, the duplicates slip in. A month later, the revenue number is off by 4% and finance opens a ticket.
Question. Given the transactions table below, show how a missing UNIQUE on (vendor_tx_id) lets duplicates land, and how the constraint would have rejected them at write time.
Input.
| id | vendor_tx_id | amount |
|---|---|---|
| 1 | TX-001 | 100 |
| 2 | TX-002 | 50 |
| 3 | TX-001 | 100 |
| 4 | TX-003 | 75 |
Code.
-- BROKEN — no UNIQUE on vendor_tx_id, duplicate TX-001 sneaks in
CREATE TABLE transactions (
id BIGINT PRIMARY KEY,
vendor_tx_id TEXT NOT NULL,
amount NUMERIC(10, 2) NOT NULL
);
-- FIXED — UNIQUE constraint on the business key rejects duplicate
CREATE TABLE transactions (
id BIGINT PRIMARY KEY,
vendor_tx_id TEXT NOT NULL,
amount NUMERIC(10, 2) NOT NULL,
CONSTRAINT transactions_vendor_tx_id_uk UNIQUE (vendor_tx_id)
);
-- Aggregate that surfaces the bug
SELECT SUM(amount) AS total_revenue
FROM transactions;
Step-by-step explanation.
- The broken schema has only a surrogate
PRIMARY KEY (id).idis generated by the ETL job, so each insert produces a new id even for a duplicatevendor_tx_id. The constraint cannot catch the duplicate because the primary key is a different column. - After loading the CSV, the table contains four rows including the duplicate
TX-001.SUM(amount)returns100 + 50 + 100 + 75 = 325— the duplicate is counted twice. - The fixed schema adds
UNIQUE (vendor_tx_id). The second insert ofTX-001raises a duplicate-key error (23505on Postgres,1062on MySQL). The ETL job sees the error and either skips or upserts — but the duplicate never reaches the table. -
SUM(amount)now returns100 + 50 + 75 = 225— the correct revenue.
Output.
| Schema | total_revenue |
|---|---|
| No UNIQUE (broken) | 325 |
| With UNIQUE (fixed) | 225 |
Rule of thumb. Every dimension table ships with at least one UNIQUE constraint on its business key, even if the primary key is a surrogate id. The constraint is free at write time and saves you a finance ticket nine months later.
Worked example — the retry-storm duplicate
Detailed explanation. A payment service emits a webhook for every charge. The webhook handler writes a row to charges. A network blip causes the handler to retry the same webhook three seconds later. Without a UNIQUE constraint on the idempotency key (stripe_event_id), both inserts succeed and the charge is recorded twice. With the constraint, the second insert raises a duplicate-key error and the handler treats it as "already processed."
Question. Show the schema and the idempotent-insert pattern that survives a retry storm without double-charging the customer.
Input.
| stripe_event_id | charge_id | amount |
|---|---|---|
| evt_1 | ch_a | 100 |
| evt_1 | ch_a | 100 |
| evt_2 | ch_b | 50 |
Code.
CREATE TABLE charges (
id BIGSERIAL PRIMARY KEY,
stripe_event_id TEXT NOT NULL,
charge_id TEXT NOT NULL,
amount NUMERIC(10, 2) NOT NULL,
CONSTRAINT charges_stripe_event_id_uk UNIQUE (stripe_event_id)
);
-- Idempotent insert — second retry no-ops
INSERT INTO charges (stripe_event_id, charge_id, amount)
VALUES ('evt_1', 'ch_a', 100)
ON CONFLICT (stripe_event_id) DO NOTHING;
Step-by-step explanation.
- The first webhook call inserts
evt_1 / ch_a / 100. The row lands; the UNIQUE constraint is satisfied. - The retry fires three seconds later with the same
stripe_event_id. The naiveINSERTwould raise23505 duplicate key value violates unique constraint. The handler would have to catch and ignore the error — workable but ugly. - The idempotent form
ON CONFLICT (stripe_event_id) DO NOTHINGtells Postgres: "if this row already exists, silently skip." No exception, no row added — the handler treats both the first call and the retry identically. - The third call inserts
evt_2 / ch_b / 50. Different idempotency key, different row, no conflict.
Output.
| Final row count | total_amount |
|---|---|
| 2 | 150 |
Rule of thumb. For every external-event ingest, the handler writes its row with ON CONFLICT (idempotency_key) DO NOTHING against a UNIQUE index on that key. The pattern is a one-liner and eliminates an entire class of retry bugs.
SQL interview question on uniqueness-as-contract
A senior interviewer often opens with: "Walk me through every place a duplicate could land in a typical data pipeline — source feed, CDC, retry, late-arriving event, manual backfill — and tell me where you'd put the UNIQUE constraint to catch each one." This blends the schema design instinct with the operational instinct.
Solution Using layered UNIQUE constraints across the pipeline
-- 1) Raw landing zone — minimal constraints, capture everything
CREATE TABLE raw_events (
id BIGSERIAL PRIMARY KEY,
event_id TEXT NOT NULL,
payload JSONB NOT NULL,
_loaded_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- 2) Staging — UNIQUE on the business key + a uniqueness audit
CREATE TABLE stg_events (
event_id TEXT PRIMARY KEY,
user_id BIGINT NOT NULL,
event_type TEXT NOT NULL,
event_ts TIMESTAMPTZ NOT NULL,
_loaded_at TIMESTAMPTZ NOT NULL
);
-- Audit query — how many duplicates landed in the raw zone?
SELECT
COUNT(*) AS raw_rows,
COUNT(DISTINCT event_id) AS unique_event_ids,
COUNT(*) - COUNT(DISTINCT event_id) AS duplicate_count
FROM raw_events;
-- 3) Production warehouse — UNIQUE (composite) + dbt unique test
CREATE TABLE fct_events (
event_id TEXT NOT NULL,
user_id BIGINT NOT NULL,
event_type TEXT NOT NULL,
event_ts TIMESTAMPTZ NOT NULL,
CONSTRAINT fct_events_pk PRIMARY KEY (event_id)
);
Step-by-step trace.
| Layer | Duplicate behaviour | Enforcement |
|---|---|---|
raw_events |
duplicates accepted | none — capture everything, audit later |
stg_events |
duplicates rejected |
PRIMARY KEY (event_id) + idempotent upsert |
fct_events |
duplicates rejected |
PRIMARY KEY (event_id) + dbt unique test (warehouse) |
After the audit query reveals duplicate_count, the team rewrites the staging load as INSERT ... ON CONFLICT (event_id) DO NOTHING against stg_events, so every duplicate raw row no-ops at the staging boundary. The fact table inherits clean data from staging and the dbt test catches any regression introduced by a future schema change.
Output:
| Layer | Row count | Distinct event_id | Duplicates |
|---|---|---|---|
| raw_events | 1,000 | 970 | 30 |
| stg_events | 970 | 970 | 0 |
| fct_events | 970 | 970 | 0 |
Why this works — concept by concept:
- Raw is the truth — the landing zone captures every row including duplicates so that the audit query can quantify the upstream defect rate. Constraining the raw zone hides the problem.
-
Staging is the contract — the
PRIMARY KEY (event_id)constraint plus the idempotent upsert turns "duplicates in raw" into "exactly-once in staging." This is where hard uniqueness lives. -
Production inherits from staging —
fct_eventsdoes not need to re-enforce uniqueness because staging already did; the dbtuniquetest is a safety net that catches schema drift. -
Audit before metric — running
COUNT(*) - COUNT(DISTINCT event_id)against the raw zone surfaces the upstream defect rate without blocking the load. Cheap insurance against a silent 3% inflation. -
Cost —
UNIQUEadds a B-tree index per constrained column (O(log n) insert + O(n) storage). The cost is paid once at write time; the alternative — soft dedup on every read — is paid forever.
SQL
Topic — distinct
DISTINCT and uniqueness problems (SQL)
SQL
Topic — aggregation
Aggregation problems with duplicates (SQL)
2. UNIQUE vs PRIMARY KEY — the matrix
unique constraint and primary key look interchangeable but differ in NULL handling, cardinality, clustering, and FK convention — five rows the interviewer wants you to recite
The mental model in one line: a table has exactly one PRIMARY KEY (which is NOT NULL and conventionally the FK target) and any number of UNIQUE constraints (which allow NULLs and live alongside the PK). Once you internalise that "PK = surrogate identity, UNIQUE = business key," the entire matrix becomes obvious.
The five-row matrix.
| Attribute | UNIQUE | PRIMARY KEY |
|---|---|---|
| NULLs allowed? | Yes — usually treated as distinct so many NULL rows allowed | No — every column is implicitly NOT NULL
|
| Count per table | Unlimited — one per business key | Exactly one |
| Clustered index | No — separate non-clustered index (or row-store entry) | Yes by default in SQL Server / InnoDB |
| Foreign-key target | Yes — works | Yes — the conventional default |
| Dialect NULL quirks | SQL Server allows exactly 1 NULL; Postgres allows many; SQL:2023 adds NULLS NOT DISTINCT | All dialects reject NULL universally |
NULL handling — the most-asked sub-question.
-
Postgres / Oracle / Snowflake / BigQuery / DuckDB — every NULL is distinct from every other NULL by default.
UNIQUE (col)lets you insert ten rows withcol = NULL. - SQL Server — only one NULL is allowed in a UNIQUE column. The reasoning: SQL Server treats NULL as a value for the purposes of indexing, then sees duplicates.
- MySQL — many NULLs allowed (matches Postgres semantics).
-
SQL:2023
UNIQUE NULLS NOT DISTINCT— opt-in syntax that makes NULL collide with NULL, so only one NULL row is allowed. Postgres 15+, DuckDB.
Foreign-key conventions.
- A
FOREIGN KEYcan point at anyUNIQUEorPRIMARY KEYcolumn — the engine just needs an enforced uniqueness guarantee at the target. - Convention is to point at the surrogate
PRIMARY KEYbecause it never changes and is a small integer (cheap to copy and index). - Pointing at a
UNIQUEbusiness key (email,username) is valid but discouraged — if the business key ever changes (a user updates their email), every FK must cascade.
Clustered vs non-clustered.
-
SQL Server / InnoDB (MySQL) — the
PRIMARY KEYis the clustered index by default. The table is physically sorted by the PK; the PK lookup is one B-tree traversal. - UNIQUE constraints in SQL Server / InnoDB create non-clustered indexes — a separate B-tree that maps the unique column to the PK, which then locates the row.
-
Postgres — no clustered indexes by default. All indexes are non-clustered; the heap stores rows in insertion order.
CLUSTERreorders the heap on demand but is rarely used.
Snowflake / BigQuery / Redshift caveats.
- All three accept
UNIQUEandPRIMARY KEYsyntax but do not enforce the constraints. They are informational hints to the query optimiser. - The optimiser may use the hint to skip a
DISTINCTstep or assume cardinality. If your data violates the hint, results may be wrong. - Enforce with
dbt test unique: column_nameor the equivalent in your transformation framework. The test runs on every CI and catches violations within minutes of the regression.
Common interview probes.
- "Can a column be both UNIQUE and a foreign key?" — yes; the UNIQUE makes it a one-to-one FK relationship.
- "How many NULLs can a UNIQUE column have in SQL Server?" — exactly one.
- "What does
UNIQUE NULLS NOT DISTINCTdo?" — treats NULLs as equal, so only one NULL row is allowed. - "Does Snowflake enforce UNIQUE?" — no; use
dbt test unique.
Worked example — same column as PRIMARY KEY vs UNIQUE
Detailed explanation. A signup table has a single email column. The team debates whether email should be the PRIMARY KEY (no surrogate id) or a UNIQUE column on top of a surrogate BIGSERIAL id. The decision changes NULL handling, FK targets, and the rename flow.
Question. Show the two schema choices side by side. Highlight the four practical differences: NULL behaviour, FK target convention, primary key changeability, and operational consequences.
Input. Conceptual — no rows, just two CREATE TABLE statements.
Code.
-- Option A — email as PRIMARY KEY (natural key only)
CREATE TABLE users_a (
email TEXT PRIMARY KEY,
full_name TEXT
);
-- Option B — surrogate id PK + UNIQUE email
CREATE TABLE users_b (
id BIGSERIAL PRIMARY KEY,
email TEXT NOT NULL UNIQUE,
full_name TEXT
);
-- FK pattern downstream
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
-- Option A
user_email TEXT REFERENCES users_a(email),
-- Option B
user_id BIGINT REFERENCES users_b(id)
);
Step-by-step explanation.
- Option A makes
emailthe natural primary key. The column is implicitlyNOT NULL. The downstreamorders.user_emailforeign key references this column directly. - Option B introduces a surrogate
BIGSERIAL idas the primary key.emailisNOT NULL UNIQUE— same uniqueness guarantee but separate from identity. Downstreamorders.user_idreferences the integer id. - If a user updates their email, Option A must
UPDATE users_a SET email = ...and cascade the change to everyorders.user_emailrow. Option B updatesusers_b.emailin place; the surrogateiddoes not change, soorders.user_idis unaffected. - Option B is also more storage-efficient: an 8-byte
BIGINTinorders.user_idis smaller than a 30+ character email inorders.user_email. With 100M order rows the difference is gigabytes. - The standard recommendation: use surrogate PK + UNIQUE business key for any table whose business key might ever change. Reserve natural PKs for immutable codes (ISO country codes, currency codes).
Output.
| Concern | Option A (email as PK) | Option B (id PK + email UNIQUE) |
|---|---|---|
| NULL email allowed? | No | No (UNIQUE + NOT NULL) |
| FK column type in orders | TEXT (email string) | BIGINT |
| Storage per FK row | 30+ bytes | 8 bytes |
| Email rename impact | cascades to every FK | none |
| Recommended for users table | rarely | almost always |
Rule of thumb. Default to surrogate primary key plus UNIQUE business key. Only use a natural primary key when the column is provably immutable and small (currency codes, country codes, ISBN-after-revision-policy).
Worked example — SQL Server's "only one NULL" quirk
Detailed explanation. A team migrates a Postgres schema to SQL Server. In Postgres, UNIQUE (referral_code) allows many NULL referral_code rows because Postgres treats NULLs as distinct. SQL Server's UNIQUE constraint allows exactly one NULL. The migration fails on the first insert with a second NULL.
Question. Show the schema and the cross-dialect difference. Provide the SQL Server workaround using a filtered index.
Input.
| user_id | referral_code |
|---|---|
| 1 | ABC |
| 2 | NULL |
| 3 | NULL |
| 4 | XYZ |
Code.
-- Postgres / MySQL / Oracle — both NULL rows accepted
CREATE TABLE users (
user_id BIGSERIAL PRIMARY KEY,
referral_code TEXT UNIQUE -- allows multiple NULLs
);
-- SQL Server — second NULL fails with 2627 violation
CREATE TABLE users (
user_id BIGINT IDENTITY PRIMARY KEY,
referral_code NVARCHAR(50) UNIQUE -- allows exactly one NULL
);
-- SQL Server workaround — filtered unique index that ignores NULL
CREATE TABLE users (
user_id BIGINT IDENTITY PRIMARY KEY,
referral_code NVARCHAR(50)
);
CREATE UNIQUE INDEX users_referral_code_uk
ON users (referral_code)
WHERE referral_code IS NOT NULL;
Step-by-step explanation.
- In Postgres, inserting
(2, NULL)then(3, NULL)both succeed because the engine treats every NULL as distinct from every other NULL — a different NULL each time. - In SQL Server, the first NULL insert succeeds and SQL Server records "I have a NULL in this UNIQUE column." The second NULL insert raises
2627 Violation of UNIQUE KEY constraint. - The workaround is a filtered unique index —
CREATE UNIQUE INDEX ... WHERE referral_code IS NOT NULL. The index covers only non-NULL rows, so NULLs are entirely outside the uniqueness check and any number are allowed. - The filtered-index trick is the canonical SQL Server pattern for "UNIQUE on a nullable column." It mirrors Postgres semantics and is also the pattern for SQL:2023
NULLS NOT DISTINCT(which Postgres 15+ supports natively).
Output.
| Engine | Insert (2, NULL) | Insert (3, NULL) |
|---|---|---|
| Postgres | OK | OK |
| MySQL | OK | OK |
| SQL Server — vanilla UNIQUE | OK | 2627 violation |
| SQL Server — filtered index | OK | OK |
Rule of thumb. When a column is both nullable and unique, state the NULL behaviour explicitly — NULLS DISTINCT (default in most engines), NULLS NOT DISTINCT (Postgres 15+ / DuckDB), or a filtered unique index (SQL Server / SQLite). Future readers and future engines will thank you.
SQL interview question on the UNIQUE vs PRIMARY KEY decision
A senior interviewer might frame this as: "You're designing a subscriptions table that holds one row per (user_id, plan_id) pair. Walk me through whether you'd use a composite PRIMARY KEY, a surrogate id + UNIQUE, or both — and why."
Solution Using surrogate PK + composite UNIQUE for stability
CREATE TABLE subscriptions (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES users(id),
plan_id BIGINT NOT NULL REFERENCES plans(id),
started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT subscriptions_user_plan_uk UNIQUE (user_id, plan_id)
);
Step-by-step trace.
| Decision | Choice | Why |
|---|---|---|
| Primary key | BIGSERIAL id |
small immutable surrogate; stable FK target |
| Business key | UNIQUE (user_id, plan_id) |
hard enforcement of "one row per pair" |
| Downstream FK | REFERENCES subscriptions(id) |
tiny 8-byte FK column |
| Future change tolerance | rename / migrate plan ids without breaking FKs | the surrogate id never changes |
After the schema lands, the team writes the standard INSERT ... ON CONFLICT (user_id, plan_id) DO NOTHING for the signup flow. Duplicate signup attempts no-op cleanly without re-implementing the uniqueness check in application code.
Output:
| Property | Value |
|---|---|
| Primary key column | id |
| Unique business key | (user_id, plan_id) |
| Downstream FK column type | BIGINT |
| Duplicate signup behaviour | silently no-op via ON CONFLICT DO NOTHING |
Why this works — concept by concept:
-
Surrogate PK + business UNIQUE — splits identity (the
id) from meaning (the business key). The id never changes; the business key is the contract. -
Composite UNIQUE —
UNIQUE (user_id, plan_id)says "a (user_id, plan_id) pair is unique" — not user_id alone, not plan_id alone. The composite index supports both the uniqueness check and queries that filter byuser_id(leading column). -
Foreign key target — pointing downstream FKs at
subscriptions.id(8 bytes) instead of(user_id, plan_id)(16+ bytes) saves storage and simplifies joins. -
Idempotent insert —
ON CONFLICT (user_id, plan_id) DO NOTHINGuses the UNIQUE constraint as the conflict target. A retried signup never doubles the row. -
Cost — one B-tree index on
(user_id, plan_id)(O(log n)insert +O(n)storage). Same cost as any composite UNIQUE; pays for itself on the first duplicate.
SQL
Topic — distinct
UNIQUE constraint problems (SQL)
SQL
Topic — joins
JOIN problems on unique keys (SQL)
3. Composite & partial unique indexes
composite unique scopes the constraint to a column tuple; partial unique index scopes it to a row predicate — together they cover every "unique within X" requirement
The mental model in one line: UNIQUE (a, b) makes the tuple unique (rows can share a or b alone, just not both together); UNIQUE INDEX ... WHERE pred makes uniqueness apply only to rows that match pred. Compose the two and you can express any business rule of the form "X must be unique within scope Y."
Composite UNIQUE in detail.
-
Syntax.
UNIQUE (a, b)— the combination ofaandbmust be unique.(1, 'x')and(1, 'y')are both allowed;(1, 'x')twice is not. -
Column order. The order in
UNIQUE (a, b)does not affect the uniqueness semantics —(a, b)is the same set as(b, a). But the underlying index is(a, b), which supports queries that filter byaalone (leading column) but not bybalone. -
Use case 1 — multi-tenant slug uniqueness.
UNIQUE (tenant_id, slug)— a slug like'launch'is unique within a tenant, but two tenants can each have their own'launch'page. -
Use case 2 — daily metric uniqueness.
UNIQUE (date, metric_name)— one row per metric per day. The upsert flow then writesON CONFLICT (date, metric_name) DO UPDATE SET value = EXCLUDED.value.
Partial UNIQUE in detail (Postgres / SQL Server / SQLite / DuckDB).
-
Syntax.
CREATE UNIQUE INDEX name ON table (col) WHERE pred. The uniqueness check applies only to rows wherepredis true. -
Use case 1 — one-active pattern.
CREATE UNIQUE INDEX one_active_sub ON subscriptions (user_id) WHERE is_active = true— at most one active subscription per user, but unlimited cancelled ones. -
Use case 2 — soft-delete-aware uniqueness.
CREATE UNIQUE INDEX unique_email ON users (email) WHERE deleted_at IS NULL— uniqueness applies only to live rows; soft-deleted rows can reuse the email. -
Use case 3 — type-scoped uniqueness.
CREATE UNIQUE INDEX one_primary_address ON addresses (user_id) WHERE is_primary = true— at most one primary address per user, multiple secondary ones allowed.
Expression-based UNIQUE (Postgres / Oracle / SQL Server with computed columns).
-
Syntax.
UNIQUE (LOWER(email))— the result of the expression is unique, not the raw column. -
Use case 1 — case-insensitive uniqueness. Emails
Alice@x.comandalice@x.comshould be considered duplicates.UNIQUE (LOWER(email))enforces it at the schema level. -
Use case 2 — normalised key uniqueness.
UNIQUE (REGEXP_REPLACE(phone, '\D', '', 'g'))— phone numbers are unique after stripping non-digits. -
Use case 3 — temporal bucketing.
UNIQUE (date_trunc('day', created_at), user_id)— at most one event per user per day, regardless of the timestamp's hour.
Snowflake / BigQuery / Redshift caveats.
- All three accept the syntax for
UNIQUE (a, b)but do not enforce it. The constraint is a hint. - Partial unique indexes do not exist in the warehouse-OLAP family — there is no enforced-at-write mechanism. Use
dbt test uniquewith awhereclause:tests: [{ name: unique_active_subs, column: user_id, config: { where: "is_active = true" } }]. - Expression-based UNIQUE — same story; emulate with a dbt test on
lower(email).
MySQL caveats.
- Composite UNIQUE — fully supported, same as Postgres.
-
Partial UNIQUE — not supported. MySQL has no filtered indexes. The standard workaround is a generated column that is NULL when the predicate is false, then UNIQUE on the generated column. Postgres-style
WHEREpartial indexes have to be emulated. -
Expression UNIQUE — supported via generated columns:
email_lower VARCHAR(255) GENERATED ALWAYS AS (LOWER(email)) STORED, UNIQUE (email_lower).
Worked example — multi-tenant slug uniqueness
Detailed explanation. A SaaS product hosts marketing pages for many tenants. Each tenant owns a set of page slugs ('launch', 'pricing', 'about'). The slug is unique within the tenant, not globally — two tenants can each have a 'launch' page. The schema enforces this with UNIQUE (tenant_id, slug).
Question. Show the table, attempt four inserts (two from each tenant), and explain which ones succeed.
Input — intended inserts.
| tenant_id | slug |
|---|---|
| 1 | launch |
| 1 | launch |
| 2 | launch |
| 2 | intro |
Code.
CREATE TABLE pages (
id BIGSERIAL PRIMARY KEY,
tenant_id BIGINT NOT NULL REFERENCES tenants(id),
slug TEXT NOT NULL,
title TEXT NOT NULL,
CONSTRAINT pages_tenant_slug_uk UNIQUE (tenant_id, slug)
);
INSERT INTO pages (tenant_id, slug, title) VALUES (1, 'launch', 'Launch v1');
INSERT INTO pages (tenant_id, slug, title) VALUES (1, 'launch', 'Launch v2'); -- fails
INSERT INTO pages (tenant_id, slug, title) VALUES (2, 'launch', 'Launch');
INSERT INTO pages (tenant_id, slug, title) VALUES (2, 'intro', 'Intro');
Step-by-step explanation.
- The first insert
(1, 'launch')lands. The composite UNIQUE check sees no existing(1, 'launch')row. - The second insert
(1, 'launch')collides. The constraint fires; the insert raises23505 duplicate key value violates unique constraint. No row added. - The third insert
(2, 'launch')lands. The composite key(2, 'launch')is different from(1, 'launch'). Even thoughslug='launch'is "the same value" as row 1, the UNIQUE is on the pair, not the column. - The fourth insert
(2, 'intro')lands. Differentslug, no collision with anything.
Output.
| Row in table | tenant_id | slug | title |
|---|---|---|---|
| 1 | 1 | launch | Launch v1 |
| 2 | 2 | launch | Launch |
| 3 | 2 | intro | Intro |
Rule of thumb. Every multi-tenant table that has a "human-friendly key" (slug, name, code) needs UNIQUE (tenant_id, that_key). Forgetting the tenant_id leading column means tenant A's 'launch' blocks tenant B from creating their own.
Worked example — one-active partial unique index
Detailed explanation. Each user can have many subscriptions rows over their lifetime (one per signup, one per cancellation, one per re-signup). At any given moment, at most one subscription per user should be is_active = true. A partial unique index expresses exactly this — uniqueness on user_id, but only for rows where is_active.
Question. Show the schema and four inserts. Explain which succeed.
Input — intended inserts.
| user_id | is_active | started_at |
|---|---|---|
| 42 | true | 2026-01-01 |
| 42 | true | 2026-02-01 |
| 42 | false | 2025-10-01 |
| 42 | false | 2025-12-01 |
Code.
CREATE TABLE subscriptions (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL,
is_active BOOLEAN NOT NULL,
started_at TIMESTAMPTZ NOT NULL,
cancelled_at TIMESTAMPTZ
);
CREATE UNIQUE INDEX one_active_sub_per_user
ON subscriptions (user_id)
WHERE is_active = true;
-- Inserts
INSERT INTO subscriptions (user_id, is_active, started_at) VALUES (42, true, '2026-01-01');
INSERT INTO subscriptions (user_id, is_active, started_at) VALUES (42, true, '2026-02-01'); -- fails
INSERT INTO subscriptions (user_id, is_active, started_at) VALUES (42, false, '2025-10-01');
INSERT INTO subscriptions (user_id, is_active, started_at) VALUES (42, false, '2025-12-01');
Step-by-step explanation.
- The first insert lands — the
WHERE is_active = truefilter includes this row, but no prior row matches, so the uniqueness check sees no collision. - The second insert collides — both
(42, true, ...)rows would land in the partial index, violatingUNIQUE (user_id)within that subset.23505raised. - The third insert lands —
is_active = false, so the row is outside the partial index. No uniqueness check applies; the row is free to coexist with anything. - The fourth insert lands — also
is_active = false. Multiple cancelled rows per user are allowed.
Output.
| Row in table | user_id | is_active | started_at |
|---|---|---|---|
| 1 | 42 | true | 2026-01-01 |
| 2 | 42 | false | 2025-10-01 |
| 3 | 42 | false | 2025-12-01 |
Rule of thumb. Whenever the business rule is "one X per Y only when condition holds," reach for a partial unique index. Common candidates: one active subscription per user, one primary address per user, one open invoice per project, one in-progress task per worker.
Worked example — case-insensitive email uniqueness
Detailed explanation. Users sign up with email addresses; the product treats Alice@X.com and alice@x.com as the same address. A naive UNIQUE (email) lets both rows coexist. The fix is UNIQUE (LOWER(email)) — the index stores the lowercase form and uniqueness applies to that.
Question. Show the schema and three inserts. Explain which succeed and how to query case-insensitively after the index is in place.
Input — intended inserts.
| Alice@X.com |
| alice@x.com |
| bob@x.com |
Code.
-- Postgres / Oracle / SQL Server (with computed column)
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email TEXT NOT NULL
);
CREATE UNIQUE INDEX users_email_lower_uk ON users (LOWER(email));
INSERT INTO users (email) VALUES ('Alice@X.com');
INSERT INTO users (email) VALUES ('alice@x.com'); -- fails
INSERT INTO users (email) VALUES ('bob@x.com');
-- Case-insensitive lookup uses the index
SELECT * FROM users WHERE LOWER(email) = LOWER('ALICE@X.COM');
Step-by-step explanation.
- The first insert lands. The index stores
lower('Alice@X.com') = 'alice@x.com'. - The second insert computes
lower('alice@x.com') = 'alice@x.com'and collides with the existing index entry.23505raised. - The third insert lands.
lower('bob@x.com') = 'bob@x.com'is new. - The lookup
WHERE LOWER(email) = LOWER('ALICE@X.COM')rewrites both sides to'alice@x.com'. The query planner uses the expression index to locate the row in O(log n) — no full table scan.
Output.
| Row in table | |
|---|---|
| 1 | Alice@X.com |
| 2 | bob@x.com |
Rule of thumb. Any business key that humans type freely (email, username, product code) needs an expression-based UNIQUE that normalises the format. Lowercasing covers 80% of the cases; REGEXP_REPLACE handles the rest. Without it, the dataset accumulates "duplicates" that the application UI shows as distinct accounts.
Worked example — Snowflake informational UNIQUE + dbt test
Detailed explanation. Snowflake accepts UNIQUE syntax but does not enforce it at write time. A team relies on the appearance of the constraint and is surprised when duplicates appear in production after a buggy load. The fix is to add a dbt test unique on the column — the test runs in CI and surfaces the violation before the bad data hits the dashboard.
Question. Show the Snowflake schema, the (non-)enforcement behaviour, and the dbt test that gives you real protection.
Input — intended inserts.
| event_id | user_id | event_ts |
|---|---|---|
| evt_1 | 42 | 2026-06-01 10:00 |
| evt_1 | 42 | 2026-06-01 10:00 |
| evt_2 | 43 | 2026-06-01 10:05 |
Code.
-- Snowflake — UNIQUE accepted but NOT enforced
CREATE TABLE fct_events (
event_id STRING NOT NULL,
user_id NUMBER NOT NULL,
event_ts TIMESTAMP_TZ NOT NULL,
CONSTRAINT fct_events_pk PRIMARY KEY (event_id), -- also informational
CONSTRAINT fct_events_uk UNIQUE (event_id) -- also informational
);
-- All three inserts succeed despite the "constraint"
INSERT INTO fct_events VALUES ('evt_1', 42, '2026-06-01 10:00:00');
INSERT INTO fct_events VALUES ('evt_1', 42, '2026-06-01 10:00:00'); -- duplicates land!
INSERT INTO fct_events VALUES ('evt_2', 43, '2026-06-01 10:05:00');
# dbt test that DOES enforce
models:
- name: fct_events
columns:
- name: event_id
tests:
- unique
- not_null
Step-by-step explanation.
- Snowflake's
CREATE TABLEaccepts bothPRIMARY KEYandUNIQUEsyntax. The Snowflake docs call these "informational integrity constraints" — they exist for the query optimiser and for downstream tools that read the schema metadata. - All three inserts succeed because Snowflake does not run an uniqueness check at write time. The duplicate
evt_1lands silently. - The dbt test runs against the materialised table (
dbt test --select fct_events). It executesSELECT event_id, COUNT(*) FROM fct_events GROUP BY event_id HAVING COUNT(*) > 1and fails the build if any row returns. - The test runs on every CI commit and every scheduled
dbt run, so the duplicate is surfaced within minutes of being introduced — before it reaches the dashboard.
Output.
| Layer | Duplicate behaviour |
|---|---|
Snowflake UNIQUE constraint |
not enforced — duplicates land |
Snowflake PRIMARY KEY constraint |
not enforced — duplicates land |
dbt unique test |
fails CI on duplicate — surfaces the bug |
Rule of thumb. On Snowflake / BigQuery / Redshift, write PRIMARY KEY and UNIQUE constraints as documentation — they help the optimiser and tools — but always pair them with a dbt unique test for real enforcement. The two-line YAML test is the warehouse-equivalent of the OLTP write-time constraint.
SQL interview question on composite + partial uniqueness
A senior interviewer often combines: "Design the constraints for a team_memberships table where each user can be a member of many teams, but only one owner per team. Show the schema and explain every constraint."
Solution Using composite UNIQUE plus partial UNIQUE
CREATE TABLE team_memberships (
id BIGSERIAL PRIMARY KEY,
team_id BIGINT NOT NULL REFERENCES teams(id),
user_id BIGINT NOT NULL REFERENCES users(id),
role TEXT NOT NULL CHECK (role IN ('owner', 'admin', 'member')),
joined_at TIMESTAMPTZ NOT NULL DEFAULT now(),
-- A user is in a team at most once
CONSTRAINT team_memberships_team_user_uk UNIQUE (team_id, user_id)
);
-- At most one owner per team
CREATE UNIQUE INDEX one_owner_per_team
ON team_memberships (team_id)
WHERE role = 'owner';
Step-by-step trace.
| Inserted row | UNIQUE (team_id, user_id) | partial UNIQUE (team_id) WHERE role='owner' | Result |
|---|---|---|---|
| (team=1, user=10, role=owner) | new pair | new owner — first | OK |
| (team=1, user=11, role=member) | new pair | not in partial index | OK |
| (team=1, user=12, role=owner) | new pair | second owner on team 1 → collision | reject |
| (team=2, user=10, role=owner) | new pair | new owner on team 2 | OK |
| (team=1, user=10, role=member) | existing pair → collision | not checked (constraint fired earlier) | reject |
The two constraints compose: composite UNIQUE on (team_id, user_id) prevents the same user from being added twice to the same team; partial UNIQUE on (team_id) WHERE role='owner' ensures at most one owner per team.
Output:
| team_id | user_id | role |
|---|---|---|
| 1 | 10 | owner |
| 1 | 11 | member |
| 2 | 10 | owner |
Why this works — concept by concept:
- Composite UNIQUE (team_id, user_id) — the pair is the business key. Same user in a different team is fine; same user twice in one team is not.
- Partial UNIQUE (team_id) WHERE role='owner' — applies uniqueness only to owner rows. Member and admin rows are outside the index and can repeat.
-
CHECK constraint on role — defends the enum at the schema layer. Application bugs that try to insert
role='superuser'fail loudly. - Two indexes, two business rules — each constraint expresses a single rule. Composing rules in one constraint (e.g. "unique by team_id, user_id, role") would let two owners coexist (different role per row) — the wrong semantics.
- Cost — two B-tree indexes (composite + partial). Combined: O(log n) write, O(n) storage. Negligible vs the cost of an "I see two owners on the team page" bug.
SQL
Topic — conditional logic
Conditional uniqueness problems (SQL)
SQL
Topic — group-by
GROUP BY duplicate-detection problems (SQL)
4. Hard vs soft deduplication — constraint vs query
hard uniqueness rejects duplicates at write time; soft uniqueness collapses them at read time with row_number deduplication or distinct on — pick the right one for the right layer
The mental model in one line: the OLTP source-of-truth ships hard uniqueness (constraint rejects the duplicate); the warehouse staging layer ships soft uniqueness (the duplicate lands but the read picks the latest). Once you choose the layer, the dedup technique follows mechanically.
Hard vs soft — the comparison.
| Property | Hard (constraint) | Soft (query) |
|---|---|---|
| When uniqueness is enforced | at INSERT / UPDATE time | at SELECT time |
| Storage cost | constant — duplicates never land | grows with input volume |
| Read cost | constant — no dedup at query time | window function or DISTINCT on every read |
| Where it shines | OLTP source-of-truth, idempotent writes | warehouse staging, CDC streams, late-arriving data |
| Failure mode | duplicate-key error visible to the writer | silently picks a row — bug surfaces in metrics weeks later |
The soft-dedup toolkit.
-
SELECT DISTINCT col1, col2 FROM t— collapses rows where every selected column is identical. Simple but inflexible: you cannot pick which row wins among the duplicates. -
GROUP BY col1, col2— same row-collapse as DISTINCT but lets you aggregate other columns.SELECT col1, col2, MAX(updated_at) FROM t GROUP BY col1, col2. -
ROW_NUMBER() OVER (PARTITION BY key ORDER BY recency DESC) AS rnthenWHERE rn = 1— the canonical "latest per key" pattern. Works in every dialect. -
DISTINCT ON (key)(Postgres / DuckDB) — Postgres-specific shorthand:SELECT DISTINCT ON (id) * FROM t ORDER BY id, updated_at DESC. Picks the first row peridafter ordering. -
QUALIFY rn = 1(Snowflake / BigQuery / DuckDB / Teradata) — lets you filter on a window function without wrapping in a subquery:SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t QUALIFY rn = 1.
DISTINCT vs GROUP BY.
- Semantically identical — both collapse rows by a unique key. The optimiser usually compiles them to the same plan.
-
Stylistic preference — use
DISTINCTwhen you don't need aggregates; useGROUP BYwhen you do.SELECT DISTINCT id FROM t≡SELECT id FROM t GROUP BY id. - Performance — DISTINCT and GROUP BY have the same cost on modern engines. The myth that "GROUP BY is faster" is a holdover from MySQL pre-5.6.
Choosing between hard and soft.
- OLTP source-of-truth → hard. The application owns the writes; uniqueness is a contract. Examples: user accounts, order ids, payment events.
-
Warehouse staging from CDC → soft. The CDC stream emits one row per change; multiple
_loaded_atversions of the same id are expected; the read picks the latest. -
Vendor CSV with no idempotency key → hard if you trust the vendor to provide a unique identifier, soft otherwise. Add a synthetic UNIQUE on
(vendor, file_date, row_number_in_file)if necessary. - High-throughput log ingest → soft. Hard constraints add per-row index lookups; at 100k rows/sec the cost matters. Land into a heap; dedup on the read.
Worked example — DISTINCT ON vs QUALIFY across dialects.
-
Postgres / DuckDB —
SELECT DISTINCT ON (id) * FROM events ORDER BY id, event_ts DESC— clean one-liner. -
Snowflake / BigQuery / DuckDB / Teradata —
SELECT * FROM events QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY event_ts DESC) = 1. -
MySQL / SQL Server / Oracle (universal) — CTE + ROW_NUMBER:
WITH ranked AS (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM events) SELECT * FROM ranked WHERE rn = 1.
Worked example — dedup a CDC stream with ROW_NUMBER
Detailed explanation. A CDC stream from Postgres emits one row per change to a customers table. The warehouse load picks up changes every five minutes; sometimes a row updates twice within a window and both versions land. The downstream dim_customers should hold only the latest row per id. Window-function dedup is the textbook pattern.
Question. Given the stg_customers table with two _loaded_at versions of id=42, write a query that selects the latest row per id.
Input — stg_customers.
| id | name | _loaded_at |
|---|---|---|
| 42 | Ana | 2026-06-01 10:00 |
| 42 | Anna | 2026-06-01 10:05 |
| 43 | Bo | 2026-06-01 10:01 |
Code.
-- Universal pattern — CTE + ROW_NUMBER
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY _loaded_at DESC) AS rn
FROM stg_customers
)
SELECT id, name, _loaded_at
FROM ranked
WHERE rn = 1;
-- Snowflake / BigQuery / DuckDB — one fewer CTE
SELECT id, name, _loaded_at
FROM stg_customers
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY _loaded_at DESC) = 1;
-- Postgres / DuckDB — DISTINCT ON shortcut
SELECT DISTINCT ON (id) id, name, _loaded_at
FROM stg_customers
ORDER BY id, _loaded_at DESC;
Step-by-step explanation.
-
PARTITION BY idgroups rows by the deduplication key — every distinctidbecomes its own window. -
ORDER BY _loaded_at DESCranks within each window — the row with the most recent_loaded_atgetsrn = 1. -
WHERE rn = 1(orQUALIFY rn = 1) keeps only the top-ranked row per partition. The older version is dropped at read time. -
DISTINCT ON (id)is the Postgres shortcut — theORDER BY id, _loaded_at DESCfirst sorts by the dedup key (required) then by recency.DISTINCT ONpicks the first row after that sort.
Output.
| id | name | _loaded_at |
|---|---|---|
| 42 | Anna | 2026-06-01 10:05 |
| 43 | Bo | 2026-06-01 10:01 |
Rule of thumb. Every staging table from a CDC source or message queue ships with a standard dedup view: ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY ingestion_ts DESC) filtered to rn = 1. Bake the view into the staging-layer template so analysts always query the dedup'd version downstream.
Worked example — DISTINCT vs GROUP BY on a duplicate-heavy table
Detailed explanation. A reporting team wants the list of distinct (user_id, event_type) pairs in the last 30 days. DISTINCT and GROUP BY both work. The DISTINCT form is shorter; the GROUP BY form lets the team add a COUNT(*) later.
Question. Given the events table, list every distinct (user_id, event_type) pair two ways. Show that the results are identical.
Input.
| event_id | user_id | event_type |
|---|---|---|
| 1 | 10 | click |
| 2 | 10 | click |
| 3 | 10 | view |
| 4 | 11 | click |
| 5 | 11 | click |
Code.
-- DISTINCT form
SELECT DISTINCT user_id, event_type
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '30 days';
-- GROUP BY form
SELECT user_id, event_type
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY user_id, event_type;
-- GROUP BY with a useful aggregate added on
SELECT user_id, event_type, COUNT(*) AS n
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY user_id, event_type;
Step-by-step explanation.
- DISTINCT collapses every row that has identical
(user_id, event_type). Three distinct pairs remain:(10, click),(10, view),(11, click). - GROUP BY produces the same three groups. Without an aggregate, the SELECT list reduces to the group-by columns — identical result to DISTINCT.
- Adding
COUNT(*)to the GROUP BY gives(10, click, 2),(10, view, 1),(11, click, 2). The DISTINCT form cannot add this column. - On a modern optimiser, both queries compile to the same hash-aggregate plan. There is no performance reason to prefer one form.
Output (DISTINCT or plain GROUP BY).
| user_id | event_type |
|---|---|
| 10 | click |
| 10 | view |
| 11 | click |
Rule of thumb. Use DISTINCT when the result is "just the distinct combinations"; switch to GROUP BY the moment you need an aggregate alongside the dedup. Both compile to the same plan.
Worked example — picking the latest row with DISTINCT ON
Detailed explanation. A Postgres-only stack wants to fetch the latest order per customer. DISTINCT ON (customer_id) is the cleanest one-liner: order by (customer_id, created_at DESC) and DISTINCT ON picks the first row per customer_id.
Question. Given the orders table, return the latest order per customer using DISTINCT ON. Show the equivalent universal pattern with ROW_NUMBER.
Input.
| order_id | customer_id | created_at | amount |
|---|---|---|---|
| 1 | 10 | 2026-05-01 | 100 |
| 2 | 10 | 2026-06-01 | 150 |
| 3 | 11 | 2026-05-15 | 200 |
| 4 | 11 | 2026-04-10 | 50 |
Code.
-- Postgres / DuckDB — DISTINCT ON shortcut
SELECT DISTINCT ON (customer_id)
customer_id, order_id, created_at, amount
FROM orders
ORDER BY customer_id, created_at DESC;
-- Universal pattern — same result
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
FROM orders
)
SELECT customer_id, order_id, created_at, amount
FROM ranked
WHERE rn = 1;
Step-by-step explanation.
-
DISTINCT ON (customer_id)tells Postgres: "keep the first row encountered percustomer_id." - The
ORDER BY customer_id, created_at DESCcontrols which row is "first." Sorting bycustomer_idis required (it must match the DISTINCT ON columns); sorting bycreated_at DESCwithin each customer puts the most recent order first. - For
customer_id = 10, the ordered rows are(10, 2, 2026-06-01)and(10, 1, 2026-05-01). DISTINCT ON keeps the first:order_id = 2. - For
customer_id = 11, the ordered rows are(11, 3, 2026-05-15)and(11, 4, 2026-04-10). DISTINCT ON keepsorder_id = 3. - The ROW_NUMBER variant produces identical results via the standard window-function pattern and is the portable choice.
Output.
| customer_id | order_id | created_at | amount |
|---|---|---|---|
| 10 | 2 | 2026-06-01 | 150 |
| 11 | 3 | 2026-05-15 | 200 |
Rule of thumb. On Postgres-only stacks, DISTINCT ON is the cleanest "latest per key" syntax. When the SQL must travel between Postgres, Snowflake, and BigQuery, fall back to the CTE + ROW_NUMBER pattern (or QUALIFY on dialects that support it). All three encode the same plan.
SQL interview question on dedup of a CDC staging table
A senior interviewer often combines: "Your warehouse staging table holds CDC events with possible duplicates per (table_name, primary_key). Write a query that returns the latest event per (table_name, primary_key) across all tables, then explain which dialects offer the cleanest syntax."
Solution Using QUALIFY ROW_NUMBER for cross-dialect dedup
-- Snowflake / BigQuery / DuckDB / Teradata — QUALIFY
SELECT
table_name,
primary_key,
op,
payload,
_loaded_at
FROM stg_cdc
QUALIFY
ROW_NUMBER() OVER (
PARTITION BY table_name, primary_key
ORDER BY _loaded_at DESC
) = 1;
-- Universal portable pattern — CTE + ROW_NUMBER
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY table_name, primary_key
ORDER BY _loaded_at DESC
) AS rn
FROM stg_cdc
)
SELECT table_name, primary_key, op, payload, _loaded_at
FROM ranked
WHERE rn = 1;
Step-by-step trace.
| Row in stg_cdc | table_name | primary_key | _loaded_at | rn |
|---|---|---|---|---|
| 1 | users | 42 | 10:00 | 2 |
| 2 | users | 42 | 10:05 | 1 ✓ |
| 3 | users | 43 | 10:01 | 1 ✓ |
| 4 | orders | 99 | 10:02 | 1 ✓ |
| 5 | orders | 99 | 10:08 | — (depends: actually the newest, so it would be rn = 1, and row 4 becomes rn = 2) |
Re-rank row 4 and 5 since 10:08 > 10:02: row 5 becomes rn = 1 and row 4 becomes rn = 2. The output keeps rn = 1 rows per partition.
Output:
| table_name | primary_key | _loaded_at |
|---|---|---|
| users | 42 | 10:05 |
| users | 43 | 10:01 |
| orders | 99 | 10:08 |
Why this works — concept by concept:
-
PARTITION BY (table_name, primary_key) — the dedup happens per composite key. Two different tables can both have
primary_key = 42without competing for the same row slot. -
ORDER BY _loaded_at DESC — picks the most recent CDC event. If two events share the same
_loaded_at, you'd add a tiebreaker like_lsnor_offsetto make the ranking deterministic. -
QUALIFY = 1 — the same as a CTE-wrapped
WHERE rn = 1but reads top-down: filter the rows whose window-function value equals 1. Available on Snowflake / BigQuery / DuckDB / Teradata. -
Portable CTE form — the CTE +
WHERE rn = 1variant compiles on every dialect. Same plan as QUALIFY; just two more lines of SQL. - Cost — one window-function pass over the staging table: O(n log n) due to the sort inside the partition. Materialise the dedup'd view if multiple downstream queries read from it.
SQL
Topic — window functions
Window-function deduplication problems (SQL)
SQL
Topic — ranking
ROW_NUMBER ranking problems (SQL)
5. UPSERT — INSERT or UPDATE on duplicate key
upsert is "insert if new, update if existing" — three syntaxes (ON CONFLICT, ON DUPLICATE KEY, MERGE) over the same UNIQUE-constraint mechanism
The mental model in one line: every upsert relies on a pre-existing UNIQUE or PRIMARY KEY index — the constraint is the contract that tells the engine "this is the duplicate; here's how to merge". Once you accept that the UNIQUE has to be there first, the dialect-specific syntax is just decoration.
The three syntaxes.
- Postgres / SQLite / DuckDB —
INSERT ... ON CONFLICT (col) DO UPDATE SET ... -
MySQL —
INSERT ... ON DUPLICATE KEY UPDATE col = VALUES(col)(orcol = NEW.colin 8.0.20+) - SQL Server / Oracle / Snowflake / BigQuery / Redshift —
MERGE INTO target USING source ON ... WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT ...
Postgres ON CONFLICT in detail.
-
Conflict target —
ON CONFLICT (col)orON CONFLICT ON CONSTRAINT constraint_name. Must reference an existing unique constraint or index. -
Actions —
DO NOTHING(silent idempotent insert) orDO UPDATE SET col = EXCLUDED.col. The pseudo-rowEXCLUDEDholds the values that would have been inserted. -
Conditional update —
ON CONFLICT (col) DO UPDATE SET col = EXCLUDED.col WHERE target.updated_at < EXCLUDED.updated_at. Skip the update if the incoming row is older. - Atomicity — the conflict check and the resulting INSERT or UPDATE happen in a single statement under the same transaction; the operation is race-safe against concurrent inserts.
MySQL ON DUPLICATE KEY UPDATE in detail.
- No explicit conflict target — MySQL uses any UNIQUE or PRIMARY KEY index that the row would violate. Tricky if multiple indexes could match different existing rows — behaviour is undefined.
-
VALUES(col)— references the incoming row's value. In MySQL 8.0.20+, the preferred form isINSERT INTO t (a, b) VALUES (1, 2) AS new ON DUPLICATE KEY UPDATE b = new.b. -
INSERT IGNORE— sibling form: silently skips rows that would violate a constraint. The MySQL equivalent ofON CONFLICT DO NOTHING.
MERGE in detail (SQL Server / Snowflake / BigQuery / Oracle / Redshift).
-
Three clauses —
WHEN MATCHED THEN UPDATE,WHEN NOT MATCHED THEN INSERT,WHEN NOT MATCHED BY SOURCE THEN DELETE(optional, deletes from target rows missing in source). -
Source can be a query —
MERGE INTO target t USING (SELECT ... FROM staging) s ON t.k = s.k WHEN MATCHED THEN UPDATE SET .... Powerful for bulk upserts from staging. -
Atomicity caveat — SQL Server — MERGE is not atomic against concurrent inserts without
HOLDLOCKorWITH (SERIALIZABLE)hints. Documented Microsoft KB article; bites teams that assume MERGE behaves like ON CONFLICT. - Atomicity caveat — BigQuery / Snowflake — both serialise MERGE statements; race conditions are absent at the SQL layer, but the operation is more expensive (full table scan in some cases).
WHERE target IS DISTINCT FROM source — the no-op update optimisation.
- A naive
MERGEorON CONFLICT DO UPDATErewrites every matched row even when the new value equals the old value. - The result is wasted writes, bloated WAL, and unnecessary downstream change-feed events.
- The optimisation:
WHEN MATCHED AND t.val IS DISTINCT FROM s.val THEN UPDATE SET val = s.val(Postgres / Snowflake). Same idea inON CONFLICT (col) DO UPDATE SET val = EXCLUDED.val WHERE target.val IS DISTINCT FROM EXCLUDED.val.
Common interview probes.
- "Walk me through
INSERT ... ON CONFLICT (col) DO UPDATE SET val = EXCLUDED.val. What does EXCLUDED refer to?" — the pseudo-row of values that would have been inserted. - "Why is
INSERT ON CONFLICTpreferred over MERGE on SQL Server?" — MERGE is not race-safe without HOLDLOCK; ON CONFLICT (Postgres) is atomic by design. - "What does
WHEN NOT MATCHED BY SOURCE THEN DELETEdo?" — deletes rows from the target that have no matching row in the source. The full "make target look like source" operation.
Worked example — Postgres ON CONFLICT for a daily metrics roll-up
Detailed explanation. A daily ETL job computes daily_metrics (date, metric_name, value) and upserts into a metrics table keyed on (date, metric_name). The upsert pattern uses ON CONFLICT (date, metric_name) DO UPDATE SET value = EXCLUDED.value so that re-running the job (same day, same metric) updates the row in place instead of duplicating.
Question. Show the schema, the upsert statement, and the result after two runs (one with a typo, one with the fix).
Input — first run.
| date | metric_name | value |
|---|---|---|
| 2026-06-01 | revenue | 1000 |
| 2026-06-01 | signups | 50 |
Input — second run (revenue corrected).
| date | metric_name | value |
|---|---|---|
| 2026-06-01 | revenue | 1200 |
Code.
CREATE TABLE metrics (
date DATE NOT NULL,
metric_name TEXT NOT NULL,
value NUMERIC(12, 2) NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
CONSTRAINT metrics_pk PRIMARY KEY (date, metric_name)
);
-- First run
INSERT INTO metrics (date, metric_name, value)
VALUES ('2026-06-01', 'revenue', 1000),
('2026-06-01', 'signups', 50)
ON CONFLICT (date, metric_name) DO UPDATE
SET value = EXCLUDED.value,
updated_at = now();
-- Second run — same key, corrected value
INSERT INTO metrics (date, metric_name, value)
VALUES ('2026-06-01', 'revenue', 1200)
ON CONFLICT (date, metric_name) DO UPDATE
SET value = EXCLUDED.value,
updated_at = now();
Step-by-step explanation.
- The first run inserts two rows. No conflicts because the table is empty. The
ON CONFLICTclause is dormant. - The second run attempts to insert
(2026-06-01, 'revenue', 1200). The PRIMARY KEY constraint on(date, metric_name)flags the conflict because that pair already exists. -
DO UPDATE SET value = EXCLUDED.valueruns.EXCLUDEDholds the new row's values;valuebecomes1200.updated_atis bumped tonow(). - The
signupsrow is untouched — no conflict and the statement only operates on the rows it tried to insert.
Output.
| date | metric_name | value | updated_at |
|---|---|---|---|
| 2026-06-01 | revenue | 1200 | run 2 timestamp |
| 2026-06-01 | signups | 50 | run 1 timestamp |
Rule of thumb. Every metrics or roll-up table ships with INSERT ... ON CONFLICT (key_columns) DO UPDATE SET .... Re-running the job is then safe and idempotent — the same input produces the same final state regardless of how many times it ran. Backfills become trivial.
Worked example — MySQL ON DUPLICATE KEY UPDATE
Detailed explanation. A MySQL-backed analytics service stores hit counters per (page_id, day). Every page view triggers an upsert that increments the count. The ON DUPLICATE KEY UPDATE form is MySQL's native syntax.
Question. Show the table, the upsert statement, and the resulting row counts after 3 page views to page 42 on the same day.
Input — three calls.
| page_id | day | delta |
|---|---|---|
| 42 | 2026-06-01 | 1 |
| 42 | 2026-06-01 | 1 |
| 42 | 2026-06-01 | 1 |
Code.
CREATE TABLE page_views (
page_id BIGINT NOT NULL,
day DATE NOT NULL,
view_count BIGINT NOT NULL,
PRIMARY KEY (page_id, day)
);
-- Upsert with increment (MySQL 8.0.20+)
INSERT INTO page_views (page_id, day, view_count)
VALUES (42, '2026-06-01', 1) AS new
ON DUPLICATE KEY UPDATE
view_count = page_views.view_count + new.view_count;
Step-by-step explanation.
- First call:
(42, '2026-06-01')does not exist. The INSERT lands;view_count = 1. - Second call:
(42, '2026-06-01')exists. The PRIMARY KEY collision fires.ON DUPLICATE KEY UPDATEruns:view_count = page_views.view_count + new.view_count = 1 + 1 = 2. - Third call: same conflict.
view_count = 2 + 1 = 3. - The increment pattern
target_col = target.target_col + new.target_colis the canonical "counter upsert" in MySQL. Older syntax usedVALUES(view_count)instead ofnew.view_count.
Output.
| page_id | day | view_count |
|---|---|---|
| 42 | 2026-06-01 | 3 |
Rule of thumb. Counter tables (page views, click counts, daily aggregates) live or die by the upsert-with-increment pattern. Use MySQL's INSERT ... AS new ON DUPLICATE KEY UPDATE col = target.col + new.col or the equivalent Postgres ON CONFLICT (key) DO UPDATE SET col = target.col + EXCLUDED.col. The pattern is race-safe and idempotent at the SQL layer.
Worked example — SQL Server MERGE with HOLDLOCK
Detailed explanation. SQL Server's MERGE statement is the canonical "upsert from staging" pattern, but it is not race-safe against concurrent inserts without the HOLDLOCK hint. Without HOLDLOCK, two concurrent MERGEs can both see "no match" for the same key, then both insert — leaving the table with a duplicate. The fix is one keyword.
Question. Show the MERGE statement, the race condition without HOLDLOCK, and the fixed form with HOLDLOCK.
Input — staging table.
| account_id | balance |
|---|---|
| 1 | 100 |
| 2 | 200 |
Code.
-- BROKEN under concurrency — duplicates possible
MERGE INTO accounts AS target
USING staging_accounts AS source
ON target.account_id = source.account_id
WHEN MATCHED THEN
UPDATE SET balance = source.balance
WHEN NOT MATCHED THEN
INSERT (account_id, balance) VALUES (source.account_id, source.balance);
-- FIXED — HOLDLOCK ensures range lock on the conflict target
MERGE INTO accounts WITH (HOLDLOCK) AS target
USING staging_accounts AS source
ON target.account_id = source.account_id
WHEN MATCHED THEN
UPDATE SET balance = source.balance
WHEN NOT MATCHED THEN
INSERT (account_id, balance) VALUES (source.account_id, source.balance);
Step-by-step explanation.
- Without
HOLDLOCK, MERGE takes only update locks on matched rows and shared locks on the conflict check. Two concurrent transactions can both see "no row foraccount_id=3" simultaneously, then both INSERT — producing a duplicate that the PRIMARY KEY constraint may or may not catch (depending on whether one transaction commits first). -
WITH (HOLDLOCK)requestsSERIALIZABLEisolation for the target table — the engine takes a range lock over the conflict-checked keys. Concurrent MERGEs serialise; only one runs at a time on overlapping keys. - The fix is universal: every SQL Server MERGE statement on a hot-path upsert ships with
WITH (HOLDLOCK). The cost is a small concurrency reduction; the benefit is correctness. - The Postgres
INSERT ... ON CONFLICTform does not need this hint — Postgres's conflict check is atomic by design.
Output (single MERGE, no race).
| account_id | balance |
|---|---|
| 1 | 100 |
| 2 | 200 |
Rule of thumb. On SQL Server, never write MERGE without WITH (HOLDLOCK) unless you've proven via load tests that the workload is serial. The duplicate-row race is documented Microsoft behaviour and shows up in every "production has unexpected duplicates" post-mortem.
Worked example — MERGE for snapshot reconciliation (Snowflake)
Detailed explanation. A Snowflake dimension table dim_customers should reflect the daily snapshot stg_customers_daily exactly — new rows inserted, changed rows updated, deleted rows removed. The MERGE statement with all three WHEN clauses is the textbook pattern.
Question. Show the MERGE statement and explain each WHEN clause.
Input. Daily snapshot vs current dim — three deltas (added, changed, removed).
Code.
MERGE INTO dim_customers AS t
USING stg_customers_daily AS s
ON t.customer_id = s.customer_id
WHEN MATCHED AND (
t.name IS DISTINCT FROM s.name OR
t.email IS DISTINCT FROM s.email
) THEN UPDATE SET
name = s.name,
email = s.email,
updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN INSERT (
customer_id, name, email, created_at, updated_at
) VALUES (
s.customer_id, s.name, s.email, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP()
)
WHEN NOT MATCHED BY SOURCE THEN DELETE;
Step-by-step explanation.
-
WHEN MATCHED AND (t.name IS DISTINCT FROM s.name OR ...)— update only when at least one column actually changed. TheIS DISTINCT FROMis NULL-safe inequality, so a NULL on either side is handled correctly. Without this guard, every row would be rewritten on every run, bloating change-tracking and wasting compute. -
WHEN NOT MATCHED THEN INSERT— new customers in the snapshot land as fresh rows in the dim.created_atandupdated_atboth set on insert. -
WHEN NOT MATCHED BY SOURCE THEN DELETE— customers that disappeared from the snapshot are removed from the dim. This makesdim_customersa full mirror of the snapshot. Omit this clause if the snapshot is incremental and old customers should stick around. - All three clauses share the
ON t.customer_id = s.customer_idjoin condition — the UNIQUE / PRIMARY KEY oncustomer_idis the contract.
Output. The dim table becomes a row-for-row mirror of the snapshot after the MERGE completes.
Rule of thumb. Three-clause MERGE (WHEN MATCHED, WHEN NOT MATCHED, WHEN NOT MATCHED BY SOURCE) is the canonical "make target equal source" pattern. Always guard the UPDATE clause with IS DISTINCT FROM to skip no-op writes. Only add WHEN NOT MATCHED BY SOURCE THEN DELETE when the source is a full snapshot; never on an incremental load.
SQL interview question on safe daily-roll-up upserts
A senior interviewer often combines: "Your team runs a daily metrics ETL that may re-run on backfill or after a partial failure. Write the upsert statement that updates only changed rows, never duplicates, and is safe to re-run any number of times."
Solution Using ON CONFLICT with IS DISTINCT FROM guard
INSERT INTO daily_metrics (date, metric_name, value, computed_at)
SELECT date, metric_name, value, now()
FROM staging_daily_metrics
ON CONFLICT (date, metric_name) DO UPDATE
SET value = EXCLUDED.value,
computed_at = EXCLUDED.computed_at
WHERE daily_metrics.value IS DISTINCT FROM EXCLUDED.value;
Step-by-step trace.
| Staging row | Target row | Conflict? | Action |
|---|---|---|---|
| (2026-06-01, revenue, 1000) | (none) | no | INSERT |
| (2026-06-01, revenue, 1000) | (2026-06-01, revenue, 1000) | yes | guard FALSE → no-op |
| (2026-06-01, revenue, 1200) | (2026-06-01, revenue, 1000) | yes | guard TRUE → UPDATE |
| (2026-06-02, signups, 75) | (2026-06-02, signups, NULL) | yes | guard TRUE (NULL DISTINCT FROM 75) → UPDATE |
| (2026-06-03, ctr, NULL) | (2026-06-03, ctr, NULL) | yes | guard FALSE (NULL IS NOT DISTINCT FROM NULL) → no-op |
The IS DISTINCT FROM guard is the no-op skip — only rewrite rows whose value actually changed. Re-running the same backfill three times produces zero writes after the first; the table stabilises and downstream change-data-capture stays quiet.
Output:
| Property | Value |
|---|---|
| First run writes | N rows (every row inserted) |
| Re-run same input writes | 0 rows (guard FALSE everywhere) |
| Re-run with changed values writes | only changed rows |
| Final table state | identical regardless of re-run count |
Why this works — concept by concept:
-
ON CONFLICT (date, metric_name) — uses the PRIMARY KEY (
(date, metric_name)) as the conflict target. The constraint must exist before the upsert is written. -
EXCLUDED — the pseudo-row holding the values from the would-have-inserted row.
EXCLUDED.valueis the new value;daily_metrics.valueis the existing target value. - IS DISTINCT FROM — NULL-safe inequality. Treats two NULLs as "the same" so a NULL → NULL re-run is a no-op; treats NULL vs non-NULL as "different" so a NULL → 75 update fires.
- Idempotency by design — same input ⇒ same final state, regardless of how many times the statement runs. Backfills become safe; partial failures become safe; retries become safe.
-
Cost — one B-tree lookup per row to check the conflict; one no-op or one write per row. The
IS DISTINCT FROMguard cuts no-op write cost to zero in steady state — the bulk of the daily run skips the table entirely.
SQL
Topic — CTE
Upsert and merge problems (SQL)
SQL
Topic — anti-join
Anti-join and MERGE delete patterns (SQL)
Cheat sheet — uniqueness recipes
-
One UNIQUE per business key.
UNIQUE (tenant_id, slug)on every multi-tenant "human-friendly key" — slug, username, code, name. -
Case-insensitive uniqueness.
UNIQUE (LOWER(email))(Postgres / Oracle), oremail_lower GENERATED ALWAYS AS (LOWER(email)) STORED, UNIQUE (email_lower)(MySQL). -
Only-active uniqueness.
CREATE UNIQUE INDEX one_active_sub ON subs(user_id) WHERE is_active = true(Postgres / SQL Server filtered index / SQLite / DuckDB). -
Soft-delete-aware uniqueness.
CREATE UNIQUE INDEX unique_email ON users(email) WHERE deleted_at IS NULL. -
NULL-tolerant unique.
UNIQUE NULLS NOT DISTINCT (col)(Postgres 15+ / DuckDB) — only one NULL row allowed. -
Surrogate PK + business UNIQUE.
id BIGSERIAL PRIMARY KEY, UNIQUE (business_key)— the default for any table whose business key might ever change. -
Latest-per-key dedup (universal).
ROW_NUMBER() OVER (PARTITION BY key ORDER BY updated_at DESC) AS rnthenWHERE rn = 1. -
Latest-per-key dedup (Postgres / DuckDB).
SELECT DISTINCT ON (key) * FROM t ORDER BY key, updated_at DESC. -
Latest-per-key dedup (Snowflake / BigQuery / DuckDB / Teradata).
SELECT * FROM t QUALIFY ROW_NUMBER() OVER (...) = 1. -
Idempotent insert (Postgres).
INSERT ... ON CONFLICT (key) DO NOTHING. (MySQL:INSERT IGNORE. SQL Server:IF NOT EXISTS ... INSERT ...under HOLDLOCK.) -
Upsert (Postgres).
INSERT ... ON CONFLICT (key) DO UPDATE SET col = EXCLUDED.col WHERE target.col IS DISTINCT FROM EXCLUDED.col. -
Upsert (MySQL 8.0.20+).
INSERT ... AS new ON DUPLICATE KEY UPDATE col = target.col + new.col(for counters) orcol = new.col(for replacements). -
Upsert (SQL Server / Snowflake / BigQuery).
MERGE INTO target [WITH (HOLDLOCK)] USING source ON ... WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT. -
Warehouse enforcement fallback.
dbt test unique: column_name(andwhere:for partial uniqueness) — the standard for Snowflake / BigQuery / Redshift where constraints are informational. -
Duplicate audit.
COUNT(*) - COUNT(DISTINCT key)to count duplicates per column;SELECT key, COUNT(*) FROM t GROUP BY key HAVING COUNT(*) > 1to list them.
Frequently asked questions
What's the difference between UNIQUE and PRIMARY KEY in SQL?
A PRIMARY KEY is a special UNIQUE constraint with three extra rules: every column is implicitly NOT NULL, a table has exactly one primary key, and the primary key is the conventional foreign-key target. A regular UNIQUE constraint allows NULLs (one or many depending on the dialect — SQL Server: one; Postgres / MySQL: many), permits unlimited UNIQUE constraints per table, and is a valid FK target. Use the primary key for the surrogate identity (id BIGSERIAL); use UNIQUE constraints for every business key (email, slug, (tenant_id, name) pair). Both compile to a B-tree index — they are siblings, not opposites.
Can a UNIQUE column have NULL values in SQL?
Yes — usually many. Postgres, MySQL, Oracle, Snowflake, BigQuery, and DuckDB treat every NULL as distinct from every other NULL by default, so a UNIQUE (col) column can hold any number of NULL rows. SQL Server is the exception: it allows exactly one NULL in a UNIQUE column. The workaround on SQL Server (and now natively on Postgres 15+) is UNIQUE NULLS NOT DISTINCT or a filtered index CREATE UNIQUE INDEX ... WHERE col IS NOT NULL. State the NULL behaviour explicitly in the DDL — future readers and future engines will not have to guess.
How do I make a UNIQUE constraint only apply to active rows?
Use a partial unique index (also called a filtered index in SQL Server). The syntax is CREATE UNIQUE INDEX one_active_sub ON subscriptions (user_id) WHERE is_active = true on Postgres / SQL Server / SQLite / DuckDB. The index only covers rows where is_active = true, so uniqueness is enforced within that subset — at most one active subscription per user — while unlimited cancelled rows can coexist. MySQL does not support partial indexes directly; the workaround is a generated column that is NULL when the row is inactive, combined with a UNIQUE constraint on the generated column. Snowflake / BigQuery / Redshift enforce neither — use a dbt unique test with a where: clause.
Should I use DISTINCT or ROW_NUMBER for deduplication in SQL?
Use DISTINCT when you only want the distinct combination of selected columns and don't care which row "wins" among the duplicates. Use ROW_NUMBER() OVER (PARTITION BY key ORDER BY recency DESC) when you need to pick a specific row per key — typically the latest by timestamp. The ROW_NUMBER pattern is the canonical "deduplicate a CDC stream" or "latest per customer" idiom and works in every dialect. Postgres / DuckDB also offer DISTINCT ON (key) as a shorter syntax; Snowflake / BigQuery / DuckDB / Teradata offer QUALIFY ROW_NUMBER() OVER (...) = 1 to skip the CTE wrapper. All three encode the same plan.
Does Snowflake enforce UNIQUE constraints?
No — Snowflake, BigQuery, and Redshift accept UNIQUE and PRIMARY KEY syntax but do not enforce the constraints at write time. The constraints are informational hints that help the query optimiser and downstream tools understand the schema. Duplicates can land silently in production. The canonical fix is a dbt test unique: column_name in the model YAML — the test runs in CI and surfaces violations within minutes of the regression. Pair every informational UNIQUE with a dbt test (or an equivalent in your transformation framework) for real protection. On Snowflake specifically, you can also rely on streams + tasks to audit uniqueness on a schedule, but the dbt test is simpler and runs on every CI commit.
What's the safest way to upsert in Postgres?
INSERT INTO t (key, val) VALUES (...) ON CONFLICT (key) DO UPDATE SET val = EXCLUDED.val WHERE t.val IS DISTINCT FROM EXCLUDED.val. This pattern is atomic by design (Postgres's conflict check and the resulting INSERT or UPDATE happen in one statement), idempotent (same input → same final state regardless of retries), and efficient (the IS DISTINCT FROM guard skips no-op rewrites that would otherwise bloat the WAL and noise the change-data-capture stream). The conflict target (key) must reference an existing UNIQUE or PRIMARY KEY constraint — the upsert mechanism leans on the constraint, and you cannot upsert without one. For idempotent insert-only flows, drop the SET clause: ON CONFLICT (key) DO NOTHING.
Practice on PipeCode
- Drill the DISTINCT and uniqueness practice library → for the COUNT-DISTINCT and dedup family of probes.
- Rehearse on window-function deduplication problems → when the interviewer wants ROW_NUMBER / RANK on a CDC table.
- Sharpen ranking drills → for the "latest per key" and tiebreaker patterns.
- Layer the aggregation library → for the COUNT(*) vs COUNT(DISTINCT col) family.
- Stack the GROUP BY library → for the duplicate-detection idiom.
- Stack the joins practice library → for upsert-from-staging and anti-join patterns.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the dialect axis with the SQL for data engineering interviews course →.
- For long-form schema craft, work through data modelling for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every uniqueness recipe above ships with hands-on practice rooms where you write the UNIQUE constraint, the partial index, and the ROW_NUMBER dedup against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your fix to a `null not in sql` upsert behaves the same on Postgres as on Snowflake.
Practice DISTINCT and uniqueness now →
Window-function dedup drills →





Top comments (0)