Gowtham Potureddi

Posted on Jun 8

SQL UNIQUE Constraints & Deduplication Strategies: Hard vs Soft Uniqueness

#python #sql #dataengineering #interview

sql unique is the most under-rated keyword in the SQL standard — a one-line schema declaration that turns "this column should not have duplicates" from a tribal comment in a Slack channel into a hard rule the database enforces on every write. Interviewers know that engineers who can't separate hard uniqueness (constraint-enforced at insert time) from soft uniqueness (collapsed at query time with ROW_NUMBER() or DISTINCT ON) ship dashboards that double-count, joins that explode row counts, and upserts that randomly trigger duplicate-key errors in production.

This guide walks the entire uniqueness surface end to end — the unique constraint vs primary key decision matrix, composite unique for multi-tenant schemas, partial unique index for soft-delete and only-one-active patterns, the read-time toolkit (row_number deduplication, distinct on, QUALIFY), and the three-dialect upsert matrix (Postgres ON CONFLICT, MySQL ON DUPLICATE KEY UPDATE, MERGE for SQL Server / Snowflake / BigQuery). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps immediately after reading, drill the DISTINCT practice library →, rehearse on window-function deduplication problems →, and stack the schema muscles with aggregation drills →.

On this page

Why uniqueness is the foundation of clean data
UNIQUE vs PRIMARY KEY — the matrix
Composite & partial unique indexes
Hard vs soft deduplication — constraint vs query
UPSERT — INSERT or UPDATE on duplicate key
Cheat sheet — uniqueness recipes
Frequently asked questions
Practice on PipeCode

1. Why uniqueness is the foundation of clean data

Uniqueness is a contract, not a comment — the schema either enforces it on every write, or you re-pay the bill on every read

The one-sentence invariant: every table has at least one "business key" that must be unique, and the cheapest way to keep it unique is a UNIQUE constraint on the write path — anything you delegate to a query is a tax you pay forever. Once you internalise the difference between hard uniqueness (a write-time constraint the database refuses to violate) and soft uniqueness (a read-time projection that collapses duplicates), the entire family of sql unique and deduplication interview questions becomes a sequence of small choices.

The two flavours.

Hard uniqueness — constraint-enforced. UNIQUE (col) or PRIMARY KEY (col) rejects the offending row at insert time with a duplicate-key error. The duplicate never reaches the table; downstream consumers never see it.
Soft uniqueness — query-enforced. The table allows duplicates; every read projects through ROW_NUMBER() OVER (PARTITION BY key ORDER BY recency DESC) WHERE rn = 1 or DISTINCT ON (key) to pick the "right" row. The duplicates linger in storage; every read pays the dedup cost.

The three places duplicates enter.

Source feeds with no contract. Vendors push CSVs that re-include yesterday's rows by accident. Every nightly load creates duplicates unless the load is idempotent.
Retries. A network blip causes the producer to re-emit the same event. Without a dedup key, the message lands twice — once under the original event_id and once under a new one if the producer assigned ids before retry.
Late-arriving CDC. Change-data-capture streams emit one row per change. If the same row updates twice in five seconds, the warehouse holds both versions; the query layer must pick the latest.

Cost of duplicates.

Inflated aggregates. SUM(amount) over a table with 3% duplicates over-reports revenue by 3%. Multiply by an executive bonus tied to that metric and the bug becomes expensive.
Broken joins. A row that appears twice on the right side of a join doubles the left side — row counts explode multiplicatively. The classic "my join returned 10x more rows than I expected" ticket.
Idempotency failures. Retried writes that re-insert the same logical record cause downstream consumers to double-process. Event-driven systems get particularly hurt because a duplicate trigger looks identical to a fresh trigger.

What interviewers listen for.

Do you say "uniqueness is a write-time constraint" before "uniqueness is a query-time filter"? — senior signal.
Do you reach for UNIQUE (tenant_id, slug) instead of UNIQUE (slug) when the schema is multi-tenant? — required answer.
Do you mention ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC) = 1 when asked how you'd dedup a CDC stream? — required answer.
Do you recognise that Snowflake / BigQuery treat UNIQUE as informational only and require a dbt unique test for enforcement? — senior signal.

The 2026 reality.

OLTP databases (Postgres, MySQL, SQL Server, Oracle) enforce UNIQUE and PRIMARY KEY at write time. The constraint is a contract.
Cloud warehouses (Snowflake, BigQuery, Redshift) accept the syntax but do not enforce it — UNIQUE is a hint to the optimiser, nothing more. Enforce with dbt test unique (or its equivalent) in the transformation layer.
SQL:2023 ships UNIQUE NULLS NOT DISTINCT — the long-missing knob that makes UNIQUE treat NULLs as equal so only one NULL row is allowed (Postgres 15+, DuckDB).
Partial unique indexes (Postgres, SQL Server filtered indexes, SQLite) let you scope uniqueness to a WHERE predicate — the only clean way to express "one active subscription per user, but unlimited cancelled ones."

Worked example — the duplicate-source feed disaster

Detailed explanation. A finance team loads a vendor CSV every night into a transactions table. The vendor's CSV occasionally re-includes the previous day's tail rows because their export script uses a date-window join that overlaps by a few hours. With no UNIQUE constraint, the duplicates slip in. A month later, the revenue number is off by 4% and finance opens a ticket.

Question. Given the transactions table below, show how a missing UNIQUE on (vendor_tx_id) lets duplicates land, and how the constraint would have rejected them at write time.

Input.

id	vendor_tx_id	amount
1	TX-001	100
2	TX-002	50
3	TX-001	100
4	TX-003	75

Code.

-- BROKEN — no UNIQUE on vendor_tx_id, duplicate TX-001 sneaks in
CREATE TABLE transactions (
    id            BIGINT PRIMARY KEY,
    vendor_tx_id  TEXT NOT NULL,
    amount        NUMERIC(10, 2) NOT NULL
);

-- FIXED — UNIQUE constraint on the business key rejects duplicate
CREATE TABLE transactions (
    id            BIGINT PRIMARY KEY,
    vendor_tx_id  TEXT NOT NULL,
    amount        NUMERIC(10, 2) NOT NULL,
    CONSTRAINT transactions_vendor_tx_id_uk UNIQUE (vendor_tx_id)
);

-- Aggregate that surfaces the bug
SELECT SUM(amount) AS total_revenue
FROM transactions;

Step-by-step explanation.

The broken schema has only a surrogate PRIMARY KEY (id). id is generated by the ETL job, so each insert produces a new id even for a duplicate vendor_tx_id. The constraint cannot catch the duplicate because the primary key is a different column.
After loading the CSV, the table contains four rows including the duplicate TX-001. SUM(amount) returns 100 + 50 + 100 + 75 = 325 — the duplicate is counted twice.
The fixed schema adds UNIQUE (vendor_tx_id). The second insert of TX-001 raises a duplicate-key error (23505 on Postgres, 1062 on MySQL). The ETL job sees the error and either skips or upserts — but the duplicate never reaches the table.
SUM(amount) now returns 100 + 50 + 75 = 225 — the correct revenue.

Output.

Schema	total_revenue
No UNIQUE (broken)	325
With UNIQUE (fixed)	225

Rule of thumb. Every dimension table ships with at least one UNIQUE constraint on its business key, even if the primary key is a surrogate id. The constraint is free at write time and saves you a finance ticket nine months later.

Worked example — the retry-storm duplicate

Detailed explanation. A payment service emits a webhook for every charge. The webhook handler writes a row to charges. A network blip causes the handler to retry the same webhook three seconds later. Without a UNIQUE constraint on the idempotency key (stripe_event_id), both inserts succeed and the charge is recorded twice. With the constraint, the second insert raises a duplicate-key error and the handler treats it as "already processed."

Question. Show the schema and the idempotent-insert pattern that survives a retry storm without double-charging the customer.

Input.

stripe_event_id	charge_id	amount
evt_1	ch_a	100
evt_1	ch_a	100
evt_2	ch_b	50

Code.

CREATE TABLE charges (
    id                BIGSERIAL PRIMARY KEY,
    stripe_event_id   TEXT NOT NULL,
    charge_id         TEXT NOT NULL,
    amount            NUMERIC(10, 2) NOT NULL,
    CONSTRAINT charges_stripe_event_id_uk UNIQUE (stripe_event_id)
);

-- Idempotent insert — second retry no-ops
INSERT INTO charges (stripe_event_id, charge_id, amount)
VALUES ('evt_1', 'ch_a', 100)
ON CONFLICT (stripe_event_id) DO NOTHING;

Step-by-step explanation.

The first webhook call inserts evt_1 / ch_a / 100. The row lands; the UNIQUE constraint is satisfied.
The retry fires three seconds later with the same stripe_event_id. The naive INSERT would raise 23505 duplicate key value violates unique constraint. The handler would have to catch and ignore the error — workable but ugly.
The idempotent form ON CONFLICT (stripe_event_id) DO NOTHING tells Postgres: "if this row already exists, silently skip." No exception, no row added — the handler treats both the first call and the retry identically.
The third call inserts evt_2 / ch_b / 50. Different idempotency key, different row, no conflict.

Output.

Final row count	total_amount
2	150

Rule of thumb. For every external-event ingest, the handler writes its row with ON CONFLICT (idempotency_key) DO NOTHING against a UNIQUE index on that key. The pattern is a one-liner and eliminates an entire class of retry bugs.

SQL interview question on uniqueness-as-contract

A senior interviewer often opens with: "Walk me through every place a duplicate could land in a typical data pipeline — source feed, CDC, retry, late-arriving event, manual backfill — and tell me where you'd put the UNIQUE constraint to catch each one." This blends the schema design instinct with the operational instinct.

Solution Using layered UNIQUE constraints across the pipeline

-- 1) Raw landing zone — minimal constraints, capture everything
CREATE TABLE raw_events (
    id              BIGSERIAL PRIMARY KEY,
    event_id        TEXT NOT NULL,
    payload         JSONB NOT NULL,
    _loaded_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- 2) Staging — UNIQUE on the business key + a uniqueness audit
CREATE TABLE stg_events (
    event_id        TEXT PRIMARY KEY,
    user_id         BIGINT NOT NULL,
    event_type      TEXT NOT NULL,
    event_ts        TIMESTAMPTZ NOT NULL,
    _loaded_at      TIMESTAMPTZ NOT NULL
);

-- Audit query — how many duplicates landed in the raw zone?
SELECT
    COUNT(*)                                    AS raw_rows,
    COUNT(DISTINCT event_id)                    AS unique_event_ids,
    COUNT(*) - COUNT(DISTINCT event_id)         AS duplicate_count
FROM raw_events;

-- 3) Production warehouse — UNIQUE (composite) + dbt unique test
CREATE TABLE fct_events (
    event_id        TEXT NOT NULL,
    user_id         BIGINT NOT NULL,
    event_type      TEXT NOT NULL,
    event_ts        TIMESTAMPTZ NOT NULL,
    CONSTRAINT fct_events_pk PRIMARY KEY (event_id)
);

Step-by-step trace.

Layer	Duplicate behaviour	Enforcement
`raw_events`	duplicates accepted	none — capture everything, audit later
`stg_events`	duplicates rejected	`PRIMARY KEY (event_id)` + idempotent upsert
`fct_events`	duplicates rejected	`PRIMARY KEY (event_id)` + dbt `unique` test (warehouse)

After the audit query reveals duplicate_count, the team rewrites the staging load as INSERT ... ON CONFLICT (event_id) DO NOTHING against stg_events, so every duplicate raw row no-ops at the staging boundary. The fact table inherits clean data from staging and the dbt test catches any regression introduced by a future schema change.

Output:

Layer	Row count	Distinct event_id	Duplicates
raw_events	1,000	970	30
stg_events	970	970	0
fct_events	970	970	0

Why this works — concept by concept:

Raw is the truth — the landing zone captures every row including duplicates so that the audit query can quantify the upstream defect rate. Constraining the raw zone hides the problem.
Staging is the contract — the PRIMARY KEY (event_id) constraint plus the idempotent upsert turns "duplicates in raw" into "exactly-once in staging." This is where hard uniqueness lives.
Production inherits from staging — fct_events does not need to re-enforce uniqueness because staging already did; the dbt unique test is a safety net that catches schema drift.
Audit before metric — running COUNT(*) - COUNT(DISTINCT event_id) against the raw zone surfaces the upstream defect rate without blocking the load. Cheap insurance against a silent 3% inflation.
Cost — UNIQUE adds a B-tree index per constrained column (O(log n) insert + O(n) storage). The cost is paid once at write time; the alternative — soft dedup on every read — is paid forever.

SQL
Topic — distinct
DISTINCT and uniqueness problems (SQL)

Practice →

SQL
Topic — aggregation
Aggregation problems with duplicates (SQL)

Practice →

2. UNIQUE vs PRIMARY KEY — the matrix

`unique constraint` and `primary key` look interchangeable but differ in NULL handling, cardinality, clustering, and FK convention — five rows the interviewer wants you to recite

The mental model in one line: a table has exactly one PRIMARY KEY (which is NOT NULL and conventionally the FK target) and any number of UNIQUE constraints (which allow NULLs and live alongside the PK). Once you internalise that "PK = surrogate identity, UNIQUE = business key," the entire matrix becomes obvious.

The five-row matrix.

Attribute	UNIQUE	PRIMARY KEY
NULLs allowed?	Yes — usually treated as distinct so many NULL rows allowed	No — every column is implicitly `NOT NULL`
Count per table	Unlimited — one per business key	Exactly one
Clustered index	No — separate non-clustered index (or row-store entry)	Yes by default in SQL Server / InnoDB
Foreign-key target	Yes — works	Yes — the conventional default
Dialect NULL quirks	SQL Server allows exactly 1 NULL; Postgres allows many; SQL:2023 adds NULLS NOT DISTINCT	All dialects reject NULL universally

NULL handling — the most-asked sub-question.

Postgres / Oracle / Snowflake / BigQuery / DuckDB — every NULL is distinct from every other NULL by default. UNIQUE (col) lets you insert ten rows with col = NULL.
SQL Server — only one NULL is allowed in a UNIQUE column. The reasoning: SQL Server treats NULL as a value for the purposes of indexing, then sees duplicates.
MySQL — many NULLs allowed (matches Postgres semantics).
SQL:2023 UNIQUE NULLS NOT DISTINCT — opt-in syntax that makes NULL collide with NULL, so only one NULL row is allowed. Postgres 15+, DuckDB.

Foreign-key conventions.

A FOREIGN KEY can point at any UNIQUE or PRIMARY KEY column — the engine just needs an enforced uniqueness guarantee at the target.
Convention is to point at the surrogate PRIMARY KEY because it never changes and is a small integer (cheap to copy and index).
Pointing at a UNIQUE business key (email, username) is valid but discouraged — if the business key ever changes (a user updates their email), every FK must cascade.

Clustered vs non-clustered.

SQL Server / InnoDB (MySQL) — the PRIMARY KEY is the clustered index by default. The table is physically sorted by the PK; the PK lookup is one B-tree traversal.
UNIQUE constraints in SQL Server / InnoDB create non-clustered indexes — a separate B-tree that maps the unique column to the PK, which then locates the row.
Postgres — no clustered indexes by default. All indexes are non-clustered; the heap stores rows in insertion order. CLUSTER reorders the heap on demand but is rarely used.

Snowflake / BigQuery / Redshift caveats.

All three accept UNIQUE and PRIMARY KEY syntax but do not enforce the constraints. They are informational hints to the query optimiser.
The optimiser may use the hint to skip a DISTINCT step or assume cardinality. If your data violates the hint, results may be wrong.
Enforce with dbt test unique: column_name or the equivalent in your transformation framework. The test runs on every CI and catches violations within minutes of the regression.

Common interview probes.

"Can a column be both UNIQUE and a foreign key?" — yes; the UNIQUE makes it a one-to-one FK relationship.
"How many NULLs can a UNIQUE column have in SQL Server?" — exactly one.
"What does UNIQUE NULLS NOT DISTINCT do?" — treats NULLs as equal, so only one NULL row is allowed.
"Does Snowflake enforce UNIQUE?" — no; use dbt test unique.

Worked example — same column as PRIMARY KEY vs UNIQUE

Detailed explanation. A signup table has a single email column. The team debates whether email should be the PRIMARY KEY (no surrogate id) or a UNIQUE column on top of a surrogate BIGSERIAL id. The decision changes NULL handling, FK targets, and the rename flow.

Question. Show the two schema choices side by side. Highlight the four practical differences: NULL behaviour, FK target convention, primary key changeability, and operational consequences.

Input. Conceptual — no rows, just two CREATE TABLE statements.

Code.

-- Option A — email as PRIMARY KEY (natural key only)
CREATE TABLE users_a (
    email      TEXT PRIMARY KEY,
    full_name  TEXT
);

-- Option B — surrogate id PK + UNIQUE email
CREATE TABLE users_b (
    id         BIGSERIAL PRIMARY KEY,
    email      TEXT NOT NULL UNIQUE,
    full_name  TEXT
);

-- FK pattern downstream
CREATE TABLE orders (
    id         BIGSERIAL PRIMARY KEY,
    -- Option A
    user_email TEXT REFERENCES users_a(email),
    -- Option B
    user_id    BIGINT REFERENCES users_b(id)
);

Step-by-step explanation.

Option A makes email the natural primary key. The column is implicitly NOT NULL. The downstream orders.user_email foreign key references this column directly.
Option B introduces a surrogate BIGSERIAL id as the primary key. email is NOT NULL UNIQUE — same uniqueness guarantee but separate from identity. Downstream orders.user_id references the integer id.
If a user updates their email, Option A must UPDATE users_a SET email = ... and cascade the change to every orders.user_email row. Option B updates users_b.email in place; the surrogate id does not change, so orders.user_id is unaffected.
Option B is also more storage-efficient: an 8-byte BIGINT in orders.user_id is smaller than a 30+ character email in orders.user_email. With 100M order rows the difference is gigabytes.
The standard recommendation: use surrogate PK + UNIQUE business key for any table whose business key might ever change. Reserve natural PKs for immutable codes (ISO country codes, currency codes).

Output.

Concern	Option A (email as PK)	Option B (id PK + email UNIQUE)
NULL email allowed?	No	No (UNIQUE + NOT NULL)
FK column type in orders	TEXT (email string)	BIGINT
Storage per FK row	30+ bytes	8 bytes
Email rename impact	cascades to every FK	none
Recommended for users table	rarely	almost always

Rule of thumb. Default to surrogate primary key plus UNIQUE business key. Only use a natural primary key when the column is provably immutable and small (currency codes, country codes, ISBN-after-revision-policy).

Worked example — SQL Server's "only one NULL" quirk

Detailed explanation. A team migrates a Postgres schema to SQL Server. In Postgres, UNIQUE (referral_code) allows many NULL referral_code rows because Postgres treats NULLs as distinct. SQL Server's UNIQUE constraint allows exactly one NULL. The migration fails on the first insert with a second NULL.

Question. Show the schema and the cross-dialect difference. Provide the SQL Server workaround using a filtered index.

Input.

user_id	referral_code
1	ABC
2	NULL
3	NULL
4	XYZ

Code.

-- Postgres / MySQL / Oracle — both NULL rows accepted
CREATE TABLE users (
    user_id        BIGSERIAL PRIMARY KEY,
    referral_code  TEXT UNIQUE   -- allows multiple NULLs
);

-- SQL Server — second NULL fails with 2627 violation
CREATE TABLE users (
    user_id        BIGINT IDENTITY PRIMARY KEY,
    referral_code  NVARCHAR(50) UNIQUE  -- allows exactly one NULL
);

-- SQL Server workaround — filtered unique index that ignores NULL
CREATE TABLE users (
    user_id        BIGINT IDENTITY PRIMARY KEY,
    referral_code  NVARCHAR(50)
);
CREATE UNIQUE INDEX users_referral_code_uk
    ON users (referral_code)
    WHERE referral_code IS NOT NULL;

Step-by-step explanation.

In Postgres, inserting (2, NULL) then (3, NULL) both succeed because the engine treats every NULL as distinct from every other NULL — a different NULL each time.
In SQL Server, the first NULL insert succeeds and SQL Server records "I have a NULL in this UNIQUE column." The second NULL insert raises 2627 Violation of UNIQUE KEY constraint.
The workaround is a filtered unique index — CREATE UNIQUE INDEX ... WHERE referral_code IS NOT NULL. The index covers only non-NULL rows, so NULLs are entirely outside the uniqueness check and any number are allowed.
The filtered-index trick is the canonical SQL Server pattern for "UNIQUE on a nullable column." It mirrors Postgres semantics and is also the pattern for SQL:2023 NULLS NOT DISTINCT (which Postgres 15+ supports natively).

Output.

Engine	Insert (2, NULL)	Insert (3, NULL)
Postgres	OK	OK
MySQL	OK	OK
SQL Server — vanilla UNIQUE	OK	2627 violation
SQL Server — filtered index	OK	OK

Rule of thumb. When a column is both nullable and unique, state the NULL behaviour explicitly — NULLS DISTINCT (default in most engines), NULLS NOT DISTINCT (Postgres 15+ / DuckDB), or a filtered unique index (SQL Server / SQLite). Future readers and future engines will thank you.

SQL interview question on the UNIQUE vs PRIMARY KEY decision

A senior interviewer might frame this as: "You're designing a subscriptions table that holds one row per (user_id, plan_id) pair. Walk me through whether you'd use a composite PRIMARY KEY, a surrogate id + UNIQUE, or both — and why."

Solution Using surrogate PK + composite UNIQUE for stability

CREATE TABLE subscriptions (
    id          BIGSERIAL PRIMARY KEY,
    user_id     BIGINT NOT NULL REFERENCES users(id),
    plan_id     BIGINT NOT NULL REFERENCES plans(id),
    started_at  TIMESTAMPTZ NOT NULL DEFAULT now(),
    CONSTRAINT subscriptions_user_plan_uk UNIQUE (user_id, plan_id)
);

Step-by-step trace.

Decision	Choice	Why
Primary key	`BIGSERIAL id`	small immutable surrogate; stable FK target
Business key	`UNIQUE (user_id, plan_id)`	hard enforcement of "one row per pair"
Downstream FK	`REFERENCES subscriptions(id)`	tiny 8-byte FK column
Future change tolerance	rename / migrate plan ids without breaking FKs	the surrogate id never changes

After the schema lands, the team writes the standard INSERT ... ON CONFLICT (user_id, plan_id) DO NOTHING for the signup flow. Duplicate signup attempts no-op cleanly without re-implementing the uniqueness check in application code.

Output:

Property	Value
Primary key column	id
Unique business key	(user_id, plan_id)
Downstream FK column type	BIGINT
Duplicate signup behaviour	silently no-op via ON CONFLICT DO NOTHING

Why this works — concept by concept:

Surrogate PK + business UNIQUE — splits identity (the id) from meaning (the business key). The id never changes; the business key is the contract.
Composite UNIQUE — UNIQUE (user_id, plan_id) says "a (user_id, plan_id) pair is unique" — not user_id alone, not plan_id alone. The composite index supports both the uniqueness check and queries that filter by user_id (leading column).
Foreign key target — pointing downstream FKs at subscriptions.id (8 bytes) instead of (user_id, plan_id) (16+ bytes) saves storage and simplifies joins.
Idempotent insert — ON CONFLICT (user_id, plan_id) DO NOTHING uses the UNIQUE constraint as the conflict target. A retried signup never doubles the row.
Cost — one B-tree index on (user_id, plan_id) (O(log n) insert + O(n) storage). Same cost as any composite UNIQUE; pays for itself on the first duplicate.

SQL
Topic — distinct
UNIQUE constraint problems (SQL)

Practice →

SQL
Topic — joins
JOIN problems on unique keys (SQL)

Practice →

3. Composite & partial unique indexes

`composite unique` scopes the constraint to a column tuple; `partial unique index` scopes it to a row predicate — together they cover every "unique within X" requirement

The mental model in one line: UNIQUE (a, b) makes the tuple unique (rows can share a or b alone, just not both together); UNIQUE INDEX ... WHERE pred makes uniqueness apply only to rows that match pred. Compose the two and you can express any business rule of the form "X must be unique within scope Y."

Composite UNIQUE in detail.

Syntax. UNIQUE (a, b) — the combination of a and b must be unique. (1, 'x') and (1, 'y') are both allowed; (1, 'x') twice is not.
Column order. The order in UNIQUE (a, b) does not affect the uniqueness semantics — (a, b) is the same set as (b, a). But the underlying index is (a, b), which supports queries that filter by a alone (leading column) but not by b alone.
Use case 1 — multi-tenant slug uniqueness. UNIQUE (tenant_id, slug) — a slug like 'launch' is unique within a tenant, but two tenants can each have their own 'launch' page.
Use case 2 — daily metric uniqueness. UNIQUE (date, metric_name) — one row per metric per day. The upsert flow then writes ON CONFLICT (date, metric_name) DO UPDATE SET value = EXCLUDED.value.

Partial UNIQUE in detail (Postgres / SQL Server / SQLite / DuckDB).

Syntax. CREATE UNIQUE INDEX name ON table (col) WHERE pred. The uniqueness check applies only to rows where pred is true.
Use case 1 — one-active pattern. CREATE UNIQUE INDEX one_active_sub ON subscriptions (user_id) WHERE is_active = true — at most one active subscription per user, but unlimited cancelled ones.
Use case 2 — soft-delete-aware uniqueness. CREATE UNIQUE INDEX unique_email ON users (email) WHERE deleted_at IS NULL — uniqueness applies only to live rows; soft-deleted rows can reuse the email.
Use case 3 — type-scoped uniqueness. CREATE UNIQUE INDEX one_primary_address ON addresses (user_id) WHERE is_primary = true — at most one primary address per user, multiple secondary ones allowed.

Expression-based UNIQUE (Postgres / Oracle / SQL Server with computed columns).

Syntax. UNIQUE (LOWER(email)) — the result of the expression is unique, not the raw column.
Use case 1 — case-insensitive uniqueness. Emails Alice@x.com and alice@x.com should be considered duplicates. UNIQUE (LOWER(email)) enforces it at the schema level.
Use case 2 — normalised key uniqueness. UNIQUE (REGEXP_REPLACE(phone, '\D', '', 'g')) — phone numbers are unique after stripping non-digits.
Use case 3 — temporal bucketing. UNIQUE (date_trunc('day', created_at), user_id) — at most one event per user per day, regardless of the timestamp's hour.

Snowflake / BigQuery / Redshift caveats.

All three accept the syntax for UNIQUE (a, b) but do not enforce it. The constraint is a hint.
Partial unique indexes do not exist in the warehouse-OLAP family — there is no enforced-at-write mechanism. Use dbt test unique with a where clause: tests: [{ name: unique_active_subs, column: user_id, config: { where: "is_active = true" } }].
Expression-based UNIQUE — same story; emulate with a dbt test on lower(email).

MySQL caveats.

Composite UNIQUE — fully supported, same as Postgres.
Partial UNIQUE — not supported. MySQL has no filtered indexes. The standard workaround is a generated column that is NULL when the predicate is false, then UNIQUE on the generated column. Postgres-style WHERE partial indexes have to be emulated.
Expression UNIQUE — supported via generated columns: email_lower VARCHAR(255) GENERATED ALWAYS AS (LOWER(email)) STORED, UNIQUE (email_lower).

Worked example — multi-tenant slug uniqueness

Detailed explanation. A SaaS product hosts marketing pages for many tenants. Each tenant owns a set of page slugs ('launch', 'pricing', 'about'). The slug is unique within the tenant, not globally — two tenants can each have a 'launch' page. The schema enforces this with UNIQUE (tenant_id, slug).

Question. Show the table, attempt four inserts (two from each tenant), and explain which ones succeed.

Input — intended inserts.

tenant_id	slug
1	launch
1	launch
2	launch
2	intro

Code.

CREATE TABLE pages (
    id          BIGSERIAL PRIMARY KEY,
    tenant_id   BIGINT NOT NULL REFERENCES tenants(id),
    slug        TEXT NOT NULL,
    title       TEXT NOT NULL,
    CONSTRAINT pages_tenant_slug_uk UNIQUE (tenant_id, slug)
);

INSERT INTO pages (tenant_id, slug, title) VALUES (1, 'launch', 'Launch v1');
INSERT INTO pages (tenant_id, slug, title) VALUES (1, 'launch', 'Launch v2');  -- fails
INSERT INTO pages (tenant_id, slug, title) VALUES (2, 'launch', 'Launch');
INSERT INTO pages (tenant_id, slug, title) VALUES (2, 'intro',  'Intro');

Step-by-step explanation.

The first insert (1, 'launch') lands. The composite UNIQUE check sees no existing (1, 'launch') row.
The second insert (1, 'launch') collides. The constraint fires; the insert raises 23505 duplicate key value violates unique constraint. No row added.
The third insert (2, 'launch') lands. The composite key (2, 'launch') is different from (1, 'launch'). Even though slug='launch' is "the same value" as row 1, the UNIQUE is on the pair, not the column.
The fourth insert (2, 'intro') lands. Different slug, no collision with anything.

Output.

Row in table	tenant_id	slug	title
1	1	launch	Launch v1
2	2	launch	Launch
3	2	intro	Intro

Rule of thumb. Every multi-tenant table that has a "human-friendly key" (slug, name, code) needs UNIQUE (tenant_id, that_key). Forgetting the tenant_id leading column means tenant A's 'launch' blocks tenant B from creating their own.

Worked example — one-active partial unique index

Detailed explanation. Each user can have many subscriptions rows over their lifetime (one per signup, one per cancellation, one per re-signup). At any given moment, at most one subscription per user should be is_active = true. A partial unique index expresses exactly this — uniqueness on user_id, but only for rows where is_active.

Question. Show the schema and four inserts. Explain which succeed.

Input — intended inserts.

user_id	is_active	started_at
42	true	2026-01-01
42	true	2026-02-01
42	false	2025-10-01
42	false	2025-12-01

Code.

CREATE TABLE subscriptions (
    id           BIGSERIAL PRIMARY KEY,
    user_id      BIGINT NOT NULL,
    is_active    BOOLEAN NOT NULL,
    started_at   TIMESTAMPTZ NOT NULL,
    cancelled_at TIMESTAMPTZ
);

CREATE UNIQUE INDEX one_active_sub_per_user
    ON subscriptions (user_id)
    WHERE is_active = true;

-- Inserts
INSERT INTO subscriptions (user_id, is_active, started_at) VALUES (42, true,  '2026-01-01');
INSERT INTO subscriptions (user_id, is_active, started_at) VALUES (42, true,  '2026-02-01'); -- fails
INSERT INTO subscriptions (user_id, is_active, started_at) VALUES (42, false, '2025-10-01');
INSERT INTO subscriptions (user_id, is_active, started_at) VALUES (42, false, '2025-12-01');

Step-by-step explanation.

The first insert lands — the WHERE is_active = true filter includes this row, but no prior row matches, so the uniqueness check sees no collision.
The second insert collides — both (42, true, ...) rows would land in the partial index, violating UNIQUE (user_id) within that subset. 23505 raised.
The third insert lands — is_active = false, so the row is outside the partial index. No uniqueness check applies; the row is free to coexist with anything.
The fourth insert lands — also is_active = false. Multiple cancelled rows per user are allowed.

Output.

Row in table	user_id	is_active	started_at
1	42	true	2026-01-01
2	42	false	2025-10-01
3	42	false	2025-12-01

Rule of thumb. Whenever the business rule is "one X per Y only when condition holds," reach for a partial unique index. Common candidates: one active subscription per user, one primary address per user, one open invoice per project, one in-progress task per worker.

Worked example — case-insensitive email uniqueness

Detailed explanation. Users sign up with email addresses; the product treats Alice@X.com and alice@x.com as the same address. A naive UNIQUE (email) lets both rows coexist. The fix is UNIQUE (LOWER(email)) — the index stores the lowercase form and uniqueness applies to that.

Question. Show the schema and three inserts. Explain which succeed and how to query case-insensitively after the index is in place.

Input — intended inserts.

email
Alice@X.com
alice@x.com
bob@x.com

Code.

-- Postgres / Oracle / SQL Server (with computed column)
CREATE TABLE users (
    id     BIGSERIAL PRIMARY KEY,
    email  TEXT NOT NULL
);

CREATE UNIQUE INDEX users_email_lower_uk ON users (LOWER(email));

INSERT INTO users (email) VALUES ('Alice@X.com');
INSERT INTO users (email) VALUES ('alice@x.com');  -- fails
INSERT INTO users (email) VALUES ('bob@x.com');

-- Case-insensitive lookup uses the index
SELECT * FROM users WHERE LOWER(email) = LOWER('ALICE@X.COM');

Step-by-step explanation.

The first insert lands. The index stores lower('Alice@X.com') = 'alice@x.com'.
The second insert computes lower('alice@x.com') = 'alice@x.com' and collides with the existing index entry. 23505 raised.
The third insert lands. lower('bob@x.com') = 'bob@x.com' is new.
The lookup WHERE LOWER(email) = LOWER('ALICE@X.COM') rewrites both sides to 'alice@x.com'. The query planner uses the expression index to locate the row in O(log n) — no full table scan.

Output.

Row in table	email
1	Alice@X.com
2	bob@x.com

Rule of thumb. Any business key that humans type freely (email, username, product code) needs an expression-based UNIQUE that normalises the format. Lowercasing covers 80% of the cases; REGEXP_REPLACE handles the rest. Without it, the dataset accumulates "duplicates" that the application UI shows as distinct accounts.

Worked example — Snowflake informational UNIQUE + dbt test

Detailed explanation. Snowflake accepts UNIQUE syntax but does not enforce it at write time. A team relies on the appearance of the constraint and is surprised when duplicates appear in production after a buggy load. The fix is to add a dbt test unique on the column — the test runs in CI and surfaces the violation before the bad data hits the dashboard.

Question. Show the Snowflake schema, the (non-)enforcement behaviour, and the dbt test that gives you real protection.

Input — intended inserts.

event_id	user_id	event_ts
evt_1	42	2026-06-01 10:00
evt_1	42	2026-06-01 10:00
evt_2	43	2026-06-01 10:05

Code.

-- Snowflake — UNIQUE accepted but NOT enforced
CREATE TABLE fct_events (
    event_id   STRING NOT NULL,
    user_id    NUMBER NOT NULL,
    event_ts   TIMESTAMP_TZ NOT NULL,
    CONSTRAINT fct_events_pk PRIMARY KEY (event_id),  -- also informational
    CONSTRAINT fct_events_uk UNIQUE (event_id)         -- also informational
);

-- All three inserts succeed despite the "constraint"
INSERT INTO fct_events VALUES ('evt_1', 42, '2026-06-01 10:00:00');
INSERT INTO fct_events VALUES ('evt_1', 42, '2026-06-01 10:00:00');  -- duplicates land!
INSERT INTO fct_events VALUES ('evt_2', 43, '2026-06-01 10:05:00');

# dbt test that DOES enforce
models:
  - name: fct_events
    columns:
      - name: event_id
        tests:
          - unique
          - not_null

Step-by-step explanation.

Snowflake's CREATE TABLE accepts both PRIMARY KEY and UNIQUE syntax. The Snowflake docs call these "informational integrity constraints" — they exist for the query optimiser and for downstream tools that read the schema metadata.
All three inserts succeed because Snowflake does not run an uniqueness check at write time. The duplicate evt_1 lands silently.
The dbt test runs against the materialised table (dbt test --select fct_events). It executes SELECT event_id, COUNT(*) FROM fct_events GROUP BY event_id HAVING COUNT(*) > 1 and fails the build if any row returns.
The test runs on every CI commit and every scheduled dbt run, so the duplicate is surfaced within minutes of being introduced — before it reaches the dashboard.

Output.

Layer	Duplicate behaviour
Snowflake `UNIQUE` constraint	not enforced — duplicates land
Snowflake `PRIMARY KEY` constraint	not enforced — duplicates land
dbt `unique` test	fails CI on duplicate — surfaces the bug

Rule of thumb. On Snowflake / BigQuery / Redshift, write PRIMARY KEY and UNIQUE constraints as documentation — they help the optimiser and tools — but always pair them with a dbt unique test for real enforcement. The two-line YAML test is the warehouse-equivalent of the OLTP write-time constraint.

SQL interview question on composite + partial uniqueness

A senior interviewer often combines: "Design the constraints for a team_memberships table where each user can be a member of many teams, but only one owner per team. Show the schema and explain every constraint."

Solution Using composite UNIQUE plus partial UNIQUE

CREATE TABLE team_memberships (
    id         BIGSERIAL PRIMARY KEY,
    team_id    BIGINT NOT NULL REFERENCES teams(id),
    user_id    BIGINT NOT NULL REFERENCES users(id),
    role       TEXT NOT NULL CHECK (role IN ('owner', 'admin', 'member')),
    joined_at  TIMESTAMPTZ NOT NULL DEFAULT now(),

    -- A user is in a team at most once
    CONSTRAINT team_memberships_team_user_uk UNIQUE (team_id, user_id)
);

-- At most one owner per team
CREATE UNIQUE INDEX one_owner_per_team
    ON team_memberships (team_id)
    WHERE role = 'owner';

Step-by-step trace.

Inserted row	UNIQUE (team_id, user_id)	partial UNIQUE (team_id) WHERE role='owner'	Result
(team=1, user=10, role=owner)	new pair	new owner — first	OK
(team=1, user=11, role=member)	new pair	not in partial index	OK
(team=1, user=12, role=owner)	new pair	second owner on team 1 → collision	reject
(team=2, user=10, role=owner)	new pair	new owner on team 2	OK
(team=1, user=10, role=member)	existing pair → collision	not checked (constraint fired earlier)	reject

The two constraints compose: composite UNIQUE on (team_id, user_id) prevents the same user from being added twice to the same team; partial UNIQUE on (team_id) WHERE role='owner' ensures at most one owner per team.

Output:

team_id	user_id	role
1	10	owner
1	11	member
2	10	owner

Why this works — concept by concept:

Composite UNIQUE (team_id, user_id) — the pair is the business key. Same user in a different team is fine; same user twice in one team is not.
Partial UNIQUE (team_id) WHERE role='owner' — applies uniqueness only to owner rows. Member and admin rows are outside the index and can repeat.
CHECK constraint on role — defends the enum at the schema layer. Application bugs that try to insert role='superuser' fail loudly.
Two indexes, two business rules — each constraint expresses a single rule. Composing rules in one constraint (e.g. "unique by team_id, user_id, role") would let two owners coexist (different role per row) — the wrong semantics.
Cost — two B-tree indexes (composite + partial). Combined: O(log n) write, O(n) storage. Negligible vs the cost of an "I see two owners on the team page" bug.

SQL
Topic — conditional logic
Conditional uniqueness problems (SQL)

Practice →

SQL
Topic — group-by
GROUP BY duplicate-detection problems (SQL)

Practice →

4. Hard vs soft deduplication — constraint vs query

`hard uniqueness` rejects duplicates at write time; `soft uniqueness` collapses them at read time with `row_number deduplication` or `distinct on` — pick the right one for the right layer

The mental model in one line: the OLTP source-of-truth ships hard uniqueness (constraint rejects the duplicate); the warehouse staging layer ships soft uniqueness (the duplicate lands but the read picks the latest). Once you choose the layer, the dedup technique follows mechanically.

Hard vs soft — the comparison.

Property	Hard (constraint)	Soft (query)
When uniqueness is enforced	at INSERT / UPDATE time	at SELECT time
Storage cost	constant — duplicates never land	grows with input volume
Read cost	constant — no dedup at query time	window function or DISTINCT on every read
Where it shines	OLTP source-of-truth, idempotent writes	warehouse staging, CDC streams, late-arriving data
Failure mode	duplicate-key error visible to the writer	silently picks a row — bug surfaces in metrics weeks later

The soft-dedup toolkit.

SELECT DISTINCT col1, col2 FROM t — collapses rows where every selected column is identical. Simple but inflexible: you cannot pick which row wins among the duplicates.
GROUP BY col1, col2 — same row-collapse as DISTINCT but lets you aggregate other columns. SELECT col1, col2, MAX(updated_at) FROM t GROUP BY col1, col2.
ROW_NUMBER() OVER (PARTITION BY key ORDER BY recency DESC) AS rn then WHERE rn = 1 — the canonical "latest per key" pattern. Works in every dialect.
DISTINCT ON (key) (Postgres / DuckDB) — Postgres-specific shorthand: SELECT DISTINCT ON (id) * FROM t ORDER BY id, updated_at DESC. Picks the first row per id after ordering.
QUALIFY rn = 1 (Snowflake / BigQuery / DuckDB / Teradata) — lets you filter on a window function without wrapping in a subquery: SELECT *, ROW_NUMBER() OVER (...) AS rn FROM t QUALIFY rn = 1.

DISTINCT vs GROUP BY.

Semantically identical — both collapse rows by a unique key. The optimiser usually compiles them to the same plan.
Stylistic preference — use DISTINCT when you don't need aggregates; use GROUP BY when you do. SELECT DISTINCT id FROM t ≡ SELECT id FROM t GROUP BY id.
Performance — DISTINCT and GROUP BY have the same cost on modern engines. The myth that "GROUP BY is faster" is a holdover from MySQL pre-5.6.

Choosing between hard and soft.

OLTP source-of-truth → hard. The application owns the writes; uniqueness is a contract. Examples: user accounts, order ids, payment events.
Warehouse staging from CDC → soft. The CDC stream emits one row per change; multiple _loaded_at versions of the same id are expected; the read picks the latest.
Vendor CSV with no idempotency key → hard if you trust the vendor to provide a unique identifier, soft otherwise. Add a synthetic UNIQUE on (vendor, file_date, row_number_in_file) if necessary.
High-throughput log ingest → soft. Hard constraints add per-row index lookups; at 100k rows/sec the cost matters. Land into a heap; dedup on the read.

Worked example — DISTINCT ON vs QUALIFY across dialects.

Postgres / DuckDB — SELECT DISTINCT ON (id) * FROM events ORDER BY id, event_ts DESC — clean one-liner.
Snowflake / BigQuery / DuckDB / Teradata — SELECT * FROM events QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY event_ts DESC) = 1.
MySQL / SQL Server / Oracle (universal) — CTE + ROW_NUMBER: WITH ranked AS (SELECT *, ROW_NUMBER() OVER (...) AS rn FROM events) SELECT * FROM ranked WHERE rn = 1.

Worked example — dedup a CDC stream with ROW_NUMBER

Detailed explanation. A CDC stream from Postgres emits one row per change to a customers table. The warehouse load picks up changes every five minutes; sometimes a row updates twice within a window and both versions land. The downstream dim_customers should hold only the latest row per id. Window-function dedup is the textbook pattern.

Question. Given the stg_customers table with two _loaded_at versions of id=42, write a query that selects the latest row per id.

Input — stg_customers.

id	name	_loaded_at
42	Ana	2026-06-01 10:00
42	Anna	2026-06-01 10:05
43	Bo	2026-06-01 10:01

Code.

-- Universal pattern — CTE + ROW_NUMBER
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY id ORDER BY _loaded_at DESC) AS rn
    FROM stg_customers
)
SELECT id, name, _loaded_at
FROM ranked
WHERE rn = 1;

-- Snowflake / BigQuery / DuckDB — one fewer CTE
SELECT id, name, _loaded_at
FROM stg_customers
QUALIFY ROW_NUMBER() OVER (PARTITION BY id ORDER BY _loaded_at DESC) = 1;

-- Postgres / DuckDB — DISTINCT ON shortcut
SELECT DISTINCT ON (id) id, name, _loaded_at
FROM stg_customers
ORDER BY id, _loaded_at DESC;

Step-by-step explanation.

PARTITION BY id groups rows by the deduplication key — every distinct id becomes its own window.
ORDER BY _loaded_at DESC ranks within each window — the row with the most recent _loaded_at gets rn = 1.
WHERE rn = 1 (or QUALIFY rn = 1) keeps only the top-ranked row per partition. The older version is dropped at read time.
DISTINCT ON (id) is the Postgres shortcut — the ORDER BY id, _loaded_at DESC first sorts by the dedup key (required) then by recency. DISTINCT ON picks the first row after that sort.

Output.

id	name	_loaded_at
42	Anna	2026-06-01 10:05
43	Bo	2026-06-01 10:01

Rule of thumb. Every staging table from a CDC source or message queue ships with a standard dedup view: ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY ingestion_ts DESC) filtered to rn = 1. Bake the view into the staging-layer template so analysts always query the dedup'd version downstream.

Worked example — DISTINCT vs GROUP BY on a duplicate-heavy table

Detailed explanation. A reporting team wants the list of distinct (user_id, event_type) pairs in the last 30 days. DISTINCT and GROUP BY both work. The DISTINCT form is shorter; the GROUP BY form lets the team add a COUNT(*) later.

Question. Given the events table, list every distinct (user_id, event_type) pair two ways. Show that the results are identical.

Input.

event_id	user_id	event_type
1	10	click
2	10	click
3	10	view
4	11	click
5	11	click

Code.

-- DISTINCT form
SELECT DISTINCT user_id, event_type
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '30 days';

-- GROUP BY form
SELECT user_id, event_type
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY user_id, event_type;

-- GROUP BY with a useful aggregate added on
SELECT user_id, event_type, COUNT(*) AS n
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY user_id, event_type;

Step-by-step explanation.

DISTINCT collapses every row that has identical (user_id, event_type). Three distinct pairs remain: (10, click), (10, view), (11, click).
GROUP BY produces the same three groups. Without an aggregate, the SELECT list reduces to the group-by columns — identical result to DISTINCT.
Adding COUNT(*) to the GROUP BY gives (10, click, 2), (10, view, 1), (11, click, 2). The DISTINCT form cannot add this column.
On a modern optimiser, both queries compile to the same hash-aggregate plan. There is no performance reason to prefer one form.

Output (DISTINCT or plain GROUP BY).

user_id	event_type
10	click
10	view
11	click

Rule of thumb. Use DISTINCT when the result is "just the distinct combinations"; switch to GROUP BY the moment you need an aggregate alongside the dedup. Both compile to the same plan.

Worked example — picking the latest row with DISTINCT ON

Detailed explanation. A Postgres-only stack wants to fetch the latest order per customer. DISTINCT ON (customer_id) is the cleanest one-liner: order by (customer_id, created_at DESC) and DISTINCT ON picks the first row per customer_id.

Question. Given the orders table, return the latest order per customer using DISTINCT ON. Show the equivalent universal pattern with ROW_NUMBER.

Input.

order_id	customer_id	created_at	amount
1	10	2026-05-01	100
2	10	2026-06-01	150
3	11	2026-05-15	200
4	11	2026-04-10	50

Code.

-- Postgres / DuckDB — DISTINCT ON shortcut
SELECT DISTINCT ON (customer_id)
    customer_id, order_id, created_at, amount
FROM orders
ORDER BY customer_id, created_at DESC;

-- Universal pattern — same result
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) AS rn
    FROM orders
)
SELECT customer_id, order_id, created_at, amount
FROM ranked
WHERE rn = 1;

Step-by-step explanation.

DISTINCT ON (customer_id) tells Postgres: "keep the first row encountered per customer_id."
The ORDER BY customer_id, created_at DESC controls which row is "first." Sorting by customer_id is required (it must match the DISTINCT ON columns); sorting by created_at DESC within each customer puts the most recent order first.
For customer_id = 10, the ordered rows are (10, 2, 2026-06-01) and (10, 1, 2026-05-01). DISTINCT ON keeps the first: order_id = 2.
For customer_id = 11, the ordered rows are (11, 3, 2026-05-15) and (11, 4, 2026-04-10). DISTINCT ON keeps order_id = 3.
The ROW_NUMBER variant produces identical results via the standard window-function pattern and is the portable choice.

Output.

customer_id	order_id	created_at	amount
10	2	2026-06-01	150
11	3	2026-05-15	200

Rule of thumb. On Postgres-only stacks, DISTINCT ON is the cleanest "latest per key" syntax. When the SQL must travel between Postgres, Snowflake, and BigQuery, fall back to the CTE + ROW_NUMBER pattern (or QUALIFY on dialects that support it). All three encode the same plan.

SQL interview question on dedup of a CDC staging table

A senior interviewer often combines: "Your warehouse staging table holds CDC events with possible duplicates per (table_name, primary_key). Write a query that returns the latest event per (table_name, primary_key) across all tables, then explain which dialects offer the cleanest syntax."

Solution Using QUALIFY ROW_NUMBER for cross-dialect dedup

-- Snowflake / BigQuery / DuckDB / Teradata — QUALIFY
SELECT
    table_name,
    primary_key,
    op,
    payload,
    _loaded_at
FROM stg_cdc
QUALIFY
    ROW_NUMBER() OVER (
        PARTITION BY table_name, primary_key
        ORDER BY _loaded_at DESC
    ) = 1;

-- Universal portable pattern — CTE + ROW_NUMBER
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY table_name, primary_key
            ORDER BY _loaded_at DESC
        ) AS rn
    FROM stg_cdc
)
SELECT table_name, primary_key, op, payload, _loaded_at
FROM ranked
WHERE rn = 1;

Step-by-step trace.

Row in stg_cdc	table_name	primary_key	_loaded_at	rn
1	users	42	10:00	2
2	users	42	10:05	1 ✓
3	users	43	10:01	1 ✓
4	orders	99	10:02	1 ✓
5	orders	99	10:08	— (depends: actually the newest, so it would be rn = 1, and row 4 becomes rn = 2)

Re-rank row 4 and 5 since 10:08 > 10:02: row 5 becomes rn = 1 and row 4 becomes rn = 2. The output keeps rn = 1 rows per partition.

Output:

table_name	primary_key	_loaded_at
users	42	10:05
users	43	10:01
orders	99	10:08

Why this works — concept by concept:

PARTITION BY (table_name, primary_key) — the dedup happens per composite key. Two different tables can both have primary_key = 42 without competing for the same row slot.
ORDER BY _loaded_at DESC — picks the most recent CDC event. If two events share the same _loaded_at, you'd add a tiebreaker like _lsn or _offset to make the ranking deterministic.
QUALIFY = 1 — the same as a CTE-wrapped WHERE rn = 1 but reads top-down: filter the rows whose window-function value equals 1. Available on Snowflake / BigQuery / DuckDB / Teradata.
Portable CTE form — the CTE + WHERE rn = 1 variant compiles on every dialect. Same plan as QUALIFY; just two more lines of SQL.
Cost — one window-function pass over the staging table: O(n log n) due to the sort inside the partition. Materialise the dedup'd view if multiple downstream queries read from it.

SQL
Topic — window functions
Window-function deduplication problems (SQL)

Practice →

SQL
Topic — ranking
ROW_NUMBER ranking problems (SQL)

Practice →

5. UPSERT — INSERT or UPDATE on duplicate key

`upsert` is "insert if new, update if existing" — three syntaxes (`ON CONFLICT`, `ON DUPLICATE KEY`, `MERGE`) over the same UNIQUE-constraint mechanism

The mental model in one line: every upsert relies on a pre-existing UNIQUE or PRIMARY KEY index — the constraint is the contract that tells the engine "this is the duplicate; here's how to merge". Once you accept that the UNIQUE has to be there first, the dialect-specific syntax is just decoration.

The three syntaxes.

Postgres / SQLite / DuckDB — INSERT ... ON CONFLICT (col) DO UPDATE SET ...
MySQL — INSERT ... ON DUPLICATE KEY UPDATE col = VALUES(col) (or col = NEW.col in 8.0.20+)
SQL Server / Oracle / Snowflake / BigQuery / Redshift — MERGE INTO target USING source ON ... WHEN MATCHED THEN UPDATE ... WHEN NOT MATCHED THEN INSERT ...

Postgres ON CONFLICT in detail.

Conflict target — ON CONFLICT (col) or ON CONFLICT ON CONSTRAINT constraint_name. Must reference an existing unique constraint or index.
Actions — DO NOTHING (silent idempotent insert) or DO UPDATE SET col = EXCLUDED.col. The pseudo-row EXCLUDED holds the values that would have been inserted.
Conditional update — ON CONFLICT (col) DO UPDATE SET col = EXCLUDED.col WHERE target.updated_at < EXCLUDED.updated_at. Skip the update if the incoming row is older.
Atomicity — the conflict check and the resulting INSERT or UPDATE happen in a single statement under the same transaction; the operation is race-safe against concurrent inserts.

MySQL ON DUPLICATE KEY UPDATE in detail.

No explicit conflict target — MySQL uses any UNIQUE or PRIMARY KEY index that the row would violate. Tricky if multiple indexes could match different existing rows — behaviour is undefined.
VALUES(col) — references the incoming row's value. In MySQL 8.0.20+, the preferred form is INSERT INTO t (a, b) VALUES (1, 2) AS new ON DUPLICATE KEY UPDATE b = new.b.
INSERT IGNORE — sibling form: silently skips rows that would violate a constraint. The MySQL equivalent of ON CONFLICT DO NOTHING.

MERGE in detail (SQL Server / Snowflake / BigQuery / Oracle / Redshift).

Three clauses — WHEN MATCHED THEN UPDATE, WHEN NOT MATCHED THEN INSERT, WHEN NOT MATCHED BY SOURCE THEN DELETE (optional, deletes from target rows missing in source).
Source can be a query — MERGE INTO target t USING (SELECT ... FROM staging) s ON t.k = s.k WHEN MATCHED THEN UPDATE SET .... Powerful for bulk upserts from staging.
Atomicity caveat — SQL Server — MERGE is not atomic against concurrent inserts without HOLDLOCK or WITH (SERIALIZABLE) hints. Documented Microsoft KB article; bites teams that assume MERGE behaves like ON CONFLICT.
Atomicity caveat — BigQuery / Snowflake — both serialise MERGE statements; race conditions are absent at the SQL layer, but the operation is more expensive (full table scan in some cases).

WHERE target IS DISTINCT FROM source — the no-op update optimisation.

A naive MERGE or ON CONFLICT DO UPDATE rewrites every matched row even when the new value equals the old value.
The result is wasted writes, bloated WAL, and unnecessary downstream change-feed events.
The optimisation: WHEN MATCHED AND t.val IS DISTINCT FROM s.val THEN UPDATE SET val = s.val (Postgres / Snowflake). Same idea in ON CONFLICT (col) DO UPDATE SET val = EXCLUDED.val WHERE target.val IS DISTINCT FROM EXCLUDED.val.

Common interview probes.

"Walk me through INSERT ... ON CONFLICT (col) DO UPDATE SET val = EXCLUDED.val. What does EXCLUDED refer to?" — the pseudo-row of values that would have been inserted.
"Why is INSERT ON CONFLICT preferred over MERGE on SQL Server?" — MERGE is not race-safe without HOLDLOCK; ON CONFLICT (Postgres) is atomic by design.
"What does WHEN NOT MATCHED BY SOURCE THEN DELETE do?" — deletes rows from the target that have no matching row in the source. The full "make target look like source" operation.

Worked example — Postgres ON CONFLICT for a daily metrics roll-up

Detailed explanation. A daily ETL job computes daily_metrics (date, metric_name, value) and upserts into a metrics table keyed on (date, metric_name). The upsert pattern uses ON CONFLICT (date, metric_name) DO UPDATE SET value = EXCLUDED.value so that re-running the job (same day, same metric) updates the row in place instead of duplicating.

Question. Show the schema, the upsert statement, and the result after two runs (one with a typo, one with the fix).

Input — first run.

date	metric_name	value
2026-06-01	revenue	1000
2026-06-01	signups	50

Input — second run (revenue corrected).

date	metric_name	value
2026-06-01	revenue	1200

Code.

CREATE TABLE metrics (
    date          DATE NOT NULL,
    metric_name   TEXT NOT NULL,
    value         NUMERIC(12, 2) NOT NULL,
    updated_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
    CONSTRAINT metrics_pk PRIMARY KEY (date, metric_name)
);

-- First run
INSERT INTO metrics (date, metric_name, value)
VALUES ('2026-06-01', 'revenue', 1000),
       ('2026-06-01', 'signups', 50)
ON CONFLICT (date, metric_name) DO UPDATE
SET value = EXCLUDED.value,
    updated_at = now();

-- Second run — same key, corrected value
INSERT INTO metrics (date, metric_name, value)
VALUES ('2026-06-01', 'revenue', 1200)
ON CONFLICT (date, metric_name) DO UPDATE
SET value = EXCLUDED.value,
    updated_at = now();

Step-by-step explanation.

The first run inserts two rows. No conflicts because the table is empty. The ON CONFLICT clause is dormant.
The second run attempts to insert (2026-06-01, 'revenue', 1200). The PRIMARY KEY constraint on (date, metric_name) flags the conflict because that pair already exists.
DO UPDATE SET value = EXCLUDED.value runs. EXCLUDED holds the new row's values; value becomes 1200. updated_at is bumped to now().
The signups row is untouched — no conflict and the statement only operates on the rows it tried to insert.

Output.

date	metric_name	value	updated_at
2026-06-01	revenue	1200	run 2 timestamp
2026-06-01	signups	50	run 1 timestamp

Rule of thumb. Every metrics or roll-up table ships with INSERT ... ON CONFLICT (key_columns) DO UPDATE SET .... Re-running the job is then safe and idempotent — the same input produces the same final state regardless of how many times it ran. Backfills become trivial.

Worked example — MySQL ON DUPLICATE KEY UPDATE

Detailed explanation. A MySQL-backed analytics service stores hit counters per (page_id, day). Every page view triggers an upsert that increments the count. The ON DUPLICATE KEY UPDATE form is MySQL's native syntax.

Question. Show the table, the upsert statement, and the resulting row counts after 3 page views to page 42 on the same day.

Input — three calls.

page_id	day	delta
42	2026-06-01	1
42	2026-06-01	1
42	2026-06-01	1

Code.

CREATE TABLE page_views (
    page_id   BIGINT NOT NULL,
    day       DATE NOT NULL,
    view_count BIGINT NOT NULL,
    PRIMARY KEY (page_id, day)
);

-- Upsert with increment (MySQL 8.0.20+)
INSERT INTO page_views (page_id, day, view_count)
VALUES (42, '2026-06-01', 1) AS new
ON DUPLICATE KEY UPDATE
    view_count = page_views.view_count + new.view_count;

Step-by-step explanation.

First call: (42, '2026-06-01') does not exist. The INSERT lands; view_count = 1.
Second call: (42, '2026-06-01') exists. The PRIMARY KEY collision fires. ON DUPLICATE KEY UPDATE runs: view_count = page_views.view_count + new.view_count = 1 + 1 = 2.
Third call: same conflict. view_count = 2 + 1 = 3.
The increment pattern target_col = target.target_col + new.target_col is the canonical "counter upsert" in MySQL. Older syntax used VALUES(view_count) instead of new.view_count.

Output.

page_id	day	view_count
42	2026-06-01	3

Rule of thumb. Counter tables (page views, click counts, daily aggregates) live or die by the upsert-with-increment pattern. Use MySQL's INSERT ... AS new ON DUPLICATE KEY UPDATE col = target.col + new.col or the equivalent Postgres ON CONFLICT (key) DO UPDATE SET col = target.col + EXCLUDED.col. The pattern is race-safe and idempotent at the SQL layer.

Worked example — SQL Server MERGE with HOLDLOCK

Detailed explanation. SQL Server's MERGE statement is the canonical "upsert from staging" pattern, but it is not race-safe against concurrent inserts without the HOLDLOCK hint. Without HOLDLOCK, two concurrent MERGEs can both see "no match" for the same key, then both insert — leaving the table with a duplicate. The fix is one keyword.

Question. Show the MERGE statement, the race condition without HOLDLOCK, and the fixed form with HOLDLOCK.

Input — staging table.

account_id	balance
1	100
2	200

Code.

-- BROKEN under concurrency — duplicates possible
MERGE INTO accounts AS target
USING staging_accounts AS source
ON target.account_id = source.account_id
WHEN MATCHED THEN
    UPDATE SET balance = source.balance
WHEN NOT MATCHED THEN
    INSERT (account_id, balance) VALUES (source.account_id, source.balance);

-- FIXED — HOLDLOCK ensures range lock on the conflict target
MERGE INTO accounts WITH (HOLDLOCK) AS target
USING staging_accounts AS source
ON target.account_id = source.account_id
WHEN MATCHED THEN
    UPDATE SET balance = source.balance
WHEN NOT MATCHED THEN
    INSERT (account_id, balance) VALUES (source.account_id, source.balance);

Step-by-step explanation.

Without HOLDLOCK, MERGE takes only update locks on matched rows and shared locks on the conflict check. Two concurrent transactions can both see "no row for account_id=3" simultaneously, then both INSERT — producing a duplicate that the PRIMARY KEY constraint may or may not catch (depending on whether one transaction commits first).
WITH (HOLDLOCK) requests SERIALIZABLE isolation for the target table — the engine takes a range lock over the conflict-checked keys. Concurrent MERGEs serialise; only one runs at a time on overlapping keys.
The fix is universal: every SQL Server MERGE statement on a hot-path upsert ships with WITH (HOLDLOCK). The cost is a small concurrency reduction; the benefit is correctness.
The Postgres INSERT ... ON CONFLICT form does not need this hint — Postgres's conflict check is atomic by design.

Output (single MERGE, no race).

account_id	balance
1	100
2	200

Rule of thumb. On SQL Server, never write MERGE without WITH (HOLDLOCK) unless you've proven via load tests that the workload is serial. The duplicate-row race is documented Microsoft behaviour and shows up in every "production has unexpected duplicates" post-mortem.

Worked example — MERGE for snapshot reconciliation (Snowflake)

Detailed explanation. A Snowflake dimension table dim_customers should reflect the daily snapshot stg_customers_daily exactly — new rows inserted, changed rows updated, deleted rows removed. The MERGE statement with all three WHEN clauses is the textbook pattern.

Question. Show the MERGE statement and explain each WHEN clause.

Input. Daily snapshot vs current dim — three deltas (added, changed, removed).

Code.

MERGE INTO dim_customers AS t
USING stg_customers_daily AS s
ON t.customer_id = s.customer_id

WHEN MATCHED AND (
    t.name IS DISTINCT FROM s.name OR
    t.email IS DISTINCT FROM s.email
) THEN UPDATE SET
    name = s.name,
    email = s.email,
    updated_at = CURRENT_TIMESTAMP()

WHEN NOT MATCHED THEN INSERT (
    customer_id, name, email, created_at, updated_at
) VALUES (
    s.customer_id, s.name, s.email, CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP()
)

WHEN NOT MATCHED BY SOURCE THEN DELETE;

Step-by-step explanation.

WHEN MATCHED AND (t.name IS DISTINCT FROM s.name OR ...) — update only when at least one column actually changed. The IS DISTINCT FROM is NULL-safe inequality, so a NULL on either side is handled correctly. Without this guard, every row would be rewritten on every run, bloating change-tracking and wasting compute.
WHEN NOT MATCHED THEN INSERT — new customers in the snapshot land as fresh rows in the dim. created_at and updated_at both set on insert.
WHEN NOT MATCHED BY SOURCE THEN DELETE — customers that disappeared from the snapshot are removed from the dim. This makes dim_customers a full mirror of the snapshot. Omit this clause if the snapshot is incremental and old customers should stick around.
All three clauses share the ON t.customer_id = s.customer_id join condition — the UNIQUE / PRIMARY KEY on customer_id is the contract.

Output. The dim table becomes a row-for-row mirror of the snapshot after the MERGE completes.

Rule of thumb. Three-clause MERGE (WHEN MATCHED, WHEN NOT MATCHED, WHEN NOT MATCHED BY SOURCE) is the canonical "make target equal source" pattern. Always guard the UPDATE clause with IS DISTINCT FROM to skip no-op writes. Only add WHEN NOT MATCHED BY SOURCE THEN DELETE when the source is a full snapshot; never on an incremental load.

SQL interview question on safe daily-roll-up upserts

A senior interviewer often combines: "Your team runs a daily metrics ETL that may re-run on backfill or after a partial failure. Write the upsert statement that updates only changed rows, never duplicates, and is safe to re-run any number of times."

Solution Using ON CONFLICT with IS DISTINCT FROM guard

INSERT INTO daily_metrics (date, metric_name, value, computed_at)
SELECT date, metric_name, value, now()
FROM staging_daily_metrics
ON CONFLICT (date, metric_name) DO UPDATE
SET value = EXCLUDED.value,
    computed_at = EXCLUDED.computed_at
WHERE daily_metrics.value IS DISTINCT FROM EXCLUDED.value;

Step-by-step trace.

Staging row	Target row	Conflict?	Action
(2026-06-01, revenue, 1000)	(none)	no	INSERT
(2026-06-01, revenue, 1000)	(2026-06-01, revenue, 1000)	yes	guard FALSE → no-op
(2026-06-01, revenue, 1200)	(2026-06-01, revenue, 1000)	yes	guard TRUE → UPDATE
(2026-06-02, signups, 75)	(2026-06-02, signups, NULL)	yes	guard TRUE (NULL DISTINCT FROM 75) → UPDATE
(2026-06-03, ctr, NULL)	(2026-06-03, ctr, NULL)	yes	guard FALSE (NULL IS NOT DISTINCT FROM NULL) → no-op

The IS DISTINCT FROM guard is the no-op skip — only rewrite rows whose value actually changed. Re-running the same backfill three times produces zero writes after the first; the table stabilises and downstream change-data-capture stays quiet.

Output:

Property	Value
First run writes	N rows (every row inserted)
Re-run same input writes	0 rows (guard FALSE everywhere)
Re-run with changed values writes	only changed rows
Final table state	identical regardless of re-run count

Why this works — concept by concept:

ON CONFLICT (date, metric_name) — uses the PRIMARY KEY ((date, metric_name)) as the conflict target. The constraint must exist before the upsert is written.
EXCLUDED — the pseudo-row holding the values from the would-have-inserted row. EXCLUDED.value is the new value; daily_metrics.value is the existing target value.
IS DISTINCT FROM — NULL-safe inequality. Treats two NULLs as "the same" so a NULL → NULL re-run is a no-op; treats NULL vs non-NULL as "different" so a NULL → 75 update fires.
Idempotency by design — same input ⇒ same final state, regardless of how many times the statement runs. Backfills become safe; partial failures become safe; retries become safe.
Cost — one B-tree lookup per row to check the conflict; one no-op or one write per row. The IS DISTINCT FROM guard cuts no-op write cost to zero in steady state — the bulk of the daily run skips the table entirely.

SQL
Topic — CTE
Upsert and merge problems (SQL)

Practice →

SQL
Topic — anti-join
Anti-join and MERGE delete patterns (SQL)

Practice →

Cheat sheet — uniqueness recipes

One UNIQUE per business key. UNIQUE (tenant_id, slug) on every multi-tenant "human-friendly key" — slug, username, code, name.
Case-insensitive uniqueness. UNIQUE (LOWER(email)) (Postgres / Oracle), or email_lower GENERATED ALWAYS AS (LOWER(email)) STORED, UNIQUE (email_lower) (MySQL).
Only-active uniqueness. CREATE UNIQUE INDEX one_active_sub ON subs(user_id) WHERE is_active = true (Postgres / SQL Server filtered index / SQLite / DuckDB).
Soft-delete-aware uniqueness. CREATE UNIQUE INDEX unique_email ON users(email) WHERE deleted_at IS NULL.
NULL-tolerant unique. UNIQUE NULLS NOT DISTINCT (col) (Postgres 15+ / DuckDB) — only one NULL row allowed.
Surrogate PK + business UNIQUE. id BIGSERIAL PRIMARY KEY, UNIQUE (business_key) — the default for any table whose business key might ever change.
Latest-per-key dedup (universal). ROW_NUMBER() OVER (PARTITION BY key ORDER BY updated_at DESC) AS rn then WHERE rn = 1.
Latest-per-key dedup (Postgres / DuckDB). SELECT DISTINCT ON (key) * FROM t ORDER BY key, updated_at DESC.
Latest-per-key dedup (Snowflake / BigQuery / DuckDB / Teradata). SELECT * FROM t QUALIFY ROW_NUMBER() OVER (...) = 1.
Idempotent insert (Postgres). INSERT ... ON CONFLICT (key) DO NOTHING. (MySQL: INSERT IGNORE. SQL Server: IF NOT EXISTS ... INSERT ... under HOLDLOCK.)
Upsert (Postgres). INSERT ... ON CONFLICT (key) DO UPDATE SET col = EXCLUDED.col WHERE target.col IS DISTINCT FROM EXCLUDED.col.
Upsert (MySQL 8.0.20+). INSERT ... AS new ON DUPLICATE KEY UPDATE col = target.col + new.col (for counters) or col = new.col (for replacements).
Upsert (SQL Server / Snowflake / BigQuery). MERGE INTO target [WITH (HOLDLOCK)] USING source ON ... WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT.
Warehouse enforcement fallback. dbt test unique: column_name (and where: for partial uniqueness) — the standard for Snowflake / BigQuery / Redshift where constraints are informational.
Duplicate audit. COUNT(*) - COUNT(DISTINCT key) to count duplicates per column; SELECT key, COUNT(*) FROM t GROUP BY key HAVING COUNT(*) > 1 to list them.

Frequently asked questions

What's the difference between UNIQUE and PRIMARY KEY in SQL?

A PRIMARY KEY is a special UNIQUE constraint with three extra rules: every column is implicitly NOT NULL, a table has exactly one primary key, and the primary key is the conventional foreign-key target. A regular UNIQUE constraint allows NULLs (one or many depending on the dialect — SQL Server: one; Postgres / MySQL: many), permits unlimited UNIQUE constraints per table, and is a valid FK target. Use the primary key for the surrogate identity (id BIGSERIAL); use UNIQUE constraints for every business key (email, slug, (tenant_id, name) pair). Both compile to a B-tree index — they are siblings, not opposites.

Can a UNIQUE column have NULL values in SQL?

Yes — usually many. Postgres, MySQL, Oracle, Snowflake, BigQuery, and DuckDB treat every NULL as distinct from every other NULL by default, so a UNIQUE (col) column can hold any number of NULL rows. SQL Server is the exception: it allows exactly one NULL in a UNIQUE column. The workaround on SQL Server (and now natively on Postgres 15+) is UNIQUE NULLS NOT DISTINCT or a filtered index CREATE UNIQUE INDEX ... WHERE col IS NOT NULL. State the NULL behaviour explicitly in the DDL — future readers and future engines will not have to guess.

How do I make a UNIQUE constraint only apply to active rows?

Use a partial unique index (also called a filtered index in SQL Server). The syntax is CREATE UNIQUE INDEX one_active_sub ON subscriptions (user_id) WHERE is_active = true on Postgres / SQL Server / SQLite / DuckDB. The index only covers rows where is_active = true, so uniqueness is enforced within that subset — at most one active subscription per user — while unlimited cancelled rows can coexist. MySQL does not support partial indexes directly; the workaround is a generated column that is NULL when the row is inactive, combined with a UNIQUE constraint on the generated column. Snowflake / BigQuery / Redshift enforce neither — use a dbt unique test with a where: clause.

Should I use DISTINCT or ROW_NUMBER for deduplication in SQL?

Use DISTINCT when you only want the distinct combination of selected columns and don't care which row "wins" among the duplicates. Use ROW_NUMBER() OVER (PARTITION BY key ORDER BY recency DESC) when you need to pick a specific row per key — typically the latest by timestamp. The ROW_NUMBER pattern is the canonical "deduplicate a CDC stream" or "latest per customer" idiom and works in every dialect. Postgres / DuckDB also offer DISTINCT ON (key) as a shorter syntax; Snowflake / BigQuery / DuckDB / Teradata offer QUALIFY ROW_NUMBER() OVER (...) = 1 to skip the CTE wrapper. All three encode the same plan.

Does Snowflake enforce UNIQUE constraints?

No — Snowflake, BigQuery, and Redshift accept UNIQUE and PRIMARY KEY syntax but do not enforce the constraints at write time. The constraints are informational hints that help the query optimiser and downstream tools understand the schema. Duplicates can land silently in production. The canonical fix is a dbt test unique: column_name in the model YAML — the test runs in CI and surfaces violations within minutes of the regression. Pair every informational UNIQUE with a dbt test (or an equivalent in your transformation framework) for real protection. On Snowflake specifically, you can also rely on streams + tasks to audit uniqueness on a schedule, but the dbt test is simpler and runs on every CI commit.

What's the safest way to upsert in Postgres?

INSERT INTO t (key, val) VALUES (...) ON CONFLICT (key) DO UPDATE SET val = EXCLUDED.val WHERE t.val IS DISTINCT FROM EXCLUDED.val. This pattern is atomic by design (Postgres's conflict check and the resulting INSERT or UPDATE happen in one statement), idempotent (same input → same final state regardless of retries), and efficient (the IS DISTINCT FROM guard skips no-op rewrites that would otherwise bloat the WAL and noise the change-data-capture stream). The conflict target (key) must reference an existing UNIQUE or PRIMARY KEY constraint — the upsert mechanism leans on the constraint, and you cannot upsert without one. For idempotent insert-only flows, drop the SET clause: ON CONFLICT (key) DO NOTHING.

Practice on PipeCode

Drill the DISTINCT and uniqueness practice library → for the COUNT-DISTINCT and dedup family of probes.
Rehearse on window-function deduplication problems → when the interviewer wants ROW_NUMBER / RANK on a CDC table.
Sharpen ranking drills → for the "latest per key" and tiebreaker patterns.
Layer the aggregation library → for the COUNT(*) vs COUNT(DISTINCT col) family.
Stack the GROUP BY library → for the duplicate-detection idiom.
Stack the joins practice library → for upsert-from-staging and anti-join patterns.
For the broader surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Sharpen the dialect axis with the SQL for data engineering interviews course →.
For long-form schema craft, work through data modelling for DE interviews →.

Pipecode.ai is Leetcode for Data Engineering — every uniqueness recipe above ships with hands-on practice rooms where you write the UNIQUE constraint, the partial index, and the ROW_NUMBER dedup against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your fix to a `null not in sql` upsert behaves the same on Postgres as on Snowflake.

Practice DISTINCT and uniqueness now →
Window-function dedup drills →

1. Why uniqueness is the foundation of clean data

Uniqueness is a contract, not a comment — the schema either enforces it on every write, or you re-pay the bill on every read

Worked example — the duplicate-source feed disaster

Worked example — the retry-storm duplicate

SQL interview question on uniqueness-as-contract

Solution Using layered UNIQUE constraints across the pipeline

2. UNIQUE vs PRIMARY KEY — the matrix

unique constraint and primary key look interchangeable but differ in NULL handling, cardinality, clustering, and FK convention — five rows the interviewer wants you to recite

Worked example — same column as PRIMARY KEY vs UNIQUE

Worked example — SQL Server's "only one NULL" quirk

SQL interview question on the UNIQUE vs PRIMARY KEY decision

Solution Using surrogate PK + composite UNIQUE for stability

3. Composite & partial unique indexes

composite unique scopes the constraint to a column tuple; partial unique index scopes it to a row predicate — together they cover every "unique within X" requirement

Worked example — multi-tenant slug uniqueness

Worked example — one-active partial unique index

Worked example — case-insensitive email uniqueness

Worked example — Snowflake informational UNIQUE + dbt test

SQL interview question on composite + partial uniqueness

Solution Using composite UNIQUE plus partial UNIQUE

4. Hard vs soft deduplication — constraint vs query

hard uniqueness rejects duplicates at write time; soft uniqueness collapses them at read time with row_number deduplication or distinct on — pick the right one for the right layer

Worked example — dedup a CDC stream with ROW_NUMBER

Worked example — DISTINCT vs GROUP BY on a duplicate-heavy table

Worked example — picking the latest row with DISTINCT ON

SQL interview question on dedup of a CDC staging table

Solution Using QUALIFY ROW_NUMBER for cross-dialect dedup

5. UPSERT — INSERT or UPDATE on duplicate key

upsert is "insert if new, update if existing" — three syntaxes (ON CONFLICT, ON DUPLICATE KEY, MERGE) over the same UNIQUE-constraint mechanism

Worked example — Postgres ON CONFLICT for a daily metrics roll-up

Worked example — MySQL ON DUPLICATE KEY UPDATE

Worked example — SQL Server MERGE with HOLDLOCK

Worked example — MERGE for snapshot reconciliation (Snowflake)

SQL interview question on safe daily-roll-up upserts

Solution Using ON CONFLICT with IS DISTINCT FROM guard

Cheat sheet — uniqueness recipes

Frequently asked questions

What's the difference between UNIQUE and PRIMARY KEY in SQL?

Can a UNIQUE column have NULL values in SQL?

How do I make a UNIQUE constraint only apply to active rows?

Should I use DISTINCT or ROW_NUMBER for deduplication in SQL?

Does Snowflake enforce UNIQUE constraints?

What's the safest way to upsert in Postgres?

Practice on PipeCode

`unique constraint` and `primary key` look interchangeable but differ in NULL handling, cardinality, clustering, and FK convention — five rows the interviewer wants you to recite

`composite unique` scopes the constraint to a column tuple; `partial unique index` scopes it to a row predicate — together they cover every "unique within X" requirement

`hard uniqueness` rejects duplicates at write time; `soft uniqueness` collapses them at read time with `row_number deduplication` or `distinct on` — pick the right one for the right layer

`upsert` is "insert if new, update if existing" — three syntaxes (`ON CONFLICT`, `ON DUPLICATE KEY`, `MERGE`) over the same UNIQUE-constraint mechanism