Gowtham Potureddi

Posted on Jun 4

SQL DISTINCT + COUNT(DISTINCT): Deduplication, Approximate Counts, HyperLogLog

#python #sql #dataengineering #interview

sql distinct looks like the simplest operator in the language — a one-word switch that strips duplicates from a result set — and that is exactly why it shows up in dozens of production incidents and interview rounds every year. The catch is that sql distinct is not free. Behind every SELECT DISTINCT is a sort or a hash aggregation; behind every sql count distinct is a unique-set whose memory cost is O(distinct cardinality); behind every dedup pattern is a choice between DISTINCT, GROUP BY, ROW_NUMBER(), and DISTINCT ON that decides whether the query finishes in seconds or grinds the warehouse for an hour.

This guide walks through the five things every senior SQL round probes about select distinct sql and count distinct sql — when DISTINCT and GROUP BY produce the same plan and when they do not, why COUNT(DISTINCT col) blows past work_mem on billion-row tables, how APPROX_COUNT_DISTINCT and HyperLogLog deliver ~1.6% error in ~1.5 KB, and the three cross-dialect patterns for sql deduplicate (ROW_NUMBER, DISTINCT ON, GROUP BY + ARG_MAX). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end, distinct in sql stops being a one-word operator and starts being an explicit cost discussion you can defend on any query plan.

When you want hands-on reps immediately after reading, browse distinct + dedup practice library →, drill aggregation problems for the cost discussion →, and rehearse window-function dedup patterns →.

On this page

Why DISTINCT is more expensive than people think
SELECT DISTINCT vs GROUP BY — when same, when not
COUNT(DISTINCT col) anatomy and performance
APPROX_COUNT_DISTINCT and HyperLogLog
Deduplication patterns — ROW_NUMBER vs DISTINCT vs GROUP BY
Cheat sheet — DISTINCT recipes
Frequently asked questions
Practice on PipeCode

1. Why DISTINCT is more expensive than people think

`sql distinct` is never "free" — every DISTINCT triggers a sort or a hash aggregation under the hood

The one-sentence invariant: SELECT DISTINCT and SELECT COUNT(DISTINCT col) always materialise a unique set, which means either a sort over the inputs or a hash aggregation, and the memory cost of that unique set is O(distinct cardinality). Once you internalise that — DISTINCT is a real operator, not a syntactic flag — most of the select distinct sql interview question family becomes a discussion about plan choices and memory budgets.

The three myths every junior engineer believes.

"DISTINCT is just a syntactic switch." It is not. Postgres EXPLAIN shows HashAggregate (or Unique + Sort for ordered output); MySQL shows Using temporary; Using filesort for the multi-column case; Snowflake shows an extra aggregation stage in the profile. A switch with a plan node attached is not a switch — it is an operator.
"DISTINCT is cheaper than GROUP BY." Almost never. For a single column the planner usually picks the same plan for both. For multi-column DISTINCT the planner sometimes picks a worse hash table layout than the equivalent GROUP BY would — because GROUP BY is allowed to pick an aggregation strategy first and then think about uniqueness, whereas DISTINCT * has to treat every column as a group key.
"SELECT DISTINCT * is the safe way to dedup." It is the dangerous way to dedup. DISTINCT * keys the entire row including the created_at timestamp, the updated_at watermark, and any audit column; if even one of those changes between two otherwise-identical writes, the duplicate is preserved. The right pattern is DISTINCT ON (business_key) (Postgres) or ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY tie_breaker) = 1 (everywhere) — covered in §5.

What a senior engineer reaches for instead.

For "unique values of one column" — SELECT DISTINCT col FROM t or SELECT col FROM t GROUP BY col are interchangeable.
For "unique tuples of N columns" — write GROUP BY a, b, … and you can fold in COUNT(*) or MAX(ts) later without re-querying.
For "exact count of unique values" — COUNT(DISTINCT col) if cardinality is small; APPROX_COUNT_DISTINCT(col) if cardinality is large and ~1.6% error is acceptable.
For "one row per business key" — never DISTINCT *. Always ROW_NUMBER() or DISTINCT ON keyed on the business key.

The production code review heuristics.

If a query has SELECT DISTINCT a, b, c, d, e, f, g, ask why. There is almost always a business key buried in there and the engineer has reached for DISTINCT because they were unsure which key.
If a query has COUNT(DISTINCT col) over a known billion-row table for a dashboard, ask whether APPROX_COUNT_DISTINCT is acceptable. For MAU / DAU dashboards the answer is almost always yes.
If a query has multiple COUNT(DISTINCT) calls, ask whether the engine supports GROUPING SETS or HLL_COMBINE to fold them into one pass instead of N passes.

What interviewers listen for.

Do you say "DISTINCT triggers a sort or hash aggregation" when asked the cost? — senior signal.
Do you reach for GROUP BY when an aggregate is needed alongside the unique set? — required answer.
Do you mention APPROX_COUNT_DISTINCT / HyperLogLog when the table is huge? — senior signal.
Do you flag DISTINCT * as the dedup anti-pattern? — required answer.

Worked example — measure the cost of DISTINCT on a 100M-row events table

Detailed explanation. A common interview probe is "how much does SELECT DISTINCT user_id FROM events cost?" The senior answer is not a number; it is a method: "Show me the EXPLAIN, the cardinality of user_id, and work_mem, and I will tell you whether it stays in memory or spills."

Question. Given an events table with 100 million rows and ~5 million distinct user_id values, write the query that returns the unique users, and explain what the planner does on Postgres with work_mem=64MB.

Input.

Setting	Value
Table	`events(user_id BIGINT, event_ts TIMESTAMPTZ, …)`
Rows	100,000,000
Distinct `user_id`	5,000,000
`work_mem`	64 MB
Bytes per hash entry	~24 (key + hash + chain)

Code.

EXPLAIN (ANALYZE, BUFFERS)
SELECT DISTINCT user_id
FROM events;

Step-by-step explanation.

Planner picks a strategy. Postgres compares HashAggregate (build a hash table keyed by user_id) vs Unique + Sort (sort everything, then walk the sorted stream). For 5M distinct keys the hash table is ~120 MB — bigger than work_mem — so the planner may either pick Unique + Sort (external sort, spills to disk) or use the partitioned-hash-aggregation strategy added in Postgres 13+ (also spills).
Memory footprint = O(distinct cardinality). 5M × 24 bytes ≈ 120 MB hash table. With work_mem=64MB, the operator either spills to disk in batches or the planner switches to Sort (which spills via external merge).
Wall time dominated by I/O during spill. Once the operator spills, the cost is dominated by reading and writing temp files, not by hashing CPU.
The fix. Raise work_mem to ~256 MB just for this session, or pre-aggregate at write time, or accept APPROX_COUNT_DISTINCT when an exact list of users is not needed.

Output (illustrative).

Strategy	work_mem	Wall time	Disk temp files
`HashAggregate` (in memory)	256 MB	8 s	0
`HashAggregate` (spilled)	64 MB	38 s	~120 MB
`Sort + Unique`	64 MB	42 s	~2 GB (sorts all 100M rows first)

Rule of thumb. work_mem >= hash_entry_bytes × distinct_cardinality keeps DISTINCT in memory. When you cannot raise work_mem, switch to APPROX_COUNT_DISTINCT or pre-aggregate the unique set in a daily rollup.

Worked example — why DISTINCT * preserves duplicates that look identical

Detailed explanation. A real production trap: a team has a signups(email, user_id, source, signup_ts, updated_at) table and tries to dedupe with SELECT DISTINCT * FROM signups. They expect one row per email; they get every version because updated_at differs across replays of the same logical signup.

Question. Given a table where the business key is email_lower but the row carries signup_ts and updated_at audit columns, explain why SELECT DISTINCT * does not produce one row per email, and rewrite the query so it does.

Input.

email	user_id	source	signup_ts	updated_at
alice@x	1	direct	2026-04-01 09:00	2026-04-01 09:00
alice@x	1	direct	2026-04-01 09:00	2026-04-15 10:30
alice@x	1	direct	2026-04-01 09:00	2026-05-02 14:00
bob@x	2	ad	2026-04-02 10:00	2026-04-02 10:00

Code.

-- Wrong — DISTINCT * preserves duplicates because updated_at differs
SELECT DISTINCT *
FROM signups;
-- 4 rows returned (no rows dropped — alice@x has 3 different updated_at values)

-- Right — key on the business key
WITH ranked AS (
    SELECT
        s.*,
        ROW_NUMBER() OVER (
            PARTITION BY LOWER(email)
            ORDER BY updated_at DESC
        ) AS rn
    FROM signups s
)
SELECT email, user_id, source, signup_ts, updated_at
FROM ranked
WHERE rn = 1;

-- Postgres concise form
SELECT DISTINCT ON (LOWER(email)) *
FROM signups
ORDER BY LOWER(email), updated_at DESC;

Step-by-step explanation.

DISTINCT * hashes the entire row including the updated_at audit column. Two rows with identical business data but different updated_at are not duplicates by the DISTINCT * definition — and so neither is dropped.
The right answer is "key on the business key" — LOWER(email) (so casing differences don't fragment the key) — and pick the row with the latest updated_at.
ROW_NUMBER() ... PARTITION BY LOWER(email) ORDER BY updated_at DESC does exactly that. Filter rn = 1 to keep the latest.
DISTINCT ON (LOWER(email)) ... ORDER BY LOWER(email), updated_at DESC is the Postgres-only equivalent.

Output (with the correct dedup).

email	user_id	source	signup_ts	updated_at
alice@x	1	direct	2026-04-01 09:00	2026-05-02 14:00
bob@x	2	ad	2026-04-02 10:00	2026-04-02 10:00

Rule of thumb. When you reach for DISTINCT, ask "by what key?" The answer is almost never "every column in the row." Build the dedup around the business key explicitly — LOWER(email), order_id, (customer_id, event_date) — and pick the canonical row with a deterministic ORDER BY tie-breaker.

SQL interview question on the cost of DISTINCT

A senior interviewer often opens this round as: "Walk me through what happens when I write SELECT DISTINCT user_id FROM events on a 100M-row table. Show me the plan, the memory cost, and the failure mode."

Solution Using EXPLAIN + work_mem reasoning

-- The query under test
SELECT DISTINCT user_id
FROM events;

-- The diagnostic command
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT DISTINCT user_id
FROM events;

Step-by-step trace.

Step	Action	Observable effect
1	Planner estimates `n_distinct` for `user_id` from `pg_statistic`	`n_distinct = -0.05` → ~5% of rows are distinct → ~5M unique values
2	Planner picks `HashAggregate` because it expects O(5M) groups	`EXPLAIN` line: `HashAggregate (cost=… rows=5000000)`
3	Executor builds the hash table keyed by `user_id`	Memory grows toward ~120 MB
4	If `work_mem < 120 MB`, the operator spills hash partitions to disk	`EXPLAIN ANALYZE` line: `Batches: 8 Memory Usage: 64MB Disk Usage: 56MB`
5	After the seq-scan finishes, the executor emits the unique set	Result row count = ~5M

Output:

Metric	In-memory run (work_mem=256MB)	Spilled run (work_mem=64MB)
Wall time	~8 s	~38 s
Buffers shared hit	1.2M	1.2M
Temp files written	0	8 partitions, ~120 MB total
Result rows	5,000,000	5,000,000

Why this works — concept by concept:

HashAggregate vs Sort+Unique — Postgres picks whichever is cheaper given work_mem and the estimated cardinality. HashAggregate is O(N) CPU but O(distinct) memory; Sort+Unique is O(N log N) CPU but O(N) temp space.
work_mem as the spill gate — when the hash table exceeds work_mem, modern Postgres switches to a partitioned-hash-aggregation that spills excess partitions to disk and merges them. The wall time triples but the query still completes.
n_distinct statistics — the planner reads n_distinct from pg_statistic (filled by ANALYZE). Stale stats can pick the wrong plan; ANALYZE events periodically.
The escape hatch — for billion-row tables, neither plan is fast. The right answer is APPROX_COUNT_DISTINCT (§4) or a daily-rolled-up distinct_users table.
Cost — memory = O(distinct cardinality × entry size); CPU = O(N) for hash, O(N log N) for sort; disk = O(distinct cardinality) on spill.

SQL
Topic — distinct
DISTINCT cost and deduplication problems (SQL)

Practice →

2. SELECT DISTINCT vs GROUP BY — when same, when not

`select distinct sql` and `GROUP BY` are two syntaxes for the same plan — until an aggregate or an `ORDER BY` enters the picture

The mental model in one line: for the question "give me the unique values of column X" the planner picks the same operator for SELECT DISTINCT X FROM t and SELECT X FROM t GROUP BY X; for "give me the unique values plus an aggregate", GROUP BY wins because DISTINCT cannot aggregate inline. Once you say that out loud, the distinct in sql vs GROUP BY interview question collapses to a two-bullet checklist.

The four cases that show up in interviews.

Single-column uniqueness. SELECT DISTINCT col ≡ SELECT col GROUP BY col. Planner picks HashAggregate for both; same memory, same wall time.
Multi-column uniqueness. SELECT DISTINCT a, b, c ≡ SELECT a, b, c GROUP BY a, b, c. Same plan in Postgres; subtle plan differences in MySQL because the optimizer for GROUP BY is older and sometimes adds an implicit ORDER BY (configurable since MySQL 8 with ONLY_FULL_GROUP_BY).
Uniqueness + aggregate. GROUP BY wins. You cannot write SELECT DISTINCT a, COUNT(*) FROM t GROUP BY a more concisely than the GROUP BY itself — and DISTINCT a, b followed by an outer COUNT(*) is an extra pass.
Uniqueness + ordering. Postgres requires every ORDER BY expression to appear in the DISTINCT list (otherwise the planner can't guarantee deterministic order). GROUP BY is allowed to sort by any aggregate-valid expression.

The three planner observations every senior interviewer wants to hear.

EXPLAIN tells the truth. Run both queries and read the plan. If both show HashAggregate (cost=… rows=… width=…) with the same numbers, the choice is purely stylistic.
MySQL is the exception, not the rule. MySQL prior to 8.0 used to implicitly ORDER BY the grouped columns; this was removed in 8.0. The DISTINCT vs GROUP BY performance gap on MySQL 5.7 was sometimes real because of the implicit sort; on 8.0+ it is not.
The aggregate axis is the real divergence. DISTINCT cannot be combined with an aggregate inline. Once you need COUNT(*), MAX(ts), SUM(x), the equivalent GROUP BY is shorter and the plan is identical to the DISTINCT + aggregate rewrite anyway.

Postgres-specific subtlety — DISTINCT ORDER BY.

Legal in Postgres: SELECT DISTINCT a, b FROM t ORDER BY a, b — a and b are in the DISTINCT list.
Illegal in Postgres: SELECT DISTINCT a FROM t ORDER BY a, b — b is not in the DISTINCT list, so the planner cannot guarantee order.
Equivalent GROUP BY: SELECT a, MAX(b) FROM t GROUP BY a ORDER BY a, MAX(b) — works because MAX(b) is a deterministic function of the group.

Worked example — DISTINCT a, b vs GROUP BY a, b

Detailed explanation. A common interview probe is "are SELECT DISTINCT a, b FROM t and SELECT a, b FROM t GROUP BY a, b always identical?" The senior answer is "yes, modulo the aggregate axis and modulo ORDER BY."

Question. Given a 50M-row pageviews(user_id, page_url, ts) table, return the unique (user_id, page_url) tuples two ways — using DISTINCT and using GROUP BY — and compare the plans.

Input.

Column	Type	Cardinality
user_id	BIGINT	5M distinct
page_url	TEXT	200 distinct
ts	TIMESTAMPTZ	unique per row

Code.

-- A — using DISTINCT
SELECT DISTINCT user_id, page_url
FROM pageviews;

-- B — using GROUP BY
SELECT user_id, page_url
FROM pageviews
GROUP BY user_id, page_url;

Step-by-step explanation.

Both queries hash on the tuple (user_id, page_url) — same key, same hash function.
Postgres's planner picks HashAggregate for both. EXPLAIN output is byte-for-byte identical except for the operator label.
Estimated rows = min(rows_in_t, cardinality_a × cardinality_b) = min(50M, 5M × 200) = 50M — but the planner caps the estimate at n_distinct ≈ 5M × ~50 (actual co-occurrence) ≈ 250M, then clips to the table row count.
Wall time is identical within ±5% on warm caches; choose the syntax your team's style guide prefers.

Output.

Query	Plan node	Estimated rows	Wall time	Memory
`DISTINCT a, b`	HashAggregate	~50M	24 s	~3 GB (with spill)
`GROUP BY a, b`	HashAggregate	~50M	24 s	~3 GB (with spill)

Rule of thumb. Use DISTINCT when the query is only "give me unique values." Use GROUP BY the moment you might want to fold in COUNT(*), MAX(ts), or any aggregate — even speculatively. The GROUP BY shape leaves the door open without a rewrite.

Worked example — DISTINCT with ORDER BY that breaks on Postgres

Detailed explanation. A subtle Postgres-only behaviour every senior interview probes: SELECT DISTINCT a FROM t ORDER BY b is illegal because the planner cannot guarantee a deterministic order when the sort key is not in the DISTINCT list. GROUP BY does not have this restriction — it can sort by any aggregate of the grouped columns.

Question. Given an events(user_id, last_seen_at, country) table, write a query that returns one row per (user_id) ordered by last_seen_at DESC. Show why the naïve DISTINCT version errors on Postgres and how to rewrite it.

Input.

user_id	last_seen_at	country
1	2026-05-01 09:00	US
1	2026-05-04 12:00	US
2	2026-05-03 11:00	DE
3	2026-05-02 10:00	UK

Code.

-- Broken on Postgres
SELECT DISTINCT user_id
FROM events
ORDER BY last_seen_at DESC;   -- ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list

-- Correct rewrite using GROUP BY + MAX
SELECT user_id
FROM events
GROUP BY user_id
ORDER BY MAX(last_seen_at) DESC;

-- Also correct — fold the column into the DISTINCT list and key on user_id
SELECT DISTINCT ON (user_id) user_id, last_seen_at
FROM events
ORDER BY user_id, last_seen_at DESC;

Step-by-step explanation.

The first query fails because Postgres cannot prove that any chosen last_seen_at per user_id is canonical — different rows for the same user_id could be at different last_seen_at.
The GROUP BY shape works because MAX(last_seen_at) is a deterministic function of the group.
The DISTINCT ON shape works because it explicitly says "pick one row per user_id, namely the one with the largest last_seen_at."
MySQL is more permissive (ONLY_FULL_GROUP_BY off by default before 8.0) but the resulting order is then non-deterministic.

Output.

user_id	(sort key)
1	2026-05-04 12:00
2	2026-05-03 11:00
3	2026-05-02 10:00

Rule of thumb. If you find yourself reaching for DISTINCT col ORDER BY other_col, you actually want GROUP BY col ORDER BY aggregate(other_col) or DISTINCT ON (col). The shape of the answer is "one row per key with a chosen tie-breaker," and that is exactly what GROUP BY / DISTINCT ON model.

SQL interview question on DISTINCT vs GROUP BY for aggregation

A scaling probe: "For a daily ETL that needs the unique (user_id, page_url) tuples and a count of how many times each tuple occurred, which is faster — DISTINCT followed by a self-join, or GROUP BY with COUNT(*)?"

Solution Using GROUP BY with COUNT(*) in one pass

-- The wrong way — DISTINCT then re-aggregate (two passes)
WITH unique_pairs AS (
    SELECT DISTINCT user_id, page_url FROM pageviews
)
SELECT u.user_id, u.page_url, COUNT(*) AS hits
FROM unique_pairs u
JOIN pageviews p USING (user_id, page_url)
GROUP BY u.user_id, u.page_url;

-- The right way — GROUP BY does both in one pass
SELECT user_id, page_url, COUNT(*) AS hits
FROM pageviews
GROUP BY user_id, page_url;

Step-by-step trace.

Approach	Passes	Operators	Reads
`DISTINCT` + self-join	2	HashAggregate → HashJoin → HashAggregate	50M + 50M = 100M
Single `GROUP BY`	1	HashAggregate	50M

Output:

Approach	Wall time	Result rows	Peak memory
`DISTINCT` + self-join	78 s	~250K	~6 GB (two hash tables)
Single `GROUP BY`	28 s	~250K	~3 GB (one hash table)

Why this works — concept by concept:

One hash pass vs two — GROUP BY builds the hash table keyed by (user_id, page_url) once and accumulates COUNT(*) into each bucket while it scans. The DISTINCT shape builds the table once, throws away the bucket counts, and forces a re-scan to recompute them.
No self-join cost — the DISTINCT + JOIN shape pays a hash-join cost (build + probe of two 50M streams) that GROUP BY does not.
The planner cannot rescue you — Postgres does not have a rewrite rule that turns DISTINCT + self-join into a single GROUP BY. The optimization is at the SQL level, not the planner level.
Same memory upper bound, smaller peak — both shapes have O(distinct tuples) memory; the GROUP BY shape never instantiates the second hash table for the join.
Cost — GROUP BY + COUNT ≈ O(N) CPU, O(distinct) memory, 1 pass; DISTINCT + JOIN + COUNT ≈ O(N) + O(N log N) CPU, O(2 × distinct) memory, 2 passes.

SQL
Topic — aggregation
Aggregation problems (GROUP BY, DISTINCT, COUNT)

Practice →

3. COUNT(DISTINCT col) anatomy and performance

`sql count distinct` is exact, expensive, and the most common performance footgun in analytics queries

The mental model: COUNT(DISTINCT col) must build the entire unique set of col values in memory (or on disk if it spills), then emit the cardinality of that set. The set never shrinks during the scan — it can only grow — so memory cost is O(distinct cardinality) and any index on col does not save you, because the engine must touch every row to know whether each value is "new" or "already seen."

The five cost properties every interview probes.

Memory = O(distinct cardinality). Not O(row count). A column with 1B rows but only 100 distinct values fits in a few KB; a column with 100M rows and 100M distinct values needs ~2.4 GB of hash table.
Disk spill above work_mem. Postgres spills hash partitions; MySQL writes a temp file to tmpdir; Snowflake spills to remote storage and the credit cost spikes.
Index does not help. A B-tree on col lets the engine sort cheaply, but it still must read every leaf entry. COUNT(DISTINCT col) cannot be answered from an index summary unless the engine maintains a MIN/MAX/UNIQUE_COUNT per leaf page (no major OLTP engine does).
Multiple COUNT(DISTINCT) in one query → N passes. SELECT COUNT(DISTINCT a), COUNT(DISTINCT b), COUNT(DISTINCT c) FROM t builds three separate hash tables. Postgres does this with three Aggregate nodes that each consume the input; some warehouses parallelise it but the memory cost is the sum.
COUNT(DISTINCT a, b) is dialect-specific. Postgres, Snowflake, and BigQuery all support it. MySQL supports it but treats (NULL, NULL) differently. SQL Server requires a CONCAT workaround. Spark supports it via count(distinct a, b) which translates to a tuple hash.

The four most-asked count distinct sql interview questions.

"What's the difference between COUNT(*), COUNT(col), and COUNT(DISTINCT col)?" — row count, non-null row count, unique non-null count.
"Why is COUNT(DISTINCT) slow on big tables?" — the unique set lives in memory; spill is brutal.
"How do I count distinct over a windowed range?" — typically not directly; either array_agg(DISTINCT) + cardinality or a daily-rolled-up table.
"Multiple COUNT(DISTINCT) in one query — how to avoid N passes?" — GROUPING SETS, HLL_COMBINE, or pre-aggregate.

Dialect notes that matter at the whiteboard.

Postgres. COUNT(DISTINCT col) always uses Sort + Aggregate (not HashAggregate) — historical, will be fixed in newer versions. The workaround for hash performance is SELECT COUNT(*) FROM (SELECT DISTINCT col FROM t) s.
MySQL. COUNT(DISTINCT a, b) is fine; (NULL, x) and (NULL, y) are treated as distinct from each other only if the engine sees them as a tuple (configurable).
Snowflake. COUNT(DISTINCT col) is exact; APPROX_COUNT_DISTINCT(col) is the HLL variant. The exact form is the default — and the expensive one.
BigQuery. COUNT(DISTINCT col) over more than ~1B rows triggers a Resources Exceeded error unless the cluster is sized for it; APPROX_COUNT_DISTINCT works without limit.

Worked example — measure the cost of multiple COUNT(DISTINCT) in one query

Detailed explanation. A dashboard query asks: "for the events table, how many distinct users, distinct sessions, and distinct pages did we see yesterday?" Naïvely written, this is three COUNT(DISTINCT) calls and three hash tables — three full scans worth of memory.

Question. On a 1B-row daily events table, what does SELECT COUNT(DISTINCT user_id), COUNT(DISTINCT session_id), COUNT(DISTINCT page_url) FROM events WHERE event_date = '2026-06-01' do on Postgres, and how would you reduce the cost?

Input.

Column	Distinct values (1 day)	Hash bytes per entry
user_id	10,000,000	24 (BIGINT + hash + chain)
session_id	25,000,000	40 (UUID + hash + chain)
page_url	50,000	80 (TEXT avg 40 bytes + overhead)

Code (the naive query).

SELECT
    COUNT(DISTINCT user_id)    AS distinct_users,
    COUNT(DISTINCT session_id) AS distinct_sessions,
    COUNT(DISTINCT page_url)   AS distinct_pages
FROM events
WHERE event_date = '2026-06-01';

Step-by-step explanation.

Three hash tables in memory. ~240 MB (user) + ~1 GB (session) + ~4 MB (page) ≈ 1.24 GB.
Single seq-scan with three operators on top. Postgres reads the day's partition once and feeds three independent Aggregate nodes — each maintains its own hash table.
Spill behaviour. With work_mem=64MB, the user_id and session_id operators each spill to ~200 MB of temp files. Wall time grows by ~3×.
The fix. Either size work_mem to fit (SET work_mem = '2GB' for this session) or switch to APPROX_COUNT_DISTINCT (or its dialect equivalent) which uses ~1.5 KB per metric.

Output (illustrative, Postgres on a 64-core box).

Strategy	Memory	Wall time	Result accuracy
Three exact `COUNT(DISTINCT)`	1.24 GB	4 min	exact
Three `APPROX_COUNT_DISTINCT`	~4.5 KB	12 s	±1.6%
One `COUNT(DISTINCT)` + two `APPROX_COUNT_DISTINCT`	~240 MB + ~3 KB	45 s	exact / ±1.6% / ±1.6%

Rule of thumb. For dashboards, default to APPROX_COUNT_DISTINCT. Use exact COUNT(DISTINCT) only when (a) the cardinality is small (< 1M distinct values) or (b) a regulator or product requirement says "exact."

Worked example — COUNT(DISTINCT a, b) and dialect differences

Detailed explanation. The multi-column form COUNT(DISTINCT a, b) is the second-most-frequent interview shape after COUNT(DISTINCT col). It is widely supported but the dialect quirks bite: SQL Server does not have it, MySQL handles NULL differently from Postgres, and Spark's tuple-hash makes it cheap to compute.

Question. Given a page_visits(user_id, session_id, page_url) table, write a query that returns the number of distinct (user_id, page_url) pairs. Show three forms — Postgres / Snowflake, MySQL with caveat, and SQL Server with CONCAT fallback.

Input.

user_id	session_id	page_url
1	s1	/home
1	s1	/pricing
1	s2	/home
2	s3	/home
2	s3	NULL
2	s4	NULL

Code.

-- Postgres / Snowflake / BigQuery / Spark — direct tuple form
SELECT COUNT(DISTINCT user_id, page_url) AS distinct_pairs
FROM page_visits;
-- Result: 3 distinct non-NULL pairs (1, /home), (1, /pricing), (2, /home)
-- The (2, NULL) and (2, NULL) duplicate rows are skipped because page_url is NULL

-- MySQL — same syntax works, NULL handling is similar but verify on your version
SELECT COUNT(DISTINCT user_id, page_url) AS distinct_pairs
FROM page_visits;

-- SQL Server — must concatenate
SELECT COUNT(DISTINCT CONCAT(CAST(user_id AS VARCHAR(20)), '||', ISNULL(page_url, '<NULL>'))) AS distinct_pairs
FROM page_visits;

-- Portable form — GROUP BY then COUNT
SELECT COUNT(*) AS distinct_pairs
FROM (
    SELECT user_id, page_url
    FROM page_visits
    GROUP BY user_id, page_url
) g;

Step-by-step explanation.

Postgres / Snowflake / BigQuery / Spark — the engine hashes the tuple (user_id, page_url) directly. Rows where any column is NULL are skipped entirely from the count (this is the standard-SQL behaviour for COUNT(DISTINCT) over any column).
MySQL — same syntax, but historically had different NULL semantics; on 8.0+ it matches standard SQL.
SQL Server — no native multi-column form; the workaround concatenates with a separator that cannot occur in either column, and explicitly maps NULL to a sentinel string so distinct (2, NULL) rows are counted instead of dropped.
Portable GROUP BY form — SELECT COUNT(*) FROM (SELECT a, b FROM t GROUP BY a, b) is the universal cross-dialect answer. It also makes the NULL semantics explicit (you choose whether to filter them out in the inner query).

Output (under standard-SQL NULL semantics).

Pair	Counted?
(1, /home)	yes
(1, /pricing)	yes
(2, /home)	yes
(2, NULL)	no — NULL in the tuple

Result: distinct_pairs = 3.

Rule of thumb. If your query has to run on SQL Server and Postgres, write the portable COUNT(*) FROM (SELECT … GROUP BY) form. If you control the dialect, use COUNT(DISTINCT a, b) — it is shorter and the planner can sometimes optimise the tuple hash.

SQL interview question on speeding up COUNT(DISTINCT)

A senior probe: "Your daily 'distinct active users' dashboard query is now taking 12 minutes because the events table crossed 5B rows. What are the three knobs you would consider — in order — to fix it?"

Solution Using pre-aggregation + APPROX_COUNT_DISTINCT + work_mem

-- Step 1 — pre-aggregate at write time
CREATE MATERIALIZED VIEW daily_active_users AS
SELECT
    event_date,
    COUNT(DISTINCT user_id) AS exact_dau
FROM events
GROUP BY event_date;

REFRESH MATERIALIZED VIEW daily_active_users;   -- runs once per day, cost amortised

-- Step 2 — switch to APPROX for the dashboard (no materialized view needed)
SELECT
    event_date,
    APPROX_COUNT_DISTINCT(user_id) AS approx_dau
FROM events
WHERE event_date BETWEEN '2026-05-01' AND '2026-06-01'
GROUP BY event_date;

-- Step 3 — if exact is required, raise work_mem for this session only
SET work_mem = '4GB';

SELECT
    event_date,
    COUNT(DISTINCT user_id) AS exact_dau
FROM events
WHERE event_date BETWEEN '2026-05-01' AND '2026-06-01'
GROUP BY event_date;

Step-by-step trace.

Step	Action	Effect
1	Pre-aggregate via materialized view, refreshed daily	dashboard reads ~30 rows instead of 5B
2	Switch dashboard to `APPROX_COUNT_DISTINCT`	error budget ±1.6%, memory ~1.5 KB/day
3	If exact required, raise `work_mem` for the session	avoids spill, hash stays in RAM
4	Cache the materialized view in a fast tier (Postgres read replica or Snowflake result cache)	dashboard latency drops to single-digit ms

Output:

Strategy	Memory	Wall time	Accuracy
Original (raw `COUNT(DISTINCT)`)	~12 GB (spilled)	12 min	exact
`APPROX_COUNT_DISTINCT` on raw events	~50 KB	28 s	±1.6%
Materialized view (exact)	~30 rows scanned	80 ms	exact
Materialized view + `work_mem` for refresh	~12 GB during refresh only	12 min (1×/day)	exact, daily snapshot

Why this works — concept by concept:

Pre-aggregation amortises the cost — the expensive COUNT(DISTINCT) runs once per day in a controlled window, not on every dashboard load.
APPROX is the right default for dashboards — a ±1.6% error band on a metric like DAU is invisible to humans and well below the day-to-day variance the dashboard already shows.
work_mem as a session-level lever — SET work_mem = '4GB' only applies to the current session and lets a one-off exact run finish without changing global config.
Materialized view + read replica — separates the OLAP compute from the dashboard latency budget; the dashboard reads pre-computed numbers, not raw events.
Cost — original = O(N) scan + O(distinct) memory + spill; materialized = O(days) scan + tiny memory; APPROX = O(N) scan + O(1) memory.

SQL
Topic — aggregation
COUNT(DISTINCT) + dashboard aggregation problems (SQL)

Practice →

4. APPROX_COUNT_DISTINCT and HyperLogLog

`APPROX_COUNT_DISTINCT` trades `±1.6%` error for `O(log log n)` memory — the right default for dashboards

The mental model: HyperLogLog (HLL) is a probabilistic data structure that estimates the cardinality of a set in roughly 1.5 KB of memory, with ~1.6% standard error, regardless of whether the set has 10K or 10B distinct elements. Every major analytics warehouse exposes it under a slightly different name — APPROX_COUNT_DISTINCT in Snowflake and BigQuery, approx_count_distinct in Spark, the hll extension on Postgres, PFCOUNT on Redis — but the underlying algorithm is the same.

The HyperLogLog one-page summary.

Hash every input. Each value v is hashed to a uniform b-bit number, e.g. 64 bits.
Split into bucket + tail. The first p bits index one of m = 2^p registers (typically p=14, so m=16384). The remaining bits form the "tail."
Track the max leading-zero count per register. For each new value, count how many leading zeros the tail has (call it R); store max(register[bucket], R).
Estimate cardinality. estimate = αₘ × m² / Σ 2^(-Rⱼ) where αₘ is a known bias-correction constant. With p=14, the error is ~1.04 / √m ≈ 0.81% (HLL++ further improves small-cardinality accuracy).
Memory = m × 5 bits (per-register max leading zero counter) ≈ 16384 × 5 / 8 ≈ 10 KB. Many implementations compress further to ~1.5 KB for typical workloads.

Engines and their flavours.

Snowflake — APPROX_COUNT_DISTINCT(col); standard error 1.62%. Also HLL(col) (returns the HLL state) and HLL_COMBINE (merges states).
BigQuery — APPROX_COUNT_DISTINCT(col); backed by HLL++. Also HLL_COUNT.INIT, HLL_COUNT.MERGE, HLL_COUNT.EXTRACT for state-management workflows.
Spark — approx_count_distinct(col, rsd=0.05) where rsd is the target relative standard deviation (default 5%). Lower rsd → more memory.
Postgres — hll extension (citus, aiven, timescale). Use hll_cardinality(hll_add_agg(col)).
Redis — PFADD, PFCOUNT, PFMERGE — the canonical streaming HLL surface.

When the approximation is safe and when it is not.

Safe. Dashboards (MAU, DAU, ad-network reach), weekly reports, anomaly-detection thresholds, capacity planning. A 1.6% miss on "monthly active users" is invisible.
Unsafe. Billing and revenue reconciliation, regulatory reporting, deduplication of a list of records (HLL gives you a cardinality, not the list). A 1.6% miss on "users billed this month" is a lawsuit.
Hybrid. Keep exact COUNT(DISTINCT) in the monthly close report (regulatory) and HLL everywhere else (dashboards). The two never collide because the close report runs once a month with work_mem cranked up.

The mergeable property — why HLL is huge for ETL.

HLL states are mergeable — HLL(A) ∪ HLL(B) = HLL(A ∪ B). That means you can compute the daily HLL state per partition, store it, and merge weeks or months later without re-scanning the underlying events.
This makes HLL the right pattern for MAU / DAU rollups — store one HLL per (day, segment) and merge on demand for the dashboard query.

Worked example — APPROX_COUNT_DISTINCT on a 10B-row events table

Detailed explanation. A practical comparison: a 10B-row events table, the question "how many distinct users in the last 30 days?", with exact and approximate variants run on the same Snowflake warehouse.

Question. Given events(user_id, event_date, …) with 10B rows over 30 days and ~80M distinct users, compare COUNT(DISTINCT user_id) and APPROX_COUNT_DISTINCT(user_id) on Snowflake. Show the SQL, the trace, and the cost numbers a senior engineer would quote.

Input.

Setting	Value
Table	`events(user_id BIGINT, event_date DATE, …)`
Rows	10,000,000,000
Distinct user_id	~80,000,000
Warehouse	Snowflake LARGE (8 nodes)
Tables clustered on	`event_date`

Code.

-- Exact distinct count
SELECT COUNT(DISTINCT user_id) AS exact_users
FROM events
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';

-- Approximate distinct count
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_users
FROM events
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';

-- The mergeable pattern — store one HLL state per day, merge on demand
CREATE OR REPLACE TABLE daily_user_hll AS
SELECT
    event_date,
    HLL(user_id) AS user_hll_state    -- serialised HLL state, one row per day
FROM events
GROUP BY event_date;

-- Dashboard query reads the rollup, not the raw events
SELECT HLL_ESTIMATE(HLL_COMBINE(user_hll_state)) AS approx_30d_users
FROM daily_user_hll
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';

Step-by-step explanation.

Exact path. The 8-node warehouse scans ~10B rows, hashes each user_id into a global unique set of ~80M, then emits one number. Wall time ≈ 2 minutes; warehouse credits ≈ 0.27.
Approximate path. Same scan, but instead of a hash set it maintains 8 × 16384 HLL registers across nodes, then merges them at the coordinator. Wall time ≈ 8 seconds; credits ≈ 0.02.
Mergeable rollup path. Daily HLL states are pre-computed (one row per day, ~10 KB serialised); the dashboard query reads 30 rows and merges them. Wall time ≈ 200 ms; credits ≈ negligible.
Accuracy. APPROX returns ~80.1M (true is 80.0M) — 0.125% error, well within the ~1.6% envelope.

Output.

Strategy	Wall time	Warehouse credits	Result	Error
`COUNT(DISTINCT user_id)`	2 min	0.27	80,000,000 (exact)	0
`APPROX_COUNT_DISTINCT`	8 s	0.02	80,135,000	0.17%
HLL rollup + merge	200 ms	< 0.001	80,142,000	0.18%

Rule of thumb. For any "distinct over a date range" dashboard, start with APPROX_COUNT_DISTINCT. Materialise one HLL state per partition (day, region, segment) if the dashboard latency budget is sub-second.

Worked example — streaming distinct counts with Redis PFADD

Detailed explanation. Outside the warehouse, HLL also powers streaming distinct counts in OLTP-adjacent stores. Redis exposes the HyperLogLog primitive as PFADD / PFCOUNT / PFMERGE — perfect for "how many distinct visitors are on the site right now" without keeping a row per visitor in a Redis Set.

Question. Build a real-time "unique visitors today" counter for a website that receives ~1B events per day. Compare a Redis Set vs a Redis HyperLogLog (PFADD).

Input.

Setting	Value
Events / day	~1,000,000,000
Distinct visitors / day	~10,000,000
Average user_id size	36 bytes (UUID string)

Code.

# Variant A — Redis Set
import redis
r = redis.Redis()

def track_visitor_set(user_id: str):
    r.sadd("visitors:2026-06-01", user_id)

def count_visitors_set():
    return r.scard("visitors:2026-06-01")

# Variant B — Redis HyperLogLog
def track_visitor_hll(user_id: str):
    r.pfadd("visitors_hll:2026-06-01", user_id)

def count_visitors_hll():
    return r.pfcount("visitors_hll:2026-06-01")

# Merge multiple days for "last 7 days unique"
def merge_week():
    keys = [f"visitors_hll:2026-05-{d:02d}" for d in range(26, 32)] + ["visitors_hll:2026-06-01"]
    r.pfmerge("visitors_week", *keys)
    return r.pfcount("visitors_week")

Step-by-step explanation.

Redis Set memory. A SET keyed by user_id stores every distinct value — 10M UUIDs × ~80 bytes each (Redis overhead) ≈ 800 MB / day.
Redis HLL memory. A single PFADD key stays at ~12 KB regardless of cardinality (Redis HLL is fixed-size). Memory drops by ~65,000×.
Merge for ranges. PFMERGE combines per-day HLLs into a single weekly state in one constant-time operation per source key. The Set alternative would require iterating every member of every day — impossibly slow.
Error envelope. Redis HLL claims 0.81% standard error; observed in production at ~1.6% peak. Acceptable for any visitor-counter dashboard.

Output.

Metric	Redis Set	Redis HLL
Memory per day	~800 MB	~12 KB
`track_visitor` latency	~50 µs (SADD)	~30 µs (PFADD)
`count_visitors` latency	~10 µs (SCARD)	~20 µs (PFCOUNT)
7-day merge	impractical	one `PFMERGE`, ~80 µs
Result accuracy	exact	±1.6%

Rule of thumb. Unless you need an exact list of the visitors (not just the count) or unless you need to test membership (SISMEMBER), reach for HLL. The memory savings are not 2× — they are 4–5 orders of magnitude.

Worked example — trace the HLL update for a single insert

Detailed explanation. To defend the algorithm at the whiteboard you need to walk through one update step by hand. The interview probe usually sounds like "what does HLL actually do when I PFADD user_42?" — and the answer is a four-step pipeline you can sketch in 30 seconds.

Question. With m=8 registers (for whiteboard simplicity; production HLL uses m=16384), walk through what happens when a new value v = "user_42" is added to the HLL. Show the hash, the bucket selection, the leading-zero count, and the register update.

Input.

Setting	Value
Register count `m`	8 (so `p = log2(m) = 3`)
Hash function	32-bit `xxhash`
Input value `v`	`"user_42"`
Initial register array	`[0, 1, 0, 2, 0, 0, 1, 0]`
Suppose `hash(v)`	`0x00031a4f` = `0000 0000 0000 0011 0001 1010 0100 1111`

Code (pseudocode for the update step).

def pfadd(registers, value, p=3, b=32):
    h = xxhash(value)                          # 32-bit hash
    bucket = h >> (b - p)                      # top `p` bits select the bucket
    tail   = (h << p) & ((1 << b) - 1)         # remaining bits form the tail
    # Count leading zeros in `tail`, then add 1 to land on the position
    R = leading_zeros(tail, width=b - p) + 1
    if R > registers[bucket]:
        registers[bucket] = R
    return registers

# Trace
registers = [0, 1, 0, 2, 0, 0, 1, 0]
pfadd(registers, "user_42", p=3, b=32)

Step-by-step explanation.

Hash. hash("user_42") = 0x00031a4f. Binary: 0000 0000 0000 0011 0001 1010 0100 1111.
Bucket. Top p=3 bits = 000 → bucket index 0.
Tail. Remaining b - p = 29 bits = 0 0000 0000 0011 0001 1010 0100 1111. Leading zeros in this 29-bit tail = 13.
R (rank). R = 13 + 1 = 14 — by convention HLL stores 1 + leading_zeros so the minimum value is 1.
Register update. registers[0] = max(0, 14) = 14. The array becomes [14, 1, 0, 2, 0, 0, 1, 0].
Cardinality estimate. Z = Σ 2^(-Rⱼ) = 2^-14 + 2^-1 + 2^-0 + 2^-2 + 2^-0 + 2^-0 + 2^-1 + 2^-0 (using the new values). For m=8, αₘ ≈ 0.673. estimate = αₘ × m² / Z.

Output.

Register	Before	After
0	0	14
1	1	1
2	0	0
3	2	2
4	0	0
5	0	0
6	1	1
7	0	0

Rule of thumb. Each register only ever grows — max(old, new). That is what makes HLL mergeable (the union of two HLLs is the element-wise max of their register arrays) and what gives it constant memory regardless of input size.

Worked example — choose the right approximation across engines

Detailed explanation. Every major warehouse has its own knobs for the precision-vs-memory trade-off. Knowing the right call across Snowflake, BigQuery, and Spark is the kind of dialect-fluency interview round senior interviewers love to probe.

Question. A team wants to compute unique-visitor counts per region across the last year, with a target relative standard deviation of 1%. Compare the API and the memory cost on Snowflake, BigQuery, and Spark.

Input.

Setting	Value
Rows scanned	~50B (1 year of events)
Distinct visitors / region	~5M
Regions	12
Target rsd	1%

Code.

-- Snowflake — APPROX_COUNT_DISTINCT (precision is fixed at p=14, ~1.6% standard error)
SELECT region, APPROX_COUNT_DISTINCT(user_id) AS approx_users
FROM events
WHERE event_date >= '2025-06-01'
GROUP BY region;

-- BigQuery — HLL_COUNT.MERGE for finer control
SELECT region,
       HLL_COUNT.MERGE(user_id_hll) AS approx_users
FROM (
    SELECT region, HLL_COUNT.INIT(user_id, 15) AS user_id_hll  -- p=15 → ~1.0% std err
    FROM events
    WHERE event_date >= '2025-06-01'
    GROUP BY region
);

-- Spark — approx_count_distinct(col, rsd)
SELECT region, approx_count_distinct(user_id, 0.01) AS approx_users
FROM events
WHERE event_date >= '2025-06-01'
GROUP BY region;

Step-by-step explanation.

Snowflake — APPROX_COUNT_DISTINCT uses HLL with fixed p=14 (~16,384 registers, ~1.6% std err). To get a tighter envelope, use HLL + HLL_COMBINE and reason about per-day rollups instead.
BigQuery — HLL_COUNT.INIT(col, precision) lets you set precision from 10 to 24. p=15 gives ~1.0% std err with ~32KB per HLL state. HLL_COUNT.MERGE combines.
Spark — approx_count_distinct(col, rsd) takes a target relative standard deviation directly. rsd=0.01 chooses p internally to hit 1% accuracy at the cost of more memory per estimator.
Memory comparison (per-region HLL state).

Output.

Engine	API	Tuning knob	Memory per region	Std err
Snowflake	`APPROX_COUNT_DISTINCT`	none (fixed p=14)	~12 KB	~1.6%
BigQuery	`HLL_COUNT.INIT(col, p=15)`	`p` 10–24	~32 KB	~1.0%
Spark	`approx_count_distinct(col, 0.01)`	`rsd` 0.001–0.5	~32 KB	~1.0%

Rule of thumb. When the spec says "I need ~1% accuracy," BigQuery's HLL_COUNT.INIT(col, 15) and Spark's approx_count_distinct(col, 0.01) give you the tunable knob. Snowflake's APPROX_COUNT_DISTINCT is fixed at ~1.6% — for tighter envelopes, store an HLL state per partition and combine.

SQL interview question on choosing between exact and approximate distinct

A common probe: "Your finance team wants the monthly active users dashboard to load in under 500ms. The events table is 10B rows. Walk me through your reasoning."

Solution Using mergeable HLL rollups

-- Step 1 — define a daily rollup of HLL states
CREATE OR REPLACE TABLE daily_user_hll (
    event_date     DATE       NOT NULL,
    user_hll_state BINARY     NOT NULL,
    PRIMARY KEY (event_date)
);

-- Step 2 — populate it once per day (e.g. via Snowflake task or Airflow DAG)
INSERT INTO daily_user_hll
SELECT
    event_date,
    HLL(user_id)
FROM events
WHERE event_date = CURRENT_DATE - 1
GROUP BY event_date;

-- Step 3 — dashboard query merges the per-day HLLs
SELECT HLL_ESTIMATE(HLL_COMBINE(user_hll_state)) AS mau_estimate
FROM daily_user_hll
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';

Step-by-step trace.

Step	Operation	Memory	I/O	Wall time
1	Daily ETL builds one HLL state per `(event_date)`	~1.5 KB per row	1 day of events	~30 s/day
2	Dashboard reads 30 rows from `daily_user_hll`	~45 KB total	30 rows	~10 ms
3	`HLL_COMBINE` merges 30 HLL states into one	~1.5 KB	none	~5 ms
4	`HLL_ESTIMATE` derives the final cardinality	~constant	none	~1 ms
5	Dashboard renders the number	n/a	n/a	total ~30 ms

Output:

Metric	Raw `COUNT(DISTINCT)`	HLL rollup pattern
Dashboard latency	2 min	~30 ms
Warehouse credits per dashboard load	0.27	< 0.001
Result	80,000,000 (exact)	80,142,000 (±1.6%)
Daily ETL cost	none	~30 s × 1/day
Memory per query	~12 GB	~45 KB

Why this works — concept by concept:

HyperLogLog as a sub-linear cardinality estimator — O(log log n) memory, ~1.6% standard error, regardless of input cardinality. This is the algorithmic core that makes the rollup possible.
Mergeability — HLL(A ∪ B) = HLL_COMBINE(HLL(A), HLL(B)). The dashboard query never re-scans the raw events; it merges 30 pre-computed states.
Amortised cost — the expensive scan happens once per day in a controlled window; the dashboard cost is essentially zero.
Sub-second dashboard latency — reading 30 rows + merging tiny HLL states fits comfortably in any web request budget.
Cost — exact = O(N) scan + O(distinct) memory + O(N) time per load; HLL rollup = O(N) scan once per day + O(days) time per dashboard load + O(1) memory.

SQL
Topic — aggregation
Approximate aggregation + dashboard rollup problems (SQL)

Practice →

5. Deduplication patterns — ROW_NUMBER vs DISTINCT vs GROUP BY

`sql deduplicate` is a three-pattern matrix — ROW_NUMBER (general), DISTINCT ON (Postgres-only), GROUP BY + ARG_MAX (cross-dialect with helpers)

The mental model: for the question "give me one row per business key, picking a specific one when there are duplicates," the right tool depends on the dialect and on what column decides the tie. DISTINCT * is the wrong answer in production — it preserves rows that differ only by audit columns. The three patterns below cover every interview shape from "latest order per customer" to "first session per user."

The three patterns at a glance.

Pattern 1 — ROW_NUMBER() over a partition. The universal answer. Works in every dialect. Pick the row where rn = 1 after partitioning by the business key and ordering by the tie-breaker.
Pattern 2 — DISTINCT ON (key). Postgres-only, concise. Returns the first row per group given the trailing ORDER BY.
Pattern 3 — GROUP BY key + ARG_MAX / FIRST_VALUE. The warehouse-native shape — Snowflake has MAX_BY / MIN_BY, BigQuery has ARRAY_AGG ... LIMIT 1, Spark has first(...) after an ORDER BY. Concise once you know the engine's helper.

Decision tree for the interview.

"Cross-dialect, no guarantees on engine?" — ROW_NUMBER(). It is in standard SQL since 2003.
"Postgres, and I want concise code?" — DISTINCT ON. One of Postgres's best features.
"Snowflake / BigQuery / Spark warehouse query?" — MAX_BY / ARG_MAX / ARRAY_AGG LIMIT 1. The engine's helper is the shortest and the planner gives it the best plan.
"I just need the latest ts per (key), no other columns?" — MAX(ts) GROUP BY key. Don't reach for ROW_NUMBER() when a plain aggregate suffices.

The classic interview shape — "latest order per customer."

Input: orders(order_id, customer_id, order_ts, status, …).
Output: one row per customer_id, the row with the largest order_ts.
Three valid answers:
- ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_ts DESC) = 1
- DISTINCT ON (customer_id) * ORDER BY customer_id, order_ts DESC
- JOIN orders USING (customer_id, order_ts) against a GROUP BY customer_id SELECT MAX(order_ts) derived table — or MAX_BY(*, order_ts) if the engine has it.

Why DISTINCT * is the wrong answer for this shape.

DISTINCT * deduplicates rows that are byte-for-byte identical. If a row is updated with a new updated_at watermark, DISTINCT * returns both versions — defeating the purpose.
The right answer keys explicitly on the business key (customer_id, order_id, email_lower), not on every column.

Worked example — pick the latest order per customer with ROW_NUMBER

Detailed explanation. The senior pattern is ROW_NUMBER() with a deterministic tie-breaker — usually order_ts DESC, order_id DESC so ties on order_ts resolve to the higher order_id. The CTE shape makes the intent obvious and is easy to refactor when the spec changes.

Question. Given an orders table with multiple rows per customer, return the most recent order per customer, breaking ties by order_id DESC.

Input.

order_id	customer_id	order_ts	status
101	42	2026-05-01 09:00	placed
102	42	2026-05-02 14:00	placed
103	42	2026-05-02 14:00	paid
104	99	2026-05-03 10:00	paid
105	99	2026-05-04 11:00	shipped

Code.

WITH ranked AS (
    SELECT
        o.*,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_ts DESC, order_id DESC
        ) AS rn
    FROM orders o
)
SELECT order_id, customer_id, order_ts, status
FROM ranked
WHERE rn = 1;

Step-by-step explanation.

The window function sorts each customer_id partition by (order_ts DESC, order_id DESC).
For customer_id=42, the rows are ranked 103 (rn=1), 102 (rn=2), 101 (rn=3) — 103 wins because its order_id is higher and the order_ts tied with 102.
For customer_id=99, the rows are ranked 105 (rn=1), 104 (rn=2) — 105 wins on order_ts.
The outer filter keeps only rn = 1.

Output.

order_id	customer_id	order_ts	status
103	42	2026-05-02 14:00	paid
105	99	2026-05-04 11:00	shipped

Rule of thumb. Always include a deterministic tie-breaker in the ORDER BY (order_id DESC here). Without it, the engine is allowed to return either row when order_ts ties — and the choice can flip between query runs.

Worked example — the same query in Postgres with DISTINCT ON

Detailed explanation. Postgres's DISTINCT ON (key) ... ORDER BY key, tie_breaker is a near-perfect dual to the ROW_NUMBER() shape — shorter, same plan, Postgres-only.

Question. Re-write the "latest order per customer" query using DISTINCT ON (Postgres).

Input. Same as above.

Code.

SELECT DISTINCT ON (customer_id)
    order_id, customer_id, order_ts, status
FROM orders
ORDER BY customer_id, order_ts DESC, order_id DESC;

Step-by-step explanation.

Postgres sorts the input by (customer_id, order_ts DESC, order_id DESC).
DISTINCT ON (customer_id) keeps the first row per customer_id after that sort.
Because order_ts DESC, order_id DESC is the second/third sort key, the kept row is the most recent.
The plan is Unique → Sort → SeqScan — equivalent to the ROW_NUMBER() = 1 plan on the same Postgres version.

Output. Identical to the ROW_NUMBER() shape.

Rule of thumb. When you are sure the engine is Postgres, DISTINCT ON is the cleanest expression of "one row per key." Outside Postgres, fall back to ROW_NUMBER().

Worked example — the same query in Snowflake with MAX_BY

Detailed explanation. Snowflake's MAX_BY(value_col, sort_col) returns the value of value_col from the row that has the maximum sort_col. Combined with GROUP BY customer_id, this is the shortest cross-dialect-ish answer.

Question. Re-write the "latest order per customer" query in Snowflake using MAX_BY.

Input. Same as above.

Code.

SELECT
    customer_id,
    MAX_BY(order_id, (order_ts, order_id))  AS latest_order_id,
    MAX(order_ts)                            AS latest_order_ts,
    MAX_BY(status,   (order_ts, order_id))  AS latest_status
FROM orders
GROUP BY customer_id;

Step-by-step explanation.

MAX_BY(order_id, (order_ts, order_id)) returns the order_id from the row whose (order_ts, order_id) is largest — that is, the latest order, with order_id DESC as the tie-breaker.
The same pattern repeats for status (and for any other column the spec needs).
The GROUP BY customer_id aggregation runs in one pass — O(N) time, O(distinct customers) memory.
Equivalent in BigQuery: ARRAY_AGG(STRUCT(order_id, status) ORDER BY order_ts DESC, order_id DESC LIMIT 1)[OFFSET(0)].

Output. Identical to the ROW_NUMBER() shape.

Rule of thumb. Inside a Snowflake / BigQuery query, the engine's MAX_BY / ARG_MAX is often the shortest and the most efficient — no window, no extra sort.

Worked example — keep all ties with RANK() vs DENSE_RANK() vs ROW_NUMBER()

Detailed explanation. A nuance interviewers love: ROW_NUMBER() always assigns distinct ranks, but RANK() and DENSE_RANK() produce ties when the ORDER BY keys tie. The choice depends on whether you want one row per business key or all tied rows.

Question. Given an events(user_id, score, ts) table, return:

The single top-scoring event per user_id (any one if there's a tie).
All top-scoring events per user_id (including ties).

Compare ROW_NUMBER(), RANK(), and DENSE_RANK().

Input.

user_id	score	ts
1	90	09:00
1	90	09:05
1	85	09:10
2	75	09:02
2	75	09:08

Code.

-- A — ROW_NUMBER: one row per user_id, deterministic by adding ts as tie-breaker
SELECT user_id, score, ts
FROM (
    SELECT
        e.*,
        ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY score DESC, ts ASC) AS rn
    FROM events e
) ranked
WHERE rn = 1;

-- B — RANK: keep all rows where the score is the user's top score (handles ties)
SELECT user_id, score, ts
FROM (
    SELECT
        e.*,
        RANK() OVER (PARTITION BY user_id ORDER BY score DESC) AS rk
    FROM events e
) ranked
WHERE rk = 1;

Step-by-step explanation.

ROW_NUMBER() assigns 1, 2, 3, … even when scores tie — the tie-break is the secondary ORDER BY (here ts ASC).
RANK() assigns 1, 1, 3, … — both score=90 rows for user 1 get rank 1; the score=85 row jumps to rank 3 (skipping 2).
DENSE_RANK() assigns 1, 1, 2, … — same as RANK but without the gap.
Filter rk = 1 to keep all rows tied at the top score; filter rn = 1 to keep exactly one.

Output A (ROW_NUMBER).

user_id	score	ts
1	90	09:00
2	75	09:02

Output B (RANK).

user_id	score	ts
1	90	09:00
1	90	09:05
2	75	09:02
2	75	09:08

Rule of thumb. Use ROW_NUMBER() when the spec says "the latest" / "the top" (one row). Use RANK() or DENSE_RANK() when the spec says "all rows that are tied at the top." If you reach for ROW_NUMBER() to dedup and the spec changes to "keep ties," you'll need to swap to RANK() — flag this in the comment.

SQL interview question on cross-dialect deduplication

A senior probe: "I have an orders table on Postgres in dev and Snowflake in prod. Write the 'latest order per customer' query so the same SQL runs on both dialects."

Solution Using ROW_NUMBER() — the universal pattern

WITH ranked AS (
    SELECT
        o.*,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY order_ts DESC, order_id DESC
        ) AS rn
    FROM orders o
)
SELECT order_id, customer_id, order_ts, status
FROM ranked
WHERE rn = 1;

Step-by-step trace.

Row	customer_id	order_ts	order_id	partition_rn	kept?
1	42	2026-05-01 09:00	101	3	no
2	42	2026-05-02 14:00	102	2	no
3	42	2026-05-02 14:00	103	1	yes
4	99	2026-05-03 10:00	104	2	no
5	99	2026-05-04 11:00	105	1	yes

Output:

order_id	customer_id	order_ts	status
103	42	2026-05-02 14:00	paid
105	99	2026-05-04 11:00	shipped

Why this works — concept by concept:

ROW_NUMBER() as the universal dedup primitive — every major dialect (Postgres, MySQL 8+, Snowflake, BigQuery, Spark, Oracle, SQL Server) supports the same standard-SQL form. One query, every engine.
Deterministic tie-breaker — including order_id DESC in the ORDER BY removes non-determinism when order_ts ties. Without it, the engine is allowed to pick either row.
Cross-dialect filter — WHERE rn = 1 is a plain comparison, not engine-specific syntax. It survives a port from Postgres to Snowflake unchanged.
Plan parity — Postgres's ROW_NUMBER() and Snowflake's window operator pick effectively the same plan: a sort followed by a Unique-style filter. Memory is O(partition size).
Cost — O(N log N) for the sort, O(N) for the window numbering, O(N) for the filter. The same O(N log N) cost the DISTINCT ON and MAX_BY shapes pay too.

SQL
Topic — window-functions
Window-function deduplication problems (ROW_NUMBER, RANK)

Practice →

SQL interview question on dedup of CDC change events

A senior probe heavy on data-engineering nuance: "You have a cdc_events(pk, op, after_image, lsn) table where op ∈ {INSERT, UPDATE, DELETE} and lsn is the log sequence number. Each pk may have many rows; you want the current state of each pk for a downstream warehouse load. How do you deduplicate?"

Solution Using ROW_NUMBER() with LSN tie-breaker and DELETE filter

WITH latest AS (
    SELECT
        c.*,
        ROW_NUMBER() OVER (
            PARTITION BY pk
            ORDER BY lsn DESC
        ) AS rn
    FROM cdc_events c
)
SELECT pk, after_image
FROM latest
WHERE rn = 1
  AND op <> 'DELETE';   -- skip tombstoned rows

Step-by-step trace.

pk	op	lsn	rn	kept?
42	INSERT	100	3	no
42	UPDATE	110	2	no
42	UPDATE	130	1	yes — latest non-DELETE
99	INSERT	120	2	no
99	DELETE	140	1	dropped — tombstoned
7	INSERT	105	1	yes

Output:

pk	after_image
42	(latest UPDATE payload at lsn=130)
7	(INSERT payload at lsn=105)

Why this works — concept by concept:

LSN as the canonical tie-breaker — lsn is monotonically increasing per replication slot, so it is a perfect deterministic order for "latest change wins."
rn = 1 filter — for each pk, the row with the highest lsn survives; older operations are dropped.
DELETE handling — without the op <> 'DELETE' filter, tombstoned rows would re-appear as the current state. Filtering them after the ranking is what turns the change log into a current-state materialisation.
Idempotent re-runs — the query depends only on (pk, lsn, op, after_image) — replaying the same CDC stream produces the same output. Safe to re-run after any failure.
Cost — O(N log N) for the sort inside the window; O(distinct pks) memory for the window state in the executor. Comparable to a DISTINCT ON (pk) ORDER BY pk, lsn DESC shape on Postgres.

SQL
Topic — distinct
CDC-style change-log dedup problems (SQL)

Practice →

Cheat sheet — DISTINCT recipes

Unique values of one column. SELECT DISTINCT col FROM t — interchangeable with SELECT col FROM t GROUP BY col. Same plan in Postgres.
Unique tuples of N columns. SELECT a, b, c FROM t GROUP BY a, b, c — prefer GROUP BY so an aggregate can be folded in without rewriting.
Exact unique count. SELECT COUNT(DISTINCT col) FROM t — pays O(distinct cardinality) memory; tune work_mem or pre-aggregate.
Approximate unique count (single query). SELECT APPROX_COUNT_DISTINCT(col) FROM t — Snowflake / BigQuery / Spark; ~1.6% error, ~1.5 KB memory.
Approximate unique count (mergeable rollup). HLL(col) per partition + HLL_COMBINE at read time — sub-second dashboard latency on 10B-row tables.
Latest row per business key, any dialect. ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts DESC, tie DESC) = 1. Always include a deterministic tie-breaker.
Latest row per business key, Postgres only. SELECT DISTINCT ON (id) * FROM t ORDER BY id, ts DESC, tie DESC. Shortest expression of "one row per key."
Latest row per business key, warehouse-native. MAX_BY(col, (ts, tie)) (Snowflake), ARRAY_AGG(STRUCT(col) ORDER BY ts DESC, tie DESC LIMIT 1)[OFFSET(0)] (BigQuery).
Avoid DISTINCT * for dedup. It preserves rows that differ only by an audit column. Always key on the business key explicitly.
Multiple COUNT(DISTINCT) in one query. Either three APPROX_COUNT_DISTINCT calls (cheap, ±1.6%) or GROUPING SETS / pre-aggregated rollup (exact, cheap-amortised).

Frequently asked questions

Is SELECT DISTINCT faster or slower than GROUP BY in SQL?

For the common case — "give me the unique values of one or more columns" — SELECT DISTINCT a, b and SELECT a, b FROM t GROUP BY a, b produce the same plan in every modern engine (Postgres, MySQL 8+, Snowflake, BigQuery), so they run at the same speed. GROUP BY pulls ahead the moment an aggregate enters the picture, because DISTINCT cannot aggregate inline — the equivalent DISTINCT shape needs a self-join to recompute the aggregate. The rule of thumb is: use DISTINCT for read-only uniqueness, use GROUP BY everywhere else, and always read EXPLAIN if you're unsure.

Why is COUNT(DISTINCT col) so slow on big tables in SQL?

COUNT(DISTINCT col) must materialise the entire unique set of col values in memory, which costs O(distinct cardinality) bytes. On a 1B-row table with 100M distinct values that's 2–3 GB of hash table — well above the default work_mem in Postgres or the per-node spill threshold in Snowflake, so the operator spills to disk and the wall time triples. An index on col does not help, because the engine still has to touch every row to know whether each value is "new" or "already seen." The fix is to raise work_mem for the session, pre-aggregate the unique set in a daily rollup, or switch to APPROX_COUNT_DISTINCT (HyperLogLog).

How accurate is HyperLogLog and APPROX_COUNT_DISTINCT in practice?

HyperLogLog (HLL) and its successor HLL++ deliver a standard error of ~1.04 / √m where m is the number of registers — m=16384 (default in most engines) gives ~0.81% standard error and a ~1.6% 95th-percentile error. On real datasets the error is usually below 1% as long as the cardinality is above a few thousand. Snowflake, BigQuery, and Spark all use HLL++ under the hood for APPROX_COUNT_DISTINCT, so they share these accuracy guarantees. The trade-off is unbeatable: ~1.5 KB of memory for a cardinality estimate that holds for sets ranging from 10K to 10B distinct values.

Can I do COUNT(DISTINCT col1, col2) across multiple columns in SQL?

Yes in Postgres, Snowflake, BigQuery, and Spark — COUNT(DISTINCT col1, col2) treats the pair as a tuple and counts the distinct tuples. In MySQL it works too but NULL handling differs from the tuple convention; double-check on your version. In SQL Server you must concatenate manually: COUNT(DISTINCT CONCAT(col1, '|', col2)) with a separator chosen so it cannot occur inside either column. The cost is the same as COUNT(DISTINCT) of one column with cardinality distinct(col1) × co-occurrence factor.

Does an index on a column help SELECT DISTINCT col FROM t?

Sometimes. A B-tree index on col lets the engine perform an index-only scan in sorted order — Postgres can then walk the index leaf pages and emit unique values via a Unique operator without a separate sort, which is much faster than the table-scan + hash path on wide tables. The catch is that this only works when the index covers the query (no other columns needed) and when the visibility map is up to date (Postgres-specific — VACUUM recently). For SELECT DISTINCT col1, col2, an index on (col1, col2) can give the same benefit. For COUNT(DISTINCT col), the index helps the scan but not the hash, because the engine still must visit every leaf entry.

What is DISTINCT ON in PostgreSQL and when should I use it?

SELECT DISTINCT ON (key) col1, col2, … FROM t ORDER BY key, tie_breaker DESC is a Postgres-only extension that keeps the first row per group given the trailing ORDER BY. It is the cleanest expression of "one row per key" in Postgres — equivalent to ROW_NUMBER() OVER (PARTITION BY key ORDER BY tie_breaker DESC) = 1 but with one less CTE and one less filter. Use it for "latest order per customer," "first session per user," "most recent snapshot per asset," and similar patterns when you know the engine is Postgres. Outside Postgres, fall back to ROW_NUMBER() — the standard-SQL universal pattern.

How do I handle NULLs in SELECT DISTINCT and COUNT(DISTINCT)?

SELECT DISTINCT col treats NULL as a single distinct value — all NULL rows collapse into one row in the output. COUNT(DISTINCT col), in standard SQL, excludes NULL from the count entirely — only non-NULL distinct values are counted. For multi-column forms, COUNT(DISTINCT a, b) skips any tuple where any column is NULL. If you need NULL to count as a distinct value, project it to a sentinel: COUNT(DISTINCT COALESCE(CAST(col AS TEXT), '<NULL>')) works in any dialect.

Should I use SELECT DISTINCT to deduplicate after a JOIN that explodes rows?

Almost never. A SELECT DISTINCT * after a JOIN papers over a join-cardinality bug — usually a many-to-many join that should have been many-to-one. Instead, find the join key that exploded and rewrite the join with a GROUP BY or a CTE that pre-aggregates one side, so the join is one-to-many before it enters the outer query. The DISTINCT "fix" hides the bug, doubles the memory cost, and often masks a row-count mismatch that breaks downstream metrics later.

Practice on PipeCode

Drill the DISTINCT + deduplication practice library → for hands-on SELECT DISTINCT, COUNT(DISTINCT), and DISTINCT ON reps.
Rehearse aggregation problems → for the cost discussion around COUNT(DISTINCT) and GROUP BY.
Sharpen SQL aggregation drills → for warehouse-native helpers like MAX_BY, ARG_MAX, and ARRAY_AGG.
Work window-function dedup problems → for the universal ROW_NUMBER() pattern.
For the broader interview surface, read top data engineering interview questions →.
Pair this with the only 5 skills you need to become a data engineer → for the breadth-first prep map.
Take the structured deep dive in SQL for DE interviews: from zero to FAANG → — the full course covers every primitive on this page with graded problems.
Reinforce the modelling side with data modelling for DE interviews → so the next dedup question lands on a well-keyed table.

Pipecode.ai is Leetcode for Data Engineering — every DISTINCT and COUNT(DISTINCT) primitive above ships with hands-on practice rooms where you write the queries, read the real EXPLAIN output, and watch the work_mem spill happen on a real engine. Start with the distinct + dedup library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice DISTINCT now →
Aggregation drills →

1. Why DISTINCT is more expensive than people think

sql distinct is never "free" — every DISTINCT triggers a sort or a hash aggregation under the hood

Worked example — measure the cost of DISTINCT on a 100M-row events table

Worked example — why DISTINCT * preserves duplicates that look identical

SQL interview question on the cost of DISTINCT

Solution Using EXPLAIN + work_mem reasoning

2. SELECT DISTINCT vs GROUP BY — when same, when not

select distinct sql and GROUP BY are two syntaxes for the same plan — until an aggregate or an ORDER BY enters the picture

Worked example — DISTINCT a, b vs GROUP BY a, b

Worked example — DISTINCT with ORDER BY that breaks on Postgres

SQL interview question on DISTINCT vs GROUP BY for aggregation

Solution Using GROUP BY with COUNT(*) in one pass

3. COUNT(DISTINCT col) anatomy and performance

sql count distinct is exact, expensive, and the most common performance footgun in analytics queries

Worked example — measure the cost of multiple COUNT(DISTINCT) in one query

Worked example — COUNT(DISTINCT a, b) and dialect differences

SQL interview question on speeding up COUNT(DISTINCT)

Solution Using pre-aggregation + APPROX_COUNT_DISTINCT + work_mem

4. APPROX_COUNT_DISTINCT and HyperLogLog

APPROX_COUNT_DISTINCT trades ±1.6% error for O(log log n) memory — the right default for dashboards

Worked example — APPROX_COUNT_DISTINCT on a 10B-row events table

Worked example — streaming distinct counts with Redis PFADD

Worked example — trace the HLL update for a single insert

Worked example — choose the right approximation across engines

SQL interview question on choosing between exact and approximate distinct

Solution Using mergeable HLL rollups

5. Deduplication patterns — ROW_NUMBER vs DISTINCT vs GROUP BY

sql deduplicate is a three-pattern matrix — ROW_NUMBER (general), DISTINCT ON (Postgres-only), GROUP BY + ARG_MAX (cross-dialect with helpers)

Worked example — pick the latest order per customer with ROW_NUMBER

Worked example — the same query in Postgres with DISTINCT ON

Worked example — the same query in Snowflake with MAX_BY

Worked example — keep all ties with RANK() vs DENSE_RANK() vs ROW_NUMBER()

SQL interview question on cross-dialect deduplication

Solution Using ROW_NUMBER() — the universal pattern

SQL interview question on dedup of CDC change events

Solution Using ROW_NUMBER() with LSN tie-breaker and DELETE filter

Cheat sheet — DISTINCT recipes

Frequently asked questions

Is SELECT DISTINCT faster or slower than GROUP BY in SQL?

Why is COUNT(DISTINCT col) so slow on big tables in SQL?

How accurate is HyperLogLog and APPROX_COUNT_DISTINCT in practice?

Can I do COUNT(DISTINCT col1, col2) across multiple columns in SQL?

Does an index on a column help SELECT DISTINCT col FROM t?

What is DISTINCT ON in PostgreSQL and when should I use it?

How do I handle NULLs in SELECT DISTINCT and COUNT(DISTINCT)?

Should I use SELECT DISTINCT to deduplicate after a JOIN that explodes rows?

Practice on PipeCode

`sql distinct` is never "free" — every DISTINCT triggers a sort or a hash aggregation under the hood

`select distinct sql` and `GROUP BY` are two syntaxes for the same plan — until an aggregate or an `ORDER BY` enters the picture

`sql count distinct` is exact, expensive, and the most common performance footgun in analytics queries

`APPROX_COUNT_DISTINCT` trades `±1.6%` error for `O(log log n)` memory — the right default for dashboards

`sql deduplicate` is a three-pattern matrix — ROW_NUMBER (general), DISTINCT ON (Postgres-only), GROUP BY + ARG_MAX (cross-dialect with helpers)