sql distinct looks like the simplest operator in the language — a one-word switch that strips duplicates from a result set — and that is exactly why it shows up in dozens of production incidents and interview rounds every year. The catch is that sql distinct is not free. Behind every SELECT DISTINCT is a sort or a hash aggregation; behind every sql count distinct is a unique-set whose memory cost is O(distinct cardinality); behind every dedup pattern is a choice between DISTINCT, GROUP BY, ROW_NUMBER(), and DISTINCT ON that decides whether the query finishes in seconds or grinds the warehouse for an hour.
This guide walks through the five things every senior SQL round probes about select distinct sql and count distinct sql — when DISTINCT and GROUP BY produce the same plan and when they do not, why COUNT(DISTINCT col) blows past work_mem on billion-row tables, how APPROX_COUNT_DISTINCT and HyperLogLog deliver ~1.6% error in ~1.5 KB, and the three cross-dialect patterns for sql deduplicate (ROW_NUMBER, DISTINCT ON, GROUP BY + ARG_MAX). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end, distinct in sql stops being a one-word operator and starts being an explicit cost discussion you can defend on any query plan.
When you want hands-on reps immediately after reading, browse distinct + dedup practice library →, drill aggregation problems for the cost discussion →, and rehearse window-function dedup patterns →.
On this page
- Why DISTINCT is more expensive than people think
- SELECT DISTINCT vs GROUP BY — when same, when not
- COUNT(DISTINCT col) anatomy and performance
- APPROX_COUNT_DISTINCT and HyperLogLog
- Deduplication patterns — ROW_NUMBER vs DISTINCT vs GROUP BY
- Cheat sheet — DISTINCT recipes
- Frequently asked questions
- Practice on PipeCode
1. Why DISTINCT is more expensive than people think
sql distinct is never "free" — every DISTINCT triggers a sort or a hash aggregation under the hood
The one-sentence invariant: SELECT DISTINCT and SELECT COUNT(DISTINCT col) always materialise a unique set, which means either a sort over the inputs or a hash aggregation, and the memory cost of that unique set is O(distinct cardinality). Once you internalise that — DISTINCT is a real operator, not a syntactic flag — most of the select distinct sql interview question family becomes a discussion about plan choices and memory budgets.
The three myths every junior engineer believes.
-
"DISTINCT is just a syntactic switch." It is not. Postgres EXPLAIN shows
HashAggregate(orUnique+Sortfor ordered output); MySQL showsUsing temporary; Using filesortfor the multi-column case; Snowflake shows an extra aggregation stage in the profile. A switch with a plan node attached is not a switch — it is an operator. -
"DISTINCT is cheaper than GROUP BY." Almost never. For a single column the planner usually picks the same plan for both. For multi-column DISTINCT the planner sometimes picks a worse hash table layout than the equivalent
GROUP BYwould — becauseGROUP BYis allowed to pick an aggregation strategy first and then think about uniqueness, whereasDISTINCT *has to treat every column as a group key. -
"
SELECT DISTINCT *is the safe way to dedup." It is the dangerous way to dedup.DISTINCT *keys the entire row including thecreated_attimestamp, theupdated_atwatermark, and any audit column; if even one of those changes between two otherwise-identical writes, the duplicate is preserved. The right pattern isDISTINCT ON (business_key)(Postgres) orROW_NUMBER() OVER (PARTITION BY business_key ORDER BY tie_breaker) = 1(everywhere) — covered in §5.
What a senior engineer reaches for instead.
-
For "unique values of one column" —
SELECT DISTINCT col FROM torSELECT col FROM t GROUP BY colare interchangeable. -
For "unique tuples of N columns" — write
GROUP BY a, b, …and you can fold inCOUNT(*)orMAX(ts)later without re-querying. -
For "exact count of unique values" —
COUNT(DISTINCT col)if cardinality is small;APPROX_COUNT_DISTINCT(col)if cardinality is large and ~1.6% error is acceptable. -
For "one row per business key" — never
DISTINCT *. AlwaysROW_NUMBER()orDISTINCT ONkeyed on the business key.
The production code review heuristics.
- If a query has
SELECT DISTINCT a, b, c, d, e, f, g, ask why. There is almost always a business key buried in there and the engineer has reached forDISTINCTbecause they were unsure which key. - If a query has
COUNT(DISTINCT col)over a known billion-row table for a dashboard, ask whetherAPPROX_COUNT_DISTINCTis acceptable. For MAU / DAU dashboards the answer is almost always yes. - If a query has multiple
COUNT(DISTINCT)calls, ask whether the engine supportsGROUPING SETSorHLL_COMBINEto fold them into one pass instead of N passes.
What interviewers listen for.
- Do you say "
DISTINCTtriggers a sort or hash aggregation" when asked the cost? — senior signal. - Do you reach for
GROUP BYwhen an aggregate is needed alongside the unique set? — required answer. - Do you mention
APPROX_COUNT_DISTINCT/ HyperLogLog when the table is huge? — senior signal. - Do you flag
DISTINCT *as the dedup anti-pattern? — required answer.
Worked example — measure the cost of DISTINCT on a 100M-row events table
Detailed explanation. A common interview probe is "how much does SELECT DISTINCT user_id FROM events cost?" The senior answer is not a number; it is a method: "Show me the EXPLAIN, the cardinality of user_id, and work_mem, and I will tell you whether it stays in memory or spills."
Question. Given an events table with 100 million rows and ~5 million distinct user_id values, write the query that returns the unique users, and explain what the planner does on Postgres with work_mem=64MB.
Input.
| Setting | Value |
|---|---|
| Table | events(user_id BIGINT, event_ts TIMESTAMPTZ, …) |
| Rows | 100,000,000 |
Distinct user_id
|
5,000,000 |
work_mem |
64 MB |
| Bytes per hash entry | ~24 (key + hash + chain) |
Code.
EXPLAIN (ANALYZE, BUFFERS)
SELECT DISTINCT user_id
FROM events;
Step-by-step explanation.
-
Planner picks a strategy. Postgres compares
HashAggregate(build a hash table keyed byuser_id) vsUnique + Sort(sort everything, then walk the sorted stream). For 5M distinct keys the hash table is ~120 MB — bigger thanwork_mem— so the planner may either pickUnique + Sort(external sort, spills to disk) or use the partitioned-hash-aggregation strategy added in Postgres 13+ (also spills). -
Memory footprint =
O(distinct cardinality). 5M × 24 bytes ≈ 120 MB hash table. Withwork_mem=64MB, the operator either spills to disk in batches or the planner switches toSort(which spills viaexternal merge). - Wall time dominated by I/O during spill. Once the operator spills, the cost is dominated by reading and writing temp files, not by hashing CPU.
-
The fix. Raise
work_memto ~256 MB just for this session, or pre-aggregate at write time, or acceptAPPROX_COUNT_DISTINCTwhen an exact list of users is not needed.
Output (illustrative).
| Strategy | work_mem | Wall time | Disk temp files |
|---|---|---|---|
HashAggregate (in memory) |
256 MB | 8 s | 0 |
HashAggregate (spilled) |
64 MB | 38 s | ~120 MB |
Sort + Unique |
64 MB | 42 s | ~2 GB (sorts all 100M rows first) |
Rule of thumb. work_mem >= hash_entry_bytes × distinct_cardinality keeps DISTINCT in memory. When you cannot raise work_mem, switch to APPROX_COUNT_DISTINCT or pre-aggregate the unique set in a daily rollup.
Worked example — why DISTINCT * preserves duplicates that look identical
Detailed explanation. A real production trap: a team has a signups(email, user_id, source, signup_ts, updated_at) table and tries to dedupe with SELECT DISTINCT * FROM signups. They expect one row per email; they get every version because updated_at differs across replays of the same logical signup.
Question. Given a table where the business key is email_lower but the row carries signup_ts and updated_at audit columns, explain why SELECT DISTINCT * does not produce one row per email, and rewrite the query so it does.
Input.
| user_id | source | signup_ts | updated_at | |
|---|---|---|---|---|
| alice@x | 1 | direct | 2026-04-01 09:00 | 2026-04-01 09:00 |
| alice@x | 1 | direct | 2026-04-01 09:00 | 2026-04-15 10:30 |
| alice@x | 1 | direct | 2026-04-01 09:00 | 2026-05-02 14:00 |
| bob@x | 2 | ad | 2026-04-02 10:00 | 2026-04-02 10:00 |
Code.
-- Wrong — DISTINCT * preserves duplicates because updated_at differs
SELECT DISTINCT *
FROM signups;
-- 4 rows returned (no rows dropped — alice@x has 3 different updated_at values)
-- Right — key on the business key
WITH ranked AS (
SELECT
s.*,
ROW_NUMBER() OVER (
PARTITION BY LOWER(email)
ORDER BY updated_at DESC
) AS rn
FROM signups s
)
SELECT email, user_id, source, signup_ts, updated_at
FROM ranked
WHERE rn = 1;
-- Postgres concise form
SELECT DISTINCT ON (LOWER(email)) *
FROM signups
ORDER BY LOWER(email), updated_at DESC;
Step-by-step explanation.
-
DISTINCT *hashes the entire row including theupdated_ataudit column. Two rows with identical business data but differentupdated_atare not duplicates by theDISTINCT *definition — and so neither is dropped. - The right answer is "key on the business key" —
LOWER(email)(so casing differences don't fragment the key) — and pick the row with the latestupdated_at. -
ROW_NUMBER() ... PARTITION BY LOWER(email) ORDER BY updated_at DESCdoes exactly that. Filterrn = 1to keep the latest. -
DISTINCT ON (LOWER(email)) ... ORDER BY LOWER(email), updated_at DESCis the Postgres-only equivalent.
Output (with the correct dedup).
| user_id | source | signup_ts | updated_at | |
|---|---|---|---|---|
| alice@x | 1 | direct | 2026-04-01 09:00 | 2026-05-02 14:00 |
| bob@x | 2 | ad | 2026-04-02 10:00 | 2026-04-02 10:00 |
Rule of thumb. When you reach for DISTINCT, ask "by what key?" The answer is almost never "every column in the row." Build the dedup around the business key explicitly — LOWER(email), order_id, (customer_id, event_date) — and pick the canonical row with a deterministic ORDER BY tie-breaker.
SQL interview question on the cost of DISTINCT
A senior interviewer often opens this round as: "Walk me through what happens when I write SELECT DISTINCT user_id FROM events on a 100M-row table. Show me the plan, the memory cost, and the failure mode."
Solution Using EXPLAIN + work_mem reasoning
-- The query under test
SELECT DISTINCT user_id
FROM events;
-- The diagnostic command
EXPLAIN (ANALYZE, BUFFERS, VERBOSE)
SELECT DISTINCT user_id
FROM events;
Step-by-step trace.
| Step | Action | Observable effect |
|---|---|---|
| 1 | Planner estimates n_distinct for user_id from pg_statistic
|
n_distinct = -0.05 → ~5% of rows are distinct → ~5M unique values |
| 2 | Planner picks HashAggregate because it expects O(5M) groups |
EXPLAIN line: HashAggregate (cost=… rows=5000000)
|
| 3 | Executor builds the hash table keyed by user_id
|
Memory grows toward ~120 MB |
| 4 | If work_mem < 120 MB, the operator spills hash partitions to disk |
EXPLAIN ANALYZE line: Batches: 8 Memory Usage: 64MB Disk Usage: 56MB
|
| 5 | After the seq-scan finishes, the executor emits the unique set | Result row count = ~5M |
Output:
| Metric | In-memory run (work_mem=256MB) | Spilled run (work_mem=64MB) |
|---|---|---|
| Wall time | ~8 s | ~38 s |
| Buffers shared hit | 1.2M | 1.2M |
| Temp files written | 0 | 8 partitions, ~120 MB total |
| Result rows | 5,000,000 | 5,000,000 |
Why this works — concept by concept:
-
HashAggregate vs Sort+Unique — Postgres picks whichever is cheaper given
work_memand the estimated cardinality.HashAggregateisO(N)CPU butO(distinct)memory;Sort+UniqueisO(N log N)CPU butO(N)temp space. -
work_mem as the spill gate — when the hash table exceeds
work_mem, modern Postgres switches to a partitioned-hash-aggregation that spills excess partitions to disk and merges them. The wall time triples but the query still completes. -
n_distinct statistics — the planner reads
n_distinctfrompg_statistic(filled byANALYZE). Stale stats can pick the wrong plan;ANALYZE eventsperiodically. -
The escape hatch — for billion-row tables, neither plan is fast. The right answer is
APPROX_COUNT_DISTINCT(§4) or a daily-rolled-updistinct_userstable. -
Cost — memory =
O(distinct cardinality × entry size); CPU =O(N)for hash,O(N log N)for sort; disk =O(distinct cardinality)on spill.
SQL
Topic — distinct
DISTINCT cost and deduplication problems (SQL)
2. SELECT DISTINCT vs GROUP BY — when same, when not
select distinct sql and GROUP BY are two syntaxes for the same plan — until an aggregate or an ORDER BY enters the picture
The mental model in one line: for the question "give me the unique values of column X" the planner picks the same operator for SELECT DISTINCT X FROM t and SELECT X FROM t GROUP BY X; for "give me the unique values plus an aggregate", GROUP BY wins because DISTINCT cannot aggregate inline. Once you say that out loud, the distinct in sql vs GROUP BY interview question collapses to a two-bullet checklist.
The four cases that show up in interviews.
-
Single-column uniqueness.
SELECT DISTINCT col≡SELECT col GROUP BY col. Planner picksHashAggregatefor both; same memory, same wall time. -
Multi-column uniqueness.
SELECT DISTINCT a, b, c≡SELECT a, b, c GROUP BY a, b, c. Same plan in Postgres; subtle plan differences in MySQL because the optimizer forGROUP BYis older and sometimes adds an implicitORDER BY(configurable since MySQL 8 withONLY_FULL_GROUP_BY). -
Uniqueness + aggregate.
GROUP BYwins. You cannot writeSELECT DISTINCT a, COUNT(*) FROM t GROUP BY amore concisely than theGROUP BYitself — andDISTINCT a, bfollowed by an outerCOUNT(*)is an extra pass. -
Uniqueness + ordering. Postgres requires every
ORDER BYexpression to appear in theDISTINCTlist (otherwise the planner can't guarantee deterministic order).GROUP BYis allowed to sort by any aggregate-valid expression.
The three planner observations every senior interviewer wants to hear.
-
EXPLAIN tells the truth. Run both queries and read the plan. If both show
HashAggregate (cost=… rows=… width=…)with the same numbers, the choice is purely stylistic. -
MySQL is the exception, not the rule. MySQL prior to 8.0 used to implicitly
ORDER BYthe grouped columns; this was removed in 8.0. TheDISTINCTvsGROUP BYperformance gap on MySQL 5.7 was sometimes real because of the implicit sort; on 8.0+ it is not. -
The aggregate axis is the real divergence.
DISTINCTcannot be combined with an aggregate inline. Once you needCOUNT(*),MAX(ts),SUM(x), the equivalentGROUP BYis shorter and the plan is identical to theDISTINCT + aggregaterewrite anyway.
Postgres-specific subtlety — DISTINCT ORDER BY.
- Legal in Postgres:
SELECT DISTINCT a, b FROM t ORDER BY a, b—aandbare in theDISTINCTlist. - Illegal in Postgres:
SELECT DISTINCT a FROM t ORDER BY a, b—bis not in theDISTINCTlist, so the planner cannot guarantee order. - Equivalent
GROUP BY:SELECT a, MAX(b) FROM t GROUP BY a ORDER BY a, MAX(b)— works becauseMAX(b)is a deterministic function of the group.
Worked example — DISTINCT a, b vs GROUP BY a, b
Detailed explanation. A common interview probe is "are SELECT DISTINCT a, b FROM t and SELECT a, b FROM t GROUP BY a, b always identical?" The senior answer is "yes, modulo the aggregate axis and modulo ORDER BY."
Question. Given a 50M-row pageviews(user_id, page_url, ts) table, return the unique (user_id, page_url) tuples two ways — using DISTINCT and using GROUP BY — and compare the plans.
Input.
| Column | Type | Cardinality |
|---|---|---|
| user_id | BIGINT | 5M distinct |
| page_url | TEXT | 200 distinct |
| ts | TIMESTAMPTZ | unique per row |
Code.
-- A — using DISTINCT
SELECT DISTINCT user_id, page_url
FROM pageviews;
-- B — using GROUP BY
SELECT user_id, page_url
FROM pageviews
GROUP BY user_id, page_url;
Step-by-step explanation.
- Both queries hash on the tuple
(user_id, page_url)— same key, same hash function. - Postgres's planner picks
HashAggregatefor both.EXPLAINoutput is byte-for-byte identical except for the operator label. - Estimated rows =
min(rows_in_t, cardinality_a × cardinality_b)=min(50M, 5M × 200) = 50M— but the planner caps the estimate atn_distinct≈ 5M × ~50 (actual co-occurrence) ≈ 250M, then clips to the table row count. - Wall time is identical within ±5% on warm caches; choose the syntax your team's style guide prefers.
Output.
| Query | Plan node | Estimated rows | Wall time | Memory |
|---|---|---|---|---|
DISTINCT a, b |
HashAggregate | ~50M | 24 s | ~3 GB (with spill) |
GROUP BY a, b |
HashAggregate | ~50M | 24 s | ~3 GB (with spill) |
Rule of thumb. Use DISTINCT when the query is only "give me unique values." Use GROUP BY the moment you might want to fold in COUNT(*), MAX(ts), or any aggregate — even speculatively. The GROUP BY shape leaves the door open without a rewrite.
Worked example — DISTINCT with ORDER BY that breaks on Postgres
Detailed explanation. A subtle Postgres-only behaviour every senior interview probes: SELECT DISTINCT a FROM t ORDER BY b is illegal because the planner cannot guarantee a deterministic order when the sort key is not in the DISTINCT list. GROUP BY does not have this restriction — it can sort by any aggregate of the grouped columns.
Question. Given an events(user_id, last_seen_at, country) table, write a query that returns one row per (user_id) ordered by last_seen_at DESC. Show why the naïve DISTINCT version errors on Postgres and how to rewrite it.
Input.
| user_id | last_seen_at | country |
|---|---|---|
| 1 | 2026-05-01 09:00 | US |
| 1 | 2026-05-04 12:00 | US |
| 2 | 2026-05-03 11:00 | DE |
| 3 | 2026-05-02 10:00 | UK |
Code.
-- Broken on Postgres
SELECT DISTINCT user_id
FROM events
ORDER BY last_seen_at DESC; -- ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list
-- Correct rewrite using GROUP BY + MAX
SELECT user_id
FROM events
GROUP BY user_id
ORDER BY MAX(last_seen_at) DESC;
-- Also correct — fold the column into the DISTINCT list and key on user_id
SELECT DISTINCT ON (user_id) user_id, last_seen_at
FROM events
ORDER BY user_id, last_seen_at DESC;
Step-by-step explanation.
- The first query fails because Postgres cannot prove that any chosen
last_seen_atperuser_idis canonical — different rows for the sameuser_idcould be at differentlast_seen_at. - The
GROUP BYshape works becauseMAX(last_seen_at)is a deterministic function of the group. - The
DISTINCT ONshape works because it explicitly says "pick one row peruser_id, namely the one with the largestlast_seen_at." - MySQL is more permissive (
ONLY_FULL_GROUP_BYoff by default before 8.0) but the resulting order is then non-deterministic.
Output.
| user_id | (sort key) |
|---|---|
| 1 | 2026-05-04 12:00 |
| 2 | 2026-05-03 11:00 |
| 3 | 2026-05-02 10:00 |
Rule of thumb. If you find yourself reaching for DISTINCT col ORDER BY other_col, you actually want GROUP BY col ORDER BY aggregate(other_col) or DISTINCT ON (col). The shape of the answer is "one row per key with a chosen tie-breaker," and that is exactly what GROUP BY / DISTINCT ON model.
SQL interview question on DISTINCT vs GROUP BY for aggregation
A scaling probe: "For a daily ETL that needs the unique (user_id, page_url) tuples and a count of how many times each tuple occurred, which is faster — DISTINCT followed by a self-join, or GROUP BY with COUNT(*)?"
Solution Using GROUP BY with COUNT(*) in one pass
-- The wrong way — DISTINCT then re-aggregate (two passes)
WITH unique_pairs AS (
SELECT DISTINCT user_id, page_url FROM pageviews
)
SELECT u.user_id, u.page_url, COUNT(*) AS hits
FROM unique_pairs u
JOIN pageviews p USING (user_id, page_url)
GROUP BY u.user_id, u.page_url;
-- The right way — GROUP BY does both in one pass
SELECT user_id, page_url, COUNT(*) AS hits
FROM pageviews
GROUP BY user_id, page_url;
Step-by-step trace.
| Approach | Passes | Operators | Reads |
|---|---|---|---|
DISTINCT + self-join |
2 | HashAggregate → HashJoin → HashAggregate | 50M + 50M = 100M |
Single GROUP BY
|
1 | HashAggregate | 50M |
Output:
| Approach | Wall time | Result rows | Peak memory |
|---|---|---|---|
DISTINCT + self-join |
78 s | ~250K | ~6 GB (two hash tables) |
Single GROUP BY
|
28 s | ~250K | ~3 GB (one hash table) |
Why this works — concept by concept:
-
One hash pass vs two —
GROUP BYbuilds the hash table keyed by(user_id, page_url)once and accumulatesCOUNT(*)into each bucket while it scans. TheDISTINCTshape builds the table once, throws away the bucket counts, and forces a re-scan to recompute them. -
No self-join cost — the
DISTINCT + JOINshape pays a hash-join cost (build + probe of two 50M streams) thatGROUP BYdoes not. -
The planner cannot rescue you — Postgres does not have a rewrite rule that turns
DISTINCT+ self-join into a singleGROUP BY. The optimization is at the SQL level, not the planner level. -
Same memory upper bound, smaller peak — both shapes have
O(distinct tuples)memory; theGROUP BYshape never instantiates the second hash table for the join. -
Cost —
GROUP BY + COUNT≈O(N)CPU,O(distinct)memory, 1 pass;DISTINCT + JOIN + COUNT≈O(N) + O(N log N)CPU,O(2 × distinct)memory, 2 passes.
SQL
Topic — aggregation
Aggregation problems (GROUP BY, DISTINCT, COUNT)
3. COUNT(DISTINCT col) anatomy and performance
sql count distinct is exact, expensive, and the most common performance footgun in analytics queries
The mental model: COUNT(DISTINCT col) must build the entire unique set of col values in memory (or on disk if it spills), then emit the cardinality of that set. The set never shrinks during the scan — it can only grow — so memory cost is O(distinct cardinality) and any index on col does not save you, because the engine must touch every row to know whether each value is "new" or "already seen."
The five cost properties every interview probes.
-
Memory =
O(distinct cardinality). NotO(row count). A column with 1B rows but only 100 distinct values fits in a few KB; a column with 100M rows and 100M distinct values needs ~2.4 GB of hash table. -
Disk spill above
work_mem. Postgres spills hash partitions; MySQL writes a temp file totmpdir; Snowflake spills to remote storage and the credit cost spikes. -
Index does not help. A B-tree on
collets the engine sort cheaply, but it still must read every leaf entry.COUNT(DISTINCT col)cannot be answered from an index summary unless the engine maintains aMIN/MAX/UNIQUE_COUNTper leaf page (no major OLTP engine does). -
Multiple
COUNT(DISTINCT)in one query → N passes.SELECT COUNT(DISTINCT a), COUNT(DISTINCT b), COUNT(DISTINCT c) FROM tbuilds three separate hash tables. Postgres does this with threeAggregatenodes that each consume the input; some warehouses parallelise it but the memory cost is the sum. -
COUNT(DISTINCT a, b)is dialect-specific. Postgres, Snowflake, and BigQuery all support it. MySQL supports it but treats(NULL, NULL)differently. SQL Server requires aCONCATworkaround. Spark supports it viacount(distinct a, b)which translates to a tuple hash.
The four most-asked count distinct sql interview questions.
- "What's the difference between
COUNT(*),COUNT(col), andCOUNT(DISTINCT col)?" — row count, non-null row count, unique non-null count. - "Why is
COUNT(DISTINCT)slow on big tables?" — the unique set lives in memory; spill is brutal. - "How do I count distinct over a windowed range?" — typically not directly; either
array_agg(DISTINCT)+cardinalityor a daily-rolled-up table. - "Multiple
COUNT(DISTINCT)in one query — how to avoid N passes?" —GROUPING SETS,HLL_COMBINE, or pre-aggregate.
Dialect notes that matter at the whiteboard.
-
Postgres.
COUNT(DISTINCT col)always usesSort + Aggregate(notHashAggregate) — historical, will be fixed in newer versions. The workaround for hash performance isSELECT COUNT(*) FROM (SELECT DISTINCT col FROM t) s. -
MySQL.
COUNT(DISTINCT a, b)is fine;(NULL, x)and(NULL, y)are treated as distinct from each other only if the engine sees them as a tuple (configurable). -
Snowflake.
COUNT(DISTINCT col)is exact;APPROX_COUNT_DISTINCT(col)is the HLL variant. The exact form is the default — and the expensive one. -
BigQuery.
COUNT(DISTINCT col)over more than ~1B rows triggers aResources Exceedederror unless the cluster is sized for it;APPROX_COUNT_DISTINCTworks without limit.
Worked example — measure the cost of multiple COUNT(DISTINCT) in one query
Detailed explanation. A dashboard query asks: "for the events table, how many distinct users, distinct sessions, and distinct pages did we see yesterday?" Naïvely written, this is three COUNT(DISTINCT) calls and three hash tables — three full scans worth of memory.
Question. On a 1B-row daily events table, what does SELECT COUNT(DISTINCT user_id), COUNT(DISTINCT session_id), COUNT(DISTINCT page_url) FROM events WHERE event_date = '2026-06-01' do on Postgres, and how would you reduce the cost?
Input.
| Column | Distinct values (1 day) | Hash bytes per entry |
|---|---|---|
| user_id | 10,000,000 | 24 (BIGINT + hash + chain) |
| session_id | 25,000,000 | 40 (UUID + hash + chain) |
| page_url | 50,000 | 80 (TEXT avg 40 bytes + overhead) |
Code (the naive query).
SELECT
COUNT(DISTINCT user_id) AS distinct_users,
COUNT(DISTINCT session_id) AS distinct_sessions,
COUNT(DISTINCT page_url) AS distinct_pages
FROM events
WHERE event_date = '2026-06-01';
Step-by-step explanation.
- Three hash tables in memory. ~240 MB (user) + ~1 GB (session) + ~4 MB (page) ≈ 1.24 GB.
-
Single seq-scan with three operators on top. Postgres reads the day's partition once and feeds three independent
Aggregatenodes — each maintains its own hash table. -
Spill behaviour. With
work_mem=64MB, the user_id and session_id operators each spill to ~200 MB of temp files. Wall time grows by ~3×. -
The fix. Either size
work_memto fit (SET work_mem = '2GB'for this session) or switch toAPPROX_COUNT_DISTINCT(or its dialect equivalent) which uses ~1.5 KB per metric.
Output (illustrative, Postgres on a 64-core box).
| Strategy | Memory | Wall time | Result accuracy |
|---|---|---|---|
Three exact COUNT(DISTINCT)
|
1.24 GB | 4 min | exact |
Three APPROX_COUNT_DISTINCT
|
~4.5 KB | 12 s | ±1.6% |
One COUNT(DISTINCT) + two APPROX_COUNT_DISTINCT
|
~240 MB + ~3 KB | 45 s | exact / ±1.6% / ±1.6% |
Rule of thumb. For dashboards, default to APPROX_COUNT_DISTINCT. Use exact COUNT(DISTINCT) only when (a) the cardinality is small (< 1M distinct values) or (b) a regulator or product requirement says "exact."
Worked example — COUNT(DISTINCT a, b) and dialect differences
Detailed explanation. The multi-column form COUNT(DISTINCT a, b) is the second-most-frequent interview shape after COUNT(DISTINCT col). It is widely supported but the dialect quirks bite: SQL Server does not have it, MySQL handles NULL differently from Postgres, and Spark's tuple-hash makes it cheap to compute.
Question. Given a page_visits(user_id, session_id, page_url) table, write a query that returns the number of distinct (user_id, page_url) pairs. Show three forms — Postgres / Snowflake, MySQL with caveat, and SQL Server with CONCAT fallback.
Input.
| user_id | session_id | page_url |
|---|---|---|
| 1 | s1 | /home |
| 1 | s1 | /pricing |
| 1 | s2 | /home |
| 2 | s3 | /home |
| 2 | s3 | NULL |
| 2 | s4 | NULL |
Code.
-- Postgres / Snowflake / BigQuery / Spark — direct tuple form
SELECT COUNT(DISTINCT user_id, page_url) AS distinct_pairs
FROM page_visits;
-- Result: 3 distinct non-NULL pairs (1, /home), (1, /pricing), (2, /home)
-- The (2, NULL) and (2, NULL) duplicate rows are skipped because page_url is NULL
-- MySQL — same syntax works, NULL handling is similar but verify on your version
SELECT COUNT(DISTINCT user_id, page_url) AS distinct_pairs
FROM page_visits;
-- SQL Server — must concatenate
SELECT COUNT(DISTINCT CONCAT(CAST(user_id AS VARCHAR(20)), '||', ISNULL(page_url, '<NULL>'))) AS distinct_pairs
FROM page_visits;
-- Portable form — GROUP BY then COUNT
SELECT COUNT(*) AS distinct_pairs
FROM (
SELECT user_id, page_url
FROM page_visits
GROUP BY user_id, page_url
) g;
Step-by-step explanation.
-
Postgres / Snowflake / BigQuery / Spark — the engine hashes the tuple
(user_id, page_url)directly. Rows where any column isNULLare skipped entirely from the count (this is the standard-SQL behaviour forCOUNT(DISTINCT)over any column). -
MySQL — same syntax, but historically had different
NULLsemantics; on 8.0+ it matches standard SQL. -
SQL Server — no native multi-column form; the workaround concatenates with a separator that cannot occur in either column, and explicitly maps
NULLto a sentinel string so distinct(2, NULL)rows are counted instead of dropped. -
Portable GROUP BY form —
SELECT COUNT(*) FROM (SELECT a, b FROM t GROUP BY a, b)is the universal cross-dialect answer. It also makes theNULLsemantics explicit (you choose whether to filter them out in the inner query).
Output (under standard-SQL NULL semantics).
| Pair | Counted? |
|---|---|
| (1, /home) | yes |
| (1, /pricing) | yes |
| (2, /home) | yes |
| (2, NULL) | no — NULL in the tuple |
Result: distinct_pairs = 3.
Rule of thumb. If your query has to run on SQL Server and Postgres, write the portable COUNT(*) FROM (SELECT … GROUP BY) form. If you control the dialect, use COUNT(DISTINCT a, b) — it is shorter and the planner can sometimes optimise the tuple hash.
SQL interview question on speeding up COUNT(DISTINCT)
A senior probe: "Your daily 'distinct active users' dashboard query is now taking 12 minutes because the events table crossed 5B rows. What are the three knobs you would consider — in order — to fix it?"
Solution Using pre-aggregation + APPROX_COUNT_DISTINCT + work_mem
-- Step 1 — pre-aggregate at write time
CREATE MATERIALIZED VIEW daily_active_users AS
SELECT
event_date,
COUNT(DISTINCT user_id) AS exact_dau
FROM events
GROUP BY event_date;
REFRESH MATERIALIZED VIEW daily_active_users; -- runs once per day, cost amortised
-- Step 2 — switch to APPROX for the dashboard (no materialized view needed)
SELECT
event_date,
APPROX_COUNT_DISTINCT(user_id) AS approx_dau
FROM events
WHERE event_date BETWEEN '2026-05-01' AND '2026-06-01'
GROUP BY event_date;
-- Step 3 — if exact is required, raise work_mem for this session only
SET work_mem = '4GB';
SELECT
event_date,
COUNT(DISTINCT user_id) AS exact_dau
FROM events
WHERE event_date BETWEEN '2026-05-01' AND '2026-06-01'
GROUP BY event_date;
Step-by-step trace.
| Step | Action | Effect |
|---|---|---|
| 1 | Pre-aggregate via materialized view, refreshed daily | dashboard reads ~30 rows instead of 5B |
| 2 | Switch dashboard to APPROX_COUNT_DISTINCT
|
error budget ±1.6%, memory ~1.5 KB/day |
| 3 | If exact required, raise work_mem for the session |
avoids spill, hash stays in RAM |
| 4 | Cache the materialized view in a fast tier (Postgres read replica or Snowflake result cache) | dashboard latency drops to single-digit ms |
Output:
| Strategy | Memory | Wall time | Accuracy |
|---|---|---|---|
Original (raw COUNT(DISTINCT)) |
~12 GB (spilled) | 12 min | exact |
APPROX_COUNT_DISTINCT on raw events |
~50 KB | 28 s | ±1.6% |
| Materialized view (exact) | ~30 rows scanned | 80 ms | exact |
Materialized view + work_mem for refresh |
~12 GB during refresh only | 12 min (1×/day) | exact, daily snapshot |
Why this works — concept by concept:
-
Pre-aggregation amortises the cost — the expensive
COUNT(DISTINCT)runs once per day in a controlled window, not on every dashboard load. -
APPROX is the right default for dashboards — a
±1.6%error band on a metric like DAU is invisible to humans and well below the day-to-day variance the dashboard already shows. -
work_mem as a session-level lever —
SET work_mem = '4GB'only applies to the current session and lets a one-off exact run finish without changing global config. - Materialized view + read replica — separates the OLAP compute from the dashboard latency budget; the dashboard reads pre-computed numbers, not raw events.
-
Cost — original =
O(N)scan +O(distinct)memory + spill; materialized =O(days)scan + tiny memory; APPROX =O(N)scan +O(1)memory.
SQL
Topic — aggregation
COUNT(DISTINCT) + dashboard aggregation problems (SQL)
4. APPROX_COUNT_DISTINCT and HyperLogLog
APPROX_COUNT_DISTINCT trades ±1.6% error for O(log log n) memory — the right default for dashboards
The mental model: HyperLogLog (HLL) is a probabilistic data structure that estimates the cardinality of a set in roughly 1.5 KB of memory, with ~1.6% standard error, regardless of whether the set has 10K or 10B distinct elements. Every major analytics warehouse exposes it under a slightly different name — APPROX_COUNT_DISTINCT in Snowflake and BigQuery, approx_count_distinct in Spark, the hll extension on Postgres, PFCOUNT on Redis — but the underlying algorithm is the same.
The HyperLogLog one-page summary.
-
Hash every input. Each value
vis hashed to a uniformb-bit number, e.g. 64 bits. -
Split into bucket + tail. The first
pbits index one ofm = 2^pregisters (typicallyp=14, som=16384). The remaining bits form the "tail." -
Track the max leading-zero count per register. For each new value, count how many leading zeros the tail has (call it
R); storemax(register[bucket], R). -
Estimate cardinality.
estimate = αₘ × m² / Σ 2^(-Rⱼ)whereαₘis a known bias-correction constant. Withp=14, the error is~1.04 / √m ≈ 0.81%(HLL++ further improves small-cardinality accuracy). -
Memory =
m × 5 bits(per-register max leading zero counter) ≈16384 × 5 / 8 ≈ 10 KB. Many implementations compress further to~1.5 KBfor typical workloads.
Engines and their flavours.
-
Snowflake —
APPROX_COUNT_DISTINCT(col); standard error1.62%. AlsoHLL(col)(returns the HLL state) andHLL_COMBINE(merges states). -
BigQuery —
APPROX_COUNT_DISTINCT(col); backed by HLL++. AlsoHLL_COUNT.INIT,HLL_COUNT.MERGE,HLL_COUNT.EXTRACTfor state-management workflows. -
Spark —
approx_count_distinct(col, rsd=0.05)wherersdis the target relative standard deviation (default 5%). Lower rsd → more memory. -
Postgres —
hllextension (citus,aiven,timescale). Usehll_cardinality(hll_add_agg(col)). -
Redis —
PFADD,PFCOUNT,PFMERGE— the canonical streaming HLL surface.
When the approximation is safe and when it is not.
- Safe. Dashboards (MAU, DAU, ad-network reach), weekly reports, anomaly-detection thresholds, capacity planning. A 1.6% miss on "monthly active users" is invisible.
- Unsafe. Billing and revenue reconciliation, regulatory reporting, deduplication of a list of records (HLL gives you a cardinality, not the list). A 1.6% miss on "users billed this month" is a lawsuit.
-
Hybrid. Keep exact
COUNT(DISTINCT)in the monthly close report (regulatory) and HLL everywhere else (dashboards). The two never collide because the close report runs once a month withwork_memcranked up.
The mergeable property — why HLL is huge for ETL.
- HLL states are mergeable —
HLL(A) ∪ HLL(B) = HLL(A ∪ B). That means you can compute the daily HLL state per partition, store it, and merge weeks or months later without re-scanning the underlying events. - This makes HLL the right pattern for MAU / DAU rollups — store one HLL per
(day, segment)and merge on demand for the dashboard query.
Worked example — APPROX_COUNT_DISTINCT on a 10B-row events table
Detailed explanation. A practical comparison: a 10B-row events table, the question "how many distinct users in the last 30 days?", with exact and approximate variants run on the same Snowflake warehouse.
Question. Given events(user_id, event_date, …) with 10B rows over 30 days and ~80M distinct users, compare COUNT(DISTINCT user_id) and APPROX_COUNT_DISTINCT(user_id) on Snowflake. Show the SQL, the trace, and the cost numbers a senior engineer would quote.
Input.
| Setting | Value |
|---|---|
| Table | events(user_id BIGINT, event_date DATE, …) |
| Rows | 10,000,000,000 |
| Distinct user_id | ~80,000,000 |
| Warehouse | Snowflake LARGE (8 nodes) |
| Tables clustered on | event_date |
Code.
-- Exact distinct count
SELECT COUNT(DISTINCT user_id) AS exact_users
FROM events
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';
-- Approximate distinct count
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_users
FROM events
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';
-- The mergeable pattern — store one HLL state per day, merge on demand
CREATE OR REPLACE TABLE daily_user_hll AS
SELECT
event_date,
HLL(user_id) AS user_hll_state -- serialised HLL state, one row per day
FROM events
GROUP BY event_date;
-- Dashboard query reads the rollup, not the raw events
SELECT HLL_ESTIMATE(HLL_COMBINE(user_hll_state)) AS approx_30d_users
FROM daily_user_hll
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';
Step-by-step explanation.
-
Exact path. The 8-node warehouse scans ~10B rows, hashes each
user_idinto a global unique set of ~80M, then emits one number. Wall time ≈ 2 minutes; warehouse credits ≈ 0.27. - Approximate path. Same scan, but instead of a hash set it maintains 8 × 16384 HLL registers across nodes, then merges them at the coordinator. Wall time ≈ 8 seconds; credits ≈ 0.02.
- Mergeable rollup path. Daily HLL states are pre-computed (one row per day, ~10 KB serialised); the dashboard query reads 30 rows and merges them. Wall time ≈ 200 ms; credits ≈ negligible.
-
Accuracy. APPROX returns ~80.1M (true is 80.0M) — 0.125% error, well within the
~1.6%envelope.
Output.
| Strategy | Wall time | Warehouse credits | Result | Error |
|---|---|---|---|---|
COUNT(DISTINCT user_id) |
2 min | 0.27 | 80,000,000 (exact) | 0 |
APPROX_COUNT_DISTINCT |
8 s | 0.02 | 80,135,000 | 0.17% |
| HLL rollup + merge | 200 ms | < 0.001 | 80,142,000 | 0.18% |
Rule of thumb. For any "distinct over a date range" dashboard, start with APPROX_COUNT_DISTINCT. Materialise one HLL state per partition (day, region, segment) if the dashboard latency budget is sub-second.
Worked example — streaming distinct counts with Redis PFADD
Detailed explanation. Outside the warehouse, HLL also powers streaming distinct counts in OLTP-adjacent stores. Redis exposes the HyperLogLog primitive as PFADD / PFCOUNT / PFMERGE — perfect for "how many distinct visitors are on the site right now" without keeping a row per visitor in a Redis Set.
Question. Build a real-time "unique visitors today" counter for a website that receives ~1B events per day. Compare a Redis Set vs a Redis HyperLogLog (PFADD).
Input.
| Setting | Value |
|---|---|
| Events / day | ~1,000,000,000 |
| Distinct visitors / day | ~10,000,000 |
| Average user_id size | 36 bytes (UUID string) |
Code.
# Variant A — Redis Set
import redis
r = redis.Redis()
def track_visitor_set(user_id: str):
r.sadd("visitors:2026-06-01", user_id)
def count_visitors_set():
return r.scard("visitors:2026-06-01")
# Variant B — Redis HyperLogLog
def track_visitor_hll(user_id: str):
r.pfadd("visitors_hll:2026-06-01", user_id)
def count_visitors_hll():
return r.pfcount("visitors_hll:2026-06-01")
# Merge multiple days for "last 7 days unique"
def merge_week():
keys = [f"visitors_hll:2026-05-{d:02d}" for d in range(26, 32)] + ["visitors_hll:2026-06-01"]
r.pfmerge("visitors_week", *keys)
return r.pfcount("visitors_week")
Step-by-step explanation.
-
Redis Set memory. A
SETkeyed byuser_idstores every distinct value — 10M UUIDs × ~80 bytes each (Redis overhead) ≈ 800 MB / day. -
Redis HLL memory. A single
PFADDkey stays at ~12 KB regardless of cardinality (Redis HLL is fixed-size). Memory drops by ~65,000×. -
Merge for ranges.
PFMERGEcombines per-day HLLs into a single weekly state in one constant-time operation per source key. TheSetalternative would require iterating every member of every day — impossibly slow. -
Error envelope. Redis HLL claims
0.81%standard error; observed in production at~1.6%peak. Acceptable for any visitor-counter dashboard.
Output.
| Metric | Redis Set | Redis HLL |
|---|---|---|
| Memory per day | ~800 MB | ~12 KB |
track_visitor latency |
~50 µs (SADD) | ~30 µs (PFADD) |
count_visitors latency |
~10 µs (SCARD) | ~20 µs (PFCOUNT) |
| 7-day merge | impractical | one PFMERGE, ~80 µs |
| Result accuracy | exact | ±1.6% |
Rule of thumb. Unless you need an exact list of the visitors (not just the count) or unless you need to test membership (SISMEMBER), reach for HLL. The memory savings are not 2× — they are 4–5 orders of magnitude.
Worked example — trace the HLL update for a single insert
Detailed explanation. To defend the algorithm at the whiteboard you need to walk through one update step by hand. The interview probe usually sounds like "what does HLL actually do when I PFADD user_42?" — and the answer is a four-step pipeline you can sketch in 30 seconds.
Question. With m=8 registers (for whiteboard simplicity; production HLL uses m=16384), walk through what happens when a new value v = "user_42" is added to the HLL. Show the hash, the bucket selection, the leading-zero count, and the register update.
Input.
| Setting | Value |
|---|---|
Register count m
|
8 (so p = log2(m) = 3) |
| Hash function | 32-bit xxhash
|
Input value v
|
"user_42" |
| Initial register array | [0, 1, 0, 2, 0, 0, 1, 0] |
Suppose hash(v)
|
0x00031a4f = 0000 0000 0000 0011 0001 1010 0100 1111
|
Code (pseudocode for the update step).
def pfadd(registers, value, p=3, b=32):
h = xxhash(value) # 32-bit hash
bucket = h >> (b - p) # top `p` bits select the bucket
tail = (h << p) & ((1 << b) - 1) # remaining bits form the tail
# Count leading zeros in `tail`, then add 1 to land on the position
R = leading_zeros(tail, width=b - p) + 1
if R > registers[bucket]:
registers[bucket] = R
return registers
# Trace
registers = [0, 1, 0, 2, 0, 0, 1, 0]
pfadd(registers, "user_42", p=3, b=32)
Step-by-step explanation.
-
Hash.
hash("user_42")=0x00031a4f. Binary:0000 0000 0000 0011 0001 1010 0100 1111. -
Bucket. Top
p=3bits =000→ bucket index0. -
Tail. Remaining
b - p = 29bits =0 0000 0000 0011 0001 1010 0100 1111. Leading zeros in this 29-bit tail =13. -
R (rank).
R = 13 + 1 = 14— by convention HLL stores1 + leading_zerosso the minimum value is 1. -
Register update.
registers[0] = max(0, 14) = 14. The array becomes[14, 1, 0, 2, 0, 0, 1, 0]. -
Cardinality estimate.
Z = Σ 2^(-Rⱼ) = 2^-14 + 2^-1 + 2^-0 + 2^-2 + 2^-0 + 2^-0 + 2^-1 + 2^-0(using the new values). Form=8,αₘ ≈ 0.673.estimate = αₘ × m² / Z.
Output.
| Register | Before | After |
|---|---|---|
| 0 | 0 | 14 |
| 1 | 1 | 1 |
| 2 | 0 | 0 |
| 3 | 2 | 2 |
| 4 | 0 | 0 |
| 5 | 0 | 0 |
| 6 | 1 | 1 |
| 7 | 0 | 0 |
Rule of thumb. Each register only ever grows — max(old, new). That is what makes HLL mergeable (the union of two HLLs is the element-wise max of their register arrays) and what gives it constant memory regardless of input size.
Worked example — choose the right approximation across engines
Detailed explanation. Every major warehouse has its own knobs for the precision-vs-memory trade-off. Knowing the right call across Snowflake, BigQuery, and Spark is the kind of dialect-fluency interview round senior interviewers love to probe.
Question. A team wants to compute unique-visitor counts per region across the last year, with a target relative standard deviation of 1%. Compare the API and the memory cost on Snowflake, BigQuery, and Spark.
Input.
| Setting | Value |
|---|---|
| Rows scanned | ~50B (1 year of events) |
| Distinct visitors / region | ~5M |
| Regions | 12 |
| Target rsd | 1% |
Code.
-- Snowflake — APPROX_COUNT_DISTINCT (precision is fixed at p=14, ~1.6% standard error)
SELECT region, APPROX_COUNT_DISTINCT(user_id) AS approx_users
FROM events
WHERE event_date >= '2025-06-01'
GROUP BY region;
-- BigQuery — HLL_COUNT.MERGE for finer control
SELECT region,
HLL_COUNT.MERGE(user_id_hll) AS approx_users
FROM (
SELECT region, HLL_COUNT.INIT(user_id, 15) AS user_id_hll -- p=15 → ~1.0% std err
FROM events
WHERE event_date >= '2025-06-01'
GROUP BY region
);
-- Spark — approx_count_distinct(col, rsd)
SELECT region, approx_count_distinct(user_id, 0.01) AS approx_users
FROM events
WHERE event_date >= '2025-06-01'
GROUP BY region;
Step-by-step explanation.
-
Snowflake —
APPROX_COUNT_DISTINCTuses HLL with fixedp=14(~16,384 registers, ~1.6% std err). To get a tighter envelope, useHLL+HLL_COMBINEand reason about per-day rollups instead. -
BigQuery —
HLL_COUNT.INIT(col, precision)lets you setprecisionfrom 10 to 24.p=15gives ~1.0% std err with ~32KB per HLL state.HLL_COUNT.MERGEcombines. -
Spark —
approx_count_distinct(col, rsd)takes a target relative standard deviation directly.rsd=0.01choosespinternally to hit 1% accuracy at the cost of more memory per estimator. - Memory comparison (per-region HLL state).
Output.
| Engine | API | Tuning knob | Memory per region | Std err |
|---|---|---|---|---|
| Snowflake | APPROX_COUNT_DISTINCT |
none (fixed p=14) | ~12 KB | ~1.6% |
| BigQuery | HLL_COUNT.INIT(col, p=15) |
p 10–24 |
~32 KB | ~1.0% |
| Spark | approx_count_distinct(col, 0.01) |
rsd 0.001–0.5 |
~32 KB | ~1.0% |
Rule of thumb. When the spec says "I need ~1% accuracy," BigQuery's HLL_COUNT.INIT(col, 15) and Spark's approx_count_distinct(col, 0.01) give you the tunable knob. Snowflake's APPROX_COUNT_DISTINCT is fixed at ~1.6% — for tighter envelopes, store an HLL state per partition and combine.
SQL interview question on choosing between exact and approximate distinct
A common probe: "Your finance team wants the monthly active users dashboard to load in under 500ms. The events table is 10B rows. Walk me through your reasoning."
Solution Using mergeable HLL rollups
-- Step 1 — define a daily rollup of HLL states
CREATE OR REPLACE TABLE daily_user_hll (
event_date DATE NOT NULL,
user_hll_state BINARY NOT NULL,
PRIMARY KEY (event_date)
);
-- Step 2 — populate it once per day (e.g. via Snowflake task or Airflow DAG)
INSERT INTO daily_user_hll
SELECT
event_date,
HLL(user_id)
FROM events
WHERE event_date = CURRENT_DATE - 1
GROUP BY event_date;
-- Step 3 — dashboard query merges the per-day HLLs
SELECT HLL_ESTIMATE(HLL_COMBINE(user_hll_state)) AS mau_estimate
FROM daily_user_hll
WHERE event_date >= CURRENT_DATE - INTERVAL '30 days';
Step-by-step trace.
| Step | Operation | Memory | I/O | Wall time |
|---|---|---|---|---|
| 1 | Daily ETL builds one HLL state per (event_date)
|
~1.5 KB per row | 1 day of events | ~30 s/day |
| 2 | Dashboard reads 30 rows from daily_user_hll
|
~45 KB total | 30 rows | ~10 ms |
| 3 |
HLL_COMBINE merges 30 HLL states into one |
~1.5 KB | none | ~5 ms |
| 4 |
HLL_ESTIMATE derives the final cardinality |
~constant | none | ~1 ms |
| 5 | Dashboard renders the number | n/a | n/a | total ~30 ms |
Output:
| Metric | Raw COUNT(DISTINCT)
|
HLL rollup pattern |
|---|---|---|
| Dashboard latency | 2 min | ~30 ms |
| Warehouse credits per dashboard load | 0.27 | < 0.001 |
| Result | 80,000,000 (exact) | 80,142,000 (±1.6%) |
| Daily ETL cost | none | ~30 s × 1/day |
| Memory per query | ~12 GB | ~45 KB |
Why this works — concept by concept:
-
HyperLogLog as a sub-linear cardinality estimator —
O(log log n)memory,~1.6%standard error, regardless of input cardinality. This is the algorithmic core that makes the rollup possible. -
Mergeability —
HLL(A ∪ B) = HLL_COMBINE(HLL(A), HLL(B)). The dashboard query never re-scans the raw events; it merges 30 pre-computed states. - Amortised cost — the expensive scan happens once per day in a controlled window; the dashboard cost is essentially zero.
- Sub-second dashboard latency — reading 30 rows + merging tiny HLL states fits comfortably in any web request budget.
-
Cost — exact =
O(N)scan +O(distinct)memory +O(N)time per load; HLL rollup =O(N)scan once per day +O(days)time per dashboard load +O(1)memory.
SQL
Topic — aggregation
Approximate aggregation + dashboard rollup problems (SQL)
5. Deduplication patterns — ROW_NUMBER vs DISTINCT vs GROUP BY
sql deduplicate is a three-pattern matrix — ROW_NUMBER (general), DISTINCT ON (Postgres-only), GROUP BY + ARG_MAX (cross-dialect with helpers)
The mental model: for the question "give me one row per business key, picking a specific one when there are duplicates," the right tool depends on the dialect and on what column decides the tie. DISTINCT * is the wrong answer in production — it preserves rows that differ only by audit columns. The three patterns below cover every interview shape from "latest order per customer" to "first session per user."
The three patterns at a glance.
-
Pattern 1 —
ROW_NUMBER()over a partition. The universal answer. Works in every dialect. Pick the row wherern = 1after partitioning by the business key and ordering by the tie-breaker. -
Pattern 2 —
DISTINCT ON (key). Postgres-only, concise. Returns the first row per group given the trailingORDER BY. -
Pattern 3 —
GROUP BY key+ARG_MAX/FIRST_VALUE. The warehouse-native shape — Snowflake hasMAX_BY/MIN_BY, BigQuery hasARRAY_AGG ... LIMIT 1, Spark hasfirst(...)after anORDER BY. Concise once you know the engine's helper.
Decision tree for the interview.
-
"Cross-dialect, no guarantees on engine?" —
ROW_NUMBER(). It is in standard SQL since 2003. -
"Postgres, and I want concise code?" —
DISTINCT ON. One of Postgres's best features. -
"Snowflake / BigQuery / Spark warehouse query?" —
MAX_BY/ARG_MAX/ARRAY_AGG LIMIT 1. The engine's helper is the shortest and the planner gives it the best plan. -
"I just need the latest
tsper(key), no other columns?" —MAX(ts) GROUP BY key. Don't reach forROW_NUMBER()when a plain aggregate suffices.
The classic interview shape — "latest order per customer."
- Input:
orders(order_id, customer_id, order_ts, status, …). - Output: one row per
customer_id, the row with the largestorder_ts. - Three valid answers:
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_ts DESC) = 1DISTINCT ON (customer_id) * ORDER BY customer_id, order_ts DESC-
JOIN orders USING (customer_id, order_ts)against aGROUP BY customer_id SELECT MAX(order_ts)derived table — orMAX_BY(*, order_ts)if the engine has it.
Why DISTINCT * is the wrong answer for this shape.
-
DISTINCT *deduplicates rows that are byte-for-byte identical. If a row is updated with a newupdated_atwatermark,DISTINCT *returns both versions — defeating the purpose. - The right answer keys explicitly on the business key (
customer_id,order_id,email_lower), not on every column.
Worked example — pick the latest order per customer with ROW_NUMBER
Detailed explanation. The senior pattern is ROW_NUMBER() with a deterministic tie-breaker — usually order_ts DESC, order_id DESC so ties on order_ts resolve to the higher order_id. The CTE shape makes the intent obvious and is easy to refactor when the spec changes.
Question. Given an orders table with multiple rows per customer, return the most recent order per customer, breaking ties by order_id DESC.
Input.
| order_id | customer_id | order_ts | status |
|---|---|---|---|
| 101 | 42 | 2026-05-01 09:00 | placed |
| 102 | 42 | 2026-05-02 14:00 | placed |
| 103 | 42 | 2026-05-02 14:00 | paid |
| 104 | 99 | 2026-05-03 10:00 | paid |
| 105 | 99 | 2026-05-04 11:00 | shipped |
Code.
WITH ranked AS (
SELECT
o.*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_ts DESC, order_id DESC
) AS rn
FROM orders o
)
SELECT order_id, customer_id, order_ts, status
FROM ranked
WHERE rn = 1;
Step-by-step explanation.
- The window function sorts each
customer_idpartition by(order_ts DESC, order_id DESC). - For
customer_id=42, the rows are ranked103 (rn=1), 102 (rn=2), 101 (rn=3)—103wins because itsorder_idis higher and theorder_tstied with102. - For
customer_id=99, the rows are ranked105 (rn=1), 104 (rn=2)—105wins onorder_ts. - The outer filter keeps only
rn = 1.
Output.
| order_id | customer_id | order_ts | status |
|---|---|---|---|
| 103 | 42 | 2026-05-02 14:00 | paid |
| 105 | 99 | 2026-05-04 11:00 | shipped |
Rule of thumb. Always include a deterministic tie-breaker in the ORDER BY (order_id DESC here). Without it, the engine is allowed to return either row when order_ts ties — and the choice can flip between query runs.
Worked example — the same query in Postgres with DISTINCT ON
Detailed explanation. Postgres's DISTINCT ON (key) ... ORDER BY key, tie_breaker is a near-perfect dual to the ROW_NUMBER() shape — shorter, same plan, Postgres-only.
Question. Re-write the "latest order per customer" query using DISTINCT ON (Postgres).
Input. Same as above.
Code.
SELECT DISTINCT ON (customer_id)
order_id, customer_id, order_ts, status
FROM orders
ORDER BY customer_id, order_ts DESC, order_id DESC;
Step-by-step explanation.
- Postgres sorts the input by
(customer_id, order_ts DESC, order_id DESC). -
DISTINCT ON (customer_id)keeps the first row percustomer_idafter that sort. - Because
order_ts DESC, order_id DESCis the second/third sort key, the kept row is the most recent. - The plan is
Unique → Sort → SeqScan— equivalent to theROW_NUMBER() = 1plan on the same Postgres version.
Output. Identical to the ROW_NUMBER() shape.
Rule of thumb. When you are sure the engine is Postgres, DISTINCT ON is the cleanest expression of "one row per key." Outside Postgres, fall back to ROW_NUMBER().
Worked example — the same query in Snowflake with MAX_BY
Detailed explanation. Snowflake's MAX_BY(value_col, sort_col) returns the value of value_col from the row that has the maximum sort_col. Combined with GROUP BY customer_id, this is the shortest cross-dialect-ish answer.
Question. Re-write the "latest order per customer" query in Snowflake using MAX_BY.
Input. Same as above.
Code.
SELECT
customer_id,
MAX_BY(order_id, (order_ts, order_id)) AS latest_order_id,
MAX(order_ts) AS latest_order_ts,
MAX_BY(status, (order_ts, order_id)) AS latest_status
FROM orders
GROUP BY customer_id;
Step-by-step explanation.
-
MAX_BY(order_id, (order_ts, order_id))returns theorder_idfrom the row whose(order_ts, order_id)is largest — that is, the latest order, withorder_id DESCas the tie-breaker. - The same pattern repeats for
status(and for any other column the spec needs). - The
GROUP BY customer_idaggregation runs in one pass —O(N)time,O(distinct customers)memory. - Equivalent in BigQuery:
ARRAY_AGG(STRUCT(order_id, status) ORDER BY order_ts DESC, order_id DESC LIMIT 1)[OFFSET(0)].
Output. Identical to the ROW_NUMBER() shape.
Rule of thumb. Inside a Snowflake / BigQuery query, the engine's MAX_BY / ARG_MAX is often the shortest and the most efficient — no window, no extra sort.
Worked example — keep all ties with RANK() vs DENSE_RANK() vs ROW_NUMBER()
Detailed explanation. A nuance interviewers love: ROW_NUMBER() always assigns distinct ranks, but RANK() and DENSE_RANK() produce ties when the ORDER BY keys tie. The choice depends on whether you want one row per business key or all tied rows.
Question. Given an events(user_id, score, ts) table, return:
- The single top-scoring event per
user_id(any one if there's a tie). - All top-scoring events per
user_id(including ties).
Compare ROW_NUMBER(), RANK(), and DENSE_RANK().
Input.
| user_id | score | ts |
|---|---|---|
| 1 | 90 | 09:00 |
| 1 | 90 | 09:05 |
| 1 | 85 | 09:10 |
| 2 | 75 | 09:02 |
| 2 | 75 | 09:08 |
Code.
-- A — ROW_NUMBER: one row per user_id, deterministic by adding ts as tie-breaker
SELECT user_id, score, ts
FROM (
SELECT
e.*,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY score DESC, ts ASC) AS rn
FROM events e
) ranked
WHERE rn = 1;
-- B — RANK: keep all rows where the score is the user's top score (handles ties)
SELECT user_id, score, ts
FROM (
SELECT
e.*,
RANK() OVER (PARTITION BY user_id ORDER BY score DESC) AS rk
FROM events e
) ranked
WHERE rk = 1;
Step-by-step explanation.
-
ROW_NUMBER()assigns1, 2, 3, …even when scores tie — the tie-break is the secondaryORDER BY(herets ASC). -
RANK()assigns1, 1, 3, …— bothscore=90rows for user 1 get rank 1; thescore=85row jumps to rank 3 (skipping 2). -
DENSE_RANK()assigns1, 1, 2, …— same asRANKbut without the gap. - Filter
rk = 1to keep all rows tied at the top score; filterrn = 1to keep exactly one.
Output A (ROW_NUMBER).
| user_id | score | ts |
|---|---|---|
| 1 | 90 | 09:00 |
| 2 | 75 | 09:02 |
Output B (RANK).
| user_id | score | ts |
|---|---|---|
| 1 | 90 | 09:00 |
| 1 | 90 | 09:05 |
| 2 | 75 | 09:02 |
| 2 | 75 | 09:08 |
Rule of thumb. Use ROW_NUMBER() when the spec says "the latest" / "the top" (one row). Use RANK() or DENSE_RANK() when the spec says "all rows that are tied at the top." If you reach for ROW_NUMBER() to dedup and the spec changes to "keep ties," you'll need to swap to RANK() — flag this in the comment.
SQL interview question on cross-dialect deduplication
A senior probe: "I have an orders table on Postgres in dev and Snowflake in prod. Write the 'latest order per customer' query so the same SQL runs on both dialects."
Solution Using ROW_NUMBER() — the universal pattern
WITH ranked AS (
SELECT
o.*,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_ts DESC, order_id DESC
) AS rn
FROM orders o
)
SELECT order_id, customer_id, order_ts, status
FROM ranked
WHERE rn = 1;
Step-by-step trace.
| Row | customer_id | order_ts | order_id | partition_rn | kept? |
|---|---|---|---|---|---|
| 1 | 42 | 2026-05-01 09:00 | 101 | 3 | no |
| 2 | 42 | 2026-05-02 14:00 | 102 | 2 | no |
| 3 | 42 | 2026-05-02 14:00 | 103 | 1 | yes |
| 4 | 99 | 2026-05-03 10:00 | 104 | 2 | no |
| 5 | 99 | 2026-05-04 11:00 | 105 | 1 | yes |
Output:
| order_id | customer_id | order_ts | status |
|---|---|---|---|
| 103 | 42 | 2026-05-02 14:00 | paid |
| 105 | 99 | 2026-05-04 11:00 | shipped |
Why this works — concept by concept:
- ROW_NUMBER() as the universal dedup primitive — every major dialect (Postgres, MySQL 8+, Snowflake, BigQuery, Spark, Oracle, SQL Server) supports the same standard-SQL form. One query, every engine.
-
Deterministic tie-breaker — including
order_id DESCin theORDER BYremoves non-determinism whenorder_tsties. Without it, the engine is allowed to pick either row. -
Cross-dialect filter —
WHERE rn = 1is a plain comparison, not engine-specific syntax. It survives a port from Postgres to Snowflake unchanged. -
Plan parity — Postgres's
ROW_NUMBER()and Snowflake's window operator pick effectively the same plan: a sort followed by aUnique-style filter. Memory isO(partition size). -
Cost —
O(N log N)for the sort,O(N)for the window numbering,O(N)for the filter. The sameO(N log N)cost theDISTINCT ONandMAX_BYshapes pay too.
SQL
Topic — window-functions
Window-function deduplication problems (ROW_NUMBER, RANK)
SQL interview question on dedup of CDC change events
A senior probe heavy on data-engineering nuance: "You have a cdc_events(pk, op, after_image, lsn) table where op ∈ {INSERT, UPDATE, DELETE} and lsn is the log sequence number. Each pk may have many rows; you want the current state of each pk for a downstream warehouse load. How do you deduplicate?"
Solution Using ROW_NUMBER() with LSN tie-breaker and DELETE filter
WITH latest AS (
SELECT
c.*,
ROW_NUMBER() OVER (
PARTITION BY pk
ORDER BY lsn DESC
) AS rn
FROM cdc_events c
)
SELECT pk, after_image
FROM latest
WHERE rn = 1
AND op <> 'DELETE'; -- skip tombstoned rows
Step-by-step trace.
| pk | op | lsn | rn | kept? |
|---|---|---|---|---|
| 42 | INSERT | 100 | 3 | no |
| 42 | UPDATE | 110 | 2 | no |
| 42 | UPDATE | 130 | 1 | yes — latest non-DELETE |
| 99 | INSERT | 120 | 2 | no |
| 99 | DELETE | 140 | 1 | dropped — tombstoned |
| 7 | INSERT | 105 | 1 | yes |
Output:
| pk | after_image |
|---|---|
| 42 | (latest UPDATE payload at lsn=130) |
| 7 | (INSERT payload at lsn=105) |
Why this works — concept by concept:
-
LSN as the canonical tie-breaker —
lsnis monotonically increasing per replication slot, so it is a perfect deterministic order for "latest change wins." -
rn = 1 filter — for each
pk, the row with the highestlsnsurvives; older operations are dropped. -
DELETE handling — without the
op <> 'DELETE'filter, tombstoned rows would re-appear as the current state. Filtering them after the ranking is what turns the change log into a current-state materialisation. -
Idempotent re-runs — the query depends only on
(pk, lsn, op, after_image)— replaying the same CDC stream produces the same output. Safe to re-run after any failure. -
Cost —
O(N log N)for the sort inside the window;O(distinct pks)memory for the window state in the executor. Comparable to aDISTINCT ON (pk) ORDER BY pk, lsn DESCshape on Postgres.
SQL
Topic — distinct
CDC-style change-log dedup problems (SQL)
Cheat sheet — DISTINCT recipes
-
Unique values of one column.
SELECT DISTINCT col FROM t— interchangeable withSELECT col FROM t GROUP BY col. Same plan in Postgres. -
Unique tuples of N columns.
SELECT a, b, c FROM t GROUP BY a, b, c— preferGROUP BYso an aggregate can be folded in without rewriting. -
Exact unique count.
SELECT COUNT(DISTINCT col) FROM t— paysO(distinct cardinality)memory; tunework_memor pre-aggregate. -
Approximate unique count (single query).
SELECT APPROX_COUNT_DISTINCT(col) FROM t— Snowflake / BigQuery / Spark;~1.6%error,~1.5 KBmemory. -
Approximate unique count (mergeable rollup).
HLL(col)per partition +HLL_COMBINEat read time — sub-second dashboard latency on 10B-row tables. -
Latest row per business key, any dialect.
ROW_NUMBER() OVER (PARTITION BY id ORDER BY ts DESC, tie DESC) = 1. Always include a deterministic tie-breaker. -
Latest row per business key, Postgres only.
SELECT DISTINCT ON (id) * FROM t ORDER BY id, ts DESC, tie DESC. Shortest expression of "one row per key." -
Latest row per business key, warehouse-native.
MAX_BY(col, (ts, tie))(Snowflake),ARRAY_AGG(STRUCT(col) ORDER BY ts DESC, tie DESC LIMIT 1)[OFFSET(0)](BigQuery). -
Avoid
DISTINCT *for dedup. It preserves rows that differ only by an audit column. Always key on the business key explicitly. -
Multiple
COUNT(DISTINCT)in one query. Either threeAPPROX_COUNT_DISTINCTcalls (cheap, ±1.6%) orGROUPING SETS/ pre-aggregated rollup (exact, cheap-amortised).
Frequently asked questions
Is SELECT DISTINCT faster or slower than GROUP BY in SQL?
For the common case — "give me the unique values of one or more columns" — SELECT DISTINCT a, b and SELECT a, b FROM t GROUP BY a, b produce the same plan in every modern engine (Postgres, MySQL 8+, Snowflake, BigQuery), so they run at the same speed. GROUP BY pulls ahead the moment an aggregate enters the picture, because DISTINCT cannot aggregate inline — the equivalent DISTINCT shape needs a self-join to recompute the aggregate. The rule of thumb is: use DISTINCT for read-only uniqueness, use GROUP BY everywhere else, and always read EXPLAIN if you're unsure.
Why is COUNT(DISTINCT col) so slow on big tables in SQL?
COUNT(DISTINCT col) must materialise the entire unique set of col values in memory, which costs O(distinct cardinality) bytes. On a 1B-row table with 100M distinct values that's 2–3 GB of hash table — well above the default work_mem in Postgres or the per-node spill threshold in Snowflake, so the operator spills to disk and the wall time triples. An index on col does not help, because the engine still has to touch every row to know whether each value is "new" or "already seen." The fix is to raise work_mem for the session, pre-aggregate the unique set in a daily rollup, or switch to APPROX_COUNT_DISTINCT (HyperLogLog).
How accurate is HyperLogLog and APPROX_COUNT_DISTINCT in practice?
HyperLogLog (HLL) and its successor HLL++ deliver a standard error of ~1.04 / √m where m is the number of registers — m=16384 (default in most engines) gives ~0.81% standard error and a ~1.6% 95th-percentile error. On real datasets the error is usually below 1% as long as the cardinality is above a few thousand. Snowflake, BigQuery, and Spark all use HLL++ under the hood for APPROX_COUNT_DISTINCT, so they share these accuracy guarantees. The trade-off is unbeatable: ~1.5 KB of memory for a cardinality estimate that holds for sets ranging from 10K to 10B distinct values.
Can I do COUNT(DISTINCT col1, col2) across multiple columns in SQL?
Yes in Postgres, Snowflake, BigQuery, and Spark — COUNT(DISTINCT col1, col2) treats the pair as a tuple and counts the distinct tuples. In MySQL it works too but NULL handling differs from the tuple convention; double-check on your version. In SQL Server you must concatenate manually: COUNT(DISTINCT CONCAT(col1, '|', col2)) with a separator chosen so it cannot occur inside either column. The cost is the same as COUNT(DISTINCT) of one column with cardinality distinct(col1) × co-occurrence factor.
Does an index on a column help SELECT DISTINCT col FROM t?
Sometimes. A B-tree index on col lets the engine perform an index-only scan in sorted order — Postgres can then walk the index leaf pages and emit unique values via a Unique operator without a separate sort, which is much faster than the table-scan + hash path on wide tables. The catch is that this only works when the index covers the query (no other columns needed) and when the visibility map is up to date (Postgres-specific — VACUUM recently). For SELECT DISTINCT col1, col2, an index on (col1, col2) can give the same benefit. For COUNT(DISTINCT col), the index helps the scan but not the hash, because the engine still must visit every leaf entry.
What is DISTINCT ON in PostgreSQL and when should I use it?
SELECT DISTINCT ON (key) col1, col2, … FROM t ORDER BY key, tie_breaker DESC is a Postgres-only extension that keeps the first row per group given the trailing ORDER BY. It is the cleanest expression of "one row per key" in Postgres — equivalent to ROW_NUMBER() OVER (PARTITION BY key ORDER BY tie_breaker DESC) = 1 but with one less CTE and one less filter. Use it for "latest order per customer," "first session per user," "most recent snapshot per asset," and similar patterns when you know the engine is Postgres. Outside Postgres, fall back to ROW_NUMBER() — the standard-SQL universal pattern.
How do I handle NULLs in SELECT DISTINCT and COUNT(DISTINCT)?
SELECT DISTINCT col treats NULL as a single distinct value — all NULL rows collapse into one row in the output. COUNT(DISTINCT col), in standard SQL, excludes NULL from the count entirely — only non-NULL distinct values are counted. For multi-column forms, COUNT(DISTINCT a, b) skips any tuple where any column is NULL. If you need NULL to count as a distinct value, project it to a sentinel: COUNT(DISTINCT COALESCE(CAST(col AS TEXT), '<NULL>')) works in any dialect.
Should I use SELECT DISTINCT to deduplicate after a JOIN that explodes rows?
Almost never. A SELECT DISTINCT * after a JOIN papers over a join-cardinality bug — usually a many-to-many join that should have been many-to-one. Instead, find the join key that exploded and rewrite the join with a GROUP BY or a CTE that pre-aggregates one side, so the join is one-to-many before it enters the outer query. The DISTINCT "fix" hides the bug, doubles the memory cost, and often masks a row-count mismatch that breaks downstream metrics later.
Practice on PipeCode
- Drill the DISTINCT + deduplication practice library → for hands-on
SELECT DISTINCT,COUNT(DISTINCT), andDISTINCT ONreps. - Rehearse aggregation problems → for the cost discussion around
COUNT(DISTINCT)andGROUP BY. - Sharpen SQL aggregation drills → for warehouse-native helpers like
MAX_BY,ARG_MAX, andARRAY_AGG. - Work window-function dedup problems → for the universal
ROW_NUMBER()pattern. - For the broader interview surface, read top data engineering interview questions →.
- Pair this with the only 5 skills you need to become a data engineer → for the breadth-first prep map.
- Take the structured deep dive in SQL for DE interviews: from zero to FAANG → — the full course covers every primitive on this page with graded problems.
- Reinforce the modelling side with data modelling for DE interviews → so the next dedup question lands on a well-keyed table.
Pipecode.ai is Leetcode for Data Engineering — every DISTINCT and COUNT(DISTINCT) primitive above ships with hands-on practice rooms where you write the queries, read the real EXPLAIN output, and watch the work_mem spill happen on a real engine. Start with the distinct + dedup library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.





Top comments (0)