slowly changing dimensions interview questions are the dimensional-modelling round every senior data engineering loop opens with after warehouse design. Interviewers don't stop at "what's a dim table?" — they probe whether you understand scd type 2 as the production default, the scd type 1 vs type 2 vs type 3 trade-offs, the surrogate key as the dim's join key, and the late-arriving / retroactive-delete gotchas that fail most candidates.
This guide walks through the seven SCD primitives that show up most often in data engineering interview questions at FAANG and warehouse-heavy shops (Snowflake, Databricks, Netflix, Airbnb, Stripe). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll be able to design a Type 2 dim with MERGE INTO, defend dbt snapshot timestamp vs check strategy, and walk through the surrogate-key collision that breaks every junior's first SCD implementation — the exact shape dimensional modelling interview rounds reward when fact dim join and type 6 hybrid scd come up.
When you want hands-on reps immediately after reading, drill the dimensional modelling library →, browse slowly-changing-data practice →, and rehearse join patterns → plus window-function drills → on the effective-date side.
On this page
- Why SCD shows up in every senior data engineering interview
- SCD Type 1 — overwrite in place, no history
- SCD Type 2 — add row with valid_from, valid_to, is_current
- SCD Type 3 — add column for previous value
- SCD Type 6 — hybrid 1+2+3 for the messy real world
- SCD with dbt snapshots — timestamp vs check strategy
- SCD interview gotchas — late updates, retroactive deletes, surrogate-key collisions
- Choosing the right SCD type (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why SCD shows up in every senior data engineering interview
SCD is how dimensional warehouses preserve "what was true when" — the core of every analytics question
The one-sentence invariant: a slowly changing dimension is a dim table where the attributes of a business entity (customer, product, employee) change over time, and the warehouse must decide whether to preserve, partially preserve, or overwrite the history. Once you internalise that — history preservation is a modelling choice, not a feature — every slowly changing dimensions interview questions prompt resolves to "Type 1, 2, 3, or 6, and why."
Why interviewers love SCD questions.
- It surfaces dimensional-modelling instincts. Does the candidate know the difference between a dimension and a fact?
- It tests historical reasoning. Can the answer reconstruct "what was the customer's segment when this order was placed in 2024?"
-
It probes SQL fluency.
MERGE INTOwith multiple WHEN branches, effective-date predicates, surrogate keys. - It hits operational realism. Late-arriving updates, retroactive deletes, sk collisions — every team has been bitten at least once.
The four key concepts every SCD interview opens with.
-
Business key vs surrogate key. Business key = the natural id (e.g.
customer_id). Surrogate key = a synthetic id (sk) that uniquely identifies a version of the row. - History preservation. SCD type decides whether old values are kept (Type 2, 3, 6) or overwritten (Type 1).
-
Effective dates.
valid_from+valid_tocolumns;valid_to=NULL(or far-future date) marks the current row. -
is_current flag. A redundant-but-useful boolean for fast
WHERE is_current = truelookups.
What interviewers listen for.
- Do you reach for Type 2 when "preserve history" is mentioned? — required answer.
- Do you separate business key from surrogate key? — senior signal.
- Do you mention effective-date predicates when joining fact to dim? — senior-architectural signal.
- Do you bring up
MERGE INTOas the standard SCD-2 update pattern? — required answer.
Worked example — pick the SCD type for a customer dimension
Detailed explanation. A common opener: "We have a customers table with customer_id, name, email, city, segment, lifetime_value. The warehouse needs to answer 'what was the customer's segment when they placed an order?' What SCD type and why?"
Question. Justify the SCD type per column.
Input (column-by-column).
| Column | Changes? | Need history? | SCD type |
|---|---|---|---|
| customer_id | never | n/a (business key) | n/a |
| name | rarely | no | Type 1 |
| sometimes | no | Type 1 | |
| city | sometimes | yes (location at order time) | Type 2 |
| segment | yes | yes (segment at order time) | Type 2 |
| lifetime_value | continuously | no — just current | Type 1 |
Code (resulting dim_customer schema).
CREATE TABLE dim_customer (
sk BIGSERIAL PRIMARY KEY, -- surrogate key
customer_id BIGINT NOT NULL, -- business key
name VARCHAR(200) NOT NULL, -- Type 1 (overwrite)
email VARCHAR(200) NOT NULL, -- Type 1
city VARCHAR(100) NOT NULL, -- Type 2 (history)
segment VARCHAR(50) NOT NULL, -- Type 2
lifetime_value NUMERIC(12,2), -- Type 1
valid_from DATE NOT NULL,
valid_to DATE, -- NULL = current
is_current BOOLEAN NOT NULL DEFAULT TRUE
);
CREATE UNIQUE INDEX uq_dim_customer_current
ON dim_customer (customer_id) WHERE is_current = true;
Step-by-step explanation.
-
Business key =
customer_id. It identifies the entity. -
Surrogate key =
sk. Auto-generated; unique per row version. - Type 1 columns (name, email, lifetime_value) — overwritten in place on every change.
-
Type 2 columns (city, segment) — when these change, the current row's
valid_tois set to today and a new row is inserted withvalid_from = today, is_current = true. -
Partial-index uniqueness — at most one current row per
customer_id.
Output (joining fact_orders to dim_customer with effective-date predicate).
SELECT f.order_id, f.amount, c.segment AS segment_at_order_time
FROM fact_orders f
JOIN dim_customer c
ON c.customer_id = f.customer_id
AND f.order_date >= c.valid_from
AND (f.order_date < c.valid_to OR c.valid_to IS NULL);
Rule of thumb. Type 2 for any attribute that drives reporting "as of"; Type 1 for everything else. When in doubt, ask the interviewer "do we need to reconstruct the past value?"
SCD interview question on column-by-column SCD typing
A senior probe: "Walk through each column and pick its SCD type." This tests whether the candidate can reason at the attribute level, not just the table level.
Solution Using hybrid per-column typing within one dim
-- dim_customer combining Type 1 (Volatile attributes overwritten)
-- and Type 2 (historical attributes versioned)
-- Single row per (customer_id, valid_from); is_current=true is the latest.
CREATE TABLE dim_customer (
sk BIGSERIAL PRIMARY KEY,
customer_id BIGINT NOT NULL,
name VARCHAR(200) NOT NULL, -- T1
email VARCHAR(200) NOT NULL, -- T1
city VARCHAR(100) NOT NULL, -- T2
segment VARCHAR(50) NOT NULL, -- T2
lifetime_value NUMERIC(12,2), -- T1
valid_from DATE NOT NULL,
valid_to DATE,
is_current BOOLEAN NOT NULL DEFAULT TRUE
);
Step-by-step trace.
| Change | Action |
|---|---|
| email changed | UPDATE all rows for customer_id SET email = new_email |
| name changed | UPDATE all rows SET name = new_name |
| city changed | close current row (set valid_to = today, is_current = false), INSERT new row |
| segment changed | close current row, INSERT new row |
| lifetime_value changed | UPDATE all rows SET lifetime_value = new_value |
Output:
| Behaviour | Per-column SCD | Pure SCD2 (every col) | Pure SCD1 (no history) |
|---|---|---|---|
| Storage | bounded by T2-column change rate | unbounded by every-column changes | minimal |
| Historical accuracy | T2 columns preserved | every column preserved | none |
| Read complexity | medium (effective-date join) | medium | trivial |
Why this works — concept by concept:
-
Business key vs surrogate key —
customer_idis what humans recognise;skis what facts join on. Decoupling them is the foundation of SCD2. - Type 1 columns update all rows — when email changes, every historical row's email becomes the new value (so historical reports still show current email but historical city / segment).
- Type 2 columns version the row — when city changes, a new row is born; the old row is closed.
- Hybrid per-column typing — the most common production pattern. Type 6 (next sections) formalises this.
- Cost — storage = O(rows × Type-2-column-change-rate); read cost = one join with effective-date predicate.
Dimensional modelling
Topic — dimensional-modeling
Dimensional modelling problems
2. SCD Type 1 — overwrite in place, no history
scd type 1 is "UPDATE the row in place" — the cheapest, most-common, history-free strategy
The mental model: Type 1 is for attributes where you only ever care about the current value — name, email, address — and where reconstructing the past would mislead the reader of any report. There's exactly one row per business key. UPDATE in place.
When to use Type 1.
- The attribute is descriptive, not analytical (display name, contact info).
- Historical reports would be misleading or unnecessary.
- Storage / cost matters more than audit trail.
When NOT to use Type 1.
- The attribute drives slicing in reports ("revenue by region" → region must be Type 2).
- Compliance requires an audit trail.
- The attribute changes in step with billing or contractual terms.
The classic Type 1 trap.
A team Type-1's the customer's
countrycolumn. Six months later, finance asks "revenue by country last year" — every historical order now shows the customer's current country. The dim was destroying its own history.
Worked example — Type 1 update via MERGE
Detailed explanation. Even Type 1 benefits from MERGE INTO over plain UPDATE: it handles INSERT (new customer) and UPDATE (existing customer) in one statement.
Question. Write the MERGE that Type-1-updates dim_customer from a daily stg_customer extract.
Input (stg_customer).
| customer_id | name | city | |
|---|---|---|---|
| 42 | Ada Lovelace | ada@example.com | Seattle |
| 43 | Alan Turing | alan@example.com | London |
Existing dim_customer.
| sk | customer_id | name | city | |
|---|---|---|---|---|
| 101 | 42 | Ada Lovelace | ada@old.com | Austin |
| (no row) | 43 |
Code.
MERGE INTO dim_customer t
USING stg_customer s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN
UPDATE SET name = s.name, email = s.email, city = s.city
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, city)
VALUES (s.customer_id, s.name, s.email, s.city);
Step-by-step explanation.
- The MERGE walks every row in
stg_customer. - For
customer_id=42— matched → UPDATE — email goes fromada@old.comtoada@example.com, city fromAustintoSeattle. - For
customer_id=43— not matched → INSERT new row. - Surrogate key
sk=101is preserved on the update. - End state has one row per
customer_id; the prioremailandcityvalues are gone.
Output.
| sk | customer_id | name | city | |
|---|---|---|---|---|
| 101 | 42 | Ada Lovelace | ada@example.com | Seattle |
| 102 | 43 | Alan Turing | alan@example.com | London |
Rule of thumb. Always use MERGE for SCD operations, even Type 1 — it handles both UPDATE and INSERT in one statement and is replay-safe.
SCD interview question on Type 1 idempotency
The probe: "If we re-run the same MERGE twice, what happens?"
Solution Using deterministic MERGE that's idempotent
-- Same MERGE as above
MERGE INTO dim_customer t
USING stg_customer s
ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET name = s.name, email = s.email, city = s.city
WHEN NOT MATCHED THEN INSERT VALUES (s.customer_id, s.name, s.email, s.city);
Step-by-step trace.
| Run | Source row | Action | End state |
|---|---|---|---|
| 1 | (42, Ada, ada@example.com, Seattle) | MATCHED → UPDATE | row 101 reflects source |
| 2 (same source) | (42, Ada, ada@example.com, Seattle) | MATCHED → UPDATE (no-op effect) | row 101 unchanged |
Output:
| Property | Value |
|---|---|
| Idempotent? | yes — re-running produces the same end state |
| Side effects? | minor — UPDATE timestamp on metadata layer changes |
| Recommended? | yes — keeps the load script idempotent |
Why this works — concept by concept:
- MERGE is deterministic on the same input — second run finds the same matched/unmatched partition and applies the same updates.
- Type 1 has no row-creation on UPDATE — re-running doesn't multiply rows.
- sk preserved across re-runs — the fact tables that reference this sk are unaffected.
- Idempotency = replay safety — Airflow / dbt / cron can re-run without polluting state.
- Cost — O(stg rows) compute, O(stg rows) UPDATE writes. Trivial at most scales.
Dimensional modelling
Topic — joins
SCD Type 1 + MERGE pattern problems
3. SCD Type 2 — add row with valid_from, valid_to, is_current
scd type 2 is the production default — close the old row, insert the new row, preserve all history
The mental model: Type 2 preserves the full history of every change as separate rows; each row has a valid_from and valid_to (or is_current) marking its lifetime; facts join on the surrogate key based on event date.
The three columns every SCD-2 dim has.
-
valid_from(DATE / TIMESTAMP) — when this version became active. -
valid_to(DATE / TIMESTAMP, NULL or far-future for current) — when this version stopped being active. -
is_current(BOOLEAN) — TRUE for the current row, FALSE for all others. Redundant but useful for fast lookups.
The MERGE that drives Type 2.
- WHEN MATCHED AND
is_currentAND any-tracked-column changed → UPDATE to close the row (valid_to = today, is_current = false). - WHEN NOT MATCHED (by
customer_id) → INSERT a new current row. - For an UPDATE-then-INSERT pattern that needs both branches, use two passes or
MERGE INTOwith multiple statements.
Fact-to-dim joins.
- Always join on the surrogate key with an effective-date predicate.
-
fact.event_date BETWEEN dim.valid_from AND COALESCE(dim.valid_to, '9999-12-31').
Worked example — Type 2 MERGE for a customer city change
Detailed explanation. Customer 42's city changes from Austin to Seattle on 2026-05-01. Walk through how the dim table transforms.
Question. Show the staging table, the MERGE, and the final dim state.
Input (stg_customer).
| customer_id | city | snapshot_date |
|---|---|---|
| 42 | Seattle | 2026-05-01 |
Existing dim_customer.
| sk | customer_id | city | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| 101 | 42 | Austin | 2024-01-01 | NULL | true |
Code (PostgreSQL).
BEGIN;
-- Step 1: close the current row when city changed
UPDATE dim_customer
SET valid_to = s.snapshot_date - INTERVAL '1 day',
is_current = false
FROM stg_customer s
WHERE dim_customer.customer_id = s.customer_id
AND dim_customer.is_current = true
AND dim_customer.city <> s.city;
-- Step 2: insert the new current row
INSERT INTO dim_customer (customer_id, city, valid_from, valid_to, is_current)
SELECT s.customer_id, s.city, s.snapshot_date, NULL, true
FROM stg_customer s
WHERE NOT EXISTS (
SELECT 1 FROM dim_customer d
WHERE d.customer_id = s.customer_id
AND d.is_current = true
);
COMMIT;
Step-by-step explanation.
-
Step 1 UPDATE — for every customer whose current row has a different city than the staging row, close the current row by setting
valid_to = snapshot_date - 1 dayandis_current = false. -
Step 2 INSERT — for every customer who no longer has a current row (because Step 1 just closed it) OR who didn't have one at all, insert a new row with
valid_from = snapshot_date, valid_to = NULL, is_current = true. - The two-step pattern ensures that the dim never has two current rows for the same customer.
- Wrap both in a transaction so partial failure can't leave the dim in an inconsistent state.
Output (after MERGE).
| sk | customer_id | city | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| 101 | 42 | Austin | 2024-01-01 | 2026-04-30 | false |
| 102 | 42 | Seattle | 2026-05-01 | NULL | true |
Rule of thumb. is_current is a redundant convenience — always derivable from valid_to IS NULL, but worth keeping because the partial-index uniqueness constraint depends on it.
SCD interview question on the fact-to-dim join
The probe: "How do I get the customer's city at the time of every order?"
Solution Using surrogate-key + effective-date predicate
SELECT
f.order_id,
f.order_date,
c.customer_id,
c.city AS city_at_order_time
FROM fact_orders f
JOIN dim_customer c
ON c.customer_id = f.customer_id
AND f.order_date >= c.valid_from
AND (f.order_date < c.valid_to OR c.valid_to IS NULL);
Step-by-step trace.
| order_id | order_date | dim_customer match | city_at_order_time |
|---|---|---|---|
| 7001 | 2024-06-15 | sk=101 (valid 2024-01-01..2026-04-30) | Austin |
| 7002 | 2026-04-20 | sk=101 (still in range) | Austin |
| 7003 | 2026-05-15 | sk=102 (valid 2026-05-01..NULL) | Seattle |
Output:
| order_id | order_date | city_at_order_time |
|---|---|---|
| 7001 | 2024-06-15 | Austin |
| 7002 | 2026-04-20 | Austin |
| 7003 | 2026-05-15 | Seattle |
Why this works — concept by concept:
- Effective-date predicate — the BETWEEN-style predicate ensures each fact row matches exactly one dim row.
-
Surrogate key in fact (alternative)__ — many warehouses store the dim's
skdirectly in the fact, so the join becomesfact.customer_sk = dim.skwith no date predicate. Faster join but the fact must look up the right sk at load time. -
COALESCE on valid_to — required because NULL doesn't behave normally in range predicates; using
9999-12-31as a sentinel is the alternative. -
Partial-index uniqueness on
(customer_id) WHERE is_current = true— guarantees exactly one current row per business key. -
Cost — one join with effective-date predicate; indexed on
(customer_id, valid_from)for performance.
Dimensional modelling
Topic — slowly-changing-data
SCD Type 2 + effective-date join problems
4. SCD Type 3 — add column for previous value
scd type 3 is "I want to know the previous value, but not every value" — limited history, no row explosion
The mental model: Type 3 stores N previous values as extra columns on a single row per business key. Most common form is "current" + "previous" — two columns per tracked attribute. The dim never grows beyond one row per business key.
When to use Type 3.
- You only need "before" and "after" comparisons.
- The attribute changes rarely (a few times per row's lifetime).
- Storage is a serious concern.
- Reports always show "current vs previous" patterns.
When NOT to use Type 3.
- You need more than the last N values.
- The attribute changes more than once per reporting window.
- You need point-in-time reconstruction.
The typical schema.
CREATE TABLE dim_customer_t3 (
customer_id BIGINT PRIMARY KEY,
name VARCHAR(200) NOT NULL,
city VARCHAR(100) NOT NULL, -- current
prev_city VARCHAR(100), -- previous
city_changed_at DATE, -- when did it change?
segment VARCHAR(50) NOT NULL,
prev_segment VARCHAR(50),
segment_changed_at DATE
);
Worked example — Type 3 update for a customer city change
Detailed explanation. When city changes, shift the current to prev_city, set the new value, and record the change date.
Question. Write the MERGE for Type 3.
Code.
MERGE INTO dim_customer_t3 t
USING stg_customer s
ON t.customer_id = s.customer_id
WHEN MATCHED AND t.city <> s.city THEN
UPDATE SET prev_city = t.city,
city = s.city,
city_changed_at = s.snapshot_date
WHEN NOT MATCHED THEN
INSERT (customer_id, name, city, prev_city, city_changed_at, segment)
VALUES (s.customer_id, s.name, s.city, NULL, NULL, s.segment);
Step-by-step explanation.
- For matched rows where
citychanged → shift current toprev_city, set new value, record changed-date. - For matched rows where
cityis unchanged → no UPDATE (MATCHED AND clause filters out unchanged rows). - For unmatched → INSERT new row with
prev_city = NULL(no history yet). - Re-running with the same input is idempotent — the MATCHED-AND condition is false on the second pass.
Output.
Before:
| customer_id | name | city | prev_city | city_changed_at |
|---|---|---|---|---|
| 42 | Ada | Austin | NULL | NULL |
Source: (42, Ada, Seattle, 2026-05-01).
After:
| customer_id | name | city | prev_city | city_changed_at |
|---|---|---|---|---|
| 42 | Ada | Seattle | Austin | 2026-05-01 |
After another change to "Boston" on 2026-06-01:
| customer_id | name | city | prev_city | city_changed_at |
|---|---|---|---|---|
| 42 | Ada | Boston | Seattle | 2026-06-01 |
— note: "Austin" is gone forever after the second change.
Rule of thumb. Type 3 trades unbounded history for a single-row schema. Use only when you genuinely only need "current vs previous."
SCD interview question on Type 3 vs Type 2 trade-off
The probe: "When would you pick Type 3 over Type 2?"
Solution Using a trade-off decision table
| Concern | Type 2 | Type 3 |
|---|---|---|
| Number of historical values kept | unbounded | N (usually 1) |
| Row count per business key | grows with changes | always 1 |
| Storage | larger | minimal |
| Fact-to-dim join complexity | effective-date predicate | trivial — no date join |
| Point-in-time reporting | yes | no |
| "Current vs previous" reporting | yes (filter is_current) | yes (single row) |
Step-by-step trace.
| Question | Type 2 | Type 3 |
|---|---|---|
| "What was the city when order 7001 was placed?" | yes (effective-date join) | no — only current + previous available |
| "What's the previous city?" | yes (filter by valid_to ordering) | yes (read prev_city column) |
| "How many times has this customer moved?" | count of rows | no (only the most recent move) |
| "Latest city" | filter is_current = true | read city column |
Output:
| Question | Answer |
|---|---|
| Need full history? | Type 2 |
| Need only "before/after" comparison? | Type 3 |
| Need a single-row dim for cost reasons? | Type 3 |
| Need reporting "as of date"? | Type 2 |
Why this works — concept by concept:
- Bounded vs unbounded history — Type 3 commits to a constant-size schema; Type 2 lets the table grow with changes.
- Join complexity — Type 3 joins are trivial (no date predicate); Type 2 requires effective-date predicates.
- Storage / cost — Type 3 is cheap; Type 2 grows with change-rate.
- Reporting power — Type 2 wins on any "as of" question; Type 3 wins on "current vs previous" cards.
- Cost — choice cost is the application's specific reporting needs; the technical cost is the row-count difference at scale.
Dimensional modelling
Topic — dimensional-modeling
SCD type-selection problems
5. SCD Type 6 — hybrid 1+2+3 for the messy real world
scd type 6 combines Type 1 (current), Type 2 (history rows), and Type 3 (previous column) — the production reality
The mental model: Type 6 is "all of the above" — multiple rows per business key (Type 2), each with current_* columns updated on every change (Type 1), and optional prev_* columns for the last-known-prior value (Type 3). Sounds messy because it is — but it's how real dims look.
Type 6 columns.
-
Type 2 history columns:
city,valid_from,valid_to,is_current. -
Type 1 "current" columns:
current_city— set to the same value on EVERY row of this business key. Lets you do "show me the current city for every historical row" without a sub-query. -
Type 3 "previous" columns (optional):
prev_city— the value just before the row'svalid_from.
Why Type 6 exists.
- Reporting often asks two questions at once: "what was the city at order time?" AND "what is the current city?" — Type 6 answers both without joins.
- "Show me orders by current segment" without a sub-query — current_segment is duplicated on every historical row.
Type 6 trade-offs.
- Storage: bigger than Type 2 (more columns); bigger than Type 1 (more rows).
- Update cost: every row of the business key must be updated when the Type 1 column changes.
- Read flexibility: highest of any SCD type.
Worked example — Type 6 dim with current_city kept in sync
Detailed explanation. Walk through inserting two history rows for customer 42 and updating current_city on both when the city changes.
Question. Show the dim with two history rows and a city change, both before and after.
Input. Customer 42 was in Austin (2024-01-01 → 2026-04-30), then moved to Seattle (2026-05-01).
Schema.
CREATE TABLE dim_customer_t6 (
sk BIGSERIAL PRIMARY KEY,
customer_id BIGINT NOT NULL,
city VARCHAR(100) NOT NULL, -- Type 2 historical
current_city VARCHAR(100) NOT NULL, -- Type 1 (same on all rows)
valid_from DATE NOT NULL,
valid_to DATE,
is_current BOOLEAN NOT NULL
);
Code (the Type 6 update on a city change).
BEGIN;
-- Step 1: close the current row (Type 2 part)
UPDATE dim_customer_t6
SET valid_to = '2026-04-30', is_current = false
WHERE customer_id = 42 AND is_current = true;
-- Step 2: insert the new current row (Type 2 part)
INSERT INTO dim_customer_t6 (customer_id, city, current_city, valid_from, valid_to, is_current)
VALUES (42, 'Seattle', 'Seattle', '2026-05-01', NULL, true);
-- Step 3: Type 1 update — set current_city on ALL rows for this customer
UPDATE dim_customer_t6
SET current_city = 'Seattle'
WHERE customer_id = 42;
COMMIT;
Step-by-step explanation.
- Step 1 — Type 2 — close the existing current row.
- Step 2 — Type 2 — insert the new current row.
-
Step 3 — Type 1 — update
current_cityon every row of this business key (including the historical one) so a join showing "current city" doesn't need a sub-query. - End state has two history rows and identical
current_cityvalues on both.
Output (after the change).
| sk | customer_id | city | current_city | valid_from | valid_to | is_current |
|---|---|---|---|---|---|---|
| 101 | 42 | Austin | Seattle | 2024-01-01 | 2026-04-30 | false |
| 102 | 42 | Seattle | Seattle | 2026-05-01 | NULL | true |
Rule of thumb. current_city (Type 1) is denormalised redundancy that buys read simplicity. Use Type 6 when both "as of" and "current" reports are common.
SCD interview question on Type 6 fact-to-dim joins
The probe: "With Type 6, can I report both 'orders by historical city' and 'orders by current city' in one query?"
Solution Using both city and current_city from the same row
SELECT
c.customer_id,
c.city AS city_at_order_time, -- Type 2
c.current_city AS city_now, -- Type 1
COUNT(*) AS orders
FROM fact_orders f
JOIN dim_customer_t6 c
ON c.customer_id = f.customer_id
AND f.order_date >= c.valid_from
AND (f.order_date < c.valid_to OR c.valid_to IS NULL)
GROUP BY 1, 2, 3
ORDER BY 1;
Step-by-step trace.
| order | order_date | matched dim row | city (T2) | current_city (T1) |
|---|---|---|---|---|
| 7001 | 2024-06-15 | sk=101 (Austin) | Austin | Seattle |
| 7002 | 2026-05-15 | sk=102 (Seattle) | Seattle | Seattle |
Output:
| customer_id | city_at_order_time | city_now | orders |
|---|---|---|---|
| 42 | Austin | Seattle | 1 |
| 42 | Seattle | Seattle | 1 |
Why this works — concept by concept:
- Two attributes from one row — Type 6 lets one effective-date join return both the historical and the current values.
-
No sub-query for "current city" —
current_cityis denormalised on every history row. - Update cost — every Type 1 update touches every history row of the business key (potentially expensive on hot keys).
- Storage — extra column on every row; usually a small price.
- Cost — read query = O(matching dim rows); write update = O(history rows per business key).
Dimensional modelling
Topic — joins
Type 6 hybrid dim modelling problems
6. SCD with dbt snapshots — timestamp vs check strategy
dbt snapshot automates SCD Type 2 — pick timestamp or check strategy
dbt snapshots are the canonical way to materialise SCD Type 2 in the modern data stack. dbt handles the surrogate key, dbt_valid_from, dbt_valid_to, and detects changes automatically.
dbt snapshot file shape.
-- snapshots/customers_snapshot.sql
{% snapshot customers_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp', -- or 'check'
updated_at='updated_at', -- timestamp strategy
-- check_cols=['city','segment'], -- check strategy
)
}}
SELECT * FROM {{ source('app', 'customers') }}
{% endsnapshot %}
The two strategies.
-
timestamp— relies on a monotonically-updatedupdated_atcolumn in the source. Fast: dbt comparesupdated_atagainst the snapshot'sdbt_valid_from. Best when the source guarantees an accurateupdated_at. -
check— compares a list of named columns row-by-row; if any differ, dbt closes the old row and inserts a new one. More flexible but slightly slower (column-by-column hash comparison).
Generated columns.
-
dbt_valid_from— when this version became current. -
dbt_valid_to— when this version stopped being current (NULL for current). -
dbt_scd_id— surrogate key, hash of(unique_key, dbt_updated_at). -
dbt_updated_at— copy of theupdated_atcolumn (timestamp strategy only).
Running snapshots.
-
dbt snapshot— runs every snapshot in the project; idempotent. - Schedule daily/hourly via Airflow / dbt Cloud / cron.
Worked example — timestamp strategy snapshot
Detailed explanation. A customer's city changes; dbt detects the updated_at increase and rolls the snapshot.
Question. Trace one snapshot run for a city change.
Input (source customers).
| customer_id | city | updated_at |
|---|---|---|
| 42 | Austin | 2024-01-01 |
After the change:
| customer_id | city | updated_at |
|---|---|---|
| 42 | Seattle | 2026-05-01 |
Code.
{% snapshot customers_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
)
}}
SELECT * FROM {{ source('app', 'customers') }}
{% endsnapshot %}
Run:
dbt snapshot
Step-by-step explanation.
-
First run (2024-01-01) — source has
(42, Austin, 2024-01-01). Snapshot table is empty → INSERT row withdbt_valid_from=2024-01-01, dbt_valid_to=NULL, dbt_scd_id=hash(42, 2024-01-01). -
Second run (2026-05-01) — source now has
(42, Seattle, 2026-05-01). dbt comparesupdated_at=2026-05-01against the snapshot's currentdbt_updated_at=2024-01-01— they differ. - dbt closes the old row:
UPDATE … SET dbt_valid_to='2026-05-01' WHERE dbt_scd_id=hash(42, 2024-01-01). - dbt inserts the new row:
INSERT … (42, 'Seattle', '2026-05-01', dbt_valid_from='2026-05-01', dbt_valid_to=NULL, dbt_scd_id=hash(42, '2026-05-01')).
Output (snapshot table after the change).
| customer_id | city | updated_at | dbt_valid_from | dbt_valid_to | dbt_scd_id |
|---|---|---|---|---|---|
| 42 | Austin | 2024-01-01 | 2024-01-01 | 2026-05-01 | hash(42, 2024-01-01) |
| 42 | Seattle | 2026-05-01 | 2026-05-01 | NULL | hash(42, 2026-05-01) |
Rule of thumb. timestamp strategy if the source has a reliable updated_at; check strategy otherwise. Never use check when timestamp would work — it's noticeably slower on big tables.
dbt snapshot interview question on check strategy
The probe: "Our source DB doesn't have an updated_at column. SCD2 anyway?"
Solution Using check strategy with named columns
{% snapshot customers_snapshot_check %}
{{
config(
target_schema='snapshots',
unique_key='customer_id',
strategy='check',
check_cols=['city', 'segment', 'lifetime_value'],
)
}}
SELECT * FROM {{ source('app', 'customers') }}
{% endsnapshot %}
Step-by-step trace.
| Run | Source | Snapshot state | Action |
|---|---|---|---|
| 1 | (42, Austin, premium, 1000) | empty | INSERT row, dbt_valid_from = current_timestamp |
| 2 (no change) | (42, Austin, premium, 1000) | one row | NO-OP — hash(city, segment, lifetime_value) unchanged |
| 3 (city changes) | (42, Seattle, premium, 1000) | one row | close old row, INSERT new — hash changed |
| 4 (lifetime_value changes) | (42, Seattle, premium, 1500) | two rows | close current row, INSERT new |
Output (snapshot table after 4 runs):
| customer_id | city | segment | lifetime_value | dbt_valid_from | dbt_valid_to |
|---|---|---|---|---|---|
| 42 | Austin | premium | 1000 | run-1-time | run-3-time |
| 42 | Seattle | premium | 1000 | run-3-time | run-4-time |
| 42 | Seattle | premium | 1500 | run-4-time | NULL |
Why this works — concept by concept:
-
check_cols— dbt hashes the listed columns each run; any difference triggers a snapshot row update. -
No source
updated_atneeded — dbt uses the run timestamp (dbt_valid_from = current_timestamp). -
Trade-off —
checkstrategy reads more data and does more comparison thantimestamp; performance overhead grows with table size and column count. -
Idempotent runs — running
dbt snapshottwice with no source change produces no new rows. - Cost — read = O(rows), hash = O(rows × columns), write = O(changed rows). Acceptable up to ~tens of millions of rows.
Dimensional modelling
Topic — slowly-changing-data
dbt snapshot / SCD design problems
7. SCD interview gotchas — late updates, retroactive deletes, surrogate-key collisions
The three failure modes every senior slowly changing dimensions interview questions round probes
Gotcha 1 — Late-arriving updates.
- Scenario: a customer update from 2026-04-15 arrives on 2026-05-15, but the dim already has a current row from 2026-04-20.
- Naive code closes the wrong row or duplicates rows.
- Fix: insert the late row by splitting the valid_from / valid_to window of the existing row that "owned" the affected date range. Slot the new version in the middle.
Gotcha 2 — Retroactive deletes.
- Scenario: a row is deleted in the source DB.
- Naive code does nothing (the staging extract doesn't include the deleted row, so no update fires).
- Fix: anti-join staging against the dim — for every business key in the dim that's NOT in the source extract, flip
is_current = false, valid_to = todayon the current row.
Gotcha 3 — Surrogate-key collisions on re-run.
- Scenario: a snapshot re-runs and assigns a new
skto the same(customer_id, valid_from)combination. Fact tables that referenced the oldskare now orphaned. - Naive code uses
BIGSERIALforsk, so re-runs always create new sks. - Fix: use a deterministic hash of
(customer_id, valid_from)as thesk— re-runs produce the same value, fact joins stay intact.
Worked example — handle a late-arriving update
Detailed explanation. A 2026-04-15 update arrives on 2026-05-15; the dim has a current row from 2026-04-20.
Question. Show the dim before, the late row, and the dim after a correct late-arriving fix.
Input.
Existing dim:
| sk | customer_id | city | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| 101 | 42 | Austin | 2024-01-01 | NULL | true |
Late row (arriving 2026-05-15): (customer_id=42, city='Boston', effective_from='2026-04-15').
Code (corrected late-arriving update).
BEGIN;
-- Step 1: split the overlapping current row at the late row's effective date
INSERT INTO dim_customer (customer_id, city, valid_from, valid_to, is_current)
SELECT customer_id, city, valid_from, '2026-04-14'::DATE, false
FROM dim_customer
WHERE sk = 101;
-- Step 2: insert the late row (Boston, effective 2026-04-15, valid_to=NULL)
INSERT INTO dim_customer (customer_id, city, valid_from, valid_to, is_current)
VALUES (42, 'Boston', '2026-04-15', NULL, true);
-- Step 3: delete (or mark) the original overlapping row
DELETE FROM dim_customer WHERE sk = 101;
COMMIT;
Step-by-step explanation.
- The original row sk=101 covered 2024-01-01 → NULL. The late update says Boston was true from 2026-04-15.
- Split the old row into: 2024-01-01 → 2026-04-14 (the period when Austin was actually true).
- Insert the late Boston row: 2026-04-15 → NULL.
- Delete the original sk=101 because it has been superseded by the split rows.
- End state has TWO rows: Austin (2024-01-01 → 2026-04-14) and Boston (2026-04-15 → NULL).
Output.
| customer_id | city | valid_from | valid_to | is_current |
|---|---|---|---|---|
| 42 | Austin | 2024-01-01 | 2026-04-14 | false |
| 42 | Boston | 2026-04-15 | NULL | true |
Rule of thumb. Late-arriving updates need window splitting, not append-only inserts. Always validate that effective dates don't overlap within the same business key.
SCD interview question on deterministic surrogate keys
The probe: "Our backfill re-runs the snapshot and now half the fact joins are broken. Why?"
Solution Using a hash-based deterministic surrogate key
-- Use a deterministic hash instead of BIGSERIAL
CREATE TABLE dim_customer (
sk BIGINT PRIMARY KEY, -- BIGSERIAL replaced by hash
customer_id BIGINT NOT NULL,
city VARCHAR(100) NOT NULL,
valid_from DATE NOT NULL,
valid_to DATE,
is_current BOOLEAN NOT NULL DEFAULT TRUE
);
-- compute sk deterministically on insert (PostgreSQL, hashtextextended)
INSERT INTO dim_customer (sk, customer_id, city, valid_from, valid_to, is_current)
SELECT
hashtextextended(customer_id::text || '|' || valid_from::text, 0) AS sk,
customer_id, city, valid_from, valid_to, is_current
FROM stg;
Step-by-step trace.
| Run | Source row | Computed sk | Result |
|---|---|---|---|
| 1 | (42, Austin, 2024-01-01) | hash('42 | 2024-01-01') = -7842341 |
| 2 (backfill) | (42, Austin, 2024-01-01) | hash('42 | 2024-01-01') = -7842341 |
| 1 | (42, Seattle, 2026-05-01) | hash('42 | 2026-05-01') = 1893221 |
| 2 (backfill) | (42, Seattle, 2026-05-01) | hash('42 | 2026-05-01') = 1893221 |
Output:
| sk | customer_id | city | valid_from | Stable across re-runs? |
|---|---|---|---|---|
| -7842341 | 42 | Austin | 2024-01-01 | yes |
| 1893221 | 42 | Seattle | 2026-05-01 | yes |
Why this works — concept by concept:
-
Deterministic hash —
hash(business_key, valid_from)produces the sameskon every re-run. -
No fact-table orphans — fact rows referencing
sk = -7842341always resolve to the same dim row. -
BIGSERIAL collision avoided — re-running with
BIGSERIALwould assign new sequential ids and break all existing joins. -
dbt does this for you —
dbt_scd_idis exactly this hash (overunique_key+dbt_updated_at). - Cost — hash compute is trivial; the bigger win is that the sks are stable across data refreshes.
Dimensional modelling
Topic — dimensional-modeling
SCD gotchas + deterministic-sk problems
Choosing the right SCD type (cheat sheet)
- Need only current value? Type 1 — UPDATE in place, MERGE on business key.
- Need full history for "as of" reports? Type 2 — close old row, insert new row, surrogate key + effective dates.
-
Need only "before/after" comparison? Type 3 — extra
prev_*columns on a single row. -
Need both current and historical without sub-queries? Type 6 — Type 2 history rows with denormalised
current_*columns kept in sync. -
Need automated SCD2 in dbt?
dbt snapshotwithstrategy='timestamp'if source has reliableupdated_at, otherwisestrategy='check'with column list. -
Need stable surrogate keys across re-runs? Deterministic hash of
(business_key, valid_from), notBIGSERIAL. -
Need to fact-join on a Type 2 dim? Surrogate key + effective-date predicate
(fact_date BETWEEN valid_from AND COALESCE(valid_to, '9999-12-31')). - Handle late-arriving updates? Split the existing valid_from/valid_to window at the late effective date.
-
Handle retroactive deletes? Anti-join staging against dim and flip
is_current=false, valid_to=todayon missing keys. -
Optimize Type 2 read performance? Partial index
(business_key) WHERE is_current = true+ index on(business_key, valid_from).
Frequently asked questions
What's the difference between SCD Type 1, Type 2, Type 3, and Type 6?
Type 1 overwrites in place — one row per business key, no history. Type 2 preserves full history by inserting a new row on every change, with valid_from, valid_to, and is_current columns. Type 3 keeps only "current" and "previous" values as separate columns on a single row — bounded history, no row growth. Type 6 is a hybrid: Type 2 history rows plus Type 1 current_* columns kept in sync across every row of the business key, plus optional Type 3 prev_* columns. Type 2 is the production default; Type 6 is for when both "as of" and "current" reports are common.
How do I implement SCD Type 2 with SQL MERGE INTO?
Use a two-step pattern wrapped in a transaction: (1) UPDATE the current row to set valid_to = snapshot_date - 1 day and is_current = false where the tracked column changed; (2) INSERT a new current row with valid_from = snapshot_date, valid_to = NULL, is_current = true. Some warehouses (Snowflake, BigQuery, Databricks) let you express both branches in a single MERGE INTO ... WHEN MATCHED ... WHEN NOT MATCHED ... statement. Always wrap the operations in a transaction so partial failure can't leave the dim in an inconsistent state.
What's the difference between a business key and a surrogate key in SCD?
The business key is the natural identifier from the source system — customer_id, product_sku, employee_number. It identifies the entity across time. The surrogate key (sk) is a synthetic id that uniquely identifies a version of the row — one sk per Type 2 row. Fact tables join to dims on the surrogate key, not the business key, because the same customer_id can have multiple historical versions. Use a deterministic hash of (business_key, valid_from) as the surrogate key so re-runs of the snapshot produce stable sks and don't orphan fact joins.
How does dbt snapshot implement SCD Type 2?
dbt snapshot is dbt's built-in SCD Type 2 materialisation. You write a snapshot file with a unique_key and a strategy of either timestamp (uses the source's updated_at column) or check (compares a named list of columns). dbt generates dbt_valid_from, dbt_valid_to, dbt_scd_id, and dbt_updated_at columns automatically. Each dbt snapshot run closes old versions and inserts new ones for any source row whose tracked attributes have changed. Schedule snapshots daily/hourly via Airflow, dbt Cloud, or cron.
How do I join a fact table to an SCD Type 2 dim?
Join on the business key with an effective-date predicate: JOIN dim_customer c ON c.customer_id = f.customer_id AND f.order_date >= c.valid_from AND (f.order_date < c.valid_to OR c.valid_to IS NULL). This ensures each fact row matches exactly one dim version — the one that was current at the time of the fact event. Alternatively, store the dim's surrogate key directly in the fact table at load time; then the join is simply fact.customer_sk = dim.sk with no date predicate, but the fact load must look up the right sk for the event's date.
What are the most common SCD interview gotchas?
Three failure modes every senior round probes: (1) late-arriving updates — a change with an effective date earlier than the current dim row's valid_from; the fix is to split the existing valid_from/valid_to window at the late effective date. (2) Retroactive deletes — a row deleted in the source DB; the fix is to anti-join the staging extract against the dim and flip is_current=false, valid_to=today on any business key missing from the source. (3) Surrogate-key collisions on re-run — BIGSERIAL assigns new sks on backfill, breaking fact joins; the fix is a deterministic hash of (business_key, valid_from) as the sk.
Practice on PipeCode
- Drill the dimensional modelling library → for SCD design questions.
- Rehearse slowly-changing-data drills → for Type 2 / Type 6 implementations.
- Sharpen join problems → for effective-date predicates and fact-dim joins.
- Stretch into window-function drills → for
LEAD/LAG-based effective-date computation. - For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Reinforce the compute side with Apache Spark internals for DE interviews →.
- For the design-round muscles, work through ETL system design for DE interviews →.
- For the modelling muscles, browse data modelling for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every SCD concept above ships with hands-on practice rooms where you design dim tables, write SCD2 MERGEs, and trace fact-dim joins. Start with the dimensional-modelling library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.
Practice dimensional modelling →
Slowly-changing-data drills →





Top comments (0)