kimball data warehouse is still the gravity well every analytics interview falls back into: a fact table keyed by a handful of foreign keys, a halo of dimension table rows that describe context, a single declared grain that fixes what one row of the fact means, and a discipline for handling change over time — the four slowly changing dimension patterns (Type 1 overwrite, Type 2 new row, Type 3 new column, Type 6 hybrid). Together those primitives — plus the conformed dimensions that let fact_sales, fact_returns, and fact_inventory all share the same dim_customer — form the kimball methodology that powers Snowflake, BigQuery, Databricks, and Redshift warehouses in 2026, and the deep-dive interview track this guide walks through, end to end, in five numbered teaching sections.
This is the deep-dive companion to a tighter Q&A round-up: where a 5-section data-modeling cheat sheet ranges across OLTP normalisation, Inmon's third-normal-form warehouse, and Data Vault, this guide narrows the scope to dimensional modeling the way Ralph Kimball and Margy Ross actually teach it — fact tables vs dimension tables (the atoms), grain + SCDs (the decisions that bite you later), conformed dimensions + the bus matrix (modeling at enterprise scale), and the Kimball 4-step design process (business process → grain → dimensions → facts). Each section ends as dimensional modeling interview questions and answers: a question, a SQL or Python snippet, a traced execution, a sample output, and a concept-by-concept why this works breakdown — the exact shape kimball methodology rounds reward at FAANG, fintech, and every modern analytics shop.
When you want hands-on reps immediately after reading, browse dimensional-modeling practice →, drill slowly-changing-data problems →, sharpen SQL practice library →, rehearse aggregation reconciliation patterns →, reinforce database problems →, or widen coverage on the full Python practice library →.
On this page
- Why Kimball is still the dimensional-modeling interview standard
- Fact tables vs dimension tables — the atoms of Kimball modeling
- Grain + Slowly Changing Dimensions — Type 1, 2, 3, 6 with SQL
- Conformed dimensions + the Kimball bus matrix — modeling at enterprise scale
- The Kimball 4-step design process — business process → grain → dimensions → facts
- Choosing the right SCD type (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why Kimball is still the dimensional-modeling interview standard
kimball data warehouse — the dimensional model that outlived every "Kimball is dead" hot take
The one-sentence invariant: kimball data warehouse is a denormalised star schema built around fact tables (narrow, tall, numeric, foreign-key heavy) and dimension tables (wide, short, descriptive, business-key plus surrogate-key), designed so that BI users can write SELECT … FROM fact JOIN dim_a JOIN dim_b GROUP BY dim_a.something, dim_b.something and get the answer back in under a second. Every "Kimball is dead" hot take since 2010 — Inmon CIF, Data Vault 2.0, "just put everything in S3", "the warehouse is the lakehouse" — has been followed by a quiet rediscovery that, underneath the storage layer, analysts still want a star schema because that is the shape SQL pivots and BI tools natively consume.
Why dimensional modeling won the BI war (and is still winning in 2026).
-
Query simplicity —
SELECT … FROM fact_sales f JOIN dim_customer c JOIN dim_date dis teachable to a finance analyst in 30 minutes; a 6-table normalised join graph is not. - Read performance — denormalised dims mean fewer joins per query; the warehouse cost model rewards wide, short tables.
-
Stable interface —
dim_customerevolves (Type 2 history) without breaking thecustomer_keyjoin key downstream. - Tool affinity — Tableau, Looker, Power BI, ThoughtSpot, and Mode are all designed against a star schema; trying to drive them off a 3NF model is a 6-month integration project.
- Mental model fit — humans think in nouns + verbs; dimensions are nouns, facts are verbs ("the customer bought the product on the date at the store"); the schema matches the sentence.
What interviewers actually score on kimball methodology rounds.
-
Vocabulary fluency — can you crisply distinguish
fact,dimension,grain,surrogate key,business key,conformed dimension,SCD Type 2,bus matrix,degenerate dimension,junk dimension, andfactless fact tablein one sentence each? - The 4-step design — given a business request, can you walk business process → grain → dimensions → facts out loud, with explicit example values at each step?
-
graindefence — given a fact-table proposal, can you state its grain in one sentence and justify why no row is finer or coarser? -
SCD type selection per attribute — given a
dim_customerschema, can you mark each column as Type 1, 2, 3, or 6 and explain why? - Conformed-dimension reasoning — given three business processes (sales, returns, inventory), can you identify which dimensions should be shared (conformed) and which should remain process-local?
- The bus matrix — can you sketch a small bus matrix on a whiteboard with processes as rows and dimensions as columns?
The 5-section interview map this guide walks through.
- Section 2 — fact tables vs dimension tables — the two atoms; what columns belong where; the FK + measure structure of facts; the surrogate + business key + attribute structure of dims; the rule of thumb (facts are tall + skinny + numeric, dims are short + wide + descriptive).
-
Section 3 — grain + SCDs — declaring grain before any column is named; the four SCD types (1, 2, 3, 6) with full SQL
MERGEpatterns; the cost / benefit / use-case for each. -
Section 4 — conformed dimensions + the Kimball bus matrix — building
dim_customeronce and reusing it acrossfact_sales,fact_returns,fact_inventory; the bus matrix as the org-wide design artefact. - Section 5 — the Kimball 4-step design process — business process → grain → dimensions → facts, with a fully worked end-to-end example.
- Cheat sheet + FAQ — when to pick which SCD type, plus the senior-round Q&A every loop circles back to.
Why dimensional modeling is still the interview default in 2026 (and not "old hat").
- Snowflake, BigQuery, Databricks, and Redshift all publish reference architectures with star-schema gold-layer models.
-
dbt is built around dimensional modeling;
dim_/fact_naming is the de-facto convention;dbt_utilsshipsgenerate_surrogate_key, anddbt-expectationsships dimensional-model assertions. - The lakehouse did not kill it — Iceberg / Delta / Hudi tables still get a Kimball-shaped gold layer on top.
- Modern semantic layers — Cube, LookML, Snowflake's Semantic Layer — all assume a star-schema input.
- Data Vault complements, not replaces — DV 2.0 is increasingly used in the raw / integration layer with a Kimball star schema on top as the consumption layer.
Worked example — turn a one-sentence business request into the Kimball vocabulary
Detailed explanation. Real interviews probe whether you can translate a vague business request into the Kimball primitives (grain, fact, dimensions) on the spot. Below is the canonical translation drill — "track our online order line revenue by customer, product, date, and store" — and how a senior modeler maps it.
Question. A finance PM asks: "I want to see daily revenue by customer segment, product category, and store region, with the ability to drill into individual order lines." In one minute, name the fact table, its grain, and the four dimensions; include the surrogate-key columns on each.
Input. No tables exist yet. The OLTP source is a single orders table with one row per checkout and an embedded line-item array. The warehouse is empty.
Code.
-- Kimball translation of the PM request.
CREATE TABLE fact_sales (
sale_key BIGINT NOT NULL, -- surrogate PK of fact
customer_key BIGINT NOT NULL, -- FK -> dim_customer
product_key BIGINT NOT NULL, -- FK -> dim_product
date_key INT NOT NULL, -- FK -> dim_date (YYYYMMDD)
store_key BIGINT NOT NULL, -- FK -> dim_store
order_id VARCHAR(40) NOT NULL, -- degenerate dim (no own table)
line_id INT NOT NULL, -- degenerate dim
quantity INT NOT NULL,
unit_price NUMERIC(12,2) NOT NULL,
discount_amount NUMERIC(12,2) NOT NULL,
revenue NUMERIC(12,2) NOT NULL -- = quantity * unit_price - discount
);
-- Grain: one row per (order_id, line_id) — i.e. one row per ordered SKU.
Step-by-step explanation.
- Business process — online sales; the noun + verb pair tells you the process you're modelling.
- Grain — one row per order line; declared in the comment on the table; defended against finer (no row per scan event) and coarser (no row per order header) alternatives.
-
Dimensions —
customer,product,date,store; one per who / what / when / where; each becomes its owndim_*table with a surrogate*_key. -
Facts (measures) —
quantity,unit_price,discount_amount,revenue; numeric, additive, aggregatable bySUM. -
Degenerate dimensions —
order_idandline_idlive on the fact (no separate dim) because they have no descriptive attributes worth storing in their own table.
Output (the column list, grouped by role).
| role | columns | shape |
|---|---|---|
| surrogate PK | sale_key | numeric |
| FK to dim | customer_key, product_key, date_key, store_key | numeric |
| degenerate dim | order_id, line_id | string + int |
| measures | quantity, unit_price, discount_amount, revenue | numeric, additive |
Rule of thumb: every interview translation answer should explicitly name the grain in one sentence before any column is listed. Skip the grain and the rest of the model is unfounded.
dimensional modeling — Kimball vs Inmon vs Data Vault in one minute
The three competing schools (and when each wins).
- Kimball (dimensional modeling) — denormalised star / snowflake schema, fact + dim tables, grain-first, optimised for BI query speed and analyst ergonomics; the default for the gold / consumption layer.
- Inmon (Corporate Information Factory) — fully normalised 3NF enterprise warehouse acting as the integration layer, with downstream Kimball-style data marts hanging off it; the heavyweight enterprise pattern, less common at modern startups.
- Data Vault 2.0 — hub / link / satellite pattern designed for source-aware audit-friendly raw integration; excellent for the raw / integration layer, frequently combined with a Kimball star on top as the consumption layer.
Why Kimball wins the interview question by default.
-
The PM asks a single business question — "show me revenue by region by month"; the answer is "join
fact_salestodim_storeanddim_date". - Junior engineers can read it — a star schema is teachable; a 7-table Data Vault is not.
- It composes with everything — modern stacks layer Kimball on top of Vault or on top of a raw bronze lake.
-
The vocabulary travels — every BI tool, every dbt project, every dimensional textbook uses the same
fact_*/dim_*convention.
The senior signal — "Kimball + something" beats "Kimball or nothing".
- Kimball gold + Data Vault silver — DV in the integration layer handles source heterogeneity; Kimball star on top serves analysts.
-
Kimball gold + bronze raw —
bronze.orders_rawlands the source untransformed;silver.orders_cleanedadds standardisation;gold.fact_salesis the Kimball star. - Kimball gold + semantic layer — define metrics in Cube / LookML / dbt-metricflow on top of the Kimball star; the metric definitions live above the schema.
-
Kimball gold + reverse ETL — push
dim_customerType-2 history back into Salesforce / HubSpot for marketing personalisation.
SQL
Topic — dimensional-modeling
Dimensional modeling drills
SQL
Topic — slowly-changing-data
Slowly changing dimensions practice
Solution Using a Kimball-vocabulary lookup matrix
Code.
-- Materialise the Kimball vocabulary as a quick-reference matrix
-- every interview answer can be grounded against.
CREATE TABLE kimball_vocabulary AS
SELECT * FROM (VALUES
('fact table', 'narrow + tall + numeric', 'one row per business event', 'fact_sales, fact_returns'),
('dimension table', 'short + wide + descriptive','one row per business entity', 'dim_customer, dim_product'),
('grain', 'declared sentence', 'what one row of the fact means', '1 row = 1 order line'),
('surrogate key', 'numeric, system-generated', 'stable, history-aware join key', 'customer_key BIGINT'),
('business key', 'natural key from source', 'the OLTP identifier', 'customer_id VARCHAR(40)'),
('SCD Type 1', 'overwrite', 'no history kept', 'email change'),
('SCD Type 2', 'add new row', 'full history with valid_from/to', 'segment change'),
('SCD Type 3', 'add new column', 'limited history (current + prev)', 'sales_region rename'),
('SCD Type 6', 'hybrid (1+2+3)', 'full history + fast current lookup','enterprise customer dim'),
('conformed dimension', 'shared across fact tables', 'one dim_customer for sales+returns','dim_customer'),
('degenerate dim', 'on the fact, no own table', 'identifier with no attributes', 'order_id, line_id'),
('junk dimension', 'combine low-card flags', 'shrink fact width, group flags', 'dim_order_flags'),
('bridge table', 'many-to-many resolver', 'connect fact to multi-valued dim', 'bridge_account_customer'),
('factless fact', 'event with no measures', 'occurrence-only event log', 'fact_login, fact_class_attendance'),
('bus matrix', 'process x dim grid', 'org-wide dim conformance map', 'sales|returns|inventory x customer|product|date')
) AS t(term, shape, definition, example);
Step-by-step trace.
| term | shape | definition | example |
|---|---|---|---|
| fact table | narrow + tall + numeric | one row per business event | fact_sales, fact_returns |
| dimension table | short + wide + descriptive | one row per business entity | dim_customer, dim_product |
| grain | declared sentence | what one row of the fact means | 1 row = 1 order line |
| surrogate key | numeric, system-generated | stable, history-aware join key | customer_key BIGINT |
| business key | natural key from source | the OLTP identifier | customer_id VARCHAR(40) |
| SCD Type 1 | overwrite | no history kept | email change |
| SCD Type 2 | add new row | full history with valid_from/to | segment change |
| SCD Type 3 | add new column | limited history (current + prev) | sales_region rename |
| SCD Type 6 | hybrid (1+2+3) | full history + fast current lookup | enterprise customer dim |
| conformed dimension | shared across fact tables | one dim_customer for sales+returns | dim_customer |
| degenerate dim | on the fact, no own table | identifier with no attributes | order_id, line_id |
| junk dimension | combine low-card flags | shrink fact width, group flags | dim_order_flags |
| bridge table | many-to-many resolver | connect fact to multi-valued dim | bridge_account_customer |
| factless fact | event with no measures | occurrence-only event log | fact_login |
| bus matrix | process x dim grid | org-wide dim conformance map | sales |
- Rows 1–2 — the two atoms; every other term is built on top.
- Row 3 —
grainis declared as a sentence, not a column; it constrains every later modeling decision. - Rows 4–5 — every dimension has both a surrogate (system) key and a business (source) key; the surrogate joins, the business identifies.
- Rows 6–9 — the four SCD types; section 3 ships full SQL for each.
- Row 10 —
conformed dimensionsare the contract that lets cross-process analytics actually work; section 4 covers them in depth. - Rows 11–14 — the less-common but interview-favourite primitives (degenerate, junk, bridge, factless).
- Row 15 — the bus matrix is the org-wide design artefact; section 4 sketches one.
Output.
| term | example |
|---|---|
| fact table | fact_sales |
| dimension table | dim_customer |
| grain | 1 row = 1 order line |
| surrogate key | customer_key |
| business key | customer_id |
| SCD Type 2 | segment change |
| conformed dimension | dim_customer shared by sales + returns |
| bus matrix | sales |
Why this works — concept by concept:
- Vocabulary matrix — turns 15 fuzzy terms into one-row definitions you can recite under pressure; interviewers reward crisp definitions over hand-waving.
- Shape column — pairs each term with its physical characteristic (narrow + tall, short + wide, etc.); this is the senior signal that you've actually built dim models, not just read about them.
- Definition column — one sentence per term; if you can't fit it in a sentence, you don't understand it yet.
- Example column — grounds every abstract term in a concrete table or column name; concrete examples beat abstract definitions in every interview.
-
Cost —
O(1)to read; the actual schemas built from this vocabulary areO(N rows)to materialise but the vocabulary itself is constant-time recall.
2. Fact tables vs dimension tables — the atoms of Kimball modeling
fact table vs dimension table — the two atoms every Kimball schema is built from
fact table rows answer "how much"; dimension table rows answer "who / what / when / where / why". The two are physically different shapes — facts are narrow + tall + numeric (a handful of foreign-key columns plus a handful of additive measures, repeated millions of times); dims are short + wide + descriptive (one row per entity, dozens of text and date attributes, history-aware columns layered on top). Mastering Kimball is mostly mastering these two shapes and the discipline of not mixing them.
Anatomy of a fact table.
-
Foreign keys — one column per dimension that participates in the grain; named
*_key(the surrogate key, not the business key). -
Degenerate dimensions — identifiers that live on the fact because they have no descriptive attributes (
order_id,line_id,transaction_id). -
Measures — numeric columns aggregatable by
SUM/COUNT/MIN/MAX/AVG; ideally fully additive across all dimensions. - Grain comment — a one-sentence declaration of what one row means; lives in the table comment so it can't drift from the schema.
-
Surrogate fact key — optional; many shops use the composite
(order_id, line_id)as the natural PK and skip the surrogate fact key entirely.
Three flavours of fact tables (the senior interviewer will ask).
-
Transaction fact — one row per business event (one order line, one click, one payment); the most common shape; fully additive measures; example
fact_sales. -
Periodic snapshot fact — one row per (entity, time period); useful for slowly evolving balances; semi-additive over time (
balancedoesn'tSUMacross days); examplefact_account_balance_daily. -
Accumulating snapshot fact — one row per long-running process, with multiple date columns that get updated as the process advances; example
fact_order_lifecyclewithorder_date,ship_date,deliver_date,return_date.
Anatomy of a dimension table.
-
Surrogate key — system-generated
BIGINT, monotonically increasing; the only column the fact joins against; stable across SCD changes. -
Business key — the OLTP source identifier (
customer_id,product_sku); preserved for traceability but never used as a join key. -
Descriptive attributes —
name,segment,country,category,sub_category; the columns BI usersGROUP BY. -
SCD columns —
valid_from,valid_to,is_currentfor Type 2;current_*/previous_*pairs for Type 3; both layers for Type 6. -
Audit columns —
inserted_at,updated_at,source_system; metadata that helps with reconciliation and DQ.
The rule of thumb (memorise this; recite it under pressure).
- Facts are tall + skinny + numeric — billions of rows, ~10 columns, mostly FKs + measures.
- Dims are short + wide + descriptive — thousands or millions of rows (rarely > 100M), 20-100 columns, mostly text + date attributes.
- If you find yourself adding a long text column to a fact, you're modelling it wrong — that attribute belongs on a dimension.
- If you find yourself adding a numeric measure to a dimension, you're modelling it wrong — that measure belongs on a fact.
- The two atoms never mix — facts join to dims; dims do not join to dims (snowflake schema being the rare exception).
Worked example — design fact_sales and dim_customer from a raw orders source
Detailed explanation. Real interviews ask you to physically design both atoms from an OLTP source. Below is the canonical translation of a raw orders source into a Kimball fact + dim pair, with explicit column lists and surrogate-key wiring.
Question. Given an OLTP orders source with a customers lookup and a products lookup, design fact_sales and dim_customer end-to-end. Name every column, every type, every key, and the grain.
Input. The OLTP source has three tables: orders(order_id, customer_id, order_ts), order_lines(order_id, line_id, sku, qty, unit_price, discount), customers(customer_id, name, email, segment, country, signup_dt).
Code.
-- The fact: narrow + tall + numeric.
CREATE TABLE fact_sales (
sale_key BIGINT IDENTITY(1,1) PRIMARY KEY,
customer_key BIGINT NOT NULL REFERENCES dim_customer(customer_key),
product_key BIGINT NOT NULL REFERENCES dim_product(product_key),
date_key INT NOT NULL REFERENCES dim_date(date_key),
store_key BIGINT NOT NULL REFERENCES dim_store(store_key),
order_id VARCHAR(40) NOT NULL, -- degenerate dim
line_id INT NOT NULL, -- degenerate dim
quantity INT NOT NULL,
unit_price NUMERIC(12,2) NOT NULL,
discount_amount NUMERIC(12,2) NOT NULL DEFAULT 0,
revenue NUMERIC(12,2) NOT NULL,
CONSTRAINT uq_fact_sales UNIQUE (order_id, line_id)
);
-- GRAIN: one row per (order_id, line_id) — i.e. one row per ordered SKU.
-- The dim: short + wide + descriptive + SCD-aware.
CREATE TABLE dim_customer (
customer_key BIGINT IDENTITY(1,1) PRIMARY KEY, -- surrogate
customer_id VARCHAR(40) NOT NULL, -- business key
name VARCHAR(120) NOT NULL,
email VARCHAR(120) NOT NULL, -- SCD Type 1 (overwrite)
segment VARCHAR(32) NOT NULL, -- SCD Type 2 (history)
country VARCHAR(64) NOT NULL, -- SCD Type 2 (history)
signup_date DATE NOT NULL,
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP NOT NULL DEFAULT '9999-12-31',
is_current BOOLEAN NOT NULL DEFAULT TRUE,
inserted_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
source_system VARCHAR(40) NOT NULL DEFAULT 'oltp_orders'
);
CREATE INDEX ix_dim_customer_bk ON dim_customer(customer_id, is_current);
Step-by-step explanation.
-
fact_sales.sale_key— optional system-generated PK; some teams skip it and use(order_id, line_id)directly. -
fact_sales.{customer,product,date,store}_key— four FKs, one per dimension; named*_key(never*_id) to signal "this is the surrogate, not the source identifier". -
fact_sales.{order_id, line_id}— degenerate dimensions; they live on the fact because they have no descriptive attributes worth storing in their own table. -
fact_sales.{quantity, unit_price, discount_amount, revenue}— additive measures;revenueis stored even though it's derivable, so that BI queries don't have to recompute it on every aggregation. -
dim_customer.customer_keyvscustomer_id— surrogate (used by the fact) vs business (preserved for traceability); the fact never referencescustomer_iddirectly. -
dim_customer.email— SCD Type 1 (overwrite); change history of email addresses is rarely interesting and inflates row counts. -
dim_customer.segmentandcountry— SCD Type 2 (full history); these are historically interesting (a customer moved from "starter" to "enterprise" in March; revenue before vs after that change is a real question). -
dim_customer.valid_from / valid_to / is_current— the SCD Type 2 columns;is_currentis a precomputed flag so theWHERE is_currentlookup is index-friendly.
Output (the fact + dim shapes side by side).
| table | rows | columns | shape |
|---|---|---|---|
| fact_sales | 100M+ | ~12 (4 FK + 2 degen + 4 measure + 2 admin) | tall + skinny + numeric |
| dim_customer | 1-10M | ~12 (1 surr + 1 biz + 5 attr + 3 SCD + 2 audit) | short + wide + descriptive |
Rule of thumb: the physical difference between facts and dims is the easiest senior signal to give — "the fact is roughly 12 columns × 100M rows, the dim is roughly 12 columns × 1M rows, and the column types tell you which is which".
surrogate key vs business key — the rule that lets SCD history actually work
surrogate key is the system-generated BIGINT you stamp on every dimension row, and it is the only column the fact joins against. The business key (a.k.a. natural key) is the OLTP source identifier — customer_id = 'C-00012345', product_sku = 'SKU-RED-MEDIUM' — and you preserve it on the dim for traceability, but you never use it as a join key. The distinction matters because once you start tracking SCD Type 2 history, a single business key can map to multiple dim rows (one per historical version), so the join from fact to dim has to use the surrogate, never the business key.
The 5-rule surrogate-key discipline.
-
Surrogate is
BIGINT, system-generated —IDENTITY(1,1)in SQL Server,GENERATED ALWAYS AS IDENTITYin PostgreSQL,AUTOINCREMENTin Snowflake. -
Surrogate is opaque — never embed business meaning;
customer_key = 12345should mean nothing outside the warehouse. -
Fact stores only the surrogate — never
customer_id, alwayscustomer_key. -
Business key +
is_current = TRUEis the lookup recipe — to find the current row for a given customer:WHERE customer_id = 'C-001' AND is_current = TRUE. -
The surrogate key remains stable when the source business key changes — if
customer_idis reissued by the OLTP team, the surrogate stays put; the source change is just another SCD event.
Why this matters in interviews.
-
The Type 2 join is broken without surrogate keys — if the fact stores
customer_idanddim_customerhas 3 historical rows for that customer, the fact join is now 3x ambiguous. -
Hashing replaces auto-increment in modern shops —
dbt_utils.generate_surrogate_key(['customer_id', 'valid_from'])is the idiomatic Snowflake / BigQuery / dbt pattern. - Surrogate keys decouple the warehouse from the source — the source can renumber, re-key, or migrate; the warehouse surrogate is untouched.
SQL
Topic — dimensional-modeling
Fact and dimension design drills
SQL
Topic — database
Database design practice
Solution Using a surrogate-key + business-key join harness
Code.
-- Join fact_sales to dim_customer using the surrogate, with point-in-time correctness.
SELECT
d.segment,
d.country,
SUM(f.revenue) AS total_revenue,
COUNT(DISTINCT f.order_id) AS order_count
FROM fact_sales f
JOIN dim_customer d
ON d.customer_key = f.customer_key -- surrogate join, never customer_id
WHERE d.is_current = TRUE -- current segment lookup
AND f.date_key BETWEEN 20260101 AND 20260131
GROUP BY d.segment, d.country
ORDER BY total_revenue DESC;
Step-by-step trace.
| f.customer_key | d.customer_key | d.customer_id | d.segment | d.is_current | f.revenue |
|---|---|---|---|---|---|
| 101 | 101 | C-001 | enterprise | true | 5000.00 |
| 102 | 102 | C-002 | starter | true | 1200.00 |
| 103 | 103 | C-003 | enterprise | true | 8400.00 |
| 101 | 101 | C-001 | enterprise | true | 3300.00 |
- The fact stores
customer_key = 101, notcustomer_id = 'C-001'; the join isd.customer_key = f.customer_key. -
is_current = TRUEfilters to one dim row per business customer; without it the result set would multiply by SCD history depth. - Rows 1 + 4 belong to the same customer (C-001); they roll up in the
GROUP BYbecause they share the samesegment+country. - The grain of the result is one row per (
segment,country); each cell is theSUM(revenue)andCOUNT(DISTINCT order_id)for that bucket. - The
WHERE date_key BETWEENclause hits the fact-side partition pruning; the dim is small enough that no partitioning is needed.
Output:
| segment | country | total_revenue | order_count |
|---|---|---|---|
| enterprise | US | 16700.00 | 3 |
| starter | UK | 1200.00 | 1 |
Why this works — concept by concept:
-
Surrogate-key join —
d.customer_key = f.customer_keyis the only valid join shape; it survives SCD Type 2 history and source re-keying. -
is_current filter — without
WHERE is_current = TRUE, the join multiplies by historical depth; with it, you get one current row per customer. -
date_key partition pruning —
date_key BETWEEN 20260101 AND 20260131lets the warehouse skip every other partition; this is why we useINTYYYYMMDDfor date keys. -
Additive measures —
SUM(revenue)is safe because revenue is fully additive across all four dims; this is the payoff for storing the derivedrevenuecolumn on the fact. -
Cost — fact scan is
O(rows in matching partitions); dim join isO(distinct customers); the surrogate key makes both lookups index-friendly.
3. Grain + Slowly Changing Dimensions — Type 1, 2, 3, 6 with SQL
grain — declare it first, defend it forever
grain is the single sentence that defines what one row of a fact table means, declared before you name a single column, and defended for the life of the table. "One row per order line", "one row per customer per day", "one row per order, accumulated across the lifecycle" — three different grains, three different fact tables, three different physical shapes. The Kimball discipline is declare the grain first, never mix grains in the same fact table, and never change the grain after the table is built.
The three grain families.
- Transaction grain — one row per business event; the most common; example "one row per (order_id, line_id)"; measures are fully additive.
-
Periodic snapshot grain — one row per (entity, period); example "one row per (account_id, date_key)"; measures are semi-additive over time (
balancedoes notSUMacross days). -
Accumulating snapshot grain — one row per long-running process, updated in place as the process advances; example "one row per order, with
ordered_date_key,shipped_date_key,delivered_date_key,returned_date_key"; measures track lag (days_to_ship,days_to_deliver).
Why grain has to be declared first (and never changed later).
- The grain is the schema — the FK list, the degenerate-dim list, the measure list, and the additivity rules all follow from the grain.
-
Mixing grains corrupts every aggregate — if some rows are
(order_id, line_id)and others are(order_id)alone,SUM(revenue) GROUP BY product_keydouble-counts on the order-level rows. - Changing the grain breaks every downstream model — a re-grain triggers a coordinated re-publish of every BI dashboard that consumed the prior grain.
- The grain is the contract — write it in the table comment, the dbt model docstring, the data catalog, and the wiki; multiple sources of truth keep it from drifting.
Concrete grain examples (memorise the wording).
-
"One row per (order_id, line_id)" — transaction grain for
fact_sales. -
"One row per (account_id, snapshot_date_key)" — periodic snapshot grain for
fact_account_balance_daily. -
"One row per order, lifecycle-accumulating" — accumulating snapshot grain for
fact_order_lifecycle. -
"One row per (customer_id, day, event_name)" — semi-aggregated event grain for
fact_user_event_daily. -
"One row per class session, no measures" — factless fact for
fact_class_attendance.
slowly changing dimension — the four types you have to know cold
The acronym SCD covers strategies for handling change in dimension attributes over time, and every Kimball interview will probe at least Types 1, 2, and 6. The trick is not memorising the types; it is knowing which type to pick per attribute and writing the MERGE statements from memory.
SCD Type 1 — overwrite (no history)
Detailed explanation. SCD Type 1 simply overwrites the existing value in place; no history is preserved. Use it for attributes where past values are not interesting (typos, formatting changes, contact-info updates), and where the cost of preserving history outweighs the analytical value.
Question. A customer changes their email from alice@old.com to alice@new.com. Write the SCD Type 1 MERGE that updates the dimension.
Input. Source row: customer_id='C-001', email='alice@new.com'. Existing dim row: customer_key=101, customer_id='C-001', email='alice@old.com'.
Code.
-- SCD Type 1: overwrite in place.
MERGE INTO dim_customer_t1 AS tgt
USING (
SELECT
'C-001' AS customer_id,
'alice@new.com' AS email,
'Alice Smith' AS name
) AS src
ON tgt.customer_id = src.customer_id
WHEN MATCHED THEN UPDATE SET
tgt.email = src.email,
tgt.name = src.name
WHEN NOT MATCHED THEN INSERT (customer_id, email, name)
VALUES (src.customer_id, src.email, src.name);
Step-by-step explanation.
-
MERGE INTO dim_customer_t1targets the dim table; the aliastgtis conventional. -
USING (SELECT …) AS srclifts the new source row into a CTE-like alias; in production this would be a CTE over the staging table. -
ON tgt.customer_id = src.customer_idmatches on the business key; this is the only SCD type where matching on business key is safe (because there is no history). -
WHEN MATCHED THEN UPDATEoverwrites the email + name in place; the prior values are lost forever. -
WHEN NOT MATCHED THEN INSERTcovers the brand-new-customer case; first-time customers get a fresh row.
Output (the dim after the merge).
| customer_key | customer_id | name | |
|---|---|---|---|
| 101 | C-001 | alice@new.com | Alice Smith |
Rule of thumb: Type 1 is fast and cheap but lossy; use it for attributes nobody will ever ask "what was that on Feb 14th" about.
SCD Type 2 — add a new row (full history)
Detailed explanation. SCD Type 2 is the workhorse of dimensional modeling: when an attribute changes, insert a new row with a fresh surrogate key and stamp valid_from + valid_to + is_current on both the old and new rows. The prior row's valid_to becomes the new row's valid_from; the prior row's is_current becomes FALSE.
Question. Customer C-001 upgrades from starter to enterprise segment on 2026-04-15 10:30:00. Write the SCD Type 2 MERGE (or insert + update pair) that closes the old row and inserts the new one.
Input. Existing dim row: customer_key=101, customer_id='C-001', segment='starter', valid_from='2025-01-01', valid_to='9999-12-31', is_current=TRUE. Source change: customer_id='C-001', segment='enterprise', change_ts='2026-04-15 10:30:00'.
Code.
-- SCD Type 2: insert + update pair (the classic 2-step pattern).
BEGIN;
-- Step 1: close out the current row.
UPDATE dim_customer
SET
valid_to = TIMESTAMP '2026-04-15 10:30:00',
is_current = FALSE
WHERE customer_id = 'C-001'
AND is_current = TRUE;
-- Step 2: insert the new current row.
INSERT INTO dim_customer (
customer_id, name, email, segment, country, signup_date,
valid_from, valid_to, is_current
)
SELECT
'C-001', name, email, 'enterprise', country, signup_date,
TIMESTAMP '2026-04-15 10:30:00',
TIMESTAMP '9999-12-31',
TRUE
FROM dim_customer
WHERE customer_id = 'C-001'
AND valid_to = TIMESTAMP '2026-04-15 10:30:00' -- the row we just closed
LIMIT 1;
COMMIT;
Step-by-step explanation.
- Step 1 closes the prior current row by stamping
valid_to = change_tsandis_current = FALSE; this row now represents the historical state. - Step 2 inserts a new row with a fresh surrogate key (auto-generated by
IDENTITY),segment = 'enterprise',valid_from = change_ts,valid_to = '9999-12-31',is_current = TRUE. - The
WHERE valid_to = change_tsclause in step 2'sSELECTis how we copy the immutable attributes (name, email, country, signup_date) from the prior row. - The two steps run inside a transaction so a downstream reader never sees the dim with zero current rows for
C-001. - The surrogate key of the new row is different from the prior row's surrogate — that's the whole point; the fact table will join to whichever key matches the order's
valid_fromwindow.
Output (the dim after the merge — two rows now).
| customer_key | customer_id | segment | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| 101 | C-001 | starter | 2025-01-01 | 2026-04-15 10:30 | false |
| 132 | C-001 | enterprise | 2026-04-15 10:30 | 9999-12-31 | true |
Rule of thumb: SCD Type 2 inflates row count but preserves full history; pick it for any attribute where "what was the value on date X" is a real analytical question.
SCD Type 3 — add a new column (limited history)
Detailed explanation. SCD Type 3 adds a new column (typically previous_*) alongside the existing one, so the dim carries both the current and the immediately prior value side by side. It tracks one level of history per attribute; older history is lost.
Question. A company renames its sales_region from 'NorthAm' to 'Americas'. Track both the current and previous region on dim_store without inserting new rows.
Input. Existing dim row: store_key=11, store_id='S-100', sales_region='NorthAm'. Source change: store_id='S-100', sales_region='Americas', change_ts='2026-03-01'.
Code.
-- SCD Type 3: shift the current value into a previous column.
ALTER TABLE dim_store
ADD COLUMN previous_sales_region VARCHAR(64),
ADD COLUMN sales_region_changed_at TIMESTAMP;
UPDATE dim_store
SET
previous_sales_region = sales_region,
sales_region = 'Americas',
sales_region_changed_at = TIMESTAMP '2026-03-01'
WHERE store_id = 'S-100';
Step-by-step explanation.
-
ALTER TABLEadds two new columns:previous_sales_region(the prior value) andsales_region_changed_at(the change timestamp). - The
UPDATEshifts the existingsales_regionvalue intoprevious_sales_region, then overwritessales_regionwith the new value. - The row count of the dim is unchanged — Type 3 is in-place, no new rows.
- The new column lets BI write
SUM(revenue) GROUP BY sales_regionfor the current view andSUM(revenue) GROUP BY previous_sales_regionfor the prior view, without rewriting the fact joins. - Type 3 is brittle — if the region renames again a year later, the
previous_*column now holds two-changes-ago by default; some shops addprevious_previous_*, which quickly becomes silly.
Output (the dim after the merge).
| store_key | store_id | sales_region | previous_sales_region | sales_region_changed_at |
|---|---|---|---|---|
| 11 | S-100 | Americas | NorthAm | 2026-03-01 |
Rule of thumb: Type 3 fits "we just renamed one attribute and analysts want a side-by-side compare for a few quarters". Use sparingly; if you need full history, escalate to Type 2.
SCD Type 6 — hybrid (1 + 2 + 3 combined)
Detailed explanation. SCD Type 6 is the senior interview answer: combine Type 1 (overwrite the current attribute in every historical row), Type 2 (insert new rows for change), and Type 3 (carry the prior value on every row) into a single hybrid pattern. The result is a dim where every row carries both its own historical value and the current value, so BI can pivot on either without re-joining.
Question. Track customer segment changes with full history (Type 2) and let a query say WHERE current_segment = 'enterprise' cheaply on every historical row (Type 1) and expose previous_segment on each new row (Type 3). Write the SCD Type 6 update.
Input. Two existing dim rows for C-001: the original starter row and the enterprise row inserted above.
Code.
-- SCD Type 6: insert new row + overwrite current_segment on every historical row.
BEGIN;
-- Step 1: close out the prior current row (Type 2 mechanics).
UPDATE dim_customer_t6
SET valid_to = TIMESTAMP '2026-04-15 10:30:00',
is_current = FALSE
WHERE customer_id = 'C-001' AND is_current = TRUE;
-- Step 2: insert the new row with previous_segment carried (Type 3 mechanics).
INSERT INTO dim_customer_t6 (
customer_id, name, email,
segment, previous_segment, current_segment,
valid_from, valid_to, is_current
)
SELECT
'C-001', name, email,
'enterprise', -- this row's historical segment
'starter', -- the prior segment (Type 3)
'enterprise', -- the current segment (Type 1)
TIMESTAMP '2026-04-15 10:30:00',
TIMESTAMP '9999-12-31',
TRUE
FROM dim_customer_t6
WHERE customer_id = 'C-001'
AND valid_to = TIMESTAMP '2026-04-15 10:30:00'
LIMIT 1;
-- Step 3: overwrite current_segment on every historical row (Type 1 mechanics).
UPDATE dim_customer_t6
SET current_segment = 'enterprise'
WHERE customer_id = 'C-001';
COMMIT;
Step-by-step explanation.
- Step 1 mirrors SCD Type 2: close the prior current row by stamping
valid_to+is_current = FALSE. - Step 2 inserts a new row with three segment columns:
segment(the historical value for this row, here'enterprise'),previous_segment(the prior value, here'starter', the Type 3 carry-over), andcurrent_segment(the as-of-now value, here also'enterprise'). - Step 3 mirrors SCD Type 1: overwrite
current_segmenton every historical row forC-001, so even the closed-out starter row now carriescurrent_segment = 'enterprise'. - The payoff: BI can write
WHERE current_segment = 'enterprise'and get all historical revenue for that customer regardless of which row matches the order date; orWHERE segment = 'enterprise'to filter by historical segment-at-time-of-purchase. - Type 6 is the senior answer because it solves the "we want both views" problem without two separate dim tables.
Output (the dim after the merge — two rows, both carrying current_segment = 'enterprise').
| customer_key | customer_id | segment | previous_segment | current_segment | is_current |
|---|---|---|---|---|---|
| 101 | C-001 | starter | NULL | enterprise | false |
| 132 | C-001 | enterprise | starter | enterprise | true |
Rule of thumb: Type 6 is the "I want history and fast current lookup" pattern; it costs one extra column per Type-1-overwritten attribute but eliminates a whole class of join + filter complexity.
slowly changing dimension — beginner mistakes to avoid
- Joining the fact on the business key instead of the surrogate — breaks the moment you adopt SCD Type 2; the join multiplies by history depth.
-
Forgetting
is_current = TRUE— every current-state query needs it; without it the result silently sums historical rows. -
Letting
valid_tobe NULL — use'9999-12-31'instead soBETWEEN valid_from AND valid_toworks withoutIS NULLbranches. -
Updating
valid_fromon an open row —valid_fromis immutable once stamped; onlyvalid_toandis_currentflip during SCD updates. - Mixing Type 1 and Type 2 attributes in the same row without comment — every dim column should be annotated with its SCD type in the table comment or dbt YAML.
- Picking Type 2 for every attribute "just in case" — Type 2 inflates row counts; pick the type that matches the analytical question you'll be asked.
SQL
Topic — slowly-changing-data
SCD practice problems
SQL
Topic — dimensional-modeling
Grain-and-SCD dimensional modeling drills
Solution Using a per-attribute SCD type assignment matrix
Code.
-- Codify the SCD type for every attribute on dim_customer.
CREATE TABLE dim_customer_scd_plan AS
SELECT * FROM (VALUES
('customer_id', 'business key', 'NA', 'preserved for traceability; not updated after first insert'),
('name', 'descriptive', 'Type 1', 'typos and rebrands; history not interesting'),
('email', 'descriptive', 'Type 1', 'overwrite; do not preserve email history'),
('phone', 'descriptive', 'Type 1', 'overwrite; do not preserve phone history'),
('segment', 'analytical', 'Type 2', 'revenue per historical segment is a real question'),
('country', 'analytical', 'Type 2', 'geo migration matters for tax + analytics'),
('account_mgr', 'analytical', 'Type 2', 'attribution to manager-at-time-of-sale'),
('credit_score', 'analytical', 'Type 2', 'risk analysis needs historical score'),
('signup_date', 'immutable', 'NA', 'never changes; set once at insert'),
('current_segment','derived', 'Type 6', 'overwrite on all rows for fast current-state lookup')
) AS t(attribute, role, scd_type, rationale);
Step-by-step trace.
| attribute | role | scd_type | rationale |
|---|---|---|---|
| customer_id | business key | NA | preserved for traceability; not updated after first insert |
| name | descriptive | Type 1 | typos and rebrands; history not interesting |
| descriptive | Type 1 | overwrite; do not preserve email history | |
| phone | descriptive | Type 1 | overwrite; do not preserve phone history |
| segment | analytical | Type 2 | revenue per historical segment is a real question |
| country | analytical | Type 2 | geo migration matters for tax + analytics |
| account_mgr | analytical | Type 2 | attribution to manager-at-time-of-sale |
| credit_score | analytical | Type 2 | risk analysis needs historical score |
| signup_date | immutable | NA | never changes; set once at insert |
| current_segment | derived | Type 6 | overwrite on all rows for fast current-state lookup |
- Rows 1, 9 — business key + immutable; never updated after first insert.
- Rows 2-4 — Type 1; overwrite; cheap; loses history; appropriate for cosmetic and contact attributes.
- Rows 5-8 — Type 2; the analytical attributes; revenue / risk / attribution per historical value is a real question.
- Row 10 — Type 6 layered on top of
segment; one extra column gives BI a fast "current state" pivot without a join. - The matrix is the deliverable; every senior data modeler ships a per-attribute SCD plan, not a blanket "everything is Type 2".
Output.
| attribute | scd_type |
|---|---|
| customer_id | NA |
| name | Type 1 |
| Type 1 | |
| segment | Type 2 |
| country | Type 2 |
| account_mgr | Type 2 |
| credit_score | Type 2 |
| current_segment | Type 6 |
Why this works — concept by concept:
- Per-attribute SCD assignment — Kimball's discipline is "pick the SCD type per attribute, not per table"; a single dim can mix Types 1, 2, and 6 across its columns.
- Type 1 for cosmetic, Type 2 for analytical — the rule of thumb that keeps row counts down without losing analytical value; cosmetic attributes (typos, rebrands, contact info) overwrite, analytical attributes (segment, region, tier) preserve history.
-
Type 6 for derived current-state — pairing a Type 2 attribute with a Type 1
current_*column gives BI both views with zero join cost. -
Documentation as deliverable — the assignment matrix itself is shipped as part of the model design; without it the next engineer can't tell why
emailis Type 1 butcountryis Type 2. -
Cost —
O(1)to read the matrix; the actual updates costO(rows-per-change)for Type 1 (overwrite all rows for that business key) vsO(1)for Type 2 (insert one new row).
4. Conformed dimensions + the Kimball bus matrix — modeling at enterprise scale
conformed dimensions — build dim_customer once, use it in every fact
conformed dimensions are dimensions designed to be shared across multiple business processes — fact_sales, fact_returns, fact_inventory all join to the same dim_customer, dim_product, dim_date. The conformance contract is the heart of Kimball at enterprise scale: without it, every team builds their own dim_customer_sales, dim_customer_marketing, dim_customer_support, and cross-process analytics becomes impossible because the definition of "customer" has diverged in five places.
The conformance contract — what makes a dim "conformed".
-
Same columns —
dim_customer.segmentmeans the same thing whether you join it tofact_salesorfact_returns. -
Same surrogate-key generation —
customer_key = 12345resolves to the same business customer in every fact. - Same SCD policy — segment changes are tracked as Type 2 in every fact that uses the dim.
- Same grain — if the dim is at customer-account level (not customer-individual level), every fact agrees on that grain.
-
Same source of truth — one team owns
dim_customer; the other teams consume it, they don't fork it.
Why conformance matters in interviews.
-
Cross-process analytics depends on it — "what % of customers who bought in Q1 returned in Q2" requires
fact_salesandfact_returnsto sharedim_customer. -
Reconciliation breaks without it — if
fact_sales.customer_key = 12345is "Alice" butfact_returns.customer_key = 12345is "Bob", every reconciliation query lies. - It is the senior signal — junior modelers build a dim per fact; senior modelers build a conformed dim and reuse it.
- The Kimball bus matrix is the deliverable — section 4.2 walks through it.
Three flavours of conformance (with examples).
-
Identical conformance — the strongest; the dim row, the surrogate key, and every attribute match exactly across facts; example
dim_date. -
Shrunken conformance — a coarser version of the dim is used in lower-grain facts; example
dim_dateat month-grain (dim_month) for inventory snapshots whiledim_dateat day-grain servesfact_sales. -
Subset conformance — one fact uses only a subset of the dim's rows (e.g.
fact_internal_salesfiltersdim_customerto internal customers); attributes and keys match, but row set differs.
Worked example — design dim_customer once and use it in three facts
Detailed explanation. Real interviews ask you to demonstrate conformance by showing the same dim_customer being consumed by multiple fact_* tables. Below is the canonical three-fact pattern.
Question. Sales, returns, and customer-support tickets all need to be analysed by customer segment, country, and tier. Design dim_customer once and show how fact_sales, fact_returns, and fact_support_ticket all consume it.
Input. Three OLTP sources: oltp.orders, oltp.returns, oltp.support_tickets. The current state has three separate dim_customer_* tables, one per team; consolidate them.
Code.
-- One canonical conformed dim.
CREATE TABLE dim_customer (
customer_key BIGINT IDENTITY(1,1) PRIMARY KEY,
customer_id VARCHAR(40) NOT NULL,
name VARCHAR(120) NOT NULL,
email VARCHAR(120) NOT NULL, -- Type 1
segment VARCHAR(32) NOT NULL, -- Type 2
country VARCHAR(64) NOT NULL, -- Type 2
tier VARCHAR(16) NOT NULL, -- Type 2
current_segment VARCHAR(32) NOT NULL, -- Type 6 (derived)
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP NOT NULL DEFAULT '9999-12-31',
is_current BOOLEAN NOT NULL DEFAULT TRUE
);
-- Three facts, one dim.
CREATE TABLE fact_sales (
sale_key BIGINT IDENTITY(1,1) PRIMARY KEY,
customer_key BIGINT NOT NULL REFERENCES dim_customer(customer_key),
product_key BIGINT NOT NULL,
date_key INT NOT NULL,
revenue NUMERIC(12,2) NOT NULL
);
CREATE TABLE fact_returns (
return_key BIGINT IDENTITY(1,1) PRIMARY KEY,
customer_key BIGINT NOT NULL REFERENCES dim_customer(customer_key), -- conformed
product_key BIGINT NOT NULL,
date_key INT NOT NULL,
refund_amount NUMERIC(12,2) NOT NULL
);
CREATE TABLE fact_support_ticket (
ticket_key BIGINT IDENTITY(1,1) PRIMARY KEY,
customer_key BIGINT NOT NULL REFERENCES dim_customer(customer_key), -- conformed
date_key INT NOT NULL,
severity_key BIGINT NOT NULL,
resolution_minutes INT NOT NULL
);
Step-by-step explanation.
- One
dim_customertable is the single source of truth; every fact'scustomer_keyFK references it. - All three facts agree on the
customer_keysurrogate; if Alice iscustomer_key = 12345in sales, she iscustomer_key = 12345in returns and support. - All three facts inherit the same SCD policy: when Alice's segment changes, a new dim row is inserted with a new surrogate, and future facts in all three tables join to the new key.
- Cross-process queries work without effort:
SELECT segment, SUM(revenue), SUM(refund_amount), AVG(resolution_minutes) FROM dim_customer d LEFT JOIN fact_sales s LEFT JOIN fact_returns r LEFT JOIN fact_support_ticket t … GROUP BY segmentreturns a single row per segment with all three measures. - The conformance contract is enforced by the FK + the team agreement; both layers matter (DB constraints catch the technical violation, the team agreement catches the policy violation).
Output (one row per segment, joining all three facts).
| segment | total_revenue | total_refunds | avg_resolution_min |
|---|---|---|---|
| enterprise | 250000.00 | 12500.00 | 45 |
| growth | 95000.00 | 4200.00 | 60 |
| starter | 38000.00 | 1900.00 | 90 |
Rule of thumb: if you're tempted to build dim_customer_v2 for a new team, stop — the cost of forking the dim today is paid for the next decade in cross-process analytics that don't tie out.
kimball bus matrix — the org-wide design view of which dims serve which processes
The Kimball bus matrix is a 2D grid with business processes as rows and conformed dimensions as columns; a checkmark in cell (process, dim) says "this process's fact table joins to this dim". The matrix is the single artefact the data platform team uses to plan, govern, and communicate dimensional modeling at enterprise scale.
The shape of a bus matrix.
| business process | customer | product | date | store | employee | channel |
|---|---|---|---|---|---|---|
| Sales | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Returns | ✓ | ✓ | ✓ | ✓ | – | ✓ |
| Inventory snapshot | – | ✓ | ✓ (month) | ✓ | – | – |
| Support ticket | ✓ | ✓ | ✓ | – | ✓ | ✓ |
| Marketing campaign | ✓ | ✓ | ✓ | – | ✓ | ✓ |
| Web event | ✓ | ✓ | ✓ | – | – | ✓ |
| Subscription billing | ✓ | ✓ | ✓ | – | – | ✓ |
How to read it.
- Each row is a business process — a single subject area that produces a fact table.
- Each column is a conformed dimension shared across processes.
- A checkmark means "this process's fact joins to this dim".
- A dash means "this dim does not apply to this process".
- A "(month)" annotation means shrunken conformance — the inventory fact joins at month grain while sales joins at day grain.
Why the bus matrix is the senior-modeler deliverable.
-
It surfaces missing dims — if
customeris checked for sales but missing for support, that's a gap analytics will pay for later. - It exposes redundant facts — if two facts cover the same process at slightly different grains, you probably have a re-grain bug.
- It plans roadmap — each cell is a unit of work; add a fact, add a dim, conform a dim across processes.
- It governs ownership — each column has an owner (the team that owns the dim); each row has an owner (the team that owns the process).
- It travels across tools — the matrix lives in a wiki, a dbt docs page, or a Confluence page; every BI dashboard ties back to it.
Worked example — sketch a 3-row × 4-column bus matrix on a whiteboard
Detailed explanation. Whiteboard rounds love this question because it's tiny but reveals whether you actually use the bus matrix or just read about it. The drill is to design a 3-row × 4-column matrix in 60 seconds.
Question. Sketch a Kimball bus matrix for an e-commerce platform with three business processes (sales, returns, inventory snapshot) and four candidate conformed dimensions (customer, product, date, store). Mark which dims are conformed across all three, which are partial, and call out one shrunken-conformance cell.
Input. Three facts: fact_sales (transaction grain), fact_returns (transaction grain), fact_inventory_snapshot (daily snapshot, but stored monthly for cost reasons). Four candidate dims: dim_customer, dim_product, dim_date, dim_store.
Code.
-- Materialise the bus matrix as a small table for governance.
CREATE TABLE bus_matrix AS
SELECT * FROM (VALUES
('sales', 'customer', 'full'),
('sales', 'product', 'full'),
('sales', 'date', 'full'),
('sales', 'store', 'full'),
('returns', 'customer', 'full'),
('returns', 'product', 'full'),
('returns', 'date', 'full'),
('returns', 'store', 'full'),
('inventory_snapshot', 'customer', 'not_applicable'),
('inventory_snapshot', 'product', 'full'),
('inventory_snapshot', 'date', 'shrunken_to_month'),
('inventory_snapshot', 'store', 'full')
) AS t(business_process, dimension, conformance);
Step-by-step explanation.
-
salesandreturnsboth share all four dims —customer,product,date,store— at full conformance; cross-process queries (refund rate per segment per region) are trivial. -
inventory_snapshotdoes not usecustomer— the dim isnot_applicablebecause inventory is product-and-store-keyed, not customer-keyed. -
inventory_snapshotusesdim_dateat month grain (the snapshot fact stores one row per(product, store, month)); this is the shrunken-conformance cell. - The 12-row table is the bus matrix; pivot it in a BI tool for a visual grid.
- The matrix lives in version control alongside the model definitions; PR-reviewed changes to the matrix are the governance gate for adding new processes or dims.
Output (the matrix pivoted into the classic grid).
| business_process | customer | product | date | store |
|---|---|---|---|---|
| sales | full | full | full | full |
| returns | full | full | full | full |
| inventory_snapshot | – | full | month | full |
Rule of thumb: every interview-day system-design answer for an analytics platform should start with a hand-sketched bus matrix; the matrix anchors the rest of the design.
SQL
Topic — dimensional-modeling
Conformed-dimension and bus-matrix drills
SQL
Topic — aggregation
Cross-process aggregation practice
Solution Using a cross-process analytics query that depends on conformance
Code.
-- One query, three facts, one conformed dim_customer — the payoff of conformance.
WITH sales AS (
SELECT customer_key, SUM(revenue) AS total_revenue
FROM fact_sales
WHERE date_key BETWEEN 20260101 AND 20260331
GROUP BY customer_key
), returns AS (
SELECT customer_key, SUM(refund_amount) AS total_refunds
FROM fact_returns
WHERE date_key BETWEEN 20260101 AND 20260331
GROUP BY customer_key
), support AS (
SELECT customer_key,
COUNT(*) AS ticket_count,
AVG(resolution_minutes) AS avg_resolution_min
FROM fact_support_ticket
WHERE date_key BETWEEN 20260101 AND 20260331
GROUP BY customer_key
)
SELECT
d.segment,
d.country,
SUM(COALESCE(s.total_revenue, 0)) AS total_revenue,
SUM(COALESCE(r.total_refunds, 0)) AS total_refunds,
SUM(COALESCE(t.ticket_count, 0)) AS ticket_count,
AVG(t.avg_resolution_min) AS avg_resolution_min
FROM dim_customer d
LEFT JOIN sales s ON s.customer_key = d.customer_key
LEFT JOIN returns r ON r.customer_key = d.customer_key
LEFT JOIN support t ON t.customer_key = d.customer_key
WHERE d.is_current = TRUE
GROUP BY d.segment, d.country
ORDER BY total_revenue DESC;
Step-by-step trace.
| d.customer_key | d.segment | d.country | s.total_revenue | r.total_refunds | t.ticket_count |
|---|---|---|---|---|---|
| 101 | enterprise | US | 50000.00 | 2500.00 | 8 |
| 102 | growth | US | 18000.00 | 900.00 | 12 |
| 103 | enterprise | UK | 40000.00 | 1800.00 | 5 |
| 104 | starter | UK | 6000.00 | 300.00 | 20 |
- Each CTE aggregates one fact to the customer level; the grain of each CTE is
(customer_key). - The main
SELECTjoins all three CTEs todim_customer;LEFT JOINpreserves customers with no sales / no returns / no tickets. -
WHERE d.is_current = TRUEfilters the dim to one current row per customer; without it the rollup would multiply by SCD history depth. -
GROUP BY d.segment, d.countrycollapses to one row per segment-country bucket. - The conformance contract is what makes this query possible — every fact agrees that
customer_key = 101is the same customer.
Output:
| segment | country | total_revenue | total_refunds | ticket_count | avg_resolution_min |
|---|---|---|---|---|---|
| enterprise | US | 50000.00 | 2500.00 | 8 | 45 |
| enterprise | UK | 40000.00 | 1800.00 | 5 | 55 |
| growth | US | 18000.00 | 900.00 | 12 | 60 |
| starter | UK | 6000.00 | 300.00 | 20 | 90 |
Why this works — concept by concept:
-
Conformed surrogate key —
customer_keyresolves identically in all three facts; without this, the threeLEFT JOINs would silently disagree. -
One CTE per fact — pre-aggregating each fact to customer-grain before joining keeps the join cardinality manageable (
O(customers)notO(customers × sales × returns × tickets)). -
COALESCE on outer-joined measures — customers with no sales return
NULL;COALESCE(…, 0)turns nulls into zeros so theSUMis correct. -
is_current filter — required because
dim_customeris SCD Type 2; without it, the rollup multiplies by historical row count. -
Cost — three CTE scans are
O(rows in date range)each; the join isO(distinct customers); the whole query is cheap because the fact-level pre-aggregation collapses the data before the join.
5. The Kimball 4-step design process — business process → grain → dimensions → facts
kimball methodology — the canonical 4-step design process
The Kimball 4-step design is the recipe every dimensional model follows: (1) select the business process, (2) declare the grain, (3) choose the dimensions, (4) identify the facts. The order matters: skip a step, or do them out of order, and the model fails predictably — grain mistakes are the most expensive class of failure because they propagate through every downstream model and dashboard.
Step 1 — select the business process.
- Definition — a business process is a single measurement event the OLTP source produces: placing an order, shipping a parcel, returning a product, clicking a button, posting a payment.
- Rule — one business process per fact table; never combine "orders and returns" into a single fact because their grains and measures don't align.
- Sanity check — write the process down as "the system measures X when Y happens"; if you can't, you haven't picked a real process yet.
- Examples — "the system measures revenue when an order line is placed", "the system measures days-late when a shipment status changes", "the system measures attendance when a class session occurs".
Step 2 — declare the grain.
- Definition — the grain is a single sentence defining what one row of the fact table means.
- Rule — declare grain before you name a single column; defend it against finer (more atomic) and coarser (more aggregated) alternatives.
- Sanity check — fill in the blank: "One row of this fact represents ____."; the sentence is the grain.
- Examples — "one row per (order_id, line_id)", "one row per (account_id, day)", "one row per order, lifecycle-accumulating".
Step 3 — choose the dimensions.
- Definition — the dimensions are the who / what / when / where / why contexts surrounding the grain.
- Rule — pick the minimum set of dimensions the grain requires; don't drag in dimensions that aren't relevant to the process.
- Sanity check — for each candidate dim, ask "if I removed this dim, can I still answer the analytical questions the PM cares about?"; if yes, drop it.
-
Examples — for
fact_salesat order-line grain:dim_customer,dim_product,dim_date,dim_store; that's it.
Step 4 — identify the facts (measures).
- Definition — the facts are the numeric measures that aggregate up the dimension hierarchies.
-
Rule — favour additive measures (those that
SUMcorrectly across all dims); be wary of semi-additive (SUMonly across some dims) and non-additive (ratios, percentages) measures. -
Sanity check — for each candidate measure, ask "does
SUM(this) GROUP BY any dimmake sense?"; if no, it's not a fact. -
Examples — for
fact_sales:quantity,unit_price,discount_amount,revenue.
The iteration loop.
- One business process per iteration — design the sales model first, ship it, then iterate into returns.
- Re-declare grain when the source changes — if the OLTP team adds line-level cancellation, the grain may need to shift.
-
Add dims as use cases emerge —
dim_promotionmay not be needed on day 1 but becomes essential when marketing wants attribution. -
Add facts as measures are requested —
discount_pct(a derived ratio) may emerge later; store the additive components and derive the ratio in BI.
Worked example — apply the 4-step process to an e-commerce sales request
Detailed explanation. Interviews love this one because it lets the candidate demonstrate the process, not just the artefact. Below is a fully worked end-to-end design from a one-paragraph PM request.
Question. A PM says: "Our e-commerce platform sells products to customers via a web store. I want to analyse revenue by customer segment, product category, day, and store region — and drill into individual order lines." Apply the 4-step process and produce the fact_sales schema.
Input. OLTP source: orders(order_id, customer_id, order_ts, store_id) joined to order_lines(order_id, line_id, sku, qty, unit_price, discount). No other tables.
Code.
-- Step 1: business process = "online sales (order-line placement)".
-- Step 2: grain = "one row per (order_id, line_id)".
-- Step 3: dimensions = customer, product, date, store.
-- Step 4: facts = quantity, unit_price, discount_amount, revenue.
CREATE TABLE fact_sales (
sale_key BIGINT IDENTITY(1,1) PRIMARY KEY,
customer_key BIGINT NOT NULL REFERENCES dim_customer(customer_key),
product_key BIGINT NOT NULL REFERENCES dim_product(product_key),
date_key INT NOT NULL REFERENCES dim_date(date_key),
store_key BIGINT NOT NULL REFERENCES dim_store(store_key),
order_id VARCHAR(40) NOT NULL,
line_id INT NOT NULL,
quantity INT NOT NULL,
unit_price NUMERIC(12,2) NOT NULL,
discount_amount NUMERIC(12,2) NOT NULL DEFAULT 0,
revenue NUMERIC(12,2) NOT NULL,
CONSTRAINT uq_fact_sales UNIQUE (order_id, line_id)
);
-- Sanity-check the grain: COUNT(*) = COUNT(DISTINCT (order_id, line_id)).
Step-by-step explanation.
- Step 1 (business process) — "placing an order line on the web store"; it is a single measurement event; not a roll-up.
-
Step 2 (grain) — "one row per (order_id, line_id)"; the most atomic grain the source supports; declared in the table comment and enforced by the
UNIQUEconstraint. - Step 3 (dimensions) — customer (who), product (what), date (when), store (where); four FKs.
-
Step 4 (facts) — quantity, unit_price, discount_amount, revenue; the first three come directly from the source, revenue is derived (=
qty × price - discount) and stored to avoid recomputation in BI. - The order matters — process before grain before dims before facts; reversing the order (e.g. picking facts first) leads to mid-design rework when the grain doesn't support them.
Output (the schema deliverable).
| step | artefact |
|---|---|
| 1. business process | online sales |
| 2. grain | one row per (order_id, line_id) |
| 3. dimensions | customer, product, date, store |
| 4. facts | quantity, unit_price, discount_amount, revenue |
| final |
fact_sales table with 4 FKs + 2 degen dims + 4 measures |
Rule of thumb: every dimensional design should ship the 4-step process as a comment block on the fact table; the comment is the design rationale that survives turnover.
kimball methodology — common beginner mistakes when applying the 4-step process
- Skipping the business process step — jumping straight to grain or dimensions without naming the process leads to bloated facts that mix multiple processes.
- Declaring grain in plural — "one row per orders" is wrong; the grain is always singular ("one row per order line").
-
Picking dimensions before grain — the grain constrains the dimensions; you cannot have a
dim_line_itemif your grain isone row per order. -
Stuffing descriptive attributes into the fact — if you find yourself adding
customer_nameorproduct_categoryto the fact, you're modelling backwards; those belong on the dim. -
Picking non-additive measures as primary facts —
discount_pctandmargin_pctcannotSUM; store the additive components and let BI derive the ratios. -
Forgetting
dim_date— every fact has a time dimension; even factless facts have one; never store dates only asDATEcolumns on the fact without adate_keyFK.
Worked example — translate a tricky multi-process PM request into separate facts
Detailed explanation. A senior interviewer will deliberately mix processes in the PM request and see whether the candidate correctly splits them. Below is the drill.
Question. A PM says: "I want to track everything that happens to an order — when it's placed, when each line is shipped, when each line is returned. Build me one big fact." Resist the temptation; design three fact tables and explain why.
Input. OLTP sources: orders, order_lines, shipments(line_id, shipped_ts), returns(line_id, return_ts, refund_amount).
Code.
-- Three processes, three facts, three grains.
-- Process 1: order-line placement.
CREATE TABLE fact_sales (
customer_key BIGINT, product_key BIGINT, date_key INT, store_key BIGINT,
order_id VARCHAR(40), line_id INT,
quantity INT, unit_price NUMERIC(12,2), revenue NUMERIC(12,2)
);
-- grain: one row per (order_id, line_id) at placement time.
-- Process 2: shipment.
CREATE TABLE fact_shipments (
customer_key BIGINT, product_key BIGINT, ship_date_key INT, carrier_key BIGINT,
order_id VARCHAR(40), line_id INT,
quantity_shipped INT, days_from_order INT
);
-- grain: one row per (order_id, line_id) per shipment event.
-- Process 3: return.
CREATE TABLE fact_returns (
customer_key BIGINT, product_key BIGINT, return_date_key INT,
order_id VARCHAR(40), line_id INT,
quantity_returned INT, refund_amount NUMERIC(12,2),
days_from_ship INT
);
-- grain: one row per (order_id, line_id) per return event.
Step-by-step explanation.
- The PM's "one big fact" is the trap; combining three processes into one fact gives you a wide, sparse, semi-additive mess.
- Each process has its own measurement event (
placed,shipped,returned) and therefore its own fact table. - Each fact has its own grain and its own set of measures;
fact_sales.revenuedoesn't apply tofact_shipments, andfact_shipments.days_from_orderdoesn't apply tofact_sales. - The three facts share
customer_key,product_key, and(order_id, line_id)as a degenerate dim, so cross-process analytics (placed-to-shipped lag) is one join away. - An accumulating snapshot fact (
fact_order_lifecycle) can sit on top of the three transaction facts to give BI a denormalised one-row-per-order view; the three transaction facts remain the source of truth.
Output (the three-fact design with shared conformed dims).
| fact | grain | measures |
|---|---|---|
| fact_sales | 1 row per (order_id, line_id) at placement | quantity, unit_price, revenue |
| fact_shipments | 1 row per (order_id, line_id) per shipment | quantity_shipped, days_from_order |
| fact_returns | 1 row per (order_id, line_id) per return | quantity_returned, refund_amount, days_from_ship |
Rule of thumb: if a PM asks for "one big fact", count the measurement events in the request; each event is its own fact.
SQL
Topic — dimensional-modeling
4-step design process drills
SQL
Topic — database
Database / schema design practice
Solution Using a 4-step design checklist as a deliverable
Code.
-- Ship the 4-step design as a checklist row per fact table.
CREATE TABLE dim_design_checklist AS
SELECT * FROM (VALUES
('fact_sales', 1, 'business_process', 'online sales (order-line placement)'),
('fact_sales', 2, 'grain', 'one row per (order_id, line_id)'),
('fact_sales', 3, 'dimensions', 'customer, product, date, store'),
('fact_sales', 4, 'facts', 'quantity, unit_price, discount_amount, revenue'),
('fact_shipments', 1, 'business_process', 'shipment dispatch'),
('fact_shipments', 2, 'grain', 'one row per (order_id, line_id) per shipment'),
('fact_shipments', 3, 'dimensions', 'customer, product, ship_date, carrier'),
('fact_shipments', 4, 'facts', 'quantity_shipped, days_from_order'),
('fact_returns', 1, 'business_process', 'product return'),
('fact_returns', 2, 'grain', 'one row per (order_id, line_id) per return'),
('fact_returns', 3, 'dimensions', 'customer, product, return_date'),
('fact_returns', 4, 'facts', 'quantity_returned, refund_amount, days_from_ship')
) AS t(fact_table, step_no, step_name, value);
Step-by-step trace.
| fact_table | step_no | step_name | value |
|---|---|---|---|
| fact_sales | 1 | business_process | online sales (order-line placement) |
| fact_sales | 2 | grain | one row per (order_id, line_id) |
| fact_sales | 3 | dimensions | customer, product, date, store |
| fact_sales | 4 | facts | quantity, unit_price, discount_amount, revenue |
| fact_shipments | 1 | business_process | shipment dispatch |
| fact_shipments | 2 | grain | one row per (order_id, line_id) per shipment |
| fact_shipments | 3 | dimensions | customer, product, ship_date, carrier |
| fact_shipments | 4 | facts | quantity_shipped, days_from_order |
| fact_returns | 1 | business_process | product return |
| fact_returns | 2 | grain | one row per (order_id, line_id) per return |
| fact_returns | 3 | dimensions | customer, product, return_date |
| fact_returns | 4 | facts | quantity_returned, refund_amount, days_from_ship |
- Each fact gets exactly four rows in the checklist — one per step; if a fact has fewer, the design is incomplete.
- The
step_nocolumn enforces the canonical order;grainbeforedimensionsbeforefacts. - The
valuecolumn is plain English (not SQL); junior engineers and PMs can read it without warehouse fluency. - The table itself becomes the design contract; PR review against the checklist catches gaps before code lands.
- Three facts × four steps = 12 rows; with seven facts a real platform might have 28 rows, all in one queryable artefact.
Output.
| fact_table | grain |
|---|---|
| fact_sales | one row per (order_id, line_id) |
| fact_shipments | one row per (order_id, line_id) per shipment |
| fact_returns | one row per (order_id, line_id) per return |
Why this works — concept by concept:
- Checklist as deliverable — the design itself is a row-per-step artefact; this is what makes Kimball governable at scale.
-
One row per step per fact — turns a vague "did we follow the process" question into a
COUNT(*) GROUP BY fact_table HAVING COUNT(*) = 4query. - Plain-English value column — the design has to be readable by the PM, the analyst, and the DBA; SQL syntax in the design doc is over-engineering.
- Versionable in source control — the checklist lives in dbt YAML / the data catalog / a Confluence page; changes are PR-reviewed.
-
Cost —
O(1)to read; the actual schemas built from the design costO(N rows × N attributes)to materialise, but the design itself is constant-time recall.
Choosing the right SCD type (cheat sheet)
A one-screen cheat sheet for slowly changing dimension decisions — pick the type that matches the analytical question you'll be asked.
| You want to … | SCD type | Mechanism | Row impact |
|---|---|---|---|
| Fix typos in a name | Type 1 | overwrite in place | none |
| Update an email after a change | Type 1 | overwrite in place | none |
| Track historical customer segment | Type 2 | insert new row + close prior row | +1 row per change |
| Track historical region / country | Type 2 | insert new row + close prior row | +1 row per change |
| Track historical account manager | Type 2 | insert new row + close prior row | +1 row per change |
| Track historical credit score | Type 2 | insert new row + close prior row | +1 row per change |
| Track just-the-last-change region rename | Type 3 | add previous_* column |
none |
| Provide fast current-state lookup and full history | Type 6 | Type 2 + Type 1 + Type 3 hybrid | +1 row per change |
| Preserve everything in a never-purged audit | Type 4 (history table) | move old rows to dim_*_history
|
none in main dim |
| Roll back to a prior version on demand | Type 2 + retention | keep all rows; query valid_from window |
+1 row per change |
| Surface "as-of" reporting at any date | Type 2 with valid_from / valid_to
|
BETWEEN predicate |
full Type 2 cost |
| Audit who changed a field | Type 2 + updated_by audit column |
every row carries updater | full Type 2 cost |
| Track an immutable attribute (signup_date) | NA | never updated | none |
| Encode a derived current-state pivot | Type 6 (current_* column) | overwrite current_* on every row | none new |
Frequently asked questions
How is this Kimball deep-dive different from a generic data-modeling Q&A round-up?
A quick data modeling interview questions round-up usually covers OLTP normalisation (1NF / 2NF / 3NF), the Kimball-vs-Inmon-vs-Vault landscape, basic star vs snowflake schema vocabulary, and a few generic FAQ-style questions in one sitting — perfect for last-minute review. This deep-dive narrows the lens to Kimball dimensional modeling specifically, walking five numbered teaching sections — fact-vs-dim atoms, grain + the four SCD types with full SQL, conformed dimensions + the bus matrix, and the canonical 4-step design process — with worked examples, end-to-end schemas, and a per-attribute SCD assignment matrix. Pick the deep-dive when you have a week to prepare, want to teach dimensional modeling in a senior loop, or need the SCD MERGE statements memorised. Pick the round-up the night before. The two formats are complements, not duplicates — same family of topics, different depth.
What is the difference between a fact table and a dimension table in Kimball modeling?
Fact tables are narrow + tall + numeric — they have a handful of foreign-key columns (one per participating dimension), one or two degenerate dimensions (order_id, line_id), and a handful of additive measures (quantity, revenue, discount_amount). One row per business event; billions of rows over time. Dimension tables are short + wide + descriptive — they have a surrogate key, a business key, and dozens of descriptive text and date attributes (name, segment, country, category, valid_from, valid_to, is_current); one row per business entity per historical version. The interview-day rule of thumb: facts answer "how much"; dimensions answer "who / what / when / where / why". If you find a long text column on a fact, it's mis-modelled; if you find a numeric measure on a dim, it's mis-modelled.
What is grain and why is declaring it first the most important rule in Kimball?
grain is a single sentence that defines what one row of a fact table means — "one row per (order_id, line_id)", "one row per (account_id, day)", "one row per order, lifecycle-accumulating". It must be declared before any column is named, because every other modeling decision (which dimensions apply, which measures are additive, what the unique constraint is) follows from the grain. Mixing grains in the same fact table double-counts every aggregate; changing the grain after launch breaks every downstream dashboard; ambiguity about the grain produces queries that return wrong numbers silently. The Kimball discipline: write the grain in the table comment, the dbt YAML, the data-catalog entry, and the design wiki; multiple sources of truth keep it from drifting. Defending the grain in a design review — explaining why your grain isn't finer (more atomic) or coarser (more aggregated) — is the single biggest senior-modeler signal you can send.
What are the four SCD types I have to know cold for an interview?
The four canonical types are: Type 1 (overwrite) — replace the value in place; no history; cheap; use for typos, contact info, and rebrands where past values aren't analytically interesting. Type 2 (add new row) — insert a new row with a fresh surrogate key + valid_from / valid_to / is_current flags; full history; the workhorse; use for analytical attributes (segment, region, tier, account manager) where revenue-per-historical-value is a real question. Type 3 (add new column) — add a previous_* column to track one level of history per attribute; limited; use sparingly for one-time renames (region rebrand). Type 6 (hybrid 1+2+3) — the senior interview answer: layer all three patterns to give you full history and a fast current_* lookup and a per-row prior value. Memorise the SQL MERGE for each (section 3 ships all four). The interview rule: pick the SCD type per attribute, not per table — a single dim_customer can mix Type 1 on email, Type 2 on segment, Type 6 on current_segment.
What are conformed dimensions and how do they enable enterprise-scale analytics?
Conformed dimensions are dimensions designed to be shared across multiple business processes — fact_sales, fact_returns, fact_inventory, fact_support_ticket all join to the same dim_customer, dim_product, dim_date. The conformance contract specifies that the surrogate key, column set, SCD policy, and grain are identical across every fact that consumes the dim. Without conformance, every team builds their own dim_customer_sales, dim_customer_marketing, dim_customer_support, and cross-process analytics ("what % of customers who bought in Q1 returned in Q2 and opened a support ticket in Q3") becomes impossible because the definition of "customer" has diverged in five places. The Kimball bus matrix is the org-wide design artefact that surfaces conformance: business processes as rows, conformed dimensions as columns, checkmarks where the process uses the dim. Senior data modelers ship the bus matrix first as the platform's analytics blueprint; junior modelers skip it and pay for missing conformance the next decade.
Is Kimball dimensional modeling still relevant in 2026 with the lakehouse, Iceberg, and modern semantic layers?
Yes — emphatically. Snowflake, BigQuery, Databricks, and Redshift all publish reference architectures with star-schema gold-layer models. dbt is built around dimensional modeling — dim_ / fact_ naming is the de-facto convention, dbt_utils.generate_surrogate_key is universal, and dbt-expectations ships dimensional-model assertions. The lakehouse did not kill it: Iceberg, Delta, and Hudi tables still get a Kimball-shaped gold layer on top of the bronze raw + silver cleaned layers. Modern semantic layers (Cube, LookML, dbt-metricflow, Snowflake Semantic Layer) all assume a star-schema input. Data Vault complements rather than replaces — DV 2.0 increasingly handles the raw / integration layer with a Kimball star on top as the consumption layer. The reason dimensional modeling outlived every "Kimball is dead" hot take is that, underneath the storage layer, analysts still want a star schema because that is the shape SQL pivots and BI tools natively consume. In 2026, knowing Kimball cold is still the price of admission to a senior data-engineering interview at a serious analytics org.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including SQL + Python drills keyed to the exact kimball data warehouse skill set this guide teaches (fact-vs-dim design, grain declaration, surrogate keys, SCD Types 1 / 2 / 3 / 6 MERGE patterns, conformed-dimension reasoning, bus-matrix governance, the 4-step design process). Whether you're drilling dimensional modeling interview questions the night before a screen or grinding the Kimball methodology vocabulary across a multi-week prep cycle, the practice library mirrors the same five-section mental model — plus the dbt, Snowflake, BigQuery, and Databricks warehouse stacks you'll wire into your production star schema.
Kick off via Explore practice →; drill the dimensional-modeling lane →; fan out into slowly-changing-data problems →; reinforce the broader SQL practice library →; rehearse aggregation patterns →; widen coverage on the full Python practice library →.





Top comments (0)