Gowtham Potureddi

Posted on May 31

Kimball Dimensional Modeling for Data Engineering Interviews: Facts, Dimensions, Grain & SCDs

#python #sql #dataengineering #interview

kimball data warehouse is still the gravity well every analytics interview falls back into: a fact table keyed by a handful of foreign keys, a halo of dimension table rows that describe context, a single declared grain that fixes what one row of the fact means, and a discipline for handling change over time — the four slowly changing dimension patterns (Type 1 overwrite, Type 2 new row, Type 3 new column, Type 6 hybrid). Together those primitives — plus the conformed dimensions that let fact_sales, fact_returns, and fact_inventory all share the same dim_customer — form the kimball methodology that powers Snowflake, BigQuery, Databricks, and Redshift warehouses in 2026, and the deep-dive interview track this guide walks through, end to end, in five numbered teaching sections.

This is the deep-dive companion to a tighter Q&A round-up: where a 5-section data-modeling cheat sheet ranges across OLTP normalisation, Inmon's third-normal-form warehouse, and Data Vault, this guide narrows the scope to dimensional modeling the way Ralph Kimball and Margy Ross actually teach it — fact tables vs dimension tables (the atoms), grain + SCDs (the decisions that bite you later), conformed dimensions + the bus matrix (modeling at enterprise scale), and the Kimball 4-step design process (business process → grain → dimensions → facts). Each section ends as dimensional modeling interview questions and answers: a question, a SQL or Python snippet, a traced execution, a sample output, and a concept-by-concept why this works breakdown — the exact shape kimball methodology rounds reward at FAANG, fintech, and every modern analytics shop.

When you want hands-on reps immediately after reading, browse dimensional-modeling practice →, drill slowly-changing-data problems →, sharpen SQL practice library →, rehearse aggregation reconciliation patterns →, reinforce database problems →, or widen coverage on the full Python practice library →.

On this page

Why Kimball is still the dimensional-modeling interview standard
Fact tables vs dimension tables — the atoms of Kimball modeling
Grain + Slowly Changing Dimensions — Type 1, 2, 3, 6 with SQL
Conformed dimensions + the Kimball bus matrix — modeling at enterprise scale
The Kimball 4-step design process — business process → grain → dimensions → facts
Choosing the right SCD type (cheat sheet)
Frequently asked questions
Practice on PipeCode

1. Why Kimball is still the dimensional-modeling interview standard

`kimball data warehouse` — the dimensional model that outlived every "Kimball is dead" hot take

The one-sentence invariant: kimball data warehouse is a denormalised star schema built around fact tables (narrow, tall, numeric, foreign-key heavy) and dimension tables (wide, short, descriptive, business-key plus surrogate-key), designed so that BI users can write SELECT … FROM fact JOIN dim_a JOIN dim_b GROUP BY dim_a.something, dim_b.something and get the answer back in under a second. Every "Kimball is dead" hot take since 2010 — Inmon CIF, Data Vault 2.0, "just put everything in S3", "the warehouse is the lakehouse" — has been followed by a quiet rediscovery that, underneath the storage layer, analysts still want a star schema because that is the shape SQL pivots and BI tools natively consume.

Why dimensional modeling won the BI war (and is still winning in 2026).

Query simplicity — SELECT … FROM fact_sales f JOIN dim_customer c JOIN dim_date d is teachable to a finance analyst in 30 minutes; a 6-table normalised join graph is not.
Read performance — denormalised dims mean fewer joins per query; the warehouse cost model rewards wide, short tables.
Stable interface — dim_customer evolves (Type 2 history) without breaking the customer_key join key downstream.
Tool affinity — Tableau, Looker, Power BI, ThoughtSpot, and Mode are all designed against a star schema; trying to drive them off a 3NF model is a 6-month integration project.
Mental model fit — humans think in nouns + verbs; dimensions are nouns, facts are verbs ("the customer bought the product on the date at the store"); the schema matches the sentence.

What interviewers actually score on kimball methodology rounds.

Vocabulary fluency — can you crisply distinguish fact, dimension, grain, surrogate key, business key, conformed dimension, SCD Type 2, bus matrix, degenerate dimension, junk dimension, and factless fact table in one sentence each?
The 4-step design — given a business request, can you walk business process → grain → dimensions → facts out loud, with explicit example values at each step?
grain defence — given a fact-table proposal, can you state its grain in one sentence and justify why no row is finer or coarser?
SCD type selection per attribute — given a dim_customer schema, can you mark each column as Type 1, 2, 3, or 6 and explain why?
Conformed-dimension reasoning — given three business processes (sales, returns, inventory), can you identify which dimensions should be shared (conformed) and which should remain process-local?
The bus matrix — can you sketch a small bus matrix on a whiteboard with processes as rows and dimensions as columns?

The 5-section interview map this guide walks through.

Section 2 — fact tables vs dimension tables — the two atoms; what columns belong where; the FK + measure structure of facts; the surrogate + business key + attribute structure of dims; the rule of thumb (facts are tall + skinny + numeric, dims are short + wide + descriptive).
Section 3 — grain + SCDs — declaring grain before any column is named; the four SCD types (1, 2, 3, 6) with full SQL MERGE patterns; the cost / benefit / use-case for each.
Section 4 — conformed dimensions + the Kimball bus matrix — building dim_customer once and reusing it across fact_sales, fact_returns, fact_inventory; the bus matrix as the org-wide design artefact.
Section 5 — the Kimball 4-step design process — business process → grain → dimensions → facts, with a fully worked end-to-end example.
Cheat sheet + FAQ — when to pick which SCD type, plus the senior-round Q&A every loop circles back to.

Why dimensional modeling is still the interview default in 2026 (and not "old hat").

Snowflake, BigQuery, Databricks, and Redshift all publish reference architectures with star-schema gold-layer models.
dbt is built around dimensional modeling; dim_ / fact_ naming is the de-facto convention; dbt_utils ships generate_surrogate_key, and dbt-expectations ships dimensional-model assertions.
The lakehouse did not kill it — Iceberg / Delta / Hudi tables still get a Kimball-shaped gold layer on top.
Modern semantic layers — Cube, LookML, Snowflake's Semantic Layer — all assume a star-schema input.
Data Vault complements, not replaces — DV 2.0 is increasingly used in the raw / integration layer with a Kimball star schema on top as the consumption layer.

Worked example — turn a one-sentence business request into the Kimball vocabulary

Detailed explanation. Real interviews probe whether you can translate a vague business request into the Kimball primitives (grain, fact, dimensions) on the spot. Below is the canonical translation drill — "track our online order line revenue by customer, product, date, and store" — and how a senior modeler maps it.

Question. A finance PM asks: "I want to see daily revenue by customer segment, product category, and store region, with the ability to drill into individual order lines." In one minute, name the fact table, its grain, and the four dimensions; include the surrogate-key columns on each.

Input. No tables exist yet. The OLTP source is a single orders table with one row per checkout and an embedded line-item array. The warehouse is empty.

Code.

-- Kimball translation of the PM request.
CREATE TABLE fact_sales (
    sale_key       BIGINT       NOT NULL,  -- surrogate PK of fact
    customer_key   BIGINT       NOT NULL,  -- FK -> dim_customer
    product_key    BIGINT       NOT NULL,  -- FK -> dim_product
    date_key       INT          NOT NULL,  -- FK -> dim_date (YYYYMMDD)
    store_key      BIGINT       NOT NULL,  -- FK -> dim_store
    order_id       VARCHAR(40)  NOT NULL,  -- degenerate dim (no own table)
    line_id        INT          NOT NULL,  -- degenerate dim
    quantity       INT          NOT NULL,
    unit_price     NUMERIC(12,2) NOT NULL,
    discount_amount NUMERIC(12,2) NOT NULL,
    revenue        NUMERIC(12,2) NOT NULL  -- = quantity * unit_price - discount
);
-- Grain: one row per (order_id, line_id) — i.e. one row per ordered SKU.

Step-by-step explanation.

Business process — online sales; the noun + verb pair tells you the process you're modelling.
Grain — one row per order line; declared in the comment on the table; defended against finer (no row per scan event) and coarser (no row per order header) alternatives.
Dimensions — customer, product, date, store; one per who / what / when / where; each becomes its own dim_* table with a surrogate *_key.
Facts (measures) — quantity, unit_price, discount_amount, revenue; numeric, additive, aggregatable by SUM.
Degenerate dimensions — order_id and line_id live on the fact (no separate dim) because they have no descriptive attributes worth storing in their own table.

Output (the column list, grouped by role).

role	columns	shape
surrogate PK	sale_key	numeric
FK to dim	customer_key, product_key, date_key, store_key	numeric
degenerate dim	order_id, line_id	string + int
measures	quantity, unit_price, discount_amount, revenue	numeric, additive

Rule of thumb: every interview translation answer should explicitly name the grain in one sentence before any column is listed. Skip the grain and the rest of the model is unfounded.

`dimensional modeling` — Kimball vs Inmon vs Data Vault in one minute

The three competing schools (and when each wins).

Kimball (dimensional modeling) — denormalised star / snowflake schema, fact + dim tables, grain-first, optimised for BI query speed and analyst ergonomics; the default for the gold / consumption layer.
Inmon (Corporate Information Factory) — fully normalised 3NF enterprise warehouse acting as the integration layer, with downstream Kimball-style data marts hanging off it; the heavyweight enterprise pattern, less common at modern startups.
Data Vault 2.0 — hub / link / satellite pattern designed for source-aware audit-friendly raw integration; excellent for the raw / integration layer, frequently combined with a Kimball star on top as the consumption layer.

Why Kimball wins the interview question by default.

The PM asks a single business question — "show me revenue by region by month"; the answer is "join fact_sales to dim_store and dim_date".
Junior engineers can read it — a star schema is teachable; a 7-table Data Vault is not.
It composes with everything — modern stacks layer Kimball on top of Vault or on top of a raw bronze lake.
The vocabulary travels — every BI tool, every dbt project, every dimensional textbook uses the same fact_* / dim_* convention.

The senior signal — "Kimball + something" beats "Kimball or nothing".

Kimball gold + Data Vault silver — DV in the integration layer handles source heterogeneity; Kimball star on top serves analysts.
Kimball gold + bronze raw — bronze.orders_raw lands the source untransformed; silver.orders_cleaned adds standardisation; gold.fact_sales is the Kimball star.
Kimball gold + semantic layer — define metrics in Cube / LookML / dbt-metricflow on top of the Kimball star; the metric definitions live above the schema.
Kimball gold + reverse ETL — push dim_customer Type-2 history back into Salesforce / HubSpot for marketing personalisation.

SQL
Topic — dimensional-modeling
Dimensional modeling drills

Practice →

SQL
Topic — slowly-changing-data
Slowly changing dimensions practice

Practice →

Solution Using a Kimball-vocabulary lookup matrix

Code.

-- Materialise the Kimball vocabulary as a quick-reference matrix
-- every interview answer can be grounded against.
CREATE TABLE kimball_vocabulary AS
SELECT * FROM (VALUES
    ('fact table',          'narrow + tall + numeric',   'one row per business event',         'fact_sales, fact_returns'),
    ('dimension table',     'short + wide + descriptive','one row per business entity',        'dim_customer, dim_product'),
    ('grain',               'declared sentence',          'what one row of the fact means',     '1 row = 1 order line'),
    ('surrogate key',       'numeric, system-generated', 'stable, history-aware join key',     'customer_key BIGINT'),
    ('business key',        'natural key from source',   'the OLTP identifier',                'customer_id VARCHAR(40)'),
    ('SCD Type 1',          'overwrite',                  'no history kept',                    'email change'),
    ('SCD Type 2',          'add new row',                'full history with valid_from/to',    'segment change'),
    ('SCD Type 3',          'add new column',             'limited history (current + prev)',   'sales_region rename'),
    ('SCD Type 6',          'hybrid (1+2+3)',             'full history + fast current lookup','enterprise customer dim'),
    ('conformed dimension', 'shared across fact tables', 'one dim_customer for sales+returns','dim_customer'),
    ('degenerate dim',      'on the fact, no own table', 'identifier with no attributes',      'order_id, line_id'),
    ('junk dimension',      'combine low-card flags',     'shrink fact width, group flags',     'dim_order_flags'),
    ('bridge table',        'many-to-many resolver',     'connect fact to multi-valued dim',   'bridge_account_customer'),
    ('factless fact',       'event with no measures',    'occurrence-only event log',          'fact_login, fact_class_attendance'),
    ('bus matrix',          'process x dim grid',         'org-wide dim conformance map',       'sales|returns|inventory x customer|product|date')
) AS t(term, shape, definition, example);

Step-by-step trace.

term	shape	definition	example
fact table	narrow + tall + numeric	one row per business event	fact_sales, fact_returns
dimension table	short + wide + descriptive	one row per business entity	dim_customer, dim_product
grain	declared sentence	what one row of the fact means	1 row = 1 order line
surrogate key	numeric, system-generated	stable, history-aware join key	customer_key BIGINT
business key	natural key from source	the OLTP identifier	customer_id VARCHAR(40)
SCD Type 1	overwrite	no history kept	email change
SCD Type 2	add new row	full history with valid_from/to	segment change
SCD Type 3	add new column	limited history (current + prev)	sales_region rename
SCD Type 6	hybrid (1+2+3)	full history + fast current lookup	enterprise customer dim
conformed dimension	shared across fact tables	one dim_customer for sales+returns	dim_customer
degenerate dim	on the fact, no own table	identifier with no attributes	order_id, line_id
junk dimension	combine low-card flags	shrink fact width, group flags	dim_order_flags
bridge table	many-to-many resolver	connect fact to multi-valued dim	bridge_account_customer
factless fact	event with no measures	occurrence-only event log	fact_login
bus matrix	process x dim grid	org-wide dim conformance map	sales

Rows 1–2 — the two atoms; every other term is built on top.
Row 3 — grain is declared as a sentence, not a column; it constrains every later modeling decision.
Rows 4–5 — every dimension has both a surrogate (system) key and a business (source) key; the surrogate joins, the business identifies.
Rows 6–9 — the four SCD types; section 3 ships full SQL for each.
Row 10 — conformed dimensions are the contract that lets cross-process analytics actually work; section 4 covers them in depth.
Rows 11–14 — the less-common but interview-favourite primitives (degenerate, junk, bridge, factless).
Row 15 — the bus matrix is the org-wide design artefact; section 4 sketches one.

Output.

term	example
fact table	fact_sales
dimension table	dim_customer
grain	1 row = 1 order line
surrogate key	customer_key
business key	customer_id
SCD Type 2	segment change
conformed dimension	dim_customer shared by sales + returns
bus matrix	sales

Why this works — concept by concept:

Vocabulary matrix — turns 15 fuzzy terms into one-row definitions you can recite under pressure; interviewers reward crisp definitions over hand-waving.
Shape column — pairs each term with its physical characteristic (narrow + tall, short + wide, etc.); this is the senior signal that you've actually built dim models, not just read about them.
Definition column — one sentence per term; if you can't fit it in a sentence, you don't understand it yet.
Example column — grounds every abstract term in a concrete table or column name; concrete examples beat abstract definitions in every interview.
Cost — O(1) to read; the actual schemas built from this vocabulary are O(N rows) to materialise but the vocabulary itself is constant-time recall.

2. Fact tables vs dimension tables — the atoms of Kimball modeling

`fact table` vs `dimension table` — the two atoms every Kimball schema is built from

fact table rows answer "how much"; dimension table rows answer "who / what / when / where / why". The two are physically different shapes — facts are narrow + tall + numeric (a handful of foreign-key columns plus a handful of additive measures, repeated millions of times); dims are short + wide + descriptive (one row per entity, dozens of text and date attributes, history-aware columns layered on top). Mastering Kimball is mostly mastering these two shapes and the discipline of not mixing them.

Anatomy of a fact table.

Foreign keys — one column per dimension that participates in the grain; named *_key (the surrogate key, not the business key).
Degenerate dimensions — identifiers that live on the fact because they have no descriptive attributes (order_id, line_id, transaction_id).
Measures — numeric columns aggregatable by SUM / COUNT / MIN / MAX / AVG; ideally fully additive across all dimensions.
Grain comment — a one-sentence declaration of what one row means; lives in the table comment so it can't drift from the schema.
Surrogate fact key — optional; many shops use the composite (order_id, line_id) as the natural PK and skip the surrogate fact key entirely.

Three flavours of fact tables (the senior interviewer will ask).

Transaction fact — one row per business event (one order line, one click, one payment); the most common shape; fully additive measures; example fact_sales.
Periodic snapshot fact — one row per (entity, time period); useful for slowly evolving balances; semi-additive over time (balance doesn't SUM across days); example fact_account_balance_daily.
Accumulating snapshot fact — one row per long-running process, with multiple date columns that get updated as the process advances; example fact_order_lifecycle with order_date, ship_date, deliver_date, return_date.

Anatomy of a dimension table.

Surrogate key — system-generated BIGINT, monotonically increasing; the only column the fact joins against; stable across SCD changes.
Business key — the OLTP source identifier (customer_id, product_sku); preserved for traceability but never used as a join key.
Descriptive attributes — name, segment, country, category, sub_category; the columns BI users GROUP BY.
SCD columns — valid_from, valid_to, is_current for Type 2; current_* / previous_* pairs for Type 3; both layers for Type 6.
Audit columns — inserted_at, updated_at, source_system; metadata that helps with reconciliation and DQ.

The rule of thumb (memorise this; recite it under pressure).

Facts are tall + skinny + numeric — billions of rows, ~10 columns, mostly FKs + measures.
Dims are short + wide + descriptive — thousands or millions of rows (rarely > 100M), 20-100 columns, mostly text + date attributes.
If you find yourself adding a long text column to a fact, you're modelling it wrong — that attribute belongs on a dimension.
If you find yourself adding a numeric measure to a dimension, you're modelling it wrong — that measure belongs on a fact.
The two atoms never mix — facts join to dims; dims do not join to dims (snowflake schema being the rare exception).

Worked example — design `fact_sales` and `dim_customer` from a raw `orders` source

Detailed explanation. Real interviews ask you to physically design both atoms from an OLTP source. Below is the canonical translation of a raw orders source into a Kimball fact + dim pair, with explicit column lists and surrogate-key wiring.

Question. Given an OLTP orders source with a customers lookup and a products lookup, design fact_sales and dim_customer end-to-end. Name every column, every type, every key, and the grain.

Input. The OLTP source has three tables: orders(order_id, customer_id, order_ts), order_lines(order_id, line_id, sku, qty, unit_price, discount), customers(customer_id, name, email, segment, country, signup_dt).

Code.

-- The fact: narrow + tall + numeric.
CREATE TABLE fact_sales (
    sale_key       BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_key   BIGINT NOT NULL REFERENCES dim_customer(customer_key),
    product_key    BIGINT NOT NULL REFERENCES dim_product(product_key),
    date_key       INT    NOT NULL REFERENCES dim_date(date_key),
    store_key      BIGINT NOT NULL REFERENCES dim_store(store_key),
    order_id       VARCHAR(40) NOT NULL,             -- degenerate dim
    line_id        INT         NOT NULL,             -- degenerate dim
    quantity       INT           NOT NULL,
    unit_price     NUMERIC(12,2) NOT NULL,
    discount_amount NUMERIC(12,2) NOT NULL DEFAULT 0,
    revenue        NUMERIC(12,2) NOT NULL,
    CONSTRAINT uq_fact_sales UNIQUE (order_id, line_id)
);
-- GRAIN: one row per (order_id, line_id) — i.e. one row per ordered SKU.

-- The dim: short + wide + descriptive + SCD-aware.
CREATE TABLE dim_customer (
    customer_key   BIGINT IDENTITY(1,1) PRIMARY KEY,   -- surrogate
    customer_id    VARCHAR(40) NOT NULL,               -- business key
    name           VARCHAR(120) NOT NULL,
    email          VARCHAR(120) NOT NULL,              -- SCD Type 1 (overwrite)
    segment        VARCHAR(32)  NOT NULL,              -- SCD Type 2 (history)
    country        VARCHAR(64)  NOT NULL,              -- SCD Type 2 (history)
    signup_date    DATE         NOT NULL,
    valid_from     TIMESTAMP    NOT NULL,
    valid_to       TIMESTAMP    NOT NULL DEFAULT '9999-12-31',
    is_current     BOOLEAN      NOT NULL DEFAULT TRUE,
    inserted_at    TIMESTAMP    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    source_system  VARCHAR(40)  NOT NULL DEFAULT 'oltp_orders'
);
CREATE INDEX ix_dim_customer_bk ON dim_customer(customer_id, is_current);

Step-by-step explanation.

fact_sales.sale_key — optional system-generated PK; some teams skip it and use (order_id, line_id) directly.
fact_sales.{customer,product,date,store}_key — four FKs, one per dimension; named *_key (never *_id) to signal "this is the surrogate, not the source identifier".
fact_sales.{order_id, line_id} — degenerate dimensions; they live on the fact because they have no descriptive attributes worth storing in their own table.
fact_sales.{quantity, unit_price, discount_amount, revenue} — additive measures; revenue is stored even though it's derivable, so that BI queries don't have to recompute it on every aggregation.
dim_customer.customer_key vs customer_id — surrogate (used by the fact) vs business (preserved for traceability); the fact never references customer_id directly.
dim_customer.email — SCD Type 1 (overwrite); change history of email addresses is rarely interesting and inflates row counts.
dim_customer.segment and country — SCD Type 2 (full history); these are historically interesting (a customer moved from "starter" to "enterprise" in March; revenue before vs after that change is a real question).
dim_customer.valid_from / valid_to / is_current — the SCD Type 2 columns; is_current is a precomputed flag so the WHERE is_current lookup is index-friendly.

Output (the fact + dim shapes side by side).

table	rows	columns	shape
fact_sales	100M+	~12 (4 FK + 2 degen + 4 measure + 2 admin)	tall + skinny + numeric
dim_customer	1-10M	~12 (1 surr + 1 biz + 5 attr + 3 SCD + 2 audit)	short + wide + descriptive

Rule of thumb: the physical difference between facts and dims is the easiest senior signal to give — "the fact is roughly 12 columns × 100M rows, the dim is roughly 12 columns × 1M rows, and the column types tell you which is which".

`surrogate key` vs `business key` — the rule that lets SCD history actually work

surrogate key is the system-generated BIGINT you stamp on every dimension row, and it is the only column the fact joins against. The business key (a.k.a. natural key) is the OLTP source identifier — customer_id = 'C-00012345', product_sku = 'SKU-RED-MEDIUM' — and you preserve it on the dim for traceability, but you never use it as a join key. The distinction matters because once you start tracking SCD Type 2 history, a single business key can map to multiple dim rows (one per historical version), so the join from fact to dim has to use the surrogate, never the business key.

The 5-rule surrogate-key discipline.

Surrogate is BIGINT, system-generated — IDENTITY(1,1) in SQL Server, GENERATED ALWAYS AS IDENTITY in PostgreSQL, AUTOINCREMENT in Snowflake.
Surrogate is opaque — never embed business meaning; customer_key = 12345 should mean nothing outside the warehouse.
Fact stores only the surrogate — never customer_id, always customer_key.
Business key + is_current = TRUE is the lookup recipe — to find the current row for a given customer: WHERE customer_id = 'C-001' AND is_current = TRUE.
The surrogate key remains stable when the source business key changes — if customer_id is reissued by the OLTP team, the surrogate stays put; the source change is just another SCD event.

Why this matters in interviews.

The Type 2 join is broken without surrogate keys — if the fact stores customer_id and dim_customer has 3 historical rows for that customer, the fact join is now 3x ambiguous.
Hashing replaces auto-increment in modern shops — dbt_utils.generate_surrogate_key(['customer_id', 'valid_from']) is the idiomatic Snowflake / BigQuery / dbt pattern.
Surrogate keys decouple the warehouse from the source — the source can renumber, re-key, or migrate; the warehouse surrogate is untouched.

SQL
Topic — dimensional-modeling
Fact and dimension design drills

Practice →

SQL
Topic — database
Database design practice

Practice →

Solution Using a surrogate-key + business-key join harness

Code.

-- Join fact_sales to dim_customer using the surrogate, with point-in-time correctness.
SELECT
    d.segment,
    d.country,
    SUM(f.revenue)         AS total_revenue,
    COUNT(DISTINCT f.order_id) AS order_count
FROM fact_sales f
JOIN dim_customer d
  ON d.customer_key = f.customer_key       -- surrogate join, never customer_id
WHERE d.is_current = TRUE                  -- current segment lookup
  AND f.date_key BETWEEN 20260101 AND 20260131
GROUP BY d.segment, d.country
ORDER BY total_revenue DESC;

Step-by-step trace.

f.customer_key	d.customer_key	d.customer_id	d.segment	d.is_current	f.revenue
101	101	C-001	enterprise	true	5000.00
102	102	C-002	starter	true	1200.00
103	103	C-003	enterprise	true	8400.00
101	101	C-001	enterprise	true	3300.00

The fact stores customer_key = 101, not customer_id = 'C-001'; the join is d.customer_key = f.customer_key.
is_current = TRUE filters to one dim row per business customer; without it the result set would multiply by SCD history depth.
Rows 1 + 4 belong to the same customer (C-001); they roll up in the GROUP BY because they share the same segment + country.
The grain of the result is one row per (segment, country); each cell is the SUM(revenue) and COUNT(DISTINCT order_id) for that bucket.
The WHERE date_key BETWEEN clause hits the fact-side partition pruning; the dim is small enough that no partitioning is needed.

Output:

segment	country	total_revenue	order_count
enterprise	US	16700.00	3
starter	UK	1200.00	1

Why this works — concept by concept:

Surrogate-key join — d.customer_key = f.customer_key is the only valid join shape; it survives SCD Type 2 history and source re-keying.
is_current filter — without WHERE is_current = TRUE, the join multiplies by historical depth; with it, you get one current row per customer.
date_key partition pruning — date_key BETWEEN 20260101 AND 20260131 lets the warehouse skip every other partition; this is why we use INT YYYYMMDD for date keys.
Additive measures — SUM(revenue) is safe because revenue is fully additive across all four dims; this is the payoff for storing the derived revenue column on the fact.
Cost — fact scan is O(rows in matching partitions); dim join is O(distinct customers); the surrogate key makes both lookups index-friendly.

3. Grain + Slowly Changing Dimensions — Type 1, 2, 3, 6 with SQL

`grain` — declare it first, defend it forever

grain is the single sentence that defines what one row of a fact table means, declared before you name a single column, and defended for the life of the table. "One row per order line", "one row per customer per day", "one row per order, accumulated across the lifecycle" — three different grains, three different fact tables, three different physical shapes. The Kimball discipline is declare the grain first, never mix grains in the same fact table, and never change the grain after the table is built.

The three grain families.

Transaction grain — one row per business event; the most common; example "one row per (order_id, line_id)"; measures are fully additive.
Periodic snapshot grain — one row per (entity, period); example "one row per (account_id, date_key)"; measures are semi-additive over time (balance does not SUM across days).
Accumulating snapshot grain — one row per long-running process, updated in place as the process advances; example "one row per order, with ordered_date_key, shipped_date_key, delivered_date_key, returned_date_key"; measures track lag (days_to_ship, days_to_deliver).

Why grain has to be declared first (and never changed later).

The grain is the schema — the FK list, the degenerate-dim list, the measure list, and the additivity rules all follow from the grain.
Mixing grains corrupts every aggregate — if some rows are (order_id, line_id) and others are (order_id) alone, SUM(revenue) GROUP BY product_key double-counts on the order-level rows.
Changing the grain breaks every downstream model — a re-grain triggers a coordinated re-publish of every BI dashboard that consumed the prior grain.
The grain is the contract — write it in the table comment, the dbt model docstring, the data catalog, and the wiki; multiple sources of truth keep it from drifting.

Concrete grain examples (memorise the wording).

"One row per (order_id, line_id)" — transaction grain for fact_sales.
"One row per (account_id, snapshot_date_key)" — periodic snapshot grain for fact_account_balance_daily.
"One row per order, lifecycle-accumulating" — accumulating snapshot grain for fact_order_lifecycle.
"One row per (customer_id, day, event_name)" — semi-aggregated event grain for fact_user_event_daily.
"One row per class session, no measures" — factless fact for fact_class_attendance.

`slowly changing dimension` — the four types you have to know cold

The acronym SCD covers strategies for handling change in dimension attributes over time, and every Kimball interview will probe at least Types 1, 2, and 6. The trick is not memorising the types; it is knowing which type to pick per attribute and writing the MERGE statements from memory.

SCD Type 1 — overwrite (no history)

Detailed explanation. SCD Type 1 simply overwrites the existing value in place; no history is preserved. Use it for attributes where past values are not interesting (typos, formatting changes, contact-info updates), and where the cost of preserving history outweighs the analytical value.

Question. A customer changes their email from alice@old.com to alice@new.com. Write the SCD Type 1 MERGE that updates the dimension.

Input. Source row: customer_id='C-001', email='alice@new.com'. Existing dim row: customer_key=101, customer_id='C-001', email='alice@old.com'.

Code.

-- SCD Type 1: overwrite in place.
MERGE INTO dim_customer_t1 AS tgt
USING (
    SELECT
        'C-001'              AS customer_id,
        'alice@new.com'      AS email,
        'Alice Smith'        AS name
) AS src
ON tgt.customer_id = src.customer_id
WHEN MATCHED THEN UPDATE SET
    tgt.email = src.email,
    tgt.name  = src.name
WHEN NOT MATCHED THEN INSERT (customer_id, email, name)
VALUES (src.customer_id, src.email, src.name);

Step-by-step explanation.

MERGE INTO dim_customer_t1 targets the dim table; the alias tgt is conventional.
USING (SELECT …) AS src lifts the new source row into a CTE-like alias; in production this would be a CTE over the staging table.
ON tgt.customer_id = src.customer_id matches on the business key; this is the only SCD type where matching on business key is safe (because there is no history).
WHEN MATCHED THEN UPDATE overwrites the email + name in place; the prior values are lost forever.
WHEN NOT MATCHED THEN INSERT covers the brand-new-customer case; first-time customers get a fresh row.

Output (the dim after the merge).

customer_key	customer_id	email	name
101	C-001	alice@new.com	Alice Smith

Rule of thumb: Type 1 is fast and cheap but lossy; use it for attributes nobody will ever ask "what was that on Feb 14th" about.

SCD Type 2 — add a new row (full history)

Detailed explanation. SCD Type 2 is the workhorse of dimensional modeling: when an attribute changes, insert a new row with a fresh surrogate key and stamp valid_from + valid_to + is_current on both the old and new rows. The prior row's valid_to becomes the new row's valid_from; the prior row's is_current becomes FALSE.

Question. Customer C-001 upgrades from starter to enterprise segment on 2026-04-15 10:30:00. Write the SCD Type 2 MERGE (or insert + update pair) that closes the old row and inserts the new one.

Input. Existing dim row: customer_key=101, customer_id='C-001', segment='starter', valid_from='2025-01-01', valid_to='9999-12-31', is_current=TRUE. Source change: customer_id='C-001', segment='enterprise', change_ts='2026-04-15 10:30:00'.

Code.

-- SCD Type 2: insert + update pair (the classic 2-step pattern).
BEGIN;

-- Step 1: close out the current row.
UPDATE dim_customer
SET
    valid_to   = TIMESTAMP '2026-04-15 10:30:00',
    is_current = FALSE
WHERE customer_id = 'C-001'
  AND is_current  = TRUE;

-- Step 2: insert the new current row.
INSERT INTO dim_customer (
    customer_id, name, email, segment, country, signup_date,
    valid_from, valid_to, is_current
)
SELECT
    'C-001', name, email, 'enterprise', country, signup_date,
    TIMESTAMP '2026-04-15 10:30:00',
    TIMESTAMP '9999-12-31',
    TRUE
FROM dim_customer
WHERE customer_id = 'C-001'
  AND valid_to = TIMESTAMP '2026-04-15 10:30:00'   -- the row we just closed
LIMIT 1;

COMMIT;

Step-by-step explanation.

Step 1 closes the prior current row by stamping valid_to = change_ts and is_current = FALSE; this row now represents the historical state.
Step 2 inserts a new row with a fresh surrogate key (auto-generated by IDENTITY), segment = 'enterprise', valid_from = change_ts, valid_to = '9999-12-31', is_current = TRUE.
The WHERE valid_to = change_ts clause in step 2's SELECT is how we copy the immutable attributes (name, email, country, signup_date) from the prior row.
The two steps run inside a transaction so a downstream reader never sees the dim with zero current rows for C-001.
The surrogate key of the new row is different from the prior row's surrogate — that's the whole point; the fact table will join to whichever key matches the order's valid_from window.

Output (the dim after the merge — two rows now).

customer_key	customer_id	segment	valid_from	valid_to	is_current
101	C-001	starter	2025-01-01	2026-04-15 10:30	false
132	C-001	enterprise	2026-04-15 10:30	9999-12-31	true

Rule of thumb: SCD Type 2 inflates row count but preserves full history; pick it for any attribute where "what was the value on date X" is a real analytical question.

SCD Type 3 — add a new column (limited history)

Detailed explanation. SCD Type 3 adds a new column (typically previous_*) alongside the existing one, so the dim carries both the current and the immediately prior value side by side. It tracks one level of history per attribute; older history is lost.

Question. A company renames its sales_region from 'NorthAm' to 'Americas'. Track both the current and previous region on dim_store without inserting new rows.

Input. Existing dim row: store_key=11, store_id='S-100', sales_region='NorthAm'. Source change: store_id='S-100', sales_region='Americas', change_ts='2026-03-01'.

Code.

-- SCD Type 3: shift the current value into a previous column.
ALTER TABLE dim_store
    ADD COLUMN previous_sales_region VARCHAR(64),
    ADD COLUMN sales_region_changed_at TIMESTAMP;

UPDATE dim_store
SET
    previous_sales_region   = sales_region,
    sales_region            = 'Americas',
    sales_region_changed_at = TIMESTAMP '2026-03-01'
WHERE store_id = 'S-100';

Step-by-step explanation.

ALTER TABLE adds two new columns: previous_sales_region (the prior value) and sales_region_changed_at (the change timestamp).
The UPDATE shifts the existing sales_region value into previous_sales_region, then overwrites sales_region with the new value.
The row count of the dim is unchanged — Type 3 is in-place, no new rows.
The new column lets BI write SUM(revenue) GROUP BY sales_region for the current view and SUM(revenue) GROUP BY previous_sales_region for the prior view, without rewriting the fact joins.
Type 3 is brittle — if the region renames again a year later, the previous_* column now holds two-changes-ago by default; some shops add previous_previous_*, which quickly becomes silly.

Output (the dim after the merge).

store_key	store_id	sales_region	previous_sales_region	sales_region_changed_at
11	S-100	Americas	NorthAm	2026-03-01

Rule of thumb: Type 3 fits "we just renamed one attribute and analysts want a side-by-side compare for a few quarters". Use sparingly; if you need full history, escalate to Type 2.

SCD Type 6 — hybrid (1 + 2 + 3 combined)

Detailed explanation. SCD Type 6 is the senior interview answer: combine Type 1 (overwrite the current attribute in every historical row), Type 2 (insert new rows for change), and Type 3 (carry the prior value on every row) into a single hybrid pattern. The result is a dim where every row carries both its own historical value and the current value, so BI can pivot on either without re-joining.

Question. Track customer segment changes with full history (Type 2) and let a query say WHERE current_segment = 'enterprise' cheaply on every historical row (Type 1) and expose previous_segment on each new row (Type 3). Write the SCD Type 6 update.

Input. Two existing dim rows for C-001: the original starter row and the enterprise row inserted above.

Code.

-- SCD Type 6: insert new row + overwrite current_segment on every historical row.
BEGIN;

-- Step 1: close out the prior current row (Type 2 mechanics).
UPDATE dim_customer_t6
SET valid_to = TIMESTAMP '2026-04-15 10:30:00',
    is_current = FALSE
WHERE customer_id = 'C-001' AND is_current = TRUE;

-- Step 2: insert the new row with previous_segment carried (Type 3 mechanics).
INSERT INTO dim_customer_t6 (
    customer_id, name, email,
    segment, previous_segment, current_segment,
    valid_from, valid_to, is_current
)
SELECT
    'C-001', name, email,
    'enterprise',              -- this row's historical segment
    'starter',                 -- the prior segment (Type 3)
    'enterprise',              -- the current segment (Type 1)
    TIMESTAMP '2026-04-15 10:30:00',
    TIMESTAMP '9999-12-31',
    TRUE
FROM dim_customer_t6
WHERE customer_id = 'C-001'
  AND valid_to = TIMESTAMP '2026-04-15 10:30:00'
LIMIT 1;

-- Step 3: overwrite current_segment on every historical row (Type 1 mechanics).
UPDATE dim_customer_t6
SET current_segment = 'enterprise'
WHERE customer_id = 'C-001';

COMMIT;

Step-by-step explanation.

Step 1 mirrors SCD Type 2: close the prior current row by stamping valid_to + is_current = FALSE.
Step 2 inserts a new row with three segment columns: segment (the historical value for this row, here 'enterprise'), previous_segment (the prior value, here 'starter', the Type 3 carry-over), and current_segment (the as-of-now value, here also 'enterprise').
Step 3 mirrors SCD Type 1: overwrite current_segment on every historical row for C-001, so even the closed-out starter row now carries current_segment = 'enterprise'.
The payoff: BI can write WHERE current_segment = 'enterprise' and get all historical revenue for that customer regardless of which row matches the order date; or WHERE segment = 'enterprise' to filter by historical segment-at-time-of-purchase.
Type 6 is the senior answer because it solves the "we want both views" problem without two separate dim tables.

Output (the dim after the merge — two rows, both carrying current_segment = 'enterprise').

customer_key	customer_id	segment	previous_segment	current_segment	is_current
101	C-001	starter	NULL	enterprise	false
132	C-001	enterprise	starter	enterprise	true

Rule of thumb: Type 6 is the "I want history and fast current lookup" pattern; it costs one extra column per Type-1-overwritten attribute but eliminates a whole class of join + filter complexity.

`slowly changing dimension` — beginner mistakes to avoid

Joining the fact on the business key instead of the surrogate — breaks the moment you adopt SCD Type 2; the join multiplies by history depth.
Forgetting is_current = TRUE — every current-state query needs it; without it the result silently sums historical rows.
Letting valid_to be NULL — use '9999-12-31' instead so BETWEEN valid_from AND valid_to works without IS NULL branches.
Updating valid_from on an open row — valid_from is immutable once stamped; only valid_to and is_current flip during SCD updates.
Mixing Type 1 and Type 2 attributes in the same row without comment — every dim column should be annotated with its SCD type in the table comment or dbt YAML.
Picking Type 2 for every attribute "just in case" — Type 2 inflates row counts; pick the type that matches the analytical question you'll be asked.

SQL
Topic — slowly-changing-data
SCD practice problems

Practice →

SQL
Topic — dimensional-modeling
Grain-and-SCD dimensional modeling drills

Practice →

Solution Using a per-attribute SCD type assignment matrix

Code.

-- Codify the SCD type for every attribute on dim_customer.
CREATE TABLE dim_customer_scd_plan AS
SELECT * FROM (VALUES
    ('customer_id',  'business key',      'NA',     'preserved for traceability; not updated after first insert'),
    ('name',         'descriptive',       'Type 1', 'typos and rebrands; history not interesting'),
    ('email',        'descriptive',       'Type 1', 'overwrite; do not preserve email history'),
    ('phone',        'descriptive',       'Type 1', 'overwrite; do not preserve phone history'),
    ('segment',      'analytical',        'Type 2', 'revenue per historical segment is a real question'),
    ('country',      'analytical',        'Type 2', 'geo migration matters for tax + analytics'),
    ('account_mgr',  'analytical',        'Type 2', 'attribution to manager-at-time-of-sale'),
    ('credit_score', 'analytical',        'Type 2', 'risk analysis needs historical score'),
    ('signup_date',  'immutable',         'NA',     'never changes; set once at insert'),
    ('current_segment','derived',         'Type 6', 'overwrite on all rows for fast current-state lookup')
) AS t(attribute, role, scd_type, rationale);

Step-by-step trace.

attribute	role	scd_type	rationale
customer_id	business key	NA	preserved for traceability; not updated after first insert
name	descriptive	Type 1	typos and rebrands; history not interesting
email	descriptive	Type 1	overwrite; do not preserve email history
phone	descriptive	Type 1	overwrite; do not preserve phone history
segment	analytical	Type 2	revenue per historical segment is a real question
country	analytical	Type 2	geo migration matters for tax + analytics
account_mgr	analytical	Type 2	attribution to manager-at-time-of-sale
credit_score	analytical	Type 2	risk analysis needs historical score
signup_date	immutable	NA	never changes; set once at insert
current_segment	derived	Type 6	overwrite on all rows for fast current-state lookup

Rows 1, 9 — business key + immutable; never updated after first insert.
Rows 2-4 — Type 1; overwrite; cheap; loses history; appropriate for cosmetic and contact attributes.
Rows 5-8 — Type 2; the analytical attributes; revenue / risk / attribution per historical value is a real question.
Row 10 — Type 6 layered on top of segment; one extra column gives BI a fast "current state" pivot without a join.
The matrix is the deliverable; every senior data modeler ships a per-attribute SCD plan, not a blanket "everything is Type 2".

Output.

attribute	scd_type
customer_id	NA
name	Type 1
email	Type 1
segment	Type 2
country	Type 2
account_mgr	Type 2
credit_score	Type 2
current_segment	Type 6

Why this works — concept by concept:

Per-attribute SCD assignment — Kimball's discipline is "pick the SCD type per attribute, not per table"; a single dim can mix Types 1, 2, and 6 across its columns.
Type 1 for cosmetic, Type 2 for analytical — the rule of thumb that keeps row counts down without losing analytical value; cosmetic attributes (typos, rebrands, contact info) overwrite, analytical attributes (segment, region, tier) preserve history.
Type 6 for derived current-state — pairing a Type 2 attribute with a Type 1 current_* column gives BI both views with zero join cost.
Documentation as deliverable — the assignment matrix itself is shipped as part of the model design; without it the next engineer can't tell why email is Type 1 but country is Type 2.
Cost — O(1) to read the matrix; the actual updates cost O(rows-per-change) for Type 1 (overwrite all rows for that business key) vs O(1) for Type 2 (insert one new row).

4. Conformed dimensions + the Kimball bus matrix — modeling at enterprise scale

`conformed dimensions` — build `dim_customer` once, use it in every fact

conformed dimensions are dimensions designed to be shared across multiple business processes — fact_sales, fact_returns, fact_inventory all join to the same dim_customer, dim_product, dim_date. The conformance contract is the heart of Kimball at enterprise scale: without it, every team builds their own dim_customer_sales, dim_customer_marketing, dim_customer_support, and cross-process analytics becomes impossible because the definition of "customer" has diverged in five places.

The conformance contract — what makes a dim "conformed".

Same columns — dim_customer.segment means the same thing whether you join it to fact_sales or fact_returns.
Same surrogate-key generation — customer_key = 12345 resolves to the same business customer in every fact.
Same SCD policy — segment changes are tracked as Type 2 in every fact that uses the dim.
Same grain — if the dim is at customer-account level (not customer-individual level), every fact agrees on that grain.
Same source of truth — one team owns dim_customer; the other teams consume it, they don't fork it.

Why conformance matters in interviews.

Cross-process analytics depends on it — "what % of customers who bought in Q1 returned in Q2" requires fact_sales and fact_returns to share dim_customer.
Reconciliation breaks without it — if fact_sales.customer_key = 12345 is "Alice" but fact_returns.customer_key = 12345 is "Bob", every reconciliation query lies.
It is the senior signal — junior modelers build a dim per fact; senior modelers build a conformed dim and reuse it.
The Kimball bus matrix is the deliverable — section 4.2 walks through it.

Three flavours of conformance (with examples).

Identical conformance — the strongest; the dim row, the surrogate key, and every attribute match exactly across facts; example dim_date.
Shrunken conformance — a coarser version of the dim is used in lower-grain facts; example dim_date at month-grain (dim_month) for inventory snapshots while dim_date at day-grain serves fact_sales.
Subset conformance — one fact uses only a subset of the dim's rows (e.g. fact_internal_sales filters dim_customer to internal customers); attributes and keys match, but row set differs.

Worked example — design `dim_customer` once and use it in three facts

Detailed explanation. Real interviews ask you to demonstrate conformance by showing the same dim_customer being consumed by multiple fact_* tables. Below is the canonical three-fact pattern.

Question. Sales, returns, and customer-support tickets all need to be analysed by customer segment, country, and tier. Design dim_customer once and show how fact_sales, fact_returns, and fact_support_ticket all consume it.

Input. Three OLTP sources: oltp.orders, oltp.returns, oltp.support_tickets. The current state has three separate dim_customer_* tables, one per team; consolidate them.

Code.

-- One canonical conformed dim.
CREATE TABLE dim_customer (
    customer_key   BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_id    VARCHAR(40) NOT NULL,
    name           VARCHAR(120) NOT NULL,
    email          VARCHAR(120) NOT NULL,                -- Type 1
    segment        VARCHAR(32)  NOT NULL,                -- Type 2
    country        VARCHAR(64)  NOT NULL,                -- Type 2
    tier           VARCHAR(16)  NOT NULL,                -- Type 2
    current_segment VARCHAR(32) NOT NULL,                -- Type 6 (derived)
    valid_from     TIMESTAMP    NOT NULL,
    valid_to       TIMESTAMP    NOT NULL DEFAULT '9999-12-31',
    is_current     BOOLEAN      NOT NULL DEFAULT TRUE
);

-- Three facts, one dim.
CREATE TABLE fact_sales (
    sale_key       BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_key   BIGINT NOT NULL REFERENCES dim_customer(customer_key),
    product_key    BIGINT NOT NULL,
    date_key       INT    NOT NULL,
    revenue        NUMERIC(12,2) NOT NULL
);

CREATE TABLE fact_returns (
    return_key     BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_key   BIGINT NOT NULL REFERENCES dim_customer(customer_key),  -- conformed
    product_key    BIGINT NOT NULL,
    date_key       INT    NOT NULL,
    refund_amount  NUMERIC(12,2) NOT NULL
);

CREATE TABLE fact_support_ticket (
    ticket_key     BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_key   BIGINT NOT NULL REFERENCES dim_customer(customer_key),  -- conformed
    date_key       INT    NOT NULL,
    severity_key   BIGINT NOT NULL,
    resolution_minutes INT NOT NULL
);

Step-by-step explanation.

One dim_customer table is the single source of truth; every fact's customer_key FK references it.
All three facts agree on the customer_key surrogate; if Alice is customer_key = 12345 in sales, she is customer_key = 12345 in returns and support.
All three facts inherit the same SCD policy: when Alice's segment changes, a new dim row is inserted with a new surrogate, and future facts in all three tables join to the new key.
Cross-process queries work without effort: SELECT segment, SUM(revenue), SUM(refund_amount), AVG(resolution_minutes) FROM dim_customer d LEFT JOIN fact_sales s LEFT JOIN fact_returns r LEFT JOIN fact_support_ticket t … GROUP BY segment returns a single row per segment with all three measures.
The conformance contract is enforced by the FK + the team agreement; both layers matter (DB constraints catch the technical violation, the team agreement catches the policy violation).

Output (one row per segment, joining all three facts).

segment	total_revenue	total_refunds	avg_resolution_min
enterprise	250000.00	12500.00	45
growth	95000.00	4200.00	60
starter	38000.00	1900.00	90

Rule of thumb: if you're tempted to build dim_customer_v2 for a new team, stop — the cost of forking the dim today is paid for the next decade in cross-process analytics that don't tie out.

`kimball bus matrix` — the org-wide design view of which dims serve which processes

The Kimball bus matrix is a 2D grid with business processes as rows and conformed dimensions as columns; a checkmark in cell (process, dim) says "this process's fact table joins to this dim". The matrix is the single artefact the data platform team uses to plan, govern, and communicate dimensional modeling at enterprise scale.

The shape of a bus matrix.

business process	customer	product	date	store	employee	channel
Sales	✓	✓	✓	✓	✓	✓
Returns	✓	✓	✓	✓	–	✓
Inventory snapshot	–	✓	✓ (month)	✓	–	–
Support ticket	✓	✓	✓	–	✓	✓
Marketing campaign	✓	✓	✓	–	✓	✓
Web event	✓	✓	✓	–	–	✓
Subscription billing	✓	✓	✓	–	–	✓

How to read it.

Each row is a business process — a single subject area that produces a fact table.
Each column is a conformed dimension shared across processes.
A checkmark means "this process's fact joins to this dim".
A dash means "this dim does not apply to this process".
A "(month)" annotation means shrunken conformance — the inventory fact joins at month grain while sales joins at day grain.

Why the bus matrix is the senior-modeler deliverable.

It surfaces missing dims — if customer is checked for sales but missing for support, that's a gap analytics will pay for later.
It exposes redundant facts — if two facts cover the same process at slightly different grains, you probably have a re-grain bug.
It plans roadmap — each cell is a unit of work; add a fact, add a dim, conform a dim across processes.
It governs ownership — each column has an owner (the team that owns the dim); each row has an owner (the team that owns the process).
It travels across tools — the matrix lives in a wiki, a dbt docs page, or a Confluence page; every BI dashboard ties back to it.

Worked example — sketch a 3-row × 4-column bus matrix on a whiteboard

Detailed explanation. Whiteboard rounds love this question because it's tiny but reveals whether you actually use the bus matrix or just read about it. The drill is to design a 3-row × 4-column matrix in 60 seconds.

Question. Sketch a Kimball bus matrix for an e-commerce platform with three business processes (sales, returns, inventory snapshot) and four candidate conformed dimensions (customer, product, date, store). Mark which dims are conformed across all three, which are partial, and call out one shrunken-conformance cell.

Input. Three facts: fact_sales (transaction grain), fact_returns (transaction grain), fact_inventory_snapshot (daily snapshot, but stored monthly for cost reasons). Four candidate dims: dim_customer, dim_product, dim_date, dim_store.

Code.

-- Materialise the bus matrix as a small table for governance.
CREATE TABLE bus_matrix AS
SELECT * FROM (VALUES
    ('sales',              'customer', 'full'),
    ('sales',              'product',  'full'),
    ('sales',              'date',     'full'),
    ('sales',              'store',    'full'),
    ('returns',            'customer', 'full'),
    ('returns',            'product',  'full'),
    ('returns',            'date',     'full'),
    ('returns',            'store',    'full'),
    ('inventory_snapshot', 'customer', 'not_applicable'),
    ('inventory_snapshot', 'product',  'full'),
    ('inventory_snapshot', 'date',     'shrunken_to_month'),
    ('inventory_snapshot', 'store',    'full')
) AS t(business_process, dimension, conformance);

Step-by-step explanation.

sales and returns both share all four dims — customer, product, date, store — at full conformance; cross-process queries (refund rate per segment per region) are trivial.
inventory_snapshot does not use customer — the dim is not_applicable because inventory is product-and-store-keyed, not customer-keyed.
inventory_snapshot uses dim_date at month grain (the snapshot fact stores one row per (product, store, month)); this is the shrunken-conformance cell.
The 12-row table is the bus matrix; pivot it in a BI tool for a visual grid.
The matrix lives in version control alongside the model definitions; PR-reviewed changes to the matrix are the governance gate for adding new processes or dims.

Output (the matrix pivoted into the classic grid).

business_process	customer	product	date	store
sales	full	full	full	full
returns	full	full	full	full
inventory_snapshot	–	full	month	full

Rule of thumb: every interview-day system-design answer for an analytics platform should start with a hand-sketched bus matrix; the matrix anchors the rest of the design.

SQL
Topic — dimensional-modeling
Conformed-dimension and bus-matrix drills

Practice →

SQL
Topic — aggregation
Cross-process aggregation practice

Practice →

Solution Using a cross-process analytics query that depends on conformance

Code.

-- One query, three facts, one conformed dim_customer — the payoff of conformance.
WITH sales AS (
    SELECT customer_key, SUM(revenue) AS total_revenue
    FROM fact_sales
    WHERE date_key BETWEEN 20260101 AND 20260331
    GROUP BY customer_key
), returns AS (
    SELECT customer_key, SUM(refund_amount) AS total_refunds
    FROM fact_returns
    WHERE date_key BETWEEN 20260101 AND 20260331
    GROUP BY customer_key
), support AS (
    SELECT customer_key,
           COUNT(*)              AS ticket_count,
           AVG(resolution_minutes) AS avg_resolution_min
    FROM fact_support_ticket
    WHERE date_key BETWEEN 20260101 AND 20260331
    GROUP BY customer_key
)
SELECT
    d.segment,
    d.country,
    SUM(COALESCE(s.total_revenue, 0))      AS total_revenue,
    SUM(COALESCE(r.total_refunds, 0))      AS total_refunds,
    SUM(COALESCE(t.ticket_count, 0))       AS ticket_count,
    AVG(t.avg_resolution_min)               AS avg_resolution_min
FROM dim_customer d
LEFT JOIN sales   s ON s.customer_key = d.customer_key
LEFT JOIN returns r ON r.customer_key = d.customer_key
LEFT JOIN support t ON t.customer_key = d.customer_key
WHERE d.is_current = TRUE
GROUP BY d.segment, d.country
ORDER BY total_revenue DESC;

Step-by-step trace.

d.customer_key	d.segment	d.country	s.total_revenue	r.total_refunds	t.ticket_count
101	enterprise	US	50000.00	2500.00	8
102	growth	US	18000.00	900.00	12
103	enterprise	UK	40000.00	1800.00	5
104	starter	UK	6000.00	300.00	20

Each CTE aggregates one fact to the customer level; the grain of each CTE is (customer_key).
The main SELECT joins all three CTEs to dim_customer; LEFT JOIN preserves customers with no sales / no returns / no tickets.
WHERE d.is_current = TRUE filters the dim to one current row per customer; without it the rollup would multiply by SCD history depth.
GROUP BY d.segment, d.country collapses to one row per segment-country bucket.
The conformance contract is what makes this query possible — every fact agrees that customer_key = 101 is the same customer.

Output:

segment	country	total_revenue	total_refunds	ticket_count	avg_resolution_min
enterprise	US	50000.00	2500.00	8	45
enterprise	UK	40000.00	1800.00	5	55
growth	US	18000.00	900.00	12	60
starter	UK	6000.00	300.00	20	90

Why this works — concept by concept:

Conformed surrogate key — customer_key resolves identically in all three facts; without this, the three LEFT JOINs would silently disagree.
One CTE per fact — pre-aggregating each fact to customer-grain before joining keeps the join cardinality manageable (O(customers) not O(customers × sales × returns × tickets)).
COALESCE on outer-joined measures — customers with no sales return NULL; COALESCE(…, 0) turns nulls into zeros so the SUM is correct.
is_current filter — required because dim_customer is SCD Type 2; without it, the rollup multiplies by historical row count.
Cost — three CTE scans are O(rows in date range) each; the join is O(distinct customers); the whole query is cheap because the fact-level pre-aggregation collapses the data before the join.

5. The Kimball 4-step design process — business process → grain → dimensions → facts

`kimball methodology` — the canonical 4-step design process

The Kimball 4-step design is the recipe every dimensional model follows: (1) select the business process, (2) declare the grain, (3) choose the dimensions, (4) identify the facts. The order matters: skip a step, or do them out of order, and the model fails predictably — grain mistakes are the most expensive class of failure because they propagate through every downstream model and dashboard.

Step 1 — select the business process.

Definition — a business process is a single measurement event the OLTP source produces: placing an order, shipping a parcel, returning a product, clicking a button, posting a payment.
Rule — one business process per fact table; never combine "orders and returns" into a single fact because their grains and measures don't align.
Sanity check — write the process down as "the system measures X when Y happens"; if you can't, you haven't picked a real process yet.
Examples — "the system measures revenue when an order line is placed", "the system measures days-late when a shipment status changes", "the system measures attendance when a class session occurs".

Step 2 — declare the grain.

Definition — the grain is a single sentence defining what one row of the fact table means.
Rule — declare grain before you name a single column; defend it against finer (more atomic) and coarser (more aggregated) alternatives.
Sanity check — fill in the blank: "One row of this fact represents ____."; the sentence is the grain.
Examples — "one row per (order_id, line_id)", "one row per (account_id, day)", "one row per order, lifecycle-accumulating".

Step 3 — choose the dimensions.

Definition — the dimensions are the who / what / when / where / why contexts surrounding the grain.
Rule — pick the minimum set of dimensions the grain requires; don't drag in dimensions that aren't relevant to the process.
Sanity check — for each candidate dim, ask "if I removed this dim, can I still answer the analytical questions the PM cares about?"; if yes, drop it.
Examples — for fact_sales at order-line grain: dim_customer, dim_product, dim_date, dim_store; that's it.

Step 4 — identify the facts (measures).

Definition — the facts are the numeric measures that aggregate up the dimension hierarchies.
Rule — favour additive measures (those that SUM correctly across all dims); be wary of semi-additive (SUM only across some dims) and non-additive (ratios, percentages) measures.
Sanity check — for each candidate measure, ask "does SUM(this) GROUP BY any dim make sense?"; if no, it's not a fact.
Examples — for fact_sales: quantity, unit_price, discount_amount, revenue.

The iteration loop.

One business process per iteration — design the sales model first, ship it, then iterate into returns.
Re-declare grain when the source changes — if the OLTP team adds line-level cancellation, the grain may need to shift.
Add dims as use cases emerge — dim_promotion may not be needed on day 1 but becomes essential when marketing wants attribution.
Add facts as measures are requested — discount_pct (a derived ratio) may emerge later; store the additive components and derive the ratio in BI.

Worked example — apply the 4-step process to an e-commerce sales request

Detailed explanation. Interviews love this one because it lets the candidate demonstrate the process, not just the artefact. Below is a fully worked end-to-end design from a one-paragraph PM request.

Question. A PM says: "Our e-commerce platform sells products to customers via a web store. I want to analyse revenue by customer segment, product category, day, and store region — and drill into individual order lines." Apply the 4-step process and produce the fact_sales schema.

Input. OLTP source: orders(order_id, customer_id, order_ts, store_id) joined to order_lines(order_id, line_id, sku, qty, unit_price, discount). No other tables.

Code.

-- Step 1: business process = "online sales (order-line placement)".
-- Step 2: grain         = "one row per (order_id, line_id)".
-- Step 3: dimensions    = customer, product, date, store.
-- Step 4: facts         = quantity, unit_price, discount_amount, revenue.

CREATE TABLE fact_sales (
    sale_key       BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_key   BIGINT NOT NULL REFERENCES dim_customer(customer_key),
    product_key    BIGINT NOT NULL REFERENCES dim_product(product_key),
    date_key       INT    NOT NULL REFERENCES dim_date(date_key),
    store_key      BIGINT NOT NULL REFERENCES dim_store(store_key),
    order_id       VARCHAR(40) NOT NULL,
    line_id        INT         NOT NULL,
    quantity       INT           NOT NULL,
    unit_price     NUMERIC(12,2) NOT NULL,
    discount_amount NUMERIC(12,2) NOT NULL DEFAULT 0,
    revenue        NUMERIC(12,2) NOT NULL,
    CONSTRAINT uq_fact_sales UNIQUE (order_id, line_id)
);
-- Sanity-check the grain: COUNT(*) = COUNT(DISTINCT (order_id, line_id)).

Step-by-step explanation.

Step 1 (business process) — "placing an order line on the web store"; it is a single measurement event; not a roll-up.
Step 2 (grain) — "one row per (order_id, line_id)"; the most atomic grain the source supports; declared in the table comment and enforced by the UNIQUE constraint.
Step 3 (dimensions) — customer (who), product (what), date (when), store (where); four FKs.
Step 4 (facts) — quantity, unit_price, discount_amount, revenue; the first three come directly from the source, revenue is derived (= qty × price - discount) and stored to avoid recomputation in BI.
The order matters — process before grain before dims before facts; reversing the order (e.g. picking facts first) leads to mid-design rework when the grain doesn't support them.

Output (the schema deliverable).

step	artefact
1. business process	online sales
2. grain	one row per (order_id, line_id)
3. dimensions	customer, product, date, store
4. facts	quantity, unit_price, discount_amount, revenue
final	`fact_sales` table with 4 FKs + 2 degen dims + 4 measures

Rule of thumb: every dimensional design should ship the 4-step process as a comment block on the fact table; the comment is the design rationale that survives turnover.

`kimball methodology` — common beginner mistakes when applying the 4-step process

Skipping the business process step — jumping straight to grain or dimensions without naming the process leads to bloated facts that mix multiple processes.
Declaring grain in plural — "one row per orders" is wrong; the grain is always singular ("one row per order line").
Picking dimensions before grain — the grain constrains the dimensions; you cannot have a dim_line_item if your grain is one row per order.
Stuffing descriptive attributes into the fact — if you find yourself adding customer_name or product_category to the fact, you're modelling backwards; those belong on the dim.
Picking non-additive measures as primary facts — discount_pct and margin_pct cannot SUM; store the additive components and let BI derive the ratios.
Forgetting dim_date — every fact has a time dimension; even factless facts have one; never store dates only as DATE columns on the fact without a date_key FK.

Worked example — translate a tricky multi-process PM request into separate facts

Detailed explanation. A senior interviewer will deliberately mix processes in the PM request and see whether the candidate correctly splits them. Below is the drill.

Question. A PM says: "I want to track everything that happens to an order — when it's placed, when each line is shipped, when each line is returned. Build me one big fact." Resist the temptation; design three fact tables and explain why.

Input. OLTP sources: orders, order_lines, shipments(line_id, shipped_ts), returns(line_id, return_ts, refund_amount).

Code.

-- Three processes, three facts, three grains.

-- Process 1: order-line placement.
CREATE TABLE fact_sales (
    customer_key BIGINT, product_key BIGINT, date_key INT, store_key BIGINT,
    order_id VARCHAR(40), line_id INT,
    quantity INT, unit_price NUMERIC(12,2), revenue NUMERIC(12,2)
);
-- grain: one row per (order_id, line_id) at placement time.

-- Process 2: shipment.
CREATE TABLE fact_shipments (
    customer_key BIGINT, product_key BIGINT, ship_date_key INT, carrier_key BIGINT,
    order_id VARCHAR(40), line_id INT,
    quantity_shipped INT, days_from_order INT
);
-- grain: one row per (order_id, line_id) per shipment event.

-- Process 3: return.
CREATE TABLE fact_returns (
    customer_key BIGINT, product_key BIGINT, return_date_key INT,
    order_id VARCHAR(40), line_id INT,
    quantity_returned INT, refund_amount NUMERIC(12,2),
    days_from_ship INT
);
-- grain: one row per (order_id, line_id) per return event.

Step-by-step explanation.

The PM's "one big fact" is the trap; combining three processes into one fact gives you a wide, sparse, semi-additive mess.
Each process has its own measurement event (placed, shipped, returned) and therefore its own fact table.
Each fact has its own grain and its own set of measures; fact_sales.revenue doesn't apply to fact_shipments, and fact_shipments.days_from_order doesn't apply to fact_sales.
The three facts share customer_key, product_key, and (order_id, line_id) as a degenerate dim, so cross-process analytics (placed-to-shipped lag) is one join away.
An accumulating snapshot fact (fact_order_lifecycle) can sit on top of the three transaction facts to give BI a denormalised one-row-per-order view; the three transaction facts remain the source of truth.

Output (the three-fact design with shared conformed dims).

fact	grain	measures
fact_sales	1 row per (order_id, line_id) at placement	quantity, unit_price, revenue
fact_shipments	1 row per (order_id, line_id) per shipment	quantity_shipped, days_from_order
fact_returns	1 row per (order_id, line_id) per return	quantity_returned, refund_amount, days_from_ship

Rule of thumb: if a PM asks for "one big fact", count the measurement events in the request; each event is its own fact.

SQL
Topic — dimensional-modeling
4-step design process drills

Practice →

SQL
Topic — database
Database / schema design practice

Practice →

Solution Using a 4-step design checklist as a deliverable

Code.

-- Ship the 4-step design as a checklist row per fact table.
CREATE TABLE dim_design_checklist AS
SELECT * FROM (VALUES
    ('fact_sales',         1, 'business_process', 'online sales (order-line placement)'),
    ('fact_sales',         2, 'grain',            'one row per (order_id, line_id)'),
    ('fact_sales',         3, 'dimensions',       'customer, product, date, store'),
    ('fact_sales',         4, 'facts',            'quantity, unit_price, discount_amount, revenue'),

    ('fact_shipments',     1, 'business_process', 'shipment dispatch'),
    ('fact_shipments',     2, 'grain',            'one row per (order_id, line_id) per shipment'),
    ('fact_shipments',     3, 'dimensions',       'customer, product, ship_date, carrier'),
    ('fact_shipments',     4, 'facts',            'quantity_shipped, days_from_order'),

    ('fact_returns',       1, 'business_process', 'product return'),
    ('fact_returns',       2, 'grain',            'one row per (order_id, line_id) per return'),
    ('fact_returns',       3, 'dimensions',       'customer, product, return_date'),
    ('fact_returns',       4, 'facts',            'quantity_returned, refund_amount, days_from_ship')
) AS t(fact_table, step_no, step_name, value);

Step-by-step trace.

fact_table	step_no	step_name	value
fact_sales	1	business_process	online sales (order-line placement)
fact_sales	2	grain	one row per (order_id, line_id)
fact_sales	3	dimensions	customer, product, date, store
fact_sales	4	facts	quantity, unit_price, discount_amount, revenue
fact_shipments	1	business_process	shipment dispatch
fact_shipments	2	grain	one row per (order_id, line_id) per shipment
fact_shipments	3	dimensions	customer, product, ship_date, carrier
fact_shipments	4	facts	quantity_shipped, days_from_order
fact_returns	1	business_process	product return
fact_returns	2	grain	one row per (order_id, line_id) per return
fact_returns	3	dimensions	customer, product, return_date
fact_returns	4	facts	quantity_returned, refund_amount, days_from_ship

Each fact gets exactly four rows in the checklist — one per step; if a fact has fewer, the design is incomplete.
The step_no column enforces the canonical order; grain before dimensions before facts.
The value column is plain English (not SQL); junior engineers and PMs can read it without warehouse fluency.
The table itself becomes the design contract; PR review against the checklist catches gaps before code lands.
Three facts × four steps = 12 rows; with seven facts a real platform might have 28 rows, all in one queryable artefact.

Output.

fact_table	grain
fact_sales	one row per (order_id, line_id)
fact_shipments	one row per (order_id, line_id) per shipment
fact_returns	one row per (order_id, line_id) per return

Why this works — concept by concept:

Checklist as deliverable — the design itself is a row-per-step artefact; this is what makes Kimball governable at scale.
One row per step per fact — turns a vague "did we follow the process" question into a COUNT(*) GROUP BY fact_table HAVING COUNT(*) = 4 query.
Plain-English value column — the design has to be readable by the PM, the analyst, and the DBA; SQL syntax in the design doc is over-engineering.
Versionable in source control — the checklist lives in dbt YAML / the data catalog / a Confluence page; changes are PR-reviewed.
Cost — O(1) to read; the actual schemas built from the design cost O(N rows × N attributes) to materialise, but the design itself is constant-time recall.

Choosing the right SCD type (cheat sheet)

A one-screen cheat sheet for slowly changing dimension decisions — pick the type that matches the analytical question you'll be asked.

You want to …	SCD type	Mechanism	Row impact
Fix typos in a name	Type 1	overwrite in place	none
Update an email after a change	Type 1	overwrite in place	none
Track historical customer segment	Type 2	insert new row + close prior row	+1 row per change
Track historical region / country	Type 2	insert new row + close prior row	+1 row per change
Track historical account manager	Type 2	insert new row + close prior row	+1 row per change
Track historical credit score	Type 2	insert new row + close prior row	+1 row per change
Track just-the-last-change region rename	Type 3	add `previous_*` column	none
Provide fast current-state lookup and full history	Type 6	Type 2 + Type 1 + Type 3 hybrid	+1 row per change
Preserve everything in a never-purged audit	Type 4 (history table)	move old rows to `dim_*_history`	none in main dim
Roll back to a prior version on demand	Type 2 + retention	keep all rows; query `valid_from` window	+1 row per change
Surface "as-of" reporting at any date	Type 2 with `valid_from / valid_to`	`BETWEEN` predicate	full Type 2 cost
Audit who changed a field	Type 2 + `updated_by` audit column	every row carries updater	full Type 2 cost
Track an immutable attribute (signup_date)	NA	never updated	none
Encode a derived current-state pivot	Type 6 (current_* column)	overwrite current_* on every row	none new

Frequently asked questions

How is this Kimball deep-dive different from a generic data-modeling Q&A round-up?

A quick data modeling interview questions round-up usually covers OLTP normalisation (1NF / 2NF / 3NF), the Kimball-vs-Inmon-vs-Vault landscape, basic star vs snowflake schema vocabulary, and a few generic FAQ-style questions in one sitting — perfect for last-minute review. This deep-dive narrows the lens to Kimball dimensional modeling specifically, walking five numbered teaching sections — fact-vs-dim atoms, grain + the four SCD types with full SQL, conformed dimensions + the bus matrix, and the canonical 4-step design process — with worked examples, end-to-end schemas, and a per-attribute SCD assignment matrix. Pick the deep-dive when you have a week to prepare, want to teach dimensional modeling in a senior loop, or need the SCD MERGE statements memorised. Pick the round-up the night before. The two formats are complements, not duplicates — same family of topics, different depth.

What is the difference between a fact table and a dimension table in Kimball modeling?

Fact tables are narrow + tall + numeric — they have a handful of foreign-key columns (one per participating dimension), one or two degenerate dimensions (order_id, line_id), and a handful of additive measures (quantity, revenue, discount_amount). One row per business event; billions of rows over time. Dimension tables are short + wide + descriptive — they have a surrogate key, a business key, and dozens of descriptive text and date attributes (name, segment, country, category, valid_from, valid_to, is_current); one row per business entity per historical version. The interview-day rule of thumb: facts answer "how much"; dimensions answer "who / what / when / where / why". If you find a long text column on a fact, it's mis-modelled; if you find a numeric measure on a dim, it's mis-modelled.

What is `grain` and why is declaring it first the most important rule in Kimball?

grain is a single sentence that defines what one row of a fact table means — "one row per (order_id, line_id)", "one row per (account_id, day)", "one row per order, lifecycle-accumulating". It must be declared before any column is named, because every other modeling decision (which dimensions apply, which measures are additive, what the unique constraint is) follows from the grain. Mixing grains in the same fact table double-counts every aggregate; changing the grain after launch breaks every downstream dashboard; ambiguity about the grain produces queries that return wrong numbers silently. The Kimball discipline: write the grain in the table comment, the dbt YAML, the data-catalog entry, and the design wiki; multiple sources of truth keep it from drifting. Defending the grain in a design review — explaining why your grain isn't finer (more atomic) or coarser (more aggregated) — is the single biggest senior-modeler signal you can send.

What are the four SCD types I have to know cold for an interview?

The four canonical types are: Type 1 (overwrite) — replace the value in place; no history; cheap; use for typos, contact info, and rebrands where past values aren't analytically interesting. Type 2 (add new row) — insert a new row with a fresh surrogate key + valid_from / valid_to / is_current flags; full history; the workhorse; use for analytical attributes (segment, region, tier, account manager) where revenue-per-historical-value is a real question. Type 3 (add new column) — add a previous_* column to track one level of history per attribute; limited; use sparingly for one-time renames (region rebrand). Type 6 (hybrid 1+2+3) — the senior interview answer: layer all three patterns to give you full history and a fast current_* lookup and a per-row prior value. Memorise the SQL MERGE for each (section 3 ships all four). The interview rule: pick the SCD type per attribute, not per table — a single dim_customer can mix Type 1 on email, Type 2 on segment, Type 6 on current_segment.

What are conformed dimensions and how do they enable enterprise-scale analytics?

Conformed dimensions are dimensions designed to be shared across multiple business processes — fact_sales, fact_returns, fact_inventory, fact_support_ticket all join to the same dim_customer, dim_product, dim_date. The conformance contract specifies that the surrogate key, column set, SCD policy, and grain are identical across every fact that consumes the dim. Without conformance, every team builds their own dim_customer_sales, dim_customer_marketing, dim_customer_support, and cross-process analytics ("what % of customers who bought in Q1 returned in Q2 and opened a support ticket in Q3") becomes impossible because the definition of "customer" has diverged in five places. The Kimball bus matrix is the org-wide design artefact that surfaces conformance: business processes as rows, conformed dimensions as columns, checkmarks where the process uses the dim. Senior data modelers ship the bus matrix first as the platform's analytics blueprint; junior modelers skip it and pay for missing conformance the next decade.

Is Kimball dimensional modeling still relevant in 2026 with the lakehouse, Iceberg, and modern semantic layers?

Yes — emphatically. Snowflake, BigQuery, Databricks, and Redshift all publish reference architectures with star-schema gold-layer models. dbt is built around dimensional modeling — dim_ / fact_ naming is the de-facto convention, dbt_utils.generate_surrogate_key is universal, and dbt-expectations ships dimensional-model assertions. The lakehouse did not kill it: Iceberg, Delta, and Hudi tables still get a Kimball-shaped gold layer on top of the bronze raw + silver cleaned layers. Modern semantic layers (Cube, LookML, dbt-metricflow, Snowflake Semantic Layer) all assume a star-schema input. Data Vault complements rather than replaces — DV 2.0 increasingly handles the raw / integration layer with a Kimball star on top as the consumption layer. The reason dimensional modeling outlived every "Kimball is dead" hot take is that, underneath the storage layer, analysts still want a star schema because that is the shape SQL pivots and BI tools natively consume. In 2026, knowing Kimball cold is still the price of admission to a senior data-engineering interview at a serious analytics org.

Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including SQL + Python drills keyed to the exact kimball data warehouse skill set this guide teaches (fact-vs-dim design, grain declaration, surrogate keys, SCD Types 1 / 2 / 3 / 6 MERGE patterns, conformed-dimension reasoning, bus-matrix governance, the 4-step design process). Whether you're drilling dimensional modeling interview questions the night before a screen or grinding the Kimball methodology vocabulary across a multi-week prep cycle, the practice library mirrors the same five-section mental model — plus the dbt, Snowflake, BigQuery, and Databricks warehouse stacks you'll wire into your production star schema.

Kick off via Explore practice →; drill the dimensional-modeling lane →; fan out into slowly-changing-data problems →; reinforce the broader SQL practice library →; rehearse aggregation patterns →; widen coverage on the full Python practice library →.

1. Why Kimball is still the dimensional-modeling interview standard

kimball data warehouse — the dimensional model that outlived every "Kimball is dead" hot take

Worked example — turn a one-sentence business request into the Kimball vocabulary

dimensional modeling — Kimball vs Inmon vs Data Vault in one minute

Solution Using a Kimball-vocabulary lookup matrix

2. Fact tables vs dimension tables — the atoms of Kimball modeling

fact table vs dimension table — the two atoms every Kimball schema is built from

Worked example — design fact_sales and dim_customer from a raw orders source

surrogate key vs business key — the rule that lets SCD history actually work

Solution Using a surrogate-key + business-key join harness

3. Grain + Slowly Changing Dimensions — Type 1, 2, 3, 6 with SQL

grain — declare it first, defend it forever

slowly changing dimension — the four types you have to know cold

SCD Type 1 — overwrite (no history)

SCD Type 2 — add a new row (full history)

SCD Type 3 — add a new column (limited history)

SCD Type 6 — hybrid (1 + 2 + 3 combined)

slowly changing dimension — beginner mistakes to avoid

Solution Using a per-attribute SCD type assignment matrix

4. Conformed dimensions + the Kimball bus matrix — modeling at enterprise scale

conformed dimensions — build dim_customer once, use it in every fact

Worked example — design dim_customer once and use it in three facts

kimball bus matrix — the org-wide design view of which dims serve which processes

Worked example — sketch a 3-row × 4-column bus matrix on a whiteboard

Solution Using a cross-process analytics query that depends on conformance

5. The Kimball 4-step design process — business process → grain → dimensions → facts

kimball methodology — the canonical 4-step design process

Worked example — apply the 4-step process to an e-commerce sales request

kimball methodology — common beginner mistakes when applying the 4-step process

Worked example — translate a tricky multi-process PM request into separate facts

Solution Using a 4-step design checklist as a deliverable

Choosing the right SCD type (cheat sheet)

Frequently asked questions

How is this Kimball deep-dive different from a generic data-modeling Q&A round-up?

What is the difference between a fact table and a dimension table in Kimball modeling?

What is grain and why is declaring it first the most important rule in Kimball?

What are the four SCD types I have to know cold for an interview?

What are conformed dimensions and how do they enable enterprise-scale analytics?

Is Kimball dimensional modeling still relevant in 2026 with the lakehouse, Iceberg, and modern semantic layers?

Practice on PipeCode

`kimball data warehouse` — the dimensional model that outlived every "Kimball is dead" hot take

`dimensional modeling` — Kimball vs Inmon vs Data Vault in one minute

`fact table` vs `dimension table` — the two atoms every Kimball schema is built from

Worked example — design `fact_sales` and `dim_customer` from a raw `orders` source

`surrogate key` vs `business key` — the rule that lets SCD history actually work

`grain` — declare it first, defend it forever

`slowly changing dimension` — the four types you have to know cold

`slowly changing dimension` — beginner mistakes to avoid

`conformed dimensions` — build `dim_customer` once, use it in every fact

Worked example — design `dim_customer` once and use it in three facts

`kimball bus matrix` — the org-wide design view of which dims serve which processes

`kimball methodology` — the canonical 4-step design process

`kimball methodology` — common beginner mistakes when applying the 4-step process

What is `grain` and why is declaring it first the most important rule in Kimball?