DEV Community

Cover image for Kimball Dimensional Modeling for Data Engineering Interviews: Facts, Dimensions, Grain & SCDs
Gowtham Potureddi
Gowtham Potureddi

Posted on

Kimball Dimensional Modeling for Data Engineering Interviews: Facts, Dimensions, Grain & SCDs

kimball data warehouse is still the gravity well every analytics interview falls back into: a fact table keyed by a handful of foreign keys, a halo of dimension table rows that describe context, a single declared grain that fixes what one row of the fact means, and a discipline for handling change over time — the four slowly changing dimension patterns (Type 1 overwrite, Type 2 new row, Type 3 new column, Type 6 hybrid). Together those primitives — plus the conformed dimensions that let fact_sales, fact_returns, and fact_inventory all share the same dim_customer — form the kimball methodology that powers Snowflake, BigQuery, Databricks, and Redshift warehouses in 2026, and the deep-dive interview track this guide walks through, end to end, in five numbered teaching sections.

This is the deep-dive companion to a tighter Q&A round-up: where a 5-section data-modeling cheat sheet ranges across OLTP normalisation, Inmon's third-normal-form warehouse, and Data Vault, this guide narrows the scope to dimensional modeling the way Ralph Kimball and Margy Ross actually teach it — fact tables vs dimension tables (the atoms), grain + SCDs (the decisions that bite you later), conformed dimensions + the bus matrix (modeling at enterprise scale), and the Kimball 4-step design process (business process → grain → dimensions → facts). Each section ends as dimensional modeling interview questions and answers: a question, a SQL or Python snippet, a traced execution, a sample output, and a concept-by-concept why this works breakdown — the exact shape kimball methodology rounds reward at FAANG, fintech, and every modern analytics shop.

PipeCode blog header for a deep-dive Kimball dimensional modeling guide — bold white headline 'Kimball Dimensional Modeling' with subtitle 'Facts · Dimensions · Grain · SCDs · Conformed' and a stylised central fact card with four dimensions in a star pattern plus a Kimball 4-step ribbon on a dark gradient with purple, orange, green, and amber accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse dimensional-modeling practice →, drill slowly-changing-data problems →, sharpen SQL practice library →, rehearse aggregation reconciliation patterns →, reinforce database problems →, or widen coverage on the full Python practice library →.


On this page


1. Why Kimball is still the dimensional-modeling interview standard

kimball data warehouse — the dimensional model that outlived every "Kimball is dead" hot take

The one-sentence invariant: kimball data warehouse is a denormalised star schema built around fact tables (narrow, tall, numeric, foreign-key heavy) and dimension tables (wide, short, descriptive, business-key plus surrogate-key), designed so that BI users can write SELECT … FROM fact JOIN dim_a JOIN dim_b GROUP BY dim_a.something, dim_b.something and get the answer back in under a second. Every "Kimball is dead" hot take since 2010 — Inmon CIF, Data Vault 2.0, "just put everything in S3", "the warehouse is the lakehouse" — has been followed by a quiet rediscovery that, underneath the storage layer, analysts still want a star schema because that is the shape SQL pivots and BI tools natively consume.

Why dimensional modeling won the BI war (and is still winning in 2026).

  • Query simplicitySELECT … FROM fact_sales f JOIN dim_customer c JOIN dim_date d is teachable to a finance analyst in 30 minutes; a 6-table normalised join graph is not.
  • Read performance — denormalised dims mean fewer joins per query; the warehouse cost model rewards wide, short tables.
  • Stable interfacedim_customer evolves (Type 2 history) without breaking the customer_key join key downstream.
  • Tool affinity — Tableau, Looker, Power BI, ThoughtSpot, and Mode are all designed against a star schema; trying to drive them off a 3NF model is a 6-month integration project.
  • Mental model fit — humans think in nouns + verbs; dimensions are nouns, facts are verbs ("the customer bought the product on the date at the store"); the schema matches the sentence.

What interviewers actually score on kimball methodology rounds.

  • Vocabulary fluency — can you crisply distinguish fact, dimension, grain, surrogate key, business key, conformed dimension, SCD Type 2, bus matrix, degenerate dimension, junk dimension, and factless fact table in one sentence each?
  • The 4-step design — given a business request, can you walk business process → grain → dimensions → facts out loud, with explicit example values at each step?
  • grain defence — given a fact-table proposal, can you state its grain in one sentence and justify why no row is finer or coarser?
  • SCD type selection per attribute — given a dim_customer schema, can you mark each column as Type 1, 2, 3, or 6 and explain why?
  • Conformed-dimension reasoning — given three business processes (sales, returns, inventory), can you identify which dimensions should be shared (conformed) and which should remain process-local?
  • The bus matrix — can you sketch a small bus matrix on a whiteboard with processes as rows and dimensions as columns?

The 5-section interview map this guide walks through.

  • Section 2 — fact tables vs dimension tables — the two atoms; what columns belong where; the FK + measure structure of facts; the surrogate + business key + attribute structure of dims; the rule of thumb (facts are tall + skinny + numeric, dims are short + wide + descriptive).
  • Section 3 — grain + SCDs — declaring grain before any column is named; the four SCD types (1, 2, 3, 6) with full SQL MERGE patterns; the cost / benefit / use-case for each.
  • Section 4 — conformed dimensions + the Kimball bus matrix — building dim_customer once and reusing it across fact_sales, fact_returns, fact_inventory; the bus matrix as the org-wide design artefact.
  • Section 5 — the Kimball 4-step design process — business process → grain → dimensions → facts, with a fully worked end-to-end example.
  • Cheat sheet + FAQ — when to pick which SCD type, plus the senior-round Q&A every loop circles back to.

Why dimensional modeling is still the interview default in 2026 (and not "old hat").

  • Snowflake, BigQuery, Databricks, and Redshift all publish reference architectures with star-schema gold-layer models.
  • dbt is built around dimensional modeling; dim_ / fact_ naming is the de-facto convention; dbt_utils ships generate_surrogate_key, and dbt-expectations ships dimensional-model assertions.
  • The lakehouse did not kill it — Iceberg / Delta / Hudi tables still get a Kimball-shaped gold layer on top.
  • Modern semantic layers — Cube, LookML, Snowflake's Semantic Layer — all assume a star-schema input.
  • Data Vault complements, not replaces — DV 2.0 is increasingly used in the raw / integration layer with a Kimball star schema on top as the consumption layer.

Worked example — turn a one-sentence business request into the Kimball vocabulary

Detailed explanation. Real interviews probe whether you can translate a vague business request into the Kimball primitives (grain, fact, dimensions) on the spot. Below is the canonical translation drill — "track our online order line revenue by customer, product, date, and store" — and how a senior modeler maps it.

Question. A finance PM asks: "I want to see daily revenue by customer segment, product category, and store region, with the ability to drill into individual order lines." In one minute, name the fact table, its grain, and the four dimensions; include the surrogate-key columns on each.

Input. No tables exist yet. The OLTP source is a single orders table with one row per checkout and an embedded line-item array. The warehouse is empty.

Code.

-- Kimball translation of the PM request.
CREATE TABLE fact_sales (
    sale_key       BIGINT       NOT NULL,  -- surrogate PK of fact
    customer_key   BIGINT       NOT NULL,  -- FK -> dim_customer
    product_key    BIGINT       NOT NULL,  -- FK -> dim_product
    date_key       INT          NOT NULL,  -- FK -> dim_date (YYYYMMDD)
    store_key      BIGINT       NOT NULL,  -- FK -> dim_store
    order_id       VARCHAR(40)  NOT NULL,  -- degenerate dim (no own table)
    line_id        INT          NOT NULL,  -- degenerate dim
    quantity       INT          NOT NULL,
    unit_price     NUMERIC(12,2) NOT NULL,
    discount_amount NUMERIC(12,2) NOT NULL,
    revenue        NUMERIC(12,2) NOT NULL  -- = quantity * unit_price - discount
);
-- Grain: one row per (order_id, line_id) — i.e. one row per ordered SKU.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Business processonline sales; the noun + verb pair tells you the process you're modelling.
  2. Grainone row per order line; declared in the comment on the table; defended against finer (no row per scan event) and coarser (no row per order header) alternatives.
  3. Dimensionscustomer, product, date, store; one per who / what / when / where; each becomes its own dim_* table with a surrogate *_key.
  4. Facts (measures)quantity, unit_price, discount_amount, revenue; numeric, additive, aggregatable by SUM.
  5. Degenerate dimensionsorder_id and line_id live on the fact (no separate dim) because they have no descriptive attributes worth storing in their own table.

Output (the column list, grouped by role).

role columns shape
surrogate PK sale_key numeric
FK to dim customer_key, product_key, date_key, store_key numeric
degenerate dim order_id, line_id string + int
measures quantity, unit_price, discount_amount, revenue numeric, additive

Rule of thumb: every interview translation answer should explicitly name the grain in one sentence before any column is listed. Skip the grain and the rest of the model is unfounded.

dimensional modeling — Kimball vs Inmon vs Data Vault in one minute

The three competing schools (and when each wins).

  • Kimball (dimensional modeling) — denormalised star / snowflake schema, fact + dim tables, grain-first, optimised for BI query speed and analyst ergonomics; the default for the gold / consumption layer.
  • Inmon (Corporate Information Factory) — fully normalised 3NF enterprise warehouse acting as the integration layer, with downstream Kimball-style data marts hanging off it; the heavyweight enterprise pattern, less common at modern startups.
  • Data Vault 2.0 — hub / link / satellite pattern designed for source-aware audit-friendly raw integration; excellent for the raw / integration layer, frequently combined with a Kimball star on top as the consumption layer.

Why Kimball wins the interview question by default.

  • The PM asks a single business question — "show me revenue by region by month"; the answer is "join fact_sales to dim_store and dim_date".
  • Junior engineers can read it — a star schema is teachable; a 7-table Data Vault is not.
  • It composes with everything — modern stacks layer Kimball on top of Vault or on top of a raw bronze lake.
  • The vocabulary travels — every BI tool, every dbt project, every dimensional textbook uses the same fact_* / dim_* convention.

The senior signal — "Kimball + something" beats "Kimball or nothing".

  • Kimball gold + Data Vault silver — DV in the integration layer handles source heterogeneity; Kimball star on top serves analysts.
  • Kimball gold + bronze rawbronze.orders_raw lands the source untransformed; silver.orders_cleaned adds standardisation; gold.fact_sales is the Kimball star.
  • Kimball gold + semantic layer — define metrics in Cube / LookML / dbt-metricflow on top of the Kimball star; the metric definitions live above the schema.
  • Kimball gold + reverse ETL — push dim_customer Type-2 history back into Salesforce / HubSpot for marketing personalisation.

SQL
Topic — dimensional-modeling
Dimensional modeling drills

Practice →

SQL
Topic — slowly-changing-data
Slowly changing dimensions practice

Practice →

Solution Using a Kimball-vocabulary lookup matrix

Code.

-- Materialise the Kimball vocabulary as a quick-reference matrix
-- every interview answer can be grounded against.
CREATE TABLE kimball_vocabulary AS
SELECT * FROM (VALUES
    ('fact table',          'narrow + tall + numeric',   'one row per business event',         'fact_sales, fact_returns'),
    ('dimension table',     'short + wide + descriptive','one row per business entity',        'dim_customer, dim_product'),
    ('grain',               'declared sentence',          'what one row of the fact means',     '1 row = 1 order line'),
    ('surrogate key',       'numeric, system-generated', 'stable, history-aware join key',     'customer_key BIGINT'),
    ('business key',        'natural key from source',   'the OLTP identifier',                'customer_id VARCHAR(40)'),
    ('SCD Type 1',          'overwrite',                  'no history kept',                    'email change'),
    ('SCD Type 2',          'add new row',                'full history with valid_from/to',    'segment change'),
    ('SCD Type 3',          'add new column',             'limited history (current + prev)',   'sales_region rename'),
    ('SCD Type 6',          'hybrid (1+2+3)',             'full history + fast current lookup','enterprise customer dim'),
    ('conformed dimension', 'shared across fact tables', 'one dim_customer for sales+returns','dim_customer'),
    ('degenerate dim',      'on the fact, no own table', 'identifier with no attributes',      'order_id, line_id'),
    ('junk dimension',      'combine low-card flags',     'shrink fact width, group flags',     'dim_order_flags'),
    ('bridge table',        'many-to-many resolver',     'connect fact to multi-valued dim',   'bridge_account_customer'),
    ('factless fact',       'event with no measures',    'occurrence-only event log',          'fact_login, fact_class_attendance'),
    ('bus matrix',          'process x dim grid',         'org-wide dim conformance map',       'sales|returns|inventory x customer|product|date')
) AS t(term, shape, definition, example);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

term shape definition example
fact table narrow + tall + numeric one row per business event fact_sales, fact_returns
dimension table short + wide + descriptive one row per business entity dim_customer, dim_product
grain declared sentence what one row of the fact means 1 row = 1 order line
surrogate key numeric, system-generated stable, history-aware join key customer_key BIGINT
business key natural key from source the OLTP identifier customer_id VARCHAR(40)
SCD Type 1 overwrite no history kept email change
SCD Type 2 add new row full history with valid_from/to segment change
SCD Type 3 add new column limited history (current + prev) sales_region rename
SCD Type 6 hybrid (1+2+3) full history + fast current lookup enterprise customer dim
conformed dimension shared across fact tables one dim_customer for sales+returns dim_customer
degenerate dim on the fact, no own table identifier with no attributes order_id, line_id
junk dimension combine low-card flags shrink fact width, group flags dim_order_flags
bridge table many-to-many resolver connect fact to multi-valued dim bridge_account_customer
factless fact event with no measures occurrence-only event log fact_login
bus matrix process x dim grid org-wide dim conformance map sales
  1. Rows 1–2 — the two atoms; every other term is built on top.
  2. Row 3 — grain is declared as a sentence, not a column; it constrains every later modeling decision.
  3. Rows 4–5 — every dimension has both a surrogate (system) key and a business (source) key; the surrogate joins, the business identifies.
  4. Rows 6–9 — the four SCD types; section 3 ships full SQL for each.
  5. Row 10 — conformed dimensions are the contract that lets cross-process analytics actually work; section 4 covers them in depth.
  6. Rows 11–14 — the less-common but interview-favourite primitives (degenerate, junk, bridge, factless).
  7. Row 15 — the bus matrix is the org-wide design artefact; section 4 sketches one.

Output.

term example
fact table fact_sales
dimension table dim_customer
grain 1 row = 1 order line
surrogate key customer_key
business key customer_id
SCD Type 2 segment change
conformed dimension dim_customer shared by sales + returns
bus matrix sales

Why this works — concept by concept:

  • Vocabulary matrix — turns 15 fuzzy terms into one-row definitions you can recite under pressure; interviewers reward crisp definitions over hand-waving.
  • Shape column — pairs each term with its physical characteristic (narrow + tall, short + wide, etc.); this is the senior signal that you've actually built dim models, not just read about them.
  • Definition column — one sentence per term; if you can't fit it in a sentence, you don't understand it yet.
  • Example column — grounds every abstract term in a concrete table or column name; concrete examples beat abstract definitions in every interview.
  • CostO(1) to read; the actual schemas built from this vocabulary are O(N rows) to materialise but the vocabulary itself is constant-time recall.

2. Fact tables vs dimension tables — the atoms of Kimball modeling

Visual diagram comparing fact tables vs dimension tables — a fact table card on the left showing FK columns + measures + the grain pill ('one row = one order line'), a dimension table card on the right showing descriptive attributes + surrogate key + business key + slowly-changing flag; a small bridge arrow between them; on a light PipeCode card.

fact table vs dimension table — the two atoms every Kimball schema is built from

fact table rows answer "how much"; dimension table rows answer "who / what / when / where / why". The two are physically different shapes — facts are narrow + tall + numeric (a handful of foreign-key columns plus a handful of additive measures, repeated millions of times); dims are short + wide + descriptive (one row per entity, dozens of text and date attributes, history-aware columns layered on top). Mastering Kimball is mostly mastering these two shapes and the discipline of not mixing them.

Anatomy of a fact table.

  • Foreign keys — one column per dimension that participates in the grain; named *_key (the surrogate key, not the business key).
  • Degenerate dimensions — identifiers that live on the fact because they have no descriptive attributes (order_id, line_id, transaction_id).
  • Measures — numeric columns aggregatable by SUM / COUNT / MIN / MAX / AVG; ideally fully additive across all dimensions.
  • Grain comment — a one-sentence declaration of what one row means; lives in the table comment so it can't drift from the schema.
  • Surrogate fact key — optional; many shops use the composite (order_id, line_id) as the natural PK and skip the surrogate fact key entirely.

Three flavours of fact tables (the senior interviewer will ask).

  • Transaction fact — one row per business event (one order line, one click, one payment); the most common shape; fully additive measures; example fact_sales.
  • Periodic snapshot fact — one row per (entity, time period); useful for slowly evolving balances; semi-additive over time (balance doesn't SUM across days); example fact_account_balance_daily.
  • Accumulating snapshot fact — one row per long-running process, with multiple date columns that get updated as the process advances; example fact_order_lifecycle with order_date, ship_date, deliver_date, return_date.

Anatomy of a dimension table.

  • Surrogate key — system-generated BIGINT, monotonically increasing; the only column the fact joins against; stable across SCD changes.
  • Business key — the OLTP source identifier (customer_id, product_sku); preserved for traceability but never used as a join key.
  • Descriptive attributesname, segment, country, category, sub_category; the columns BI users GROUP BY.
  • SCD columnsvalid_from, valid_to, is_current for Type 2; current_* / previous_* pairs for Type 3; both layers for Type 6.
  • Audit columnsinserted_at, updated_at, source_system; metadata that helps with reconciliation and DQ.

The rule of thumb (memorise this; recite it under pressure).

  • Facts are tall + skinny + numeric — billions of rows, ~10 columns, mostly FKs + measures.
  • Dims are short + wide + descriptive — thousands or millions of rows (rarely > 100M), 20-100 columns, mostly text + date attributes.
  • If you find yourself adding a long text column to a fact, you're modelling it wrong — that attribute belongs on a dimension.
  • If you find yourself adding a numeric measure to a dimension, you're modelling it wrong — that measure belongs on a fact.
  • The two atoms never mix — facts join to dims; dims do not join to dims (snowflake schema being the rare exception).

Worked example — design fact_sales and dim_customer from a raw orders source

Detailed explanation. Real interviews ask you to physically design both atoms from an OLTP source. Below is the canonical translation of a raw orders source into a Kimball fact + dim pair, with explicit column lists and surrogate-key wiring.

Question. Given an OLTP orders source with a customers lookup and a products lookup, design fact_sales and dim_customer end-to-end. Name every column, every type, every key, and the grain.

Input. The OLTP source has three tables: orders(order_id, customer_id, order_ts), order_lines(order_id, line_id, sku, qty, unit_price, discount), customers(customer_id, name, email, segment, country, signup_dt).

Code.

-- The fact: narrow + tall + numeric.
CREATE TABLE fact_sales (
    sale_key       BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_key   BIGINT NOT NULL REFERENCES dim_customer(customer_key),
    product_key    BIGINT NOT NULL REFERENCES dim_product(product_key),
    date_key       INT    NOT NULL REFERENCES dim_date(date_key),
    store_key      BIGINT NOT NULL REFERENCES dim_store(store_key),
    order_id       VARCHAR(40) NOT NULL,             -- degenerate dim
    line_id        INT         NOT NULL,             -- degenerate dim
    quantity       INT           NOT NULL,
    unit_price     NUMERIC(12,2) NOT NULL,
    discount_amount NUMERIC(12,2) NOT NULL DEFAULT 0,
    revenue        NUMERIC(12,2) NOT NULL,
    CONSTRAINT uq_fact_sales UNIQUE (order_id, line_id)
);
-- GRAIN: one row per (order_id, line_id) — i.e. one row per ordered SKU.

-- The dim: short + wide + descriptive + SCD-aware.
CREATE TABLE dim_customer (
    customer_key   BIGINT IDENTITY(1,1) PRIMARY KEY,   -- surrogate
    customer_id    VARCHAR(40) NOT NULL,               -- business key
    name           VARCHAR(120) NOT NULL,
    email          VARCHAR(120) NOT NULL,              -- SCD Type 1 (overwrite)
    segment        VARCHAR(32)  NOT NULL,              -- SCD Type 2 (history)
    country        VARCHAR(64)  NOT NULL,              -- SCD Type 2 (history)
    signup_date    DATE         NOT NULL,
    valid_from     TIMESTAMP    NOT NULL,
    valid_to       TIMESTAMP    NOT NULL DEFAULT '9999-12-31',
    is_current     BOOLEAN      NOT NULL DEFAULT TRUE,
    inserted_at    TIMESTAMP    NOT NULL DEFAULT CURRENT_TIMESTAMP,
    source_system  VARCHAR(40)  NOT NULL DEFAULT 'oltp_orders'
);
CREATE INDEX ix_dim_customer_bk ON dim_customer(customer_id, is_current);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. fact_sales.sale_key — optional system-generated PK; some teams skip it and use (order_id, line_id) directly.
  2. fact_sales.{customer,product,date,store}_key — four FKs, one per dimension; named *_key (never *_id) to signal "this is the surrogate, not the source identifier".
  3. fact_sales.{order_id, line_id} — degenerate dimensions; they live on the fact because they have no descriptive attributes worth storing in their own table.
  4. fact_sales.{quantity, unit_price, discount_amount, revenue} — additive measures; revenue is stored even though it's derivable, so that BI queries don't have to recompute it on every aggregation.
  5. dim_customer.customer_key vs customer_id — surrogate (used by the fact) vs business (preserved for traceability); the fact never references customer_id directly.
  6. dim_customer.email — SCD Type 1 (overwrite); change history of email addresses is rarely interesting and inflates row counts.
  7. dim_customer.segment and country — SCD Type 2 (full history); these are historically interesting (a customer moved from "starter" to "enterprise" in March; revenue before vs after that change is a real question).
  8. dim_customer.valid_from / valid_to / is_current — the SCD Type 2 columns; is_current is a precomputed flag so the WHERE is_current lookup is index-friendly.

Output (the fact + dim shapes side by side).

table rows columns shape
fact_sales 100M+ ~12 (4 FK + 2 degen + 4 measure + 2 admin) tall + skinny + numeric
dim_customer 1-10M ~12 (1 surr + 1 biz + 5 attr + 3 SCD + 2 audit) short + wide + descriptive

Rule of thumb: the physical difference between facts and dims is the easiest senior signal to give — "the fact is roughly 12 columns × 100M rows, the dim is roughly 12 columns × 1M rows, and the column types tell you which is which".

surrogate key vs business key — the rule that lets SCD history actually work

surrogate key is the system-generated BIGINT you stamp on every dimension row, and it is the only column the fact joins against. The business key (a.k.a. natural key) is the OLTP source identifier — customer_id = 'C-00012345', product_sku = 'SKU-RED-MEDIUM' — and you preserve it on the dim for traceability, but you never use it as a join key. The distinction matters because once you start tracking SCD Type 2 history, a single business key can map to multiple dim rows (one per historical version), so the join from fact to dim has to use the surrogate, never the business key.

The 5-rule surrogate-key discipline.

  • Surrogate is BIGINT, system-generatedIDENTITY(1,1) in SQL Server, GENERATED ALWAYS AS IDENTITY in PostgreSQL, AUTOINCREMENT in Snowflake.
  • Surrogate is opaque — never embed business meaning; customer_key = 12345 should mean nothing outside the warehouse.
  • Fact stores only the surrogate — never customer_id, always customer_key.
  • Business key + is_current = TRUE is the lookup recipe — to find the current row for a given customer: WHERE customer_id = 'C-001' AND is_current = TRUE.
  • The surrogate key remains stable when the source business key changes — if customer_id is reissued by the OLTP team, the surrogate stays put; the source change is just another SCD event.

Why this matters in interviews.

  • The Type 2 join is broken without surrogate keys — if the fact stores customer_id and dim_customer has 3 historical rows for that customer, the fact join is now 3x ambiguous.
  • Hashing replaces auto-increment in modern shopsdbt_utils.generate_surrogate_key(['customer_id', 'valid_from']) is the idiomatic Snowflake / BigQuery / dbt pattern.
  • Surrogate keys decouple the warehouse from the source — the source can renumber, re-key, or migrate; the warehouse surrogate is untouched.

SQL
Topic — dimensional-modeling
Fact and dimension design drills

Practice →

SQL
Topic — database
Database design practice

Practice →

Solution Using a surrogate-key + business-key join harness

Code.

-- Join fact_sales to dim_customer using the surrogate, with point-in-time correctness.
SELECT
    d.segment,
    d.country,
    SUM(f.revenue)         AS total_revenue,
    COUNT(DISTINCT f.order_id) AS order_count
FROM fact_sales f
JOIN dim_customer d
  ON d.customer_key = f.customer_key       -- surrogate join, never customer_id
WHERE d.is_current = TRUE                  -- current segment lookup
  AND f.date_key BETWEEN 20260101 AND 20260131
GROUP BY d.segment, d.country
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

f.customer_key d.customer_key d.customer_id d.segment d.is_current f.revenue
101 101 C-001 enterprise true 5000.00
102 102 C-002 starter true 1200.00
103 103 C-003 enterprise true 8400.00
101 101 C-001 enterprise true 3300.00
  1. The fact stores customer_key = 101, not customer_id = 'C-001'; the join is d.customer_key = f.customer_key.
  2. is_current = TRUE filters to one dim row per business customer; without it the result set would multiply by SCD history depth.
  3. Rows 1 + 4 belong to the same customer (C-001); they roll up in the GROUP BY because they share the same segment + country.
  4. The grain of the result is one row per (segment, country); each cell is the SUM(revenue) and COUNT(DISTINCT order_id) for that bucket.
  5. The WHERE date_key BETWEEN clause hits the fact-side partition pruning; the dim is small enough that no partitioning is needed.

Output:

segment country total_revenue order_count
enterprise US 16700.00 3
starter UK 1200.00 1

Why this works — concept by concept:

  • Surrogate-key joind.customer_key = f.customer_key is the only valid join shape; it survives SCD Type 2 history and source re-keying.
  • is_current filter — without WHERE is_current = TRUE, the join multiplies by historical depth; with it, you get one current row per customer.
  • date_key partition pruningdate_key BETWEEN 20260101 AND 20260131 lets the warehouse skip every other partition; this is why we use INT YYYYMMDD for date keys.
  • Additive measuresSUM(revenue) is safe because revenue is fully additive across all four dims; this is the payoff for storing the derived revenue column on the fact.
  • Cost — fact scan is O(rows in matching partitions); dim join is O(distinct customers); the surrogate key makes both lookups index-friendly.

3. Grain + Slowly Changing Dimensions — Type 1, 2, 3, 6 with SQL

Visual diagram of grain + SCD types — a top grain card showing three example grains (transaction line, daily snapshot, accumulating snapshot); below it a 2x2 grid of SCD cards (Type 1 overwrite, Type 2 add new row, Type 3 add new column, Type 6 hybrid) each with a tiny mini-table illustration of how the row changes; on a light PipeCode card.

grain — declare it first, defend it forever

grain is the single sentence that defines what one row of a fact table means, declared before you name a single column, and defended for the life of the table. "One row per order line", "one row per customer per day", "one row per order, accumulated across the lifecycle" — three different grains, three different fact tables, three different physical shapes. The Kimball discipline is declare the grain first, never mix grains in the same fact table, and never change the grain after the table is built.

The three grain families.

  • Transaction grain — one row per business event; the most common; example "one row per (order_id, line_id)"; measures are fully additive.
  • Periodic snapshot grain — one row per (entity, period); example "one row per (account_id, date_key)"; measures are semi-additive over time (balance does not SUM across days).
  • Accumulating snapshot grain — one row per long-running process, updated in place as the process advances; example "one row per order, with ordered_date_key, shipped_date_key, delivered_date_key, returned_date_key"; measures track lag (days_to_ship, days_to_deliver).

Why grain has to be declared first (and never changed later).

  • The grain is the schema — the FK list, the degenerate-dim list, the measure list, and the additivity rules all follow from the grain.
  • Mixing grains corrupts every aggregate — if some rows are (order_id, line_id) and others are (order_id) alone, SUM(revenue) GROUP BY product_key double-counts on the order-level rows.
  • Changing the grain breaks every downstream model — a re-grain triggers a coordinated re-publish of every BI dashboard that consumed the prior grain.
  • The grain is the contract — write it in the table comment, the dbt model docstring, the data catalog, and the wiki; multiple sources of truth keep it from drifting.

Concrete grain examples (memorise the wording).

  • "One row per (order_id, line_id)" — transaction grain for fact_sales.
  • "One row per (account_id, snapshot_date_key)" — periodic snapshot grain for fact_account_balance_daily.
  • "One row per order, lifecycle-accumulating" — accumulating snapshot grain for fact_order_lifecycle.
  • "One row per (customer_id, day, event_name)" — semi-aggregated event grain for fact_user_event_daily.
  • "One row per class session, no measures" — factless fact for fact_class_attendance.

slowly changing dimension — the four types you have to know cold

The acronym SCD covers strategies for handling change in dimension attributes over time, and every Kimball interview will probe at least Types 1, 2, and 6. The trick is not memorising the types; it is knowing which type to pick per attribute and writing the MERGE statements from memory.

SCD Type 1 — overwrite (no history)

Detailed explanation. SCD Type 1 simply overwrites the existing value in place; no history is preserved. Use it for attributes where past values are not interesting (typos, formatting changes, contact-info updates), and where the cost of preserving history outweighs the analytical value.

Question. A customer changes their email from alice@old.com to alice@new.com. Write the SCD Type 1 MERGE that updates the dimension.

Input. Source row: customer_id='C-001', email='alice@new.com'. Existing dim row: customer_key=101, customer_id='C-001', email='alice@old.com'.

Code.

-- SCD Type 1: overwrite in place.
MERGE INTO dim_customer_t1 AS tgt
USING (
    SELECT
        'C-001'              AS customer_id,
        'alice@new.com'      AS email,
        'Alice Smith'        AS name
) AS src
ON tgt.customer_id = src.customer_id
WHEN MATCHED THEN UPDATE SET
    tgt.email = src.email,
    tgt.name  = src.name
WHEN NOT MATCHED THEN INSERT (customer_id, email, name)
VALUES (src.customer_id, src.email, src.name);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. MERGE INTO dim_customer_t1 targets the dim table; the alias tgt is conventional.
  2. USING (SELECT …) AS src lifts the new source row into a CTE-like alias; in production this would be a CTE over the staging table.
  3. ON tgt.customer_id = src.customer_id matches on the business key; this is the only SCD type where matching on business key is safe (because there is no history).
  4. WHEN MATCHED THEN UPDATE overwrites the email + name in place; the prior values are lost forever.
  5. WHEN NOT MATCHED THEN INSERT covers the brand-new-customer case; first-time customers get a fresh row.

Output (the dim after the merge).

customer_key customer_id email name
101 C-001 alice@new.com Alice Smith

Rule of thumb: Type 1 is fast and cheap but lossy; use it for attributes nobody will ever ask "what was that on Feb 14th" about.

SCD Type 2 — add a new row (full history)

Detailed explanation. SCD Type 2 is the workhorse of dimensional modeling: when an attribute changes, insert a new row with a fresh surrogate key and stamp valid_from + valid_to + is_current on both the old and new rows. The prior row's valid_to becomes the new row's valid_from; the prior row's is_current becomes FALSE.

Question. Customer C-001 upgrades from starter to enterprise segment on 2026-04-15 10:30:00. Write the SCD Type 2 MERGE (or insert + update pair) that closes the old row and inserts the new one.

Input. Existing dim row: customer_key=101, customer_id='C-001', segment='starter', valid_from='2025-01-01', valid_to='9999-12-31', is_current=TRUE. Source change: customer_id='C-001', segment='enterprise', change_ts='2026-04-15 10:30:00'.

Code.

-- SCD Type 2: insert + update pair (the classic 2-step pattern).
BEGIN;

-- Step 1: close out the current row.
UPDATE dim_customer
SET
    valid_to   = TIMESTAMP '2026-04-15 10:30:00',
    is_current = FALSE
WHERE customer_id = 'C-001'
  AND is_current  = TRUE;

-- Step 2: insert the new current row.
INSERT INTO dim_customer (
    customer_id, name, email, segment, country, signup_date,
    valid_from, valid_to, is_current
)
SELECT
    'C-001', name, email, 'enterprise', country, signup_date,
    TIMESTAMP '2026-04-15 10:30:00',
    TIMESTAMP '9999-12-31',
    TRUE
FROM dim_customer
WHERE customer_id = 'C-001'
  AND valid_to = TIMESTAMP '2026-04-15 10:30:00'   -- the row we just closed
LIMIT 1;

COMMIT;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Step 1 closes the prior current row by stamping valid_to = change_ts and is_current = FALSE; this row now represents the historical state.
  2. Step 2 inserts a new row with a fresh surrogate key (auto-generated by IDENTITY), segment = 'enterprise', valid_from = change_ts, valid_to = '9999-12-31', is_current = TRUE.
  3. The WHERE valid_to = change_ts clause in step 2's SELECT is how we copy the immutable attributes (name, email, country, signup_date) from the prior row.
  4. The two steps run inside a transaction so a downstream reader never sees the dim with zero current rows for C-001.
  5. The surrogate key of the new row is different from the prior row's surrogate — that's the whole point; the fact table will join to whichever key matches the order's valid_from window.

Output (the dim after the merge — two rows now).

customer_key customer_id segment valid_from valid_to is_current
101 C-001 starter 2025-01-01 2026-04-15 10:30 false
132 C-001 enterprise 2026-04-15 10:30 9999-12-31 true

Rule of thumb: SCD Type 2 inflates row count but preserves full history; pick it for any attribute where "what was the value on date X" is a real analytical question.

SCD Type 3 — add a new column (limited history)

Detailed explanation. SCD Type 3 adds a new column (typically previous_*) alongside the existing one, so the dim carries both the current and the immediately prior value side by side. It tracks one level of history per attribute; older history is lost.

Question. A company renames its sales_region from 'NorthAm' to 'Americas'. Track both the current and previous region on dim_store without inserting new rows.

Input. Existing dim row: store_key=11, store_id='S-100', sales_region='NorthAm'. Source change: store_id='S-100', sales_region='Americas', change_ts='2026-03-01'.

Code.

-- SCD Type 3: shift the current value into a previous column.
ALTER TABLE dim_store
    ADD COLUMN previous_sales_region VARCHAR(64),
    ADD COLUMN sales_region_changed_at TIMESTAMP;

UPDATE dim_store
SET
    previous_sales_region   = sales_region,
    sales_region            = 'Americas',
    sales_region_changed_at = TIMESTAMP '2026-03-01'
WHERE store_id = 'S-100';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. ALTER TABLE adds two new columns: previous_sales_region (the prior value) and sales_region_changed_at (the change timestamp).
  2. The UPDATE shifts the existing sales_region value into previous_sales_region, then overwrites sales_region with the new value.
  3. The row count of the dim is unchanged — Type 3 is in-place, no new rows.
  4. The new column lets BI write SUM(revenue) GROUP BY sales_region for the current view and SUM(revenue) GROUP BY previous_sales_region for the prior view, without rewriting the fact joins.
  5. Type 3 is brittle — if the region renames again a year later, the previous_* column now holds two-changes-ago by default; some shops add previous_previous_*, which quickly becomes silly.

Output (the dim after the merge).

store_key store_id sales_region previous_sales_region sales_region_changed_at
11 S-100 Americas NorthAm 2026-03-01

Rule of thumb: Type 3 fits "we just renamed one attribute and analysts want a side-by-side compare for a few quarters". Use sparingly; if you need full history, escalate to Type 2.

SCD Type 6 — hybrid (1 + 2 + 3 combined)

Detailed explanation. SCD Type 6 is the senior interview answer: combine Type 1 (overwrite the current attribute in every historical row), Type 2 (insert new rows for change), and Type 3 (carry the prior value on every row) into a single hybrid pattern. The result is a dim where every row carries both its own historical value and the current value, so BI can pivot on either without re-joining.

Question. Track customer segment changes with full history (Type 2) and let a query say WHERE current_segment = 'enterprise' cheaply on every historical row (Type 1) and expose previous_segment on each new row (Type 3). Write the SCD Type 6 update.

Input. Two existing dim rows for C-001: the original starter row and the enterprise row inserted above.

Code.

-- SCD Type 6: insert new row + overwrite current_segment on every historical row.
BEGIN;

-- Step 1: close out the prior current row (Type 2 mechanics).
UPDATE dim_customer_t6
SET valid_to = TIMESTAMP '2026-04-15 10:30:00',
    is_current = FALSE
WHERE customer_id = 'C-001' AND is_current = TRUE;

-- Step 2: insert the new row with previous_segment carried (Type 3 mechanics).
INSERT INTO dim_customer_t6 (
    customer_id, name, email,
    segment, previous_segment, current_segment,
    valid_from, valid_to, is_current
)
SELECT
    'C-001', name, email,
    'enterprise',              -- this row's historical segment
    'starter',                 -- the prior segment (Type 3)
    'enterprise',              -- the current segment (Type 1)
    TIMESTAMP '2026-04-15 10:30:00',
    TIMESTAMP '9999-12-31',
    TRUE
FROM dim_customer_t6
WHERE customer_id = 'C-001'
  AND valid_to = TIMESTAMP '2026-04-15 10:30:00'
LIMIT 1;

-- Step 3: overwrite current_segment on every historical row (Type 1 mechanics).
UPDATE dim_customer_t6
SET current_segment = 'enterprise'
WHERE customer_id = 'C-001';

COMMIT;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Step 1 mirrors SCD Type 2: close the prior current row by stamping valid_to + is_current = FALSE.
  2. Step 2 inserts a new row with three segment columns: segment (the historical value for this row, here 'enterprise'), previous_segment (the prior value, here 'starter', the Type 3 carry-over), and current_segment (the as-of-now value, here also 'enterprise').
  3. Step 3 mirrors SCD Type 1: overwrite current_segment on every historical row for C-001, so even the closed-out starter row now carries current_segment = 'enterprise'.
  4. The payoff: BI can write WHERE current_segment = 'enterprise' and get all historical revenue for that customer regardless of which row matches the order date; or WHERE segment = 'enterprise' to filter by historical segment-at-time-of-purchase.
  5. Type 6 is the senior answer because it solves the "we want both views" problem without two separate dim tables.

Output (the dim after the merge — two rows, both carrying current_segment = 'enterprise').

customer_key customer_id segment previous_segment current_segment is_current
101 C-001 starter NULL enterprise false
132 C-001 enterprise starter enterprise true

Rule of thumb: Type 6 is the "I want history and fast current lookup" pattern; it costs one extra column per Type-1-overwritten attribute but eliminates a whole class of join + filter complexity.

slowly changing dimension — beginner mistakes to avoid

  • Joining the fact on the business key instead of the surrogate — breaks the moment you adopt SCD Type 2; the join multiplies by history depth.
  • Forgetting is_current = TRUE — every current-state query needs it; without it the result silently sums historical rows.
  • Letting valid_to be NULL — use '9999-12-31' instead so BETWEEN valid_from AND valid_to works without IS NULL branches.
  • Updating valid_from on an open rowvalid_from is immutable once stamped; only valid_to and is_current flip during SCD updates.
  • Mixing Type 1 and Type 2 attributes in the same row without comment — every dim column should be annotated with its SCD type in the table comment or dbt YAML.
  • Picking Type 2 for every attribute "just in case" — Type 2 inflates row counts; pick the type that matches the analytical question you'll be asked.

SQL
Topic — slowly-changing-data
SCD practice problems

Practice →

SQL
Topic — dimensional-modeling
Grain-and-SCD dimensional modeling drills

Practice →

Solution Using a per-attribute SCD type assignment matrix

Code.

-- Codify the SCD type for every attribute on dim_customer.
CREATE TABLE dim_customer_scd_plan AS
SELECT * FROM (VALUES
    ('customer_id',  'business key',      'NA',     'preserved for traceability; not updated after first insert'),
    ('name',         'descriptive',       'Type 1', 'typos and rebrands; history not interesting'),
    ('email',        'descriptive',       'Type 1', 'overwrite; do not preserve email history'),
    ('phone',        'descriptive',       'Type 1', 'overwrite; do not preserve phone history'),
    ('segment',      'analytical',        'Type 2', 'revenue per historical segment is a real question'),
    ('country',      'analytical',        'Type 2', 'geo migration matters for tax + analytics'),
    ('account_mgr',  'analytical',        'Type 2', 'attribution to manager-at-time-of-sale'),
    ('credit_score', 'analytical',        'Type 2', 'risk analysis needs historical score'),
    ('signup_date',  'immutable',         'NA',     'never changes; set once at insert'),
    ('current_segment','derived',         'Type 6', 'overwrite on all rows for fast current-state lookup')
) AS t(attribute, role, scd_type, rationale);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

attribute role scd_type rationale
customer_id business key NA preserved for traceability; not updated after first insert
name descriptive Type 1 typos and rebrands; history not interesting
email descriptive Type 1 overwrite; do not preserve email history
phone descriptive Type 1 overwrite; do not preserve phone history
segment analytical Type 2 revenue per historical segment is a real question
country analytical Type 2 geo migration matters for tax + analytics
account_mgr analytical Type 2 attribution to manager-at-time-of-sale
credit_score analytical Type 2 risk analysis needs historical score
signup_date immutable NA never changes; set once at insert
current_segment derived Type 6 overwrite on all rows for fast current-state lookup
  1. Rows 1, 9 — business key + immutable; never updated after first insert.
  2. Rows 2-4 — Type 1; overwrite; cheap; loses history; appropriate for cosmetic and contact attributes.
  3. Rows 5-8 — Type 2; the analytical attributes; revenue / risk / attribution per historical value is a real question.
  4. Row 10 — Type 6 layered on top of segment; one extra column gives BI a fast "current state" pivot without a join.
  5. The matrix is the deliverable; every senior data modeler ships a per-attribute SCD plan, not a blanket "everything is Type 2".

Output.

attribute scd_type
customer_id NA
name Type 1
email Type 1
segment Type 2
country Type 2
account_mgr Type 2
credit_score Type 2
current_segment Type 6

Why this works — concept by concept:

  • Per-attribute SCD assignment — Kimball's discipline is "pick the SCD type per attribute, not per table"; a single dim can mix Types 1, 2, and 6 across its columns.
  • Type 1 for cosmetic, Type 2 for analytical — the rule of thumb that keeps row counts down without losing analytical value; cosmetic attributes (typos, rebrands, contact info) overwrite, analytical attributes (segment, region, tier) preserve history.
  • Type 6 for derived current-state — pairing a Type 2 attribute with a Type 1 current_* column gives BI both views with zero join cost.
  • Documentation as deliverable — the assignment matrix itself is shipped as part of the model design; without it the next engineer can't tell why email is Type 1 but country is Type 2.
  • CostO(1) to read the matrix; the actual updates cost O(rows-per-change) for Type 1 (overwrite all rows for that business key) vs O(1) for Type 2 (insert one new row).

4. Conformed dimensions + the Kimball bus matrix — modeling at enterprise scale

Visual diagram of conformed dimensions and the Kimball bus matrix — three fact tables (sales, returns, inventory) sharing the same dim_customer, dim_product, dim_date dimensions on the left; a small bus matrix grid on the right with business processes as rows and dimensions as columns, cells filled with green checkmarks for shared dims; on a light PipeCode card.

conformed dimensions — build dim_customer once, use it in every fact

conformed dimensions are dimensions designed to be shared across multiple business processes — fact_sales, fact_returns, fact_inventory all join to the same dim_customer, dim_product, dim_date. The conformance contract is the heart of Kimball at enterprise scale: without it, every team builds their own dim_customer_sales, dim_customer_marketing, dim_customer_support, and cross-process analytics becomes impossible because the definition of "customer" has diverged in five places.

The conformance contract — what makes a dim "conformed".

  • Same columnsdim_customer.segment means the same thing whether you join it to fact_sales or fact_returns.
  • Same surrogate-key generationcustomer_key = 12345 resolves to the same business customer in every fact.
  • Same SCD policy — segment changes are tracked as Type 2 in every fact that uses the dim.
  • Same grain — if the dim is at customer-account level (not customer-individual level), every fact agrees on that grain.
  • Same source of truth — one team owns dim_customer; the other teams consume it, they don't fork it.

Why conformance matters in interviews.

  • Cross-process analytics depends on it — "what % of customers who bought in Q1 returned in Q2" requires fact_sales and fact_returns to share dim_customer.
  • Reconciliation breaks without it — if fact_sales.customer_key = 12345 is "Alice" but fact_returns.customer_key = 12345 is "Bob", every reconciliation query lies.
  • It is the senior signal — junior modelers build a dim per fact; senior modelers build a conformed dim and reuse it.
  • The Kimball bus matrix is the deliverable — section 4.2 walks through it.

Three flavours of conformance (with examples).

  • Identical conformance — the strongest; the dim row, the surrogate key, and every attribute match exactly across facts; example dim_date.
  • Shrunken conformance — a coarser version of the dim is used in lower-grain facts; example dim_date at month-grain (dim_month) for inventory snapshots while dim_date at day-grain serves fact_sales.
  • Subset conformance — one fact uses only a subset of the dim's rows (e.g. fact_internal_sales filters dim_customer to internal customers); attributes and keys match, but row set differs.

Worked example — design dim_customer once and use it in three facts

Detailed explanation. Real interviews ask you to demonstrate conformance by showing the same dim_customer being consumed by multiple fact_* tables. Below is the canonical three-fact pattern.

Question. Sales, returns, and customer-support tickets all need to be analysed by customer segment, country, and tier. Design dim_customer once and show how fact_sales, fact_returns, and fact_support_ticket all consume it.

Input. Three OLTP sources: oltp.orders, oltp.returns, oltp.support_tickets. The current state has three separate dim_customer_* tables, one per team; consolidate them.

Code.

-- One canonical conformed dim.
CREATE TABLE dim_customer (
    customer_key   BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_id    VARCHAR(40) NOT NULL,
    name           VARCHAR(120) NOT NULL,
    email          VARCHAR(120) NOT NULL,                -- Type 1
    segment        VARCHAR(32)  NOT NULL,                -- Type 2
    country        VARCHAR(64)  NOT NULL,                -- Type 2
    tier           VARCHAR(16)  NOT NULL,                -- Type 2
    current_segment VARCHAR(32) NOT NULL,                -- Type 6 (derived)
    valid_from     TIMESTAMP    NOT NULL,
    valid_to       TIMESTAMP    NOT NULL DEFAULT '9999-12-31',
    is_current     BOOLEAN      NOT NULL DEFAULT TRUE
);

-- Three facts, one dim.
CREATE TABLE fact_sales (
    sale_key       BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_key   BIGINT NOT NULL REFERENCES dim_customer(customer_key),
    product_key    BIGINT NOT NULL,
    date_key       INT    NOT NULL,
    revenue        NUMERIC(12,2) NOT NULL
);

CREATE TABLE fact_returns (
    return_key     BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_key   BIGINT NOT NULL REFERENCES dim_customer(customer_key),  -- conformed
    product_key    BIGINT NOT NULL,
    date_key       INT    NOT NULL,
    refund_amount  NUMERIC(12,2) NOT NULL
);

CREATE TABLE fact_support_ticket (
    ticket_key     BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_key   BIGINT NOT NULL REFERENCES dim_customer(customer_key),  -- conformed
    date_key       INT    NOT NULL,
    severity_key   BIGINT NOT NULL,
    resolution_minutes INT NOT NULL
);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. One dim_customer table is the single source of truth; every fact's customer_key FK references it.
  2. All three facts agree on the customer_key surrogate; if Alice is customer_key = 12345 in sales, she is customer_key = 12345 in returns and support.
  3. All three facts inherit the same SCD policy: when Alice's segment changes, a new dim row is inserted with a new surrogate, and future facts in all three tables join to the new key.
  4. Cross-process queries work without effort: SELECT segment, SUM(revenue), SUM(refund_amount), AVG(resolution_minutes) FROM dim_customer d LEFT JOIN fact_sales s LEFT JOIN fact_returns r LEFT JOIN fact_support_ticket t … GROUP BY segment returns a single row per segment with all three measures.
  5. The conformance contract is enforced by the FK + the team agreement; both layers matter (DB constraints catch the technical violation, the team agreement catches the policy violation).

Output (one row per segment, joining all three facts).

segment total_revenue total_refunds avg_resolution_min
enterprise 250000.00 12500.00 45
growth 95000.00 4200.00 60
starter 38000.00 1900.00 90

Rule of thumb: if you're tempted to build dim_customer_v2 for a new team, stop — the cost of forking the dim today is paid for the next decade in cross-process analytics that don't tie out.

kimball bus matrix — the org-wide design view of which dims serve which processes

The Kimball bus matrix is a 2D grid with business processes as rows and conformed dimensions as columns; a checkmark in cell (process, dim) says "this process's fact table joins to this dim". The matrix is the single artefact the data platform team uses to plan, govern, and communicate dimensional modeling at enterprise scale.

The shape of a bus matrix.

business process customer product date store employee channel
Sales
Returns
Inventory snapshot ✓ (month)
Support ticket
Marketing campaign
Web event
Subscription billing

How to read it.

  • Each row is a business process — a single subject area that produces a fact table.
  • Each column is a conformed dimension shared across processes.
  • A checkmark means "this process's fact joins to this dim".
  • A dash means "this dim does not apply to this process".
  • A "(month)" annotation means shrunken conformance — the inventory fact joins at month grain while sales joins at day grain.

Why the bus matrix is the senior-modeler deliverable.

  • It surfaces missing dims — if customer is checked for sales but missing for support, that's a gap analytics will pay for later.
  • It exposes redundant facts — if two facts cover the same process at slightly different grains, you probably have a re-grain bug.
  • It plans roadmap — each cell is a unit of work; add a fact, add a dim, conform a dim across processes.
  • It governs ownership — each column has an owner (the team that owns the dim); each row has an owner (the team that owns the process).
  • It travels across tools — the matrix lives in a wiki, a dbt docs page, or a Confluence page; every BI dashboard ties back to it.

Worked example — sketch a 3-row × 4-column bus matrix on a whiteboard

Detailed explanation. Whiteboard rounds love this question because it's tiny but reveals whether you actually use the bus matrix or just read about it. The drill is to design a 3-row × 4-column matrix in 60 seconds.

Question. Sketch a Kimball bus matrix for an e-commerce platform with three business processes (sales, returns, inventory snapshot) and four candidate conformed dimensions (customer, product, date, store). Mark which dims are conformed across all three, which are partial, and call out one shrunken-conformance cell.

Input. Three facts: fact_sales (transaction grain), fact_returns (transaction grain), fact_inventory_snapshot (daily snapshot, but stored monthly for cost reasons). Four candidate dims: dim_customer, dim_product, dim_date, dim_store.

Code.

-- Materialise the bus matrix as a small table for governance.
CREATE TABLE bus_matrix AS
SELECT * FROM (VALUES
    ('sales',              'customer', 'full'),
    ('sales',              'product',  'full'),
    ('sales',              'date',     'full'),
    ('sales',              'store',    'full'),
    ('returns',            'customer', 'full'),
    ('returns',            'product',  'full'),
    ('returns',            'date',     'full'),
    ('returns',            'store',    'full'),
    ('inventory_snapshot', 'customer', 'not_applicable'),
    ('inventory_snapshot', 'product',  'full'),
    ('inventory_snapshot', 'date',     'shrunken_to_month'),
    ('inventory_snapshot', 'store',    'full')
) AS t(business_process, dimension, conformance);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. sales and returns both share all four dims — customer, product, date, store — at full conformance; cross-process queries (refund rate per segment per region) are trivial.
  2. inventory_snapshot does not use customer — the dim is not_applicable because inventory is product-and-store-keyed, not customer-keyed.
  3. inventory_snapshot uses dim_date at month grain (the snapshot fact stores one row per (product, store, month)); this is the shrunken-conformance cell.
  4. The 12-row table is the bus matrix; pivot it in a BI tool for a visual grid.
  5. The matrix lives in version control alongside the model definitions; PR-reviewed changes to the matrix are the governance gate for adding new processes or dims.

Output (the matrix pivoted into the classic grid).

business_process customer product date store
sales full full full full
returns full full full full
inventory_snapshot full month full

Rule of thumb: every interview-day system-design answer for an analytics platform should start with a hand-sketched bus matrix; the matrix anchors the rest of the design.

SQL
Topic — dimensional-modeling
Conformed-dimension and bus-matrix drills

Practice →

SQL
Topic — aggregation
Cross-process aggregation practice

Practice →

Solution Using a cross-process analytics query that depends on conformance

Code.

-- One query, three facts, one conformed dim_customer — the payoff of conformance.
WITH sales AS (
    SELECT customer_key, SUM(revenue) AS total_revenue
    FROM fact_sales
    WHERE date_key BETWEEN 20260101 AND 20260331
    GROUP BY customer_key
), returns AS (
    SELECT customer_key, SUM(refund_amount) AS total_refunds
    FROM fact_returns
    WHERE date_key BETWEEN 20260101 AND 20260331
    GROUP BY customer_key
), support AS (
    SELECT customer_key,
           COUNT(*)              AS ticket_count,
           AVG(resolution_minutes) AS avg_resolution_min
    FROM fact_support_ticket
    WHERE date_key BETWEEN 20260101 AND 20260331
    GROUP BY customer_key
)
SELECT
    d.segment,
    d.country,
    SUM(COALESCE(s.total_revenue, 0))      AS total_revenue,
    SUM(COALESCE(r.total_refunds, 0))      AS total_refunds,
    SUM(COALESCE(t.ticket_count, 0))       AS ticket_count,
    AVG(t.avg_resolution_min)               AS avg_resolution_min
FROM dim_customer d
LEFT JOIN sales   s ON s.customer_key = d.customer_key
LEFT JOIN returns r ON r.customer_key = d.customer_key
LEFT JOIN support t ON t.customer_key = d.customer_key
WHERE d.is_current = TRUE
GROUP BY d.segment, d.country
ORDER BY total_revenue DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

d.customer_key d.segment d.country s.total_revenue r.total_refunds t.ticket_count
101 enterprise US 50000.00 2500.00 8
102 growth US 18000.00 900.00 12
103 enterprise UK 40000.00 1800.00 5
104 starter UK 6000.00 300.00 20
  1. Each CTE aggregates one fact to the customer level; the grain of each CTE is (customer_key).
  2. The main SELECT joins all three CTEs to dim_customer; LEFT JOIN preserves customers with no sales / no returns / no tickets.
  3. WHERE d.is_current = TRUE filters the dim to one current row per customer; without it the rollup would multiply by SCD history depth.
  4. GROUP BY d.segment, d.country collapses to one row per segment-country bucket.
  5. The conformance contract is what makes this query possible — every fact agrees that customer_key = 101 is the same customer.

Output:

segment country total_revenue total_refunds ticket_count avg_resolution_min
enterprise US 50000.00 2500.00 8 45
enterprise UK 40000.00 1800.00 5 55
growth US 18000.00 900.00 12 60
starter UK 6000.00 300.00 20 90

Why this works — concept by concept:

  • Conformed surrogate keycustomer_key resolves identically in all three facts; without this, the three LEFT JOINs would silently disagree.
  • One CTE per fact — pre-aggregating each fact to customer-grain before joining keeps the join cardinality manageable (O(customers) not O(customers × sales × returns × tickets)).
  • COALESCE on outer-joined measures — customers with no sales return NULL; COALESCE(…, 0) turns nulls into zeros so the SUM is correct.
  • is_current filter — required because dim_customer is SCD Type 2; without it, the rollup multiplies by historical row count.
  • Cost — three CTE scans are O(rows in date range) each; the join is O(distinct customers); the whole query is cheap because the fact-level pre-aggregation collapses the data before the join.

5. The Kimball 4-step design process — business process → grain → dimensions → facts

Visual diagram of the Kimball 4-step design process — four numbered step cards left-to-right (Select business process → Declare grain → Choose dimensions → Identify facts) each with a tiny icon, a one-line description, and a small example pill; an arrow returns from step 4 back to step 1 to indicate iteration; on a light PipeCode card.

kimball methodology — the canonical 4-step design process

The Kimball 4-step design is the recipe every dimensional model follows: (1) select the business process, (2) declare the grain, (3) choose the dimensions, (4) identify the facts. The order matters: skip a step, or do them out of order, and the model fails predictably — grain mistakes are the most expensive class of failure because they propagate through every downstream model and dashboard.

Step 1 — select the business process.

  • Definition — a business process is a single measurement event the OLTP source produces: placing an order, shipping a parcel, returning a product, clicking a button, posting a payment.
  • Rule — one business process per fact table; never combine "orders and returns" into a single fact because their grains and measures don't align.
  • Sanity check — write the process down as "the system measures X when Y happens"; if you can't, you haven't picked a real process yet.
  • Examples"the system measures revenue when an order line is placed", "the system measures days-late when a shipment status changes", "the system measures attendance when a class session occurs".

Step 2 — declare the grain.

  • Definition — the grain is a single sentence defining what one row of the fact table means.
  • Rule — declare grain before you name a single column; defend it against finer (more atomic) and coarser (more aggregated) alternatives.
  • Sanity check — fill in the blank: "One row of this fact represents ____."; the sentence is the grain.
  • Examples"one row per (order_id, line_id)", "one row per (account_id, day)", "one row per order, lifecycle-accumulating".

Step 3 — choose the dimensions.

  • Definition — the dimensions are the who / what / when / where / why contexts surrounding the grain.
  • Rule — pick the minimum set of dimensions the grain requires; don't drag in dimensions that aren't relevant to the process.
  • Sanity check — for each candidate dim, ask "if I removed this dim, can I still answer the analytical questions the PM cares about?"; if yes, drop it.
  • Examples — for fact_sales at order-line grain: dim_customer, dim_product, dim_date, dim_store; that's it.

Step 4 — identify the facts (measures).

  • Definition — the facts are the numeric measures that aggregate up the dimension hierarchies.
  • Rule — favour additive measures (those that SUM correctly across all dims); be wary of semi-additive (SUM only across some dims) and non-additive (ratios, percentages) measures.
  • Sanity check — for each candidate measure, ask "does SUM(this) GROUP BY any dim make sense?"; if no, it's not a fact.
  • Examples — for fact_sales: quantity, unit_price, discount_amount, revenue.

The iteration loop.

  • One business process per iteration — design the sales model first, ship it, then iterate into returns.
  • Re-declare grain when the source changes — if the OLTP team adds line-level cancellation, the grain may need to shift.
  • Add dims as use cases emergedim_promotion may not be needed on day 1 but becomes essential when marketing wants attribution.
  • Add facts as measures are requesteddiscount_pct (a derived ratio) may emerge later; store the additive components and derive the ratio in BI.

Worked example — apply the 4-step process to an e-commerce sales request

Detailed explanation. Interviews love this one because it lets the candidate demonstrate the process, not just the artefact. Below is a fully worked end-to-end design from a one-paragraph PM request.

Question. A PM says: "Our e-commerce platform sells products to customers via a web store. I want to analyse revenue by customer segment, product category, day, and store region — and drill into individual order lines." Apply the 4-step process and produce the fact_sales schema.

Input. OLTP source: orders(order_id, customer_id, order_ts, store_id) joined to order_lines(order_id, line_id, sku, qty, unit_price, discount). No other tables.

Code.

-- Step 1: business process = "online sales (order-line placement)".
-- Step 2: grain         = "one row per (order_id, line_id)".
-- Step 3: dimensions    = customer, product, date, store.
-- Step 4: facts         = quantity, unit_price, discount_amount, revenue.

CREATE TABLE fact_sales (
    sale_key       BIGINT IDENTITY(1,1) PRIMARY KEY,
    customer_key   BIGINT NOT NULL REFERENCES dim_customer(customer_key),
    product_key    BIGINT NOT NULL REFERENCES dim_product(product_key),
    date_key       INT    NOT NULL REFERENCES dim_date(date_key),
    store_key      BIGINT NOT NULL REFERENCES dim_store(store_key),
    order_id       VARCHAR(40) NOT NULL,
    line_id        INT         NOT NULL,
    quantity       INT           NOT NULL,
    unit_price     NUMERIC(12,2) NOT NULL,
    discount_amount NUMERIC(12,2) NOT NULL DEFAULT 0,
    revenue        NUMERIC(12,2) NOT NULL,
    CONSTRAINT uq_fact_sales UNIQUE (order_id, line_id)
);
-- Sanity-check the grain: COUNT(*) = COUNT(DISTINCT (order_id, line_id)).
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Step 1 (business process)"placing an order line on the web store"; it is a single measurement event; not a roll-up.
  2. Step 2 (grain)"one row per (order_id, line_id)"; the most atomic grain the source supports; declared in the table comment and enforced by the UNIQUE constraint.
  3. Step 3 (dimensions)customer (who), product (what), date (when), store (where); four FKs.
  4. Step 4 (facts)quantity, unit_price, discount_amount, revenue; the first three come directly from the source, revenue is derived (= qty × price - discount) and stored to avoid recomputation in BI.
  5. The order matters — process before grain before dims before facts; reversing the order (e.g. picking facts first) leads to mid-design rework when the grain doesn't support them.

Output (the schema deliverable).

step artefact
1. business process online sales
2. grain one row per (order_id, line_id)
3. dimensions customer, product, date, store
4. facts quantity, unit_price, discount_amount, revenue
final fact_sales table with 4 FKs + 2 degen dims + 4 measures

Rule of thumb: every dimensional design should ship the 4-step process as a comment block on the fact table; the comment is the design rationale that survives turnover.

kimball methodology — common beginner mistakes when applying the 4-step process

  • Skipping the business process step — jumping straight to grain or dimensions without naming the process leads to bloated facts that mix multiple processes.
  • Declaring grain in plural"one row per orders" is wrong; the grain is always singular ("one row per order line").
  • Picking dimensions before grain — the grain constrains the dimensions; you cannot have a dim_line_item if your grain is one row per order.
  • Stuffing descriptive attributes into the fact — if you find yourself adding customer_name or product_category to the fact, you're modelling backwards; those belong on the dim.
  • Picking non-additive measures as primary factsdiscount_pct and margin_pct cannot SUM; store the additive components and let BI derive the ratios.
  • Forgetting dim_date — every fact has a time dimension; even factless facts have one; never store dates only as DATE columns on the fact without a date_key FK.

Worked example — translate a tricky multi-process PM request into separate facts

Detailed explanation. A senior interviewer will deliberately mix processes in the PM request and see whether the candidate correctly splits them. Below is the drill.

Question. A PM says: "I want to track everything that happens to an order — when it's placed, when each line is shipped, when each line is returned. Build me one big fact." Resist the temptation; design three fact tables and explain why.

Input. OLTP sources: orders, order_lines, shipments(line_id, shipped_ts), returns(line_id, return_ts, refund_amount).

Code.

-- Three processes, three facts, three grains.

-- Process 1: order-line placement.
CREATE TABLE fact_sales (
    customer_key BIGINT, product_key BIGINT, date_key INT, store_key BIGINT,
    order_id VARCHAR(40), line_id INT,
    quantity INT, unit_price NUMERIC(12,2), revenue NUMERIC(12,2)
);
-- grain: one row per (order_id, line_id) at placement time.

-- Process 2: shipment.
CREATE TABLE fact_shipments (
    customer_key BIGINT, product_key BIGINT, ship_date_key INT, carrier_key BIGINT,
    order_id VARCHAR(40), line_id INT,
    quantity_shipped INT, days_from_order INT
);
-- grain: one row per (order_id, line_id) per shipment event.

-- Process 3: return.
CREATE TABLE fact_returns (
    customer_key BIGINT, product_key BIGINT, return_date_key INT,
    order_id VARCHAR(40), line_id INT,
    quantity_returned INT, refund_amount NUMERIC(12,2),
    days_from_ship INT
);
-- grain: one row per (order_id, line_id) per return event.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The PM's "one big fact" is the trap; combining three processes into one fact gives you a wide, sparse, semi-additive mess.
  2. Each process has its own measurement event (placed, shipped, returned) and therefore its own fact table.
  3. Each fact has its own grain and its own set of measures; fact_sales.revenue doesn't apply to fact_shipments, and fact_shipments.days_from_order doesn't apply to fact_sales.
  4. The three facts share customer_key, product_key, and (order_id, line_id) as a degenerate dim, so cross-process analytics (placed-to-shipped lag) is one join away.
  5. An accumulating snapshot fact (fact_order_lifecycle) can sit on top of the three transaction facts to give BI a denormalised one-row-per-order view; the three transaction facts remain the source of truth.

Output (the three-fact design with shared conformed dims).

fact grain measures
fact_sales 1 row per (order_id, line_id) at placement quantity, unit_price, revenue
fact_shipments 1 row per (order_id, line_id) per shipment quantity_shipped, days_from_order
fact_returns 1 row per (order_id, line_id) per return quantity_returned, refund_amount, days_from_ship

Rule of thumb: if a PM asks for "one big fact", count the measurement events in the request; each event is its own fact.

SQL
Topic — dimensional-modeling
4-step design process drills

Practice →

SQL
Topic — database
Database / schema design practice

Practice →

Solution Using a 4-step design checklist as a deliverable

Code.

-- Ship the 4-step design as a checklist row per fact table.
CREATE TABLE dim_design_checklist AS
SELECT * FROM (VALUES
    ('fact_sales',         1, 'business_process', 'online sales (order-line placement)'),
    ('fact_sales',         2, 'grain',            'one row per (order_id, line_id)'),
    ('fact_sales',         3, 'dimensions',       'customer, product, date, store'),
    ('fact_sales',         4, 'facts',            'quantity, unit_price, discount_amount, revenue'),

    ('fact_shipments',     1, 'business_process', 'shipment dispatch'),
    ('fact_shipments',     2, 'grain',            'one row per (order_id, line_id) per shipment'),
    ('fact_shipments',     3, 'dimensions',       'customer, product, ship_date, carrier'),
    ('fact_shipments',     4, 'facts',            'quantity_shipped, days_from_order'),

    ('fact_returns',       1, 'business_process', 'product return'),
    ('fact_returns',       2, 'grain',            'one row per (order_id, line_id) per return'),
    ('fact_returns',       3, 'dimensions',       'customer, product, return_date'),
    ('fact_returns',       4, 'facts',            'quantity_returned, refund_amount, days_from_ship')
) AS t(fact_table, step_no, step_name, value);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

fact_table step_no step_name value
fact_sales 1 business_process online sales (order-line placement)
fact_sales 2 grain one row per (order_id, line_id)
fact_sales 3 dimensions customer, product, date, store
fact_sales 4 facts quantity, unit_price, discount_amount, revenue
fact_shipments 1 business_process shipment dispatch
fact_shipments 2 grain one row per (order_id, line_id) per shipment
fact_shipments 3 dimensions customer, product, ship_date, carrier
fact_shipments 4 facts quantity_shipped, days_from_order
fact_returns 1 business_process product return
fact_returns 2 grain one row per (order_id, line_id) per return
fact_returns 3 dimensions customer, product, return_date
fact_returns 4 facts quantity_returned, refund_amount, days_from_ship
  1. Each fact gets exactly four rows in the checklist — one per step; if a fact has fewer, the design is incomplete.
  2. The step_no column enforces the canonical order; grain before dimensions before facts.
  3. The value column is plain English (not SQL); junior engineers and PMs can read it without warehouse fluency.
  4. The table itself becomes the design contract; PR review against the checklist catches gaps before code lands.
  5. Three facts × four steps = 12 rows; with seven facts a real platform might have 28 rows, all in one queryable artefact.

Output.

fact_table grain
fact_sales one row per (order_id, line_id)
fact_shipments one row per (order_id, line_id) per shipment
fact_returns one row per (order_id, line_id) per return

Why this works — concept by concept:

  • Checklist as deliverable — the design itself is a row-per-step artefact; this is what makes Kimball governable at scale.
  • One row per step per fact — turns a vague "did we follow the process" question into a COUNT(*) GROUP BY fact_table HAVING COUNT(*) = 4 query.
  • Plain-English value column — the design has to be readable by the PM, the analyst, and the DBA; SQL syntax in the design doc is over-engineering.
  • Versionable in source control — the checklist lives in dbt YAML / the data catalog / a Confluence page; changes are PR-reviewed.
  • CostO(1) to read; the actual schemas built from the design cost O(N rows × N attributes) to materialise, but the design itself is constant-time recall.

Choosing the right SCD type (cheat sheet)

A one-screen cheat sheet for slowly changing dimension decisions — pick the type that matches the analytical question you'll be asked.

You want to … SCD type Mechanism Row impact
Fix typos in a name Type 1 overwrite in place none
Update an email after a change Type 1 overwrite in place none
Track historical customer segment Type 2 insert new row + close prior row +1 row per change
Track historical region / country Type 2 insert new row + close prior row +1 row per change
Track historical account manager Type 2 insert new row + close prior row +1 row per change
Track historical credit score Type 2 insert new row + close prior row +1 row per change
Track just-the-last-change region rename Type 3 add previous_* column none
Provide fast current-state lookup and full history Type 6 Type 2 + Type 1 + Type 3 hybrid +1 row per change
Preserve everything in a never-purged audit Type 4 (history table) move old rows to dim_*_history none in main dim
Roll back to a prior version on demand Type 2 + retention keep all rows; query valid_from window +1 row per change
Surface "as-of" reporting at any date Type 2 with valid_from / valid_to BETWEEN predicate full Type 2 cost
Audit who changed a field Type 2 + updated_by audit column every row carries updater full Type 2 cost
Track an immutable attribute (signup_date) NA never updated none
Encode a derived current-state pivot Type 6 (current_* column) overwrite current_* on every row none new

Frequently asked questions

How is this Kimball deep-dive different from a generic data-modeling Q&A round-up?

A quick data modeling interview questions round-up usually covers OLTP normalisation (1NF / 2NF / 3NF), the Kimball-vs-Inmon-vs-Vault landscape, basic star vs snowflake schema vocabulary, and a few generic FAQ-style questions in one sitting — perfect for last-minute review. This deep-dive narrows the lens to Kimball dimensional modeling specifically, walking five numbered teaching sections — fact-vs-dim atoms, grain + the four SCD types with full SQL, conformed dimensions + the bus matrix, and the canonical 4-step design process — with worked examples, end-to-end schemas, and a per-attribute SCD assignment matrix. Pick the deep-dive when you have a week to prepare, want to teach dimensional modeling in a senior loop, or need the SCD MERGE statements memorised. Pick the round-up the night before. The two formats are complements, not duplicates — same family of topics, different depth.

What is the difference between a fact table and a dimension table in Kimball modeling?

Fact tables are narrow + tall + numeric — they have a handful of foreign-key columns (one per participating dimension), one or two degenerate dimensions (order_id, line_id), and a handful of additive measures (quantity, revenue, discount_amount). One row per business event; billions of rows over time. Dimension tables are short + wide + descriptive — they have a surrogate key, a business key, and dozens of descriptive text and date attributes (name, segment, country, category, valid_from, valid_to, is_current); one row per business entity per historical version. The interview-day rule of thumb: facts answer "how much"; dimensions answer "who / what / when / where / why". If you find a long text column on a fact, it's mis-modelled; if you find a numeric measure on a dim, it's mis-modelled.

What is grain and why is declaring it first the most important rule in Kimball?

grain is a single sentence that defines what one row of a fact table means — "one row per (order_id, line_id)", "one row per (account_id, day)", "one row per order, lifecycle-accumulating". It must be declared before any column is named, because every other modeling decision (which dimensions apply, which measures are additive, what the unique constraint is) follows from the grain. Mixing grains in the same fact table double-counts every aggregate; changing the grain after launch breaks every downstream dashboard; ambiguity about the grain produces queries that return wrong numbers silently. The Kimball discipline: write the grain in the table comment, the dbt YAML, the data-catalog entry, and the design wiki; multiple sources of truth keep it from drifting. Defending the grain in a design review — explaining why your grain isn't finer (more atomic) or coarser (more aggregated) — is the single biggest senior-modeler signal you can send.

What are the four SCD types I have to know cold for an interview?

The four canonical types are: Type 1 (overwrite) — replace the value in place; no history; cheap; use for typos, contact info, and rebrands where past values aren't analytically interesting. Type 2 (add new row) — insert a new row with a fresh surrogate key + valid_from / valid_to / is_current flags; full history; the workhorse; use for analytical attributes (segment, region, tier, account manager) where revenue-per-historical-value is a real question. Type 3 (add new column) — add a previous_* column to track one level of history per attribute; limited; use sparingly for one-time renames (region rebrand). Type 6 (hybrid 1+2+3) — the senior interview answer: layer all three patterns to give you full history and a fast current_* lookup and a per-row prior value. Memorise the SQL MERGE for each (section 3 ships all four). The interview rule: pick the SCD type per attribute, not per table — a single dim_customer can mix Type 1 on email, Type 2 on segment, Type 6 on current_segment.

What are conformed dimensions and how do they enable enterprise-scale analytics?

Conformed dimensions are dimensions designed to be shared across multiple business processes — fact_sales, fact_returns, fact_inventory, fact_support_ticket all join to the same dim_customer, dim_product, dim_date. The conformance contract specifies that the surrogate key, column set, SCD policy, and grain are identical across every fact that consumes the dim. Without conformance, every team builds their own dim_customer_sales, dim_customer_marketing, dim_customer_support, and cross-process analytics ("what % of customers who bought in Q1 returned in Q2 and opened a support ticket in Q3") becomes impossible because the definition of "customer" has diverged in five places. The Kimball bus matrix is the org-wide design artefact that surfaces conformance: business processes as rows, conformed dimensions as columns, checkmarks where the process uses the dim. Senior data modelers ship the bus matrix first as the platform's analytics blueprint; junior modelers skip it and pay for missing conformance the next decade.

Is Kimball dimensional modeling still relevant in 2026 with the lakehouse, Iceberg, and modern semantic layers?

Yes — emphatically. Snowflake, BigQuery, Databricks, and Redshift all publish reference architectures with star-schema gold-layer models. dbt is built around dimensional modeling — dim_ / fact_ naming is the de-facto convention, dbt_utils.generate_surrogate_key is universal, and dbt-expectations ships dimensional-model assertions. The lakehouse did not kill it: Iceberg, Delta, and Hudi tables still get a Kimball-shaped gold layer on top of the bronze raw + silver cleaned layers. Modern semantic layers (Cube, LookML, dbt-metricflow, Snowflake Semantic Layer) all assume a star-schema input. Data Vault complements rather than replaces — DV 2.0 increasingly handles the raw / integration layer with a Kimball star on top as the consumption layer. The reason dimensional modeling outlived every "Kimball is dead" hot take is that, underneath the storage layer, analysts still want a star schema because that is the shape SQL pivots and BI tools natively consume. In 2026, knowing Kimball cold is still the price of admission to a senior data-engineering interview at a serious analytics org.


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including SQL + Python drills keyed to the exact kimball data warehouse skill set this guide teaches (fact-vs-dim design, grain declaration, surrogate keys, SCD Types 1 / 2 / 3 / 6 MERGE patterns, conformed-dimension reasoning, bus-matrix governance, the 4-step design process). Whether you're drilling dimensional modeling interview questions the night before a screen or grinding the Kimball methodology vocabulary across a multi-week prep cycle, the practice library mirrors the same five-section mental model — plus the dbt, Snowflake, BigQuery, and Databricks warehouse stacks you'll wire into your production star schema.

Kick off via Explore practice →; drill the dimensional-modeling lane →; fan out into slowly-changing-data problems →; reinforce the broader SQL practice library →; rehearse aggregation patterns →; widen coverage on the full Python practice library →.

Top comments (0)