star schema vs snowflake schema is the single most-asked dimensional modeling question on a data-engineering interview loop, because the answer touches every layer of the warehouse — fact table design, dimension table shape, grain declaration, conformed dimensions, SCD (slowly changing dimension) handling, query latency, ETL load complexity, storage cost, and BI tool fit. A senior interviewer is not asking which schema is better; they are asking whether you can map a workload onto a schema, name the five-dimension trade-off out loud, and justify the choice with a decision tree — the exact shape this deep-dive walks through, end to end.
This guide covers the topic at five teaching depths — anatomy of a star schema (one fact, denormalised dimensions, single-step joins), anatomy of a snowflake schema (normalised dimensions, branching sub-dimensions, multi-step joins), the five-dimension comparison (query speed, ETL complexity, storage cost, BI-tool fit, best-for workloads), the decision matrix (when to pick which, with worked SQL on both shapes), and a tight cheat sheet that fits on a single screen — followed by six FAQs that vary the keyword cluster so a senior loop's "explain it differently" follow-ups all have a clean answer.
When you want hands-on reps immediately after reading, browse the SQL practice library →, drill joins problems →, sharpen aggregation reps →, reinforce database problems →, rehearse data-modeling problems →, or widen coverage on the full Python practice library →.
On this page
- Why dimensional modeling is its own interview track
- Star schema anatomy — fact + denormalised dimensions + single-step joins
- Snowflake schema anatomy — normalised dimensions + branching sub-dimensions
- Star vs Snowflake — five-dimension trade-off (query speed, ETL, storage, BI fit, best for)
- Decision matrix — when to choose which (with worked SQL)
- Choosing the right schema (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why dimensional modeling is its own interview track
dimensional modeling — a distinct discipline from OLTP design and raw SQL
The one-sentence invariant: dimensional modeling is a distinct discipline because the shapes it optimises for — fact table + surrounding dimension table arms, declared grain, conformed dimensions across marts, and SCD type 2 history — make analytical queries (aggregate, slice-by-dim, time-series) one to two orders of magnitude faster than the same query against a 3NF OLTP schema, and the design decisions that buy that speed (denormalisation, surrogate keys, late-binding dimensions, slowly-changing dimension policy) are workload-shaped, not *form-shaped*. An interviewer is not testing whether you can write a JOIN — they are testing whether you can think in facts, dimensions, grain, and tradeoffs while they listen.
What interviewers actually score on star schema vs snowflake schema questions.
-
Definition fluency — can you, in 30 seconds, define
fact table,dimension table,grain,conformed dimension,SCD type 2,star schema, andsnowflake schemawithout notes? - Shape comparison — can you draw both schemas on a whiteboard and explain why the snowflake "branches" while the star is "flat"?
-
Trade-off articulation — can you name the five dimensions of trade-off (
query speed,ETL complexity,storage cost,BI tool fit,best for) and the verdict for each side? -
Decision-tree thinking — given a workload (
Tableau dashboard,regulated reporting,petabyte clickstream,data vault → mart), can you pick a schema and justify with two sentences? -
SQL fluency on both — can you write the same business question as a star query (one
JOINper dim) and a snowflake query (multi-stepJOINchain) and read off the cost difference? -
SCDliteracy — can you describe type 1 (overwrite), type 2 (versioned row + effective dates), and type 3 (versioned column) and name which star vs snowflake handles them more cleanly?
The 5-section map this guide walks through.
-
Section 1 — Why
dimensional modelingis its own interview track — the scope, the taxonomy of facts / dimensions / grain, and the four senior signals. -
Section 2 —
star schemaanatomy — one fact at the centre, denormalised dimensions in a radial pattern, single-step joins for every analytical query. -
Section 3 —
snowflake schemaanatomy — same fact, but each dimension is normalised into sub-dimensions; storage falls and join cost rises. -
Section 4 — The five-dimension trade-off —
query speed,ETL complexity,storage cost,BI tool fit,best for; the matrix interviewers expect you to recite. - Section 5 — The decision matrix — four-question decision tree with worked SQL on both shapes so you can defend the verdict.
Why this is its own interview track and not a SQL round.
-
dimensional modelingis not OLTP design — the system under design is analytical, not transactional; the shape that optimises forOLAPis the opposite of the shape that optimises forOLTP. -
The choices are shape-binding — choosing
starvssnowflakelocks ETL complexity, query latency, and BI-tool integration for years; a wrong choice is a multi-quarter refactor, not a one-day fix. -
grainis the most-missed concept — every fact table has a declared grain (e.g., "one row per order line"); without it, every aggregate query is a guess. -
conformed dimensionsare the senior signal — a junior describes a single mart's star; a senior describes adim_customerthat is shared acrossfact_sales,fact_support, andfact_marketingso all three marts roll up consistently. -
SCD type 2is the discipline gate — a slowly-changing dimension without effective dates is the bug that makes historical reports lie; the senior signal is naming the SCD policy before the shape question.
Worked example — translate one OLTP table into both a star fact + dim and a snowflake fact + dim chain
Detailed explanation. Real interviews probe whether you can translate the same OLTP source onto both shapes and read off the structural differences. Below is the canonical translation: a single source orders_oltp table is reshaped into (a) a star with dim_product denormalised and (b) a snowflake with dim_product normalised into dim_category and dim_brand.
Question. Given a source OLTP orders_oltp table containing order_id, customer_id, product_id, product_name, category_name, brand_name, order_ts, quantity, unit_price, design (a) the equivalent star schema and (b) the equivalent snowflake schema, declaring the grain of the fact table and identifying which dimension columns move into sub-dimensions in the snowflake.
Input. One OLTP table, 10M rows. Each row is one order line; an order_id can repeat across rows if an order has multiple line items.
Code.
-- (a) STAR — one fact + four denormalised dims; product hierarchy is INLINE on dim_product.
CREATE TABLE fact_sales (
sales_sk BIGINT PRIMARY KEY, -- surrogate key (grain anchor)
customer_sk BIGINT NOT NULL REFERENCES dim_customer,
product_sk BIGINT NOT NULL REFERENCES dim_product,
date_sk INT NOT NULL REFERENCES dim_date,
store_sk INT NOT NULL REFERENCES dim_store,
quantity INT NOT NULL,
unit_price NUMERIC(18,4) NOT NULL,
revenue NUMERIC(18,4) NOT NULL -- measure: quantity * unit_price
);
-- Declared grain: one row = one order LINE.
CREATE TABLE dim_product (
product_sk BIGINT PRIMARY KEY,
product_id VARCHAR(64) NOT NULL, -- natural key (from OLTP)
product_name VARCHAR(256) NOT NULL,
category_name VARCHAR(128) NOT NULL, -- denormalised hierarchy
brand_name VARCHAR(128) NOT NULL, -- denormalised hierarchy
supplier_name VARCHAR(128) NOT NULL, -- denormalised hierarchy
effective_from DATE NOT NULL, -- SCD type 2
effective_to DATE,
is_current BOOLEAN NOT NULL
);
-- (b) SNOWFLAKE — same fact, but dim_product is normalised into dim_category + dim_brand.
CREATE TABLE dim_product_sf (
product_sk BIGINT PRIMARY KEY,
product_id VARCHAR(64) NOT NULL,
product_name VARCHAR(256) NOT NULL,
category_sk BIGINT NOT NULL REFERENCES dim_category,
brand_sk BIGINT NOT NULL REFERENCES dim_brand,
effective_from DATE NOT NULL,
effective_to DATE,
is_current BOOLEAN NOT NULL
);
CREATE TABLE dim_category (
category_sk BIGINT PRIMARY KEY,
category_name VARCHAR(128) NOT NULL UNIQUE
);
CREATE TABLE dim_brand (
brand_sk BIGINT PRIMARY KEY,
brand_name VARCHAR(128) NOT NULL UNIQUE,
supplier_sk BIGINT REFERENCES dim_supplier -- snowflake can branch further
);
Step-by-step explanation.
- The grain of
fact_salesis declared as one row per order line — every aggregate downstream (revenue per region, AOV per category) reads from this grain. - The star keeps
category_name,brand_name,supplier_nameinline ondim_product; oneJOINfrom fact to dim returns everything needed for a sliced-by-category report. - The snowflake lifts those columns into
dim_categoryanddim_brand(anddim_brandfurther referencesdim_supplier), eliminating redundancy at the cost of 2-3 extra joins per query. - Both shapes use surrogate keys (
product_sk) on the fact, not the natural OLTPproduct_id; this insulates the warehouse from upstream source-system key changes and is required forSCD type 2versioning. -
SCD type 2columns (effective_from,effective_to,is_current) live ondim_productin the star and ondim_product_sfin the snowflake; the SCD policy is identical, but the snowflake spreads the impact across the sub-dimensions only when they version too.
Output (counts of tables involved per analytical query for "revenue by category, last 30 days").
| schema | tables joined | join steps | typical query latency |
|---|---|---|---|
| Star | 2 (fact_sales + dim_product) | 1 | ~150 ms on 10M rows |
| Snowflake | 3 (fact_sales + dim_product_sf + dim_category) | 2 | ~280 ms on 10M rows |
Rule of thumb: the snowflake adds one join per normalised hierarchy level. Two levels = roughly 2× the join cost; under cache + columnar storage the runtime gap narrows but never closes.
star schema vs snowflake schema — the four senior signals
Signal 1 — opinionated trade-off framing. Senior data engineers do not say "both schemas are fine"; they say "star for dashboards because Tableau and Looker auto-generate single-join SQL against it, snowflake for regulated finance reporting because the normalised sub-dimensions match the source-of-truth chart of accounts and survive audits."
Signal 2 — grain declared up front. Junior modellers describe tables; senior modellers describe grain. The first sentence of any fact-table answer is "the grain of this fact is one row per …"; without that, every downstream SUM is a guess.
Signal 3 — conformed dimensions over per-mart re-modelling. Senior teams ship one dim_customer shared across fact_sales, fact_support, and fact_marketing; the dimension is conformed once and reused, so cross-mart reporting (revenue + tickets + campaign attribution per customer) is a single, trustworthy join.
Signal 4 — SCD policy is a first-class decision. Senior data engineers state SCD policy before shape; "dim_product is SCD type 2 with effective_from/effective_to/is_current" comes out of their mouth in the first 60 seconds, because that policy is what makes historical re-runs reproducible.
SQL
Topic — data-modeling
Data modeling drills
SQL
Topic — database
Database design practice
Solution Using a fact-and-dimension catalogue table
Code.
-- One canonical catalogue table — every row maps a table to its role, grain, and SCD policy.
CREATE TABLE warehouse_catalogue AS
SELECT * FROM (VALUES
('fact_sales', 'fact', 'one row per order line', 'star', 'N/A'),
('fact_sales_sf', 'fact', 'one row per order line', 'snowflake', 'N/A'),
('dim_customer', 'dimension', 'one row per customer', 'conformed', 'SCD type 2'),
('dim_product', 'dimension', 'one row per product version', 'star', 'SCD type 2'),
('dim_product_sf', 'dimension', 'one row per product version', 'snowflake', 'SCD type 2'),
('dim_category', 'dimension', 'one row per category', 'snowflake', 'SCD type 1'),
('dim_brand', 'dimension', 'one row per brand', 'snowflake', 'SCD type 1'),
('dim_date', 'dimension', 'one row per calendar day', 'conformed', 'static'),
('dim_store', 'dimension', 'one row per store version', 'conformed', 'SCD type 2')
) AS t(table_name, role, grain, schema_shape, scd_policy);
Step-by-step trace.
| table_name | role | grain | schema_shape | scd_policy |
|---|---|---|---|---|
| fact_sales | fact | one row per order line | star | N/A |
| fact_sales_sf | fact | one row per order line | snowflake | N/A |
| dim_customer | dimension | one row per customer | conformed | SCD type 2 |
| dim_product | dimension | one row per product version | star | SCD type 2 |
| dim_product_sf | dimension | one row per product version | snowflake | SCD type 2 |
| dim_category | dimension | one row per category | snowflake | SCD type 1 |
| dim_brand | dimension | one row per brand | snowflake | SCD type 1 |
| dim_date | dimension | one row per calendar day | conformed | static |
| dim_store | dimension | one row per store version | conformed | SCD type 2 |
- Rows 1-2 — the two fact variants share the same grain; only the surrounding dimension shape differs.
- Row 3 —
dim_customeris conformed across multiple marts; this is the single biggest reuse lever in a warehouse. - Rows 4-5 — the same product dimension exists in two shapes; the snowflake version is normalised but the SCD policy is identical.
- Rows 6-7 —
dim_categoryanddim_brandare the sub-dimensions that distinguish snowflake from star; in a star, they would be columns ondim_product. - Row 8 —
dim_dateis static (no SCD); the calendar does not version. - Row 9 —
dim_storeisSCD type 2because store ownership and address change over time, and historical reports must reflect the store-as-of-the-transaction.
Output.
| table_name | role | schema_shape | scd_policy |
|---|---|---|---|
| fact_sales | fact | star | N/A |
| dim_customer | dimension | conformed | SCD type 2 |
| dim_product | dimension | star | SCD type 2 |
| dim_category | dimension | snowflake | SCD type 1 |
| dim_date | dimension | conformed | static |
Why this works — concept by concept:
-
Catalogue as artefact — turns the design into a queryable table; reviewers can
WHERE role = 'fact'and audit grain declarations in one query. -
Grain column — every table has its grain explicit and checked into git; the catalogue makes "what is the grain of
fact_sales?" a SQL lookup, not a tribal-knowledge question. -
schema_shape enum —
star/snowflake/conformedmakes the shape decision auditable; conformed dimensions are explicit, not implicit. -
SCD policy as a column —
SCD type 1/SCD type 2/staticis the single most-skipped column in junior catalogues; senior teams treat it as load-bearing metadata. -
Cost —
O(1)to read the catalogue; the actual schema lives ininformation_schemaand dbt manifests, but the intent lives here.
2. Star schema anatomy — fact + denormalised dimensions + single-step joins
star schema — one fact at the centre, four denormalised dimensions, single-step joins
star schema is the canonical analytical shape: one fact table at the centre holding measures (quantity, revenue, discount) and foreign keys (customer_sk, product_sk, date_sk, store_sk); every surrounding dimension table holds the descriptive attributes of one business entity in a single, denormalised table — no further sub-dimensions, no normalisation. Every analytical query reaches its data in one join per dimension; that single-step join shape is what every modern BI tool (Tableau, Looker, Power BI, Mode, Hex) auto-generates SQL against.
The four anatomy rules of a star schema.
-
Rule 1 — one
fact tableper process —fact_sales,fact_returns,fact_inventoryare separate facts; do not jam two processes into one fact table. -
Rule 2 —
grainis declared and uniform — every row in the fact has the same grain; "one row per order line" is a declared contract, never an assumption. -
Rule 3 —
dimension tables are *denormalised* —dim_productholdscategory,brand, andsupplieras columns, not as foreign keys; the hierarchy lives inline. -
Rule 4 —
surrogate keyseverywhere — fact-to-dim joins useproduct_sk(a BIGINT generated by the warehouse), never the naturalproduct_id; this enablesSCD type 2and insulates against upstream source-system key changes.
The canonical four-dimension star.
CREATE TABLE fact_sales (
sales_sk BIGINT PRIMARY KEY,
customer_sk BIGINT NOT NULL REFERENCES dim_customer,
product_sk BIGINT NOT NULL REFERENCES dim_product,
date_sk INT NOT NULL REFERENCES dim_date,
store_sk INT NOT NULL REFERENCES dim_store,
quantity INT NOT NULL,
unit_price NUMERIC(18,4) NOT NULL,
discount NUMERIC(18,4) NOT NULL DEFAULT 0,
revenue NUMERIC(18,4) NOT NULL
);
CREATE TABLE dim_customer (
customer_sk BIGINT PRIMARY KEY,
customer_id VARCHAR(64) NOT NULL,
customer_name VARCHAR(256) NOT NULL,
segment VARCHAR(64),
city VARCHAR(128),
region VARCHAR(64),
country VARCHAR(64),
signup_date DATE,
effective_from DATE NOT NULL,
effective_to DATE,
is_current BOOLEAN NOT NULL
);
CREATE TABLE dim_product (
product_sk BIGINT PRIMARY KEY,
product_id VARCHAR(64) NOT NULL,
product_name VARCHAR(256) NOT NULL,
category VARCHAR(128) NOT NULL,
sub_category VARCHAR(128),
brand VARCHAR(128) NOT NULL,
supplier VARCHAR(128) NOT NULL,
effective_from DATE NOT NULL,
effective_to DATE,
is_current BOOLEAN NOT NULL
);
CREATE TABLE dim_date (
date_sk INT PRIMARY KEY, -- YYYYMMDD as INT
date_value DATE NOT NULL UNIQUE,
day_of_week INT NOT NULL,
week INT NOT NULL,
month INT NOT NULL,
quarter INT NOT NULL,
year INT NOT NULL,
fiscal_yr INT NOT NULL,
fiscal_qtr INT NOT NULL,
is_weekend BOOLEAN NOT NULL
);
CREATE TABLE dim_store (
store_sk INT PRIMARY KEY,
store_id VARCHAR(64) NOT NULL,
store_name VARCHAR(256),
city VARCHAR(128),
region VARCHAR(64),
country VARCHAR(64),
manager_name VARCHAR(256),
effective_from DATE NOT NULL,
effective_to DATE,
is_current BOOLEAN NOT NULL
);
-
fact_sales— measures (quantity,unit_price,discount,revenue) plus four_skforeign keys; nothing else. -
dim_customer— descriptive attributes of a customer, including geography inline; nodim_geographysub-dimension. -
dim_product— hierarchy (category,sub_category,brand,supplier) is denormalised as columns; nodim_category. -
dim_date— pre-populated calendar dimension with every day of the past + future N years;date_skis the integer formYYYYMMDDso range scans (date_sk BETWEEN 20240101 AND 20240131) are index-friendly. -
dim_store— store attributes withSCD type 2versioning so historical reports show the manager and address as of the transaction.
The canonical star schema query — revenue by category, last 30 days.
SELECT
p.category,
SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product p ON p.product_sk = f.product_sk
JOIN dim_date d ON d.date_sk = f.date_sk
WHERE d.date_value >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY p.category
ORDER BY revenue DESC;
-
Two joins — fact to
dim_product, fact todim_date; both single-step. -
p.categorylives inline ondim_product; no sub-dimension hop. -
d.date_valuefilter uses the denormalised date column; the integerdate_skis the join key. - Plan — one hash join per dimension; columnar warehouses (Snowflake, BigQuery, Redshift) cache the dimension scans and run the fact aggregate in parallel; sub-second on 100M-row facts.
Worked example — design the star for a multi-channel retailer
Detailed explanation. A typical interview prompt is "design a star schema for a multi-channel retailer (web + store + mobile)". Below is the canonical answer, with grain declared up front and dim_channel introduced as a conformed dimension.
Question. A retailer sells through web, brick-and-mortar stores, and a mobile app. Design a star schema for fact_sales that captures (a) the order channel, (b) the customer, (c) the product, (d) the date, and (e) the store (which is 'web' or 'mobile' for non-physical channels). Declare the grain.
Input. Source OLTP feed: one row per order line, with order_id, customer_id, product_id, order_ts, channel, store_id (null for web/mobile), quantity, unit_price, discount.
Code.
-- Grain: one row per order LINE (not per order). An order with 3 line items contributes 3 fact rows.
CREATE TABLE fact_sales (
sales_sk BIGINT PRIMARY KEY,
order_id VARCHAR(64) NOT NULL, -- degenerate dimension (lives on fact)
customer_sk BIGINT NOT NULL REFERENCES dim_customer,
product_sk BIGINT NOT NULL REFERENCES dim_product,
date_sk INT NOT NULL REFERENCES dim_date,
store_sk INT NOT NULL REFERENCES dim_store,
channel_sk INT NOT NULL REFERENCES dim_channel,
quantity INT NOT NULL,
unit_price NUMERIC(18,4) NOT NULL,
discount NUMERIC(18,4) NOT NULL DEFAULT 0,
revenue NUMERIC(18,4) NOT NULL -- (unit_price * quantity) - discount
);
CREATE TABLE dim_channel (
channel_sk INT PRIMARY KEY,
channel_name VARCHAR(64) NOT NULL UNIQUE -- 'web' | 'store' | 'mobile'
);
-- dim_store has a sentinel row for web + mobile so the FK is never NULL.
INSERT INTO dim_store (store_sk, store_id, store_name, is_current)
VALUES
(-1, 'WEB', 'Web (non-physical)', TRUE),
(-2, 'MOBILE', 'Mobile (non-physical)', TRUE);
Step-by-step explanation.
-
Grain declared first — one row per order line; an order with three line items creates three fact rows. This grain is what makes
SUM(revenue) GROUP BY productcorrect. -
order_idon the fact is a degenerate dimension — a dimension that has no other attributes worth a separate table; it lives as a column on the fact. -
dim_channelis its own conformed dimension becausechanneljoins tofact_marketing,fact_returns, andfact_supportas well — three separate fact tables that should all use the samechannel_sk. -
dim_storesentinel rows — web and mobile orders usestore_sk = -1and-2; this preserves NOT NULL on the FK and makes "all-channel" rollups oneGROUP BY channel_nameaway. -
revenueis *pre-computed* on the fact —(unit_price * quantity) - discountis stored, not computed at query time; this trades one numeric column of storage for a 100× speedup on aggregate queries.
Output (typical 1-day load profile).
| process | source_rows | fact_rows | dimensions_updated |
|---|---|---|---|
| Daily sales load | 5,200,000 orders | 8,800,000 lines | dim_customer (+800 new), dim_product (+120 new) |
Rule of thumb: if a fact aggregate could ever return wrong numbers because of grain ambiguity, the grain is undeclared. Declare it once, check it in CI with a COUNT(*) = COUNT(DISTINCT grain_key_combo) test, and never let it drift.
star schema — the four senior nuances
-
Degenerate dimensions —
order_idandinvoice_numberbelong on the fact as columns, not as a one-column dim table. -
Junk dimensions — combine 3-5 low-cardinality flags (
is_promo,is_first_order,is_returning_customer) into onedim_order_flagsrather than four separate dims. -
Role-playing dimensions —
dim_datejoined asorder_date_sk,ship_date_sk,delivery_date_skis one underlying dim played three roles; alias the join. -
Slowly-changing dimensions — every dimension that can have its descriptive attributes change and you need historical accuracy on must be
SCD type 2; the rest can beSCD type 1(overwrite).
SQL
Topic — joins
Star-schema join practice
SQL
Topic — aggregation
Aggregation drills
Solution Using a single-join-per-dimension star query
Code.
-- The canonical star query: one JOIN per dimension, single-pass aggregate.
SELECT
p.category,
p.brand,
c.region AS customer_region,
s.region AS store_region,
ch.channel_name,
d.year,
d.quarter,
SUM(f.quantity) AS units,
SUM(f.revenue) AS revenue,
AVG(f.unit_price) AS avg_unit_price,
SUM(f.revenue) / NULLIF(SUM(f.quantity), 0) AS effective_price
FROM fact_sales f
JOIN dim_product p ON p.product_sk = f.product_sk
JOIN dim_customer c ON c.customer_sk = f.customer_sk
JOIN dim_store s ON s.store_sk = f.store_sk
JOIN dim_channel ch ON ch.channel_sk = f.channel_sk
JOIN dim_date d ON d.date_sk = f.date_sk
WHERE d.year = 2026 AND d.quarter = 1
GROUP BY p.category, p.brand, c.region, s.region, ch.channel_name, d.year, d.quarter
ORDER BY revenue DESC
LIMIT 50;
Step-by-step trace.
| step | operation | rows in | rows out |
|---|---|---|---|
| 1 | Scan fact_sales partition for Q1 2026 |
8,800,000 (annual) | 2,150,000 (Q1) |
| 2 | Hash-join dim_product (~50K rows) |
2,150,000 | 2,150,000 |
| 3 | Hash-join dim_customer (~1.2M rows) |
2,150,000 | 2,150,000 |
| 4 | Hash-join dim_store (~300 rows) |
2,150,000 | 2,150,000 |
| 5 | Hash-join dim_channel (3 rows, broadcast) |
2,150,000 | 2,150,000 |
| 6 | Hash-join dim_date (~5K rows) |
2,150,000 | 2,150,000 |
| 7 | Group + aggregate | 2,150,000 | ~12,000 distinct combos |
| 8 | Order + limit | 12,000 | 50 |
- Step 1 partition-prunes the fact to one quarter; the warehouse skips ~75% of the data without reading it.
- Steps 2-6 hash-join each dimension; dimensions are small enough to broadcast (replicate to every executor), so no shuffle is required.
- Step 7 performs the aggregate on the joined row set; columnar warehouses execute this in parallel across slots.
- Step 8 sorts the small aggregated result; latency is dominated by step 1 + step 7.
- Total wall-clock on Snowflake
XSwarehouse: ~600 ms on 8M rows.
Output (sample).
| category | brand | customer_region | store_region | channel_name | year | quarter | units | revenue | avg_unit_price | effective_price |
|---|---|---|---|---|---|---|---|---|---|---|
| Electronics | Acme | NA | NA | web | 2026 | 1 | 42,300 | 9,820,500.00 | 240.50 | 232.16 |
| Electronics | Acme | EU | EU | web | 2026 | 1 | 31,400 | 7,612,000.00 | 248.10 | 242.42 |
| Apparel | Beta | NA | NA | store | 2026 | 1 | 88,200 | 5,210,700.00 | 62.40 | 59.08 |
Why this works — concept by concept:
- Single join per dim — each dimension is reached in exactly one hop; the optimiser builds one hash table per dim and probes the fact once.
-
Broadcast joins on small dims —
dim_channel(3 rows),dim_store(300 rows), anddim_date(~5K rows) are broadcast; no shuffle cost. -
Pre-computed
revenue— the fact storesrevenuedirectly;SUM(f.revenue)is one column read, notSUM((unit_price * quantity) - discount)re-derived per row. -
NULLIF guard —
effective_price = revenue / NULLIF(quantity, 0)protects against divide-by-zero on zero-quantity returns. -
Cost —
O(N)over the fact scan +O(N + D)per hash join whereNis fact rows andDis dimension rows; on modern columnar warehouses the practical cost is dominated by the fact scan, not the joins.
3. Snowflake schema anatomy — normalised dimensions + branching sub-dimensions
snowflake schema — normalised dimensions, branching sub-dimensions, multi-step joins
snowflake schema is the same fact_sales at the centre, but each dimension table is normalised — typically to 3NF (third normal form) — so that hierarchies (category → brand → supplier; city → region → country) live in their own sub-dimension tables, connected by foreign keys. The result is less storage (no repeated category names across millions of products), more join steps per query (two or three hops instead of one), and a shape that matches audit-friendly source-of-truth references.
The four anatomy rules of a snowflake schema.
-
Rule 1 — same
fact tableshape as a star — the fact does not change; only the dimensions normalise. -
Rule 2 — each hierarchy level becomes its own table —
dim_product → dim_category → dim_brand → dim_supplier; one table per level. -
Rule 3 — sub-dimensions enforce uniqueness —
dim_category.category_nameisUNIQUE; a single source of truth for category names. -
Rule 4 — query SQL is multi-join — any analytical query that slices by category joins
fact_sales → dim_product → dim_category(two hops).
The canonical four-dimension snowflake.
CREATE TABLE fact_sales (
sales_sk BIGINT PRIMARY KEY,
customer_sk BIGINT NOT NULL REFERENCES dim_customer,
product_sk BIGINT NOT NULL REFERENCES dim_product,
date_sk INT NOT NULL REFERENCES dim_date,
store_sk INT NOT NULL REFERENCES dim_store,
quantity INT NOT NULL,
unit_price NUMERIC(18,4) NOT NULL,
discount NUMERIC(18,4) NOT NULL DEFAULT 0,
revenue NUMERIC(18,4) NOT NULL
);
-- Product is normalised: dim_product → dim_category → dim_brand → dim_supplier.
CREATE TABLE dim_product (
product_sk BIGINT PRIMARY KEY,
product_id VARCHAR(64) NOT NULL,
product_name VARCHAR(256) NOT NULL,
category_sk BIGINT NOT NULL REFERENCES dim_category,
brand_sk BIGINT NOT NULL REFERENCES dim_brand,
effective_from DATE NOT NULL, effective_to DATE, is_current BOOLEAN NOT NULL
);
CREATE TABLE dim_category (
category_sk BIGINT PRIMARY KEY,
category_name VARCHAR(128) NOT NULL UNIQUE,
parent_category_sk BIGINT REFERENCES dim_category -- self-reference for sub-categories
);
CREATE TABLE dim_brand (
brand_sk BIGINT PRIMARY KEY,
brand_name VARCHAR(128) NOT NULL UNIQUE,
supplier_sk BIGINT NOT NULL REFERENCES dim_supplier
);
CREATE TABLE dim_supplier (
supplier_sk BIGINT PRIMARY KEY,
supplier_name VARCHAR(128) NOT NULL UNIQUE,
country VARCHAR(64)
);
-- Customer is normalised: dim_customer → dim_geography.
CREATE TABLE dim_customer (
customer_sk BIGINT PRIMARY KEY,
customer_id VARCHAR(64) NOT NULL,
customer_name VARCHAR(256) NOT NULL,
segment VARCHAR(64),
geography_sk BIGINT NOT NULL REFERENCES dim_geography,
signup_date DATE,
effective_from DATE NOT NULL, effective_to DATE, is_current BOOLEAN NOT NULL
);
CREATE TABLE dim_geography (
geography_sk BIGINT PRIMARY KEY,
city VARCHAR(128) NOT NULL,
region VARCHAR(64) NOT NULL,
country VARCHAR(64) NOT NULL,
UNIQUE (city, region, country)
);
-
dim_productno longer carriescategory_nameorbrand_name— those live on the sub-dimensions and are reached viacategory_skandbrand_sk. -
dim_categoryhas a self-reference (parent_category_sk) so sub-categories link to parent categories without a sixth table. -
dim_brand → dim_supplier— a brand belongs to one supplier; this hierarchy is enforced by FK rather than denormalised. -
dim_geographycentralises city / region / country; if 1.2M customers all live in 30K unique geographies, the storage saving is significant (~40 GB on a wide string-heavy customer table down to ~12 GB).
The canonical snowflake schema query — revenue by category, last 30 days.
SELECT
c.category_name,
SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product p ON p.product_sk = f.product_sk
JOIN dim_category c ON c.category_sk = p.category_sk -- the extra hop
JOIN dim_date d ON d.date_sk = f.date_sk
WHERE d.date_value >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.category_name
ORDER BY revenue DESC;
-
Three joins — fact to
dim_product,dim_producttodim_category, fact todim_date; the extra hop isdim_product → dim_category. -
c.category_nameis no longer inline ondim_product; the query must traverse the sub-dimension. - Plan — one extra hash-join step; on a 100M-row fact the extra hop adds ~50-150 ms depending on warehouse size.
-
BI tools —
TableauandLookercan model this, but the user (or the LookML / Tableau-relationship layer) has to declare the join path; the auto-generated SQL is no longer single-step.
Worked example — refactor a star to a snowflake to save 35 GB on dim_customer
Detailed explanation. A common production trigger for a snowflake refactor is storage pressure on a wide dimension. Below is the canonical refactor: a 1.2M-row dim_customer with 90% repeated city/region/country strings is normalised into dim_customer + dim_geography, saving ~35 GB.
Question. Your star dim_customer is 80 GB on 1.2M rows because each row carries city VARCHAR(128) + region VARCHAR(64) + country VARCHAR(64) and the strings are repeated across customers in the same geography. Refactor to a snowflake with dim_geography, write the migration SQL, and quantify the storage saving.
Input. dim_customer 1.2M rows × ~256 bytes geography strings ≈ 300 MB; with row overhead, dictionary encoding, and indexes the on-disk size is ~80 GB. Distinct geographies: ~30,000.
Code.
-- Step 1 — extract unique geographies into a sub-dim.
CREATE TABLE dim_geography AS
SELECT
ROW_NUMBER() OVER (ORDER BY country, region, city) AS geography_sk,
city, region, country
FROM (
SELECT DISTINCT city, region, country FROM dim_customer
) g;
-- Step 2 — rebuild dim_customer pointing at the new sub-dim.
CREATE TABLE dim_customer_new AS
SELECT
c.customer_sk,
c.customer_id,
c.customer_name,
c.segment,
g.geography_sk,
c.signup_date,
c.effective_from, c.effective_to, c.is_current
FROM dim_customer c
JOIN dim_geography g
ON g.city = c.city AND g.region = c.region AND g.country = c.country;
-- Step 3 — atomic swap (Snowflake-style).
ALTER TABLE dim_customer RENAME TO dim_customer_old;
ALTER TABLE dim_customer_new RENAME TO dim_customer;
Step-by-step explanation.
- Step 1 materialises the distinct geography tuples; 30K rows replace the 1.2M repeated strings.
-
Step 2 joins each customer to its geography surrogate key; the new
dim_customercarriesgeography_sk(a BIGINT, ~8 bytes) instead of~256 bytesof strings. -
Step 3 swaps the table in one atomic DDL; downstream queries that pre-existed need a tiny adjustment to
JOIN dim_geographywhenever they need city / region / country. - Storage math — 1.2M rows × (256 − 8) bytes ≈ 290 MB raw saved; with row overhead, dictionary, and indexes the on-disk saving compounds to ~35 GB.
-
Query cost — every query that slices by region now adds one hash join; in practice this is < 50 ms on warm dimensions because
dim_geography(30K rows) fits in L2 cache.
Output (storage profile before/after).
| table | shape | row count | on-disk size |
|---|---|---|---|
| dim_customer (before) | star | 1,200,000 | ~80 GB |
| dim_customer (after) | snowflake | 1,200,000 | ~45 GB |
| dim_geography (new) | snowflake | 30,000 | ~3 MB |
| net saving | ~35 GB |
Rule of thumb: snowflake the dimensions whose hierarchy is both high-cardinality strings and heavily repeated. A 1.2M-row dim with 30K unique geographies is a clear win; a 1.2M-row dim with 1.1M unique geographies (almost no repetition) is not.
snowflake schema — the four senior nuances
- Normal forms — most production snowflakes are 3NF (third normal form); going beyond 3NF rarely pays off because the extra join cost outweighs any storage win.
-
Bridge tables — when a fact-to-dim relationship is many-to-many (a single sales line covers two promotional offers), a bridge table with
weighting_factorcolumns is the snowflake pattern. -
Outrigger dimensions — a dim that references another dim (e.g.,
dim_employee → dim_manager); fine in moderation, but more than two levels of outriggers is a smell. -
Mini-dimensions — for
dim_customerwith frequently-changing low-cardinality attributes (age band, income tier), split those into adim_customer_profileso the main customer history stays small.
SQL
Topic — joins
Multi-join SQL practice
SQL
Topic — database
Normalised-schema drills
Solution Using a multi-hop snowflake query against the normalised dimensions
Code.
-- Snowflake equivalent of the star query — same business question, more joins.
SELECT
cat.category_name,
br.brand_name,
g.region AS customer_region,
s.region AS store_region,
ch.channel_name,
d.year,
d.quarter,
SUM(f.quantity) AS units,
SUM(f.revenue) AS revenue,
SUM(f.revenue) / NULLIF(SUM(f.quantity), 0) AS effective_price
FROM fact_sales f
JOIN dim_product p ON p.product_sk = f.product_sk
JOIN dim_category cat ON cat.category_sk = p.category_sk -- sub-dim hop
JOIN dim_brand br ON br.brand_sk = p.brand_sk -- sub-dim hop
JOIN dim_customer c ON c.customer_sk = f.customer_sk
JOIN dim_geography g ON g.geography_sk = c.geography_sk -- sub-dim hop
JOIN dim_store s ON s.store_sk = f.store_sk
JOIN dim_channel ch ON ch.channel_sk = f.channel_sk
JOIN dim_date d ON d.date_sk = f.date_sk
WHERE d.year = 2026 AND d.quarter = 1
GROUP BY cat.category_name, br.brand_name, g.region, s.region, ch.channel_name, d.year, d.quarter
ORDER BY revenue DESC
LIMIT 50;
Step-by-step trace.
| step | operation | rows in | rows out |
|---|---|---|---|
| 1 | Scan fact_sales Q1 2026 |
8,800,000 (annual) | 2,150,000 (Q1) |
| 2 | Hash-join dim_product (~50K rows) |
2,150,000 | 2,150,000 |
| 3 | Hash-join dim_category (~1.2K rows, broadcast) |
2,150,000 | 2,150,000 |
| 4 | Hash-join dim_brand (~5K rows, broadcast) |
2,150,000 | 2,150,000 |
| 5 | Hash-join dim_customer (~1.2M rows) |
2,150,000 | 2,150,000 |
| 6 | Hash-join dim_geography (~30K rows, broadcast) |
2,150,000 | 2,150,000 |
| 7 | Hash-join dim_store (~300 rows, broadcast) |
2,150,000 | 2,150,000 |
| 8 | Hash-join dim_channel (3 rows, broadcast) |
2,150,000 | 2,150,000 |
| 9 | Hash-join dim_date (~5K rows, broadcast) |
2,150,000 | 2,150,000 |
| 10 | Group + aggregate | 2,150,000 | ~12,000 |
| 11 | Order + limit | 12,000 | 50 |
- Steps 1-2 are identical to the star (fact scan +
dim_productjoin). - Step 3 is the extra hop —
dim_product → dim_category; broadcast becausedim_categoryis tiny. - Step 4 is another extra hop for the brand lookup.
- Step 6 is the geography hop on the customer side; broadcast because 30K rows fit in cache.
- The total wall-clock on Snowflake
XS: ~900 ms — about 50% slower than the star (~600 ms) on the same data and same warehouse, even though all sub-dims are broadcast.
Output (sample).
| category_name | brand_name | customer_region | store_region | channel_name | year | quarter | units | revenue | effective_price |
|---|---|---|---|---|---|---|---|---|---|
| Electronics | Acme | NA | NA | web | 2026 | 1 | 42,300 | 9,820,500.00 | 232.16 |
| Electronics | Acme | EU | EU | web | 2026 | 1 | 31,400 | 7,612,000.00 | 242.42 |
| Apparel | Beta | NA | NA | store | 2026 | 1 | 88,200 | 5,210,700.00 | 59.08 |
Why this works — concept by concept:
-
Multi-hop join chain — the snowflake forces
fact → dim → sub-dimfor every hierarchy slice; the SQL pays the extra join in exchange for normalised storage. -
Broadcast joins on sub-dims —
dim_category,dim_brand,dim_geographyare small enough to broadcast; no shuffle cost on a modern columnar warehouse. - Same business answer — the result set is identical to the star query; only the SQL and the plan differ.
- Latency tax — the extra hops cost ~30-50% more runtime in practice; for a sub-second dashboard query this is fine, for a 30-minute batch this is fine, for a 50-ms BI drilldown it is a problem.
-
Cost —
O(N)over the fact scan +O(N + Dᵢ)per dim hop; cumulative cost scales linearly with hop count, which is why snowflakes with > 3 hops per query are slow in practice.
4. Star vs Snowflake — five-dimension trade-off (query speed, ETL, storage, BI fit, best for)
star schema vs snowflake schema — the five-dimension trade-off matrix
The five-dimension trade-off is the framework every senior dimensional modeling interviewer wants you to recite: query speed (joins per query), ETL complexity (load orchestration), storage cost (denormalised redundancy vs normalised reuse), BI tool fit (auto-generated SQL vs manual join paths), and best for (which workloads each schema wins at). Every senior fact table + dimension table discussion comes back to these five axes.
Dimension 1 — query speed.
- Star — fewer joins → faster. One join per dimension; columnar warehouses (Snowflake, BigQuery, Redshift, Databricks) hash-join one dim at a time; aggregate is single-pass.
- Snowflake — more joins → slower on wide queries. Two or three joins per dimension hierarchy; broadcasts help small sub-dims but every extra hop adds optimiser work and cache pressure.
- The empirical delta — on a 100M-row fact with 4 wide dimensions, the snowflake variant is typically 20-50% slower for a multi-dim slice query; for a single-dim slice the difference is < 10%.
- The senior take — query speed matters most when the workload is interactive BI (sub-second dashboards); for batch / overnight reporting the difference is irrelevant.
Dimension 2 — ETL complexity.
-
Star — heavier load on each dim, simpler shape. Building
dim_productwith denormalisedcategory,brand,suppliermeans resolving each lookup once per load and writing the wide row; simpler orchestration (one dim table per business entity). - Snowflake — lighter load per dim, more orchestration. Each sub-dim is updated independently; the load DAG has more nodes (one per sub-dim) and you must enforce parent-before-child loading order.
- The empirical delta — in dbt terms, a typical star has ~5-8 dim models; the snowflake equivalent has ~10-14. Engineering time per dim is similar; total time scales with model count.
- The senior take — pick whichever your team can maintain; an under-staffed team should not sign up for the orchestration overhead of a snowflake.
Dimension 3 — storage cost.
-
Star — redundant strings on wide dims.
dim_productwith 1M rows ×category VARCHAR(128) + brand VARCHAR(128) + supplier VARCHAR(128)carries ~400 MB of redundant strings; with row overhead and indexes the disk footprint is far larger. - Snowflake — 20-40% smaller on wide dims. Normalising the strings into sub-dims replaces the wide string columns with 8-byte surrogate keys; on dimensions with high repetition the saving is substantial.
-
The empirical delta —
dim_customer1.2M × geography refactor in section 3 saved ~35 GB; on a 30M-row clickstreamdim_eventwith repeatingevent_category,event_subcategory, the saving can hit 200-300 GB. - The senior take — storage cost matters when you are paying per-TB (cloud warehouses) at scale; at 10 TB it's a rounding error, at 10 PB it is real money.
Dimension 4 — BI tool fit.
-
Star — Tableau, Looker, Power BI, Mode, Hex love it. Every BI tool auto-generates SQL against a star with zero configuration;
dim_product.categoryis a clickable field that joins fact-to-dim transparently. -
Snowflake — needs manual joins or views.
Lookerrequires explicitLookMLview definitions per sub-dim hop;Tableaurequires relationship modelling;Power BIrequires relationship arrows. The end-user click-and-explore experience is worse unless the BI layer abstracts the hops. - The empirical delta — onboarding a new dashboard analyst on a star takes hours; on a snowflake it takes days because they must learn the join paths.
- The senior take — if business users self-serve in the BI tool, star wins; if all SQL is centrally authored by data engineers, either works.
Dimension 5 — best for (workloads).
- Star — best for — interactive BI dashboards, ad-hoc analytics, self-serve exploration, marketing/sales/product KPI surfaces, fast time-to-first-insight, smaller-to-medium warehouses where storage is not the binding constraint.
- Snowflake — best for — regulated reporting (finance, healthcare, insurance) where the source-of-truth hierarchy matches the audit chart of accounts, petabyte-scale warehouses where storage savings are material, deeply hierarchical dimensions (product taxonomies with 4+ levels), data-vault → mart pipelines where snowflake is the natural intermediate shape.
The honest meta-take. Most production warehouses ship both — a snowflake layer under the hood for raw / staging / data-vault, and a star layer at the consumption mart. The cleanest pattern is snowflake-on-the-way-in, star-on-the-way-out: normalise to sub-dimensions during ingestion to enforce hierarchy integrity, denormalise back to a star at the mart layer for BI consumption. This pattern is increasingly common in dbt + Snowflake + Looker stacks.
Worked example — score the same warehouse on all five dimensions
Detailed explanation. A realistic interview drill is "score your current warehouse on the five-dimension trade-off matrix". Below is the canonical scoring exercise for a mid-size retailer running a hybrid (snowflake staging, star mart) on Snowflake + Looker.
Question. A retailer has 100M-row fact_sales, 1.2M-row dim_customer, 50K-row dim_product, and 300-row dim_store. They run interactive Looker dashboards (~500 concurrent users), nightly finance reconciliation, and weekly product-hierarchy audits. Score star vs snowflake on the five dimensions and recommend a shape per layer.
Input. Workload mix: 70% interactive BI (sub-second SLA), 25% nightly batch (4-hour SLA), 5% audit queries (10-minute SLA). Storage budget: $5K/month.
Code.
CREATE TABLE shape_scorecard AS
SELECT * FROM (VALUES
('query_speed', 'star', 'wins', 'sub-second on 100M rows'),
('query_speed', 'snowflake', 'acceptable', '0.9-1.5 s on 100M rows'),
('etl_complexity', 'star', 'medium', '6 dim models + 1 fact'),
('etl_complexity', 'snowflake', 'higher', '12 dim/sub-dim models + 1 fact'),
('storage_cost', 'star', 'baseline', '900 GB total'),
('storage_cost', 'snowflake', 'cheaper', '~620 GB total (saves $300/month)'),
('bi_tool_fit', 'star', 'wins', 'Looker auto-joins, zero LookML hops'),
('bi_tool_fit', 'snowflake', 'requires LookML','manual joins for each sub-dim'),
('best_for', 'star', 'BI mart', 'consumption layer for Looker'),
('best_for', 'snowflake', 'staging + audit','source-of-truth + finance reconciliation')
) AS t(dimension, schema_shape, verdict, evidence);
Step-by-step explanation.
- The scorecard is a single artefact a senior engineer can paste into an architecture doc.
- Each dimension has two rows — one verdict per shape; the comparison is explicit, not narrative.
- The
evidencecolumn anchors each verdict in numbers —sub-second on 100M rows,$300/month savings; this is the senior-signal column. - The recommendation falls out: snowflake at staging + audit, star at the mart; this is the dominant production pattern in 2026.
- The scorecard is a living document — re-score quarterly as data volume and workload mix shift.
Output (recommendation table).
| layer | shape | rationale |
|---|---|---|
| raw + staging | snowflake | matches source-of-truth audit hierarchy |
| consumption mart | star | Looker auto-joins, sub-second BI |
| finance audit views | snowflake | regulated reporting needs normalised dims |
Rule of thumb: the layer drives the shape, not the warehouse-wide preference. Modern stacks rarely pick one shape for the entire warehouse.
The trade-off matrix as a one-screen reference
| Dimension | Star verdict | Snowflake verdict |
|---|---|---|
| Query speed | Fewer joins · faster | More joins · slower on wide queries |
| ETL complexity | Heavier per-dim load · simpler shape | Lighter per-dim load · more orchestration |
| Storage cost | Redundant strings on wide dims | 20-40% smaller on wide dims |
| BI tool fit | Tableau / Looker / Power BI auto-join | Needs manual joins, LookML, or views |
| Best for | Dashboards · ad-hoc analytics · self-serve | Regulated reporting · audit trails · petabyte storage |
SQL
Topic — sql
SQL practice library
SQL
Topic — group-by
GROUP BY practice
Solution Using a side-by-side query comparison + measured cost
Code.
-- Same business question, both shapes, with timing.
-- (A) STAR query — 2 joins.
EXPLAIN ANALYZE
SELECT p.category, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product_star p ON p.product_sk = f.product_sk
WHERE f.date_sk BETWEEN 20260101 AND 20260131
GROUP BY p.category;
-- (B) SNOWFLAKE query — 3 joins.
EXPLAIN ANALYZE
SELECT cat.category_name, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product_sf p ON p.product_sk = f.product_sk
JOIN dim_category cat ON cat.category_sk = p.category_sk
WHERE f.date_sk BETWEEN 20260101 AND 20260131
GROUP BY cat.category_name;
Step-by-step trace.
| step | star plan | snowflake plan |
|---|---|---|
| 1 | Scan fact (Jan 2026) — 2.15M rows | Scan fact (Jan 2026) — 2.15M rows |
| 2 | Hash-join dim_product_star (50K, broadcast) | Hash-join dim_product_sf (50K, broadcast) |
| 3 | Group + aggregate by category | Hash-join dim_category (1.2K, broadcast) |
| 4 | Order | Group + aggregate by category_name |
| 5 | Order | |
| Wall-clock | ~420 ms on Snowflake XS | ~610 ms on Snowflake XS |
- Step 1 is identical — both shapes scan the same fact partition.
- Step 2 is identical — both shapes broadcast
dim_product. - The snowflake variant adds step 3 — an extra hash-join hop to
dim_category. - Steps 4-5 in the snowflake plan are the same as steps 3-4 in the star plan, just shifted by one.
- Total cost delta: ~190 ms (~45% slower) on this small example; on larger facts the delta widens further because the hash table cache pressure grows.
Output.
| query | joins | wall_clock_ms | result_rows |
|---|---|---|---|
| (A) star | 2 | 420 | 28 categories |
| (B) snowflake | 3 | 610 | 28 categories (same answer) |
Why this works — concept by concept:
- Same-result comparison — both queries return the same category-level totals; the SQL and the plan differ but the answer does not.
- Measured wall-clock — interviewers want numbers, not opinions; bringing a measured wall-clock delta is the senior move.
- Broadcast economics — small sub-dims (< 10K rows) broadcast cheaply; large sub-dims (> 1M rows) shuffle and the snowflake delta grows fast.
- Optimiser caveats — the warehouse query planner may reorder joins; the number of joins is the floor on cost, not the ceiling.
-
Cost —
O(N)fact scan +O(N + Dᵢ)per join hop; cumulative hops are the differentiator between the shapes.
5. Decision matrix — when to choose which (with worked SQL)
star schema vs snowflake schema — a four-question decision tree
The decision matrix is the senior framework: four questions, four verdicts, one clear answer per workload. Memorise it and you can defend any shape choice in 60 seconds.
Q1 — Is query latency the #1 priority?
-
YES →
star schema(denormalised). Interactive BI / sub-second dashboards demand the fewest joins possible. Modern columnar warehouses can mask one or two extra joins, but at 100+ concurrent users every saved millisecond pays compounding rent. - NO → continue to Q2.
Q2 — Are dimension hierarchies deep AND changing often?
-
YES →
snowflake schema(normalised). Deep hierarchies (country → region → city → district → neighbourhood) with frequent re-org events (regionboundaries shift) are painful to maintain as denormalised strings. Normalising into sub-dims means a re-org touches one row indim_region, not millions of rows indim_customer. - NO → continue to Q3.
Q3 — Is storage cost a meaningful constraint? (e.g. petabyte-scale)
-
YES →
snowflake schema. At petabyte scale a 30% storage saving on wide dimensions translates to material dollars; the join cost is amortised across many queries. - NO → continue to Q4.
Q4 — Does your BI tool auto-join multi-step paths?
- YES → either works (modern Looker with explicit LookML joins, Power BI with relationship views; both can mask snowflake hops from end users).
-
NO →
star schema(safer default; minimises the BI-layer modelling cost).
The default verdict. Start star, refactor only if storage or audit requires it. For ~80% of warehouses, the star schema is the right starting shape; the cost of refactoring a star into a snowflake later is far smaller than the cost of forcing every analyst to learn snowflake join paths from day one.
When Data Vault is in the mix. Data Vault 2.0 (hubs + links + satellites) is its own paradigm and lives upstream of both star and snowflake; a typical pipeline is source → data vault → snowflake (intermediate) → star (consumption mart). The decision matrix above applies to the consumption layer, not the data-vault layer.
Worked example — pick the schema for three real workloads
Detailed explanation. Real senior interviews ask you to apply the decision tree to multiple workloads and defend each pick.
Question. For each workload, walk through the decision tree and recommend a schema with a one-sentence rationale: (a) a SaaS product analytics warehouse serving Looker dashboards to 800 product managers, (b) a bank's regulatory reporting warehouse generating Basel-III risk reports, (c) a clickstream warehouse storing 100B events for ML feature engineering.
Input. Three workloads, three different priority profiles.
Code.
CREATE TABLE workload_recommendations AS
SELECT * FROM (VALUES
('SaaS product analytics', 'Q1: YES (sub-second BI is the priority)',
'star', 'Looker auto-joins; PMs self-serve; storage is not the constraint'),
('Bank regulatory reporting','Q1: NO; Q2: YES (deep audit hierarchies that change quarterly)',
'snowflake','normalised dims match Basel-III source-of-truth references; audit-friendly'),
('Clickstream feature store','Q1: NO; Q2: NO; Q3: YES (100B events × wide string dims = petabytes)',
'snowflake','normalising event_category + event_subcategory saves ~200 GB per partition')
) AS t(workload, decision_trace, recommendation, rationale);
Step-by-step explanation.
- Workload (a) — SaaS product analytics — Q1 is YES (interactive BI), so the tree short-circuits to star; the rationale is BI auto-join + PM self-serve.
- Workload (b) — bank regulatory reporting — Q1 is NO (overnight batch is fine), Q2 is YES (Basel-III hierarchies are deep and re-organised quarterly); tree resolves to snowflake.
- Workload (c) — clickstream feature store — Q1 NO (ML feature jobs are batch), Q2 NO (hierarchies are shallow), Q3 YES (petabyte scale with redundant string dims); tree resolves to snowflake for storage economics.
- Each recommendation has a one-sentence rationale rooted in the decision-tree branch; this is the answer shape interviewers expect.
- The recommendations are defensible not because they are universally right but because they trace a known framework.
Output (the recommendation table).
| workload | recommendation | rationale |
|---|---|---|
| SaaS product analytics | star | Looker auto-join + PM self-serve |
| Bank regulatory reporting | snowflake | normalised dims match audit hierarchy |
| Clickstream feature store | snowflake | petabyte storage savings on repeating dim strings |
Rule of thumb: the decision tree is short-circuit; the first YES wins. Practise tracing the tree on three workloads before any interview — the muscle memory makes the answer feel automatic.
Worked SQL — answering the same business question on both shapes
The question. "For Q1 2026, what's the top-5 revenue by category, sliced by customer region, for online (web + mobile) orders only?"
Star answer.
SELECT
p.category,
c.region AS customer_region,
SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product p ON p.product_sk = f.product_sk
JOIN dim_customer c ON c.customer_sk = f.customer_sk
JOIN dim_channel ch ON ch.channel_sk = f.channel_sk
JOIN dim_date d ON d.date_sk = f.date_sk
WHERE d.year = 2026 AND d.quarter = 1
AND ch.channel_name IN ('web', 'mobile')
GROUP BY p.category, c.region
ORDER BY revenue DESC
LIMIT 5;
Snowflake answer.
SELECT
cat.category_name AS category,
g.region AS customer_region,
SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product_sf p ON p.product_sk = f.product_sk
JOIN dim_category cat ON cat.category_sk = p.category_sk
JOIN dim_customer c ON c.customer_sk = f.customer_sk
JOIN dim_geography g ON g.geography_sk = c.geography_sk
JOIN dim_channel ch ON ch.channel_sk = f.channel_sk
JOIN dim_date d ON d.date_sk = f.date_sk
WHERE d.year = 2026 AND d.quarter = 1
AND ch.channel_name IN ('web', 'mobile')
GROUP BY cat.category_name, g.region
ORDER BY revenue DESC
LIMIT 5;
- Same business answer, same result rows, same ordering.
-
Star — 4 joins. Snowflake — 6 joins (two extra hops:
dim_categoryanddim_geography). -
Wall-clock on warm cache — star ~500 ms, snowflake ~750 ms on a
XSSnowflake warehouse against 8M Q1 rows. - The SQL is more readable on the snowflake in one specific way: the join paths are self-documenting (you can see the hierarchy) — at the cost of more typing.
SQL
Topic — data-modeling
Schema-choice practice
SQL
Topic — joins
Multi-join SQL drills
Solution Using a layered recommendation (snowflake-in, star-out)
Code.
-- Build the snowflake-in/star-out architecture as a single layered model.
-- Layer 1 — snowflake-shaped sub-dims (audit + storage win).
CREATE OR REPLACE VIEW v_dim_product_snowflake AS
SELECT p.product_sk, p.product_id, p.product_name,
cat.category_name, br.brand_name, sup.supplier_name
FROM dim_product_sf p
JOIN dim_category cat ON cat.category_sk = p.category_sk
JOIN dim_brand br ON br.brand_sk = p.brand_sk
JOIN dim_supplier sup ON sup.supplier_sk = br.supplier_sk;
-- Layer 2 — flatten to a star-shaped consumption dim (BI win).
CREATE TABLE dim_product_star AS
SELECT product_sk, product_id, product_name,
category_name AS category, brand_name AS brand, supplier_name AS supplier
FROM v_dim_product_snowflake;
-- Layer 3 — fact stays the same; both layers read the same fact.
-- BI tools point at dim_product_star; audit queries point at v_dim_product_snowflake.
Step-by-step trace.
| layer | shape | consumer | refresh cadence |
|---|---|---|---|
| Sub-dims (dim_category, dim_brand, dim_supplier) | snowflake | audit, finance | every load |
| v_dim_product_snowflake (view) | snowflake | audit queries | virtual (no refresh) |
| dim_product_star (table) | star | BI tools, Looker | every load (materialised) |
| fact_sales | unchanged | both layers | every load |
- Layer 1 — the sub-dims persist physically; they are the source of truth and survive audits.
- Layer 2 — a flattening view exposes the snowflake hierarchy as a single wide row; auditors prefer this over chasing FKs across multiple tables.
- Layer 3 — a materialised star dim is built from the flattening view and exposed to BI tools; query latency on the BI layer is identical to a pure star.
- The pattern is storage-efficient at the source + BI-friendly at the consumption layer — the best of both shapes.
- The DAG cost is one extra
CREATE TABLE ASper dim; in dbt, this is one extra model per dim.
Output (one-row sample of dim_product_star).
| product_sk | product_id | product_name | category | brand | supplier |
|---|---|---|---|---|---|
| 1001 | SKU-9981 | Acme Wireless Earbuds Pro | Electronics | Acme | AcmeCorp Ltd |
Why this works — concept by concept:
- Snowflake-in, star-out — the dominant 2026 pattern; storage savings at staging + BI auto-join at the mart.
- Materialised star dim — the consumption layer is physically denormalised so the BI tool sees a star; no query-time hops.
- Audit view — the sub-dim hierarchy stays accessible to auditors via a thin view, so the snowflake structure survives.
-
Single fact —
fact_salesis unchanged; both layers read the same fact, so storage on the fact is paid once. - Cost — one extra materialised dim per load; the storage cost is offset by the BI-layer query-latency win on every dashboard hit thereafter.
Choosing the right schema (cheat sheet)
A one-screen cheat sheet for star schema vs snowflake schema — pick the shape that matches your workload.
| You care most about … | Pick | Why |
|---|---|---|
| Sub-second BI dashboards | star | Fewer joins → faster; BI tools auto-generate single-join SQL |
| Self-serve analyst exploration | star | Tableau / Looker / Power BI users don't need to learn join paths |
| Petabyte-scale storage economics | snowflake | 20-40% smaller wide dims; surrogate keys replace repeating strings |
| Audit-friendly regulated reporting | snowflake | Normalised dims match source-of-truth chart of accounts |
| Deep hierarchies (4+ levels) that change quarterly | snowflake | A re-org updates one sub-dim row, not millions of dim rows |
| Simplest ETL DAG | star | Fewer dim models, simpler orchestration |
| Smallest dim storage footprint | snowflake | Normalisation eliminates redundant strings |
| Easiest onboarding for new analysts | star | Single-step joins map to mental model of "fact + dim" |
| Single source of truth for hierarchies | snowflake |
dim_category.category_name UNIQUE enforces uniqueness |
| Mixed workload (BI + audit) | hybrid (snowflake staging, star mart) | Snowflake-in, star-out is the 2026 default |
| Data Vault → mart pipeline | snowflake intermediate, star mart | Natural fit; hubs/links/satellites → snowflake → star |
| Conformed dim across many marts | either | Conformed dimensions are independent of star vs snowflake choice |
| SCD type 2 history | either | Both schemas handle SCD2 identically on the relevant dims |
| First-time warehouse build | star | Default safe choice; refactor later if needed |
| Multi-channel retailer fact | star |
dim_channel as conformed dim + sentinel store rows for non-physical |
| Clickstream event store with repeating event metadata | snowflake | Storage savings on dim_event are substantial |
Frequently asked questions
What is the difference between star schema and snowflake schema in one sentence?
A star schema has one fact table in the centre and a single layer of denormalised dimension tables around it — each dimension stores its hierarchy inline as columns, so every query reaches its data in one join per dimension. A snowflake schema has the same fact table and same primary dimensions, but each dimension is normalised into sub-dimension tables (e.g., dim_product → dim_category → dim_brand → dim_supplier), so analytical queries pay more joins per dimension hierarchy in exchange for less storage redundancy. The senior way to phrase the difference is "star denormalises for BI speed; snowflake normalises for storage and audit", and most production warehouses ship both shapes in different layers.
When should I choose star schema over snowflake schema?
Choose star schema when query latency is the #1 priority (interactive BI dashboards, self-serve analyst exploration, sub-second SLAs), when your BI tool (Tableau, Looker, Power BI, Mode, Hex) auto-generates SQL and you want zero LookML / relationship overhead, when storage cost is not a binding constraint, and when your team prefers a simpler ETL DAG with fewer dim models. The four-question decision tree from section 5 short-circuits: Q1 — is query latency the priority? YES → star, no further questions. As a default starting shape for a new warehouse, star wins ~80% of the time because the cost of refactoring star-to-snowflake later is smaller than the cost of forcing every analyst to learn join paths on day one. The exception is regulated industries (finance, healthcare, insurance) where audit hierarchies dictate the shape from day one.
When should I choose snowflake schema over star schema?
Choose snowflake schema when storage cost is a meaningful constraint (petabyte-scale warehouses where a 20-40% dim-storage saving translates to material dollars), when you have deep dimension hierarchies (4+ levels) that change often (regional re-orgs, product taxonomy rewrites), when regulated reporting requires the normalised structure to match a source-of-truth chart of accounts (Basel III, IFRS, GAAP, SOX, HIPAA), or when you are building the intermediate layer of a Data Vault → snowflake → star pipeline. The snowflake pays for itself in storage and audit-friendliness at the cost of query latency and BI-tool friction; on a Snowflake or BigQuery warehouse with broadcast joins on small sub-dims, the latency penalty is typically 20-50% — acceptable for batch and tolerable for most BI workloads, painful for sub-50-ms drilldowns.
What is a fact table vs a dimension table?
A fact table stores the measurable events of a business process — quantity, revenue, discount, unit_price — along with the foreign keys (customer_sk, product_sk, date_sk, store_sk) that point at the dimensions describing each event; the fact's grain is the declared "one row per X" contract (e.g., one row per order line, one row per shipment event, one row per page view). A dimension table stores the descriptive context of a business entity — customer_name, product_name, category, region, manager_name — and acts as the slice-and-dice surface for analytical queries; dimensions are reached by joining on the surrogate key (customer_sk, product_sk). The mnemonic is "facts are numbers you sum; dimensions are strings you group by" — SUM(revenue) GROUP BY category is SUM(fact column) GROUP BY dim column. Every well-modelled warehouse has one fact per business process and one set of conformed dimensions reused across all facts.
What are conformed dimensions and slowly changing dimensions (SCD)?
Conformed dimensions are dimension tables that are shared across multiple fact tables — one dim_customer is used by fact_sales, fact_support, and fact_marketing so that cross-mart reporting (revenue + tickets + campaign attribution per customer) joins to the same customer rows everywhere; this is the single biggest reuse lever in a warehouse and the strongest senior signal in a dimensional modeling answer. Slowly changing dimensions (SCD) describe how dimension attributes change over time and how the warehouse preserves history: SCD type 1 overwrites the old value (no history); SCD type 2 versions the dimension row with effective_from, effective_to, and is_current columns so each historical fact joins to the dimension as it was at the time of the event; SCD type 3 keeps a current and a previous column (limited history). The senior interview answer is "dim_product is SCD type 2 because product attributes change and historical revenue reports must reflect the product hierarchy at the time of sale".
Is snowflake schema the same as the Snowflake data warehouse?
No — they share a name but are completely separate concepts. The snowflake schema (lowercase, dimensional-modeling concept) is the normalised dimension-table design pattern this guide compares to star; it was named in the 1990s by Ralph Kimball because the radial diagram of a normalised dim with sub-dimensions resembles a snowflake crystal. The Snowflake data warehouse (capital S, the company / product) is a cloud-native columnar warehouse vendor (Snowflake Inc., ticker SNOW) that runs on AWS, GCP, and Azure and competes with BigQuery, Databricks SQL, Redshift, and Synapse. Confusingly, the Snowflake warehouse supports both schema shapes — you can build a star schema or a snowflake schema on the Snowflake warehouse, and many production teams do exactly that (snowflake-in at staging, star-out at the mart). Interview tip: when an interviewer asks about "snowflake", clarify which one in your first sentence.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including SQL + Python drills keyed to the same star schema vs snowflake schema skill set this guide teaches (fact + dim joins, surrogate key handling, aggregate parity across normalised vs denormalised dims, SCD type 2 effective-date joins, conformed-dim reuse, and the snowflake-in / star-out architecture pattern). Whether you're prepping for a senior dimensional modeling screen the night before or grinding the fact table + dimension table + grain + conformed dimensions + SCD loop over months, the practice library mirrors the same shapes, decision-tree thinking, and trade-off vocabulary interviewers expect.
Kick off via Explore practice →; drill the SQL practice lane →; fan out into the data-modeling lane →; rehearse joins drills →; reinforce aggregation patterns →; widen coverage on the full Python practice library →.





Top comments (0)