DEV Community

Cover image for Star Schema vs Snowflake Schema: Dimensional Modeling for Data Engineering
Gowtham Potureddi
Gowtham Potureddi

Posted on

Star Schema vs Snowflake Schema: Dimensional Modeling for Data Engineering

star schema vs snowflake schema is the single most-asked dimensional modeling question on a data-engineering interview loop, because the answer touches every layer of the warehouse — fact table design, dimension table shape, grain declaration, conformed dimensions, SCD (slowly changing dimension) handling, query latency, ETL load complexity, storage cost, and BI tool fit. A senior interviewer is not asking which schema is better; they are asking whether you can map a workload onto a schema, name the five-dimension trade-off out loud, and justify the choice with a decision tree — the exact shape this deep-dive walks through, end to end.

This guide covers the topic at five teaching depths — anatomy of a star schema (one fact, denormalised dimensions, single-step joins), anatomy of a snowflake schema (normalised dimensions, branching sub-dimensions, multi-step joins), the five-dimension comparison (query speed, ETL complexity, storage cost, BI-tool fit, best-for workloads), the decision matrix (when to pick which, with worked SQL on both shapes), and a tight cheat sheet that fits on a single screen — followed by six FAQs that vary the keyword cluster so a senior loop's "explain it differently" follow-ups all have a clean answer.

PipeCode blog header for a deep-dive comparison of star schema vs snowflake schema — bold white headline 'Star vs Snowflake Schema' with subtitle 'Dimensional Modeling for Data Engineering' and two stylised mini-schemas side-by-side (a flat star on the left and a normalised snowflake on the right) on a dark gradient with purple, orange, green, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse the SQL practice library →, drill joins problems →, sharpen aggregation reps →, reinforce database problems →, rehearse data-modeling problems →, or widen coverage on the full Python practice library →.


On this page


1. Why dimensional modeling is its own interview track

dimensional modeling — a distinct discipline from OLTP design and raw SQL

The one-sentence invariant: dimensional modeling is a distinct discipline because the shapes it optimises for — fact table + surrounding dimension table arms, declared grain, conformed dimensions across marts, and SCD type 2 history — make analytical queries (aggregate, slice-by-dim, time-series) one to two orders of magnitude faster than the same query against a 3NF OLTP schema, and the design decisions that buy that speed (denormalisation, surrogate keys, late-binding dimensions, slowly-changing dimension policy) are workload-shaped, not *form-shaped*. An interviewer is not testing whether you can write a JOIN — they are testing whether you can think in facts, dimensions, grain, and tradeoffs while they listen.

What interviewers actually score on star schema vs snowflake schema questions.

  • Definition fluency — can you, in 30 seconds, define fact table, dimension table, grain, conformed dimension, SCD type 2, star schema, and snowflake schema without notes?
  • Shape comparison — can you draw both schemas on a whiteboard and explain why the snowflake "branches" while the star is "flat"?
  • Trade-off articulation — can you name the five dimensions of trade-off (query speed, ETL complexity, storage cost, BI tool fit, best for) and the verdict for each side?
  • Decision-tree thinking — given a workload (Tableau dashboard, regulated reporting, petabyte clickstream, data vault → mart), can you pick a schema and justify with two sentences?
  • SQL fluency on both — can you write the same business question as a star query (one JOIN per dim) and a snowflake query (multi-step JOIN chain) and read off the cost difference?
  • SCD literacy — can you describe type 1 (overwrite), type 2 (versioned row + effective dates), and type 3 (versioned column) and name which star vs snowflake handles them more cleanly?

The 5-section map this guide walks through.

  • Section 1 — Why dimensional modeling is its own interview track — the scope, the taxonomy of facts / dimensions / grain, and the four senior signals.
  • Section 2 — star schema anatomy — one fact at the centre, denormalised dimensions in a radial pattern, single-step joins for every analytical query.
  • Section 3 — snowflake schema anatomy — same fact, but each dimension is normalised into sub-dimensions; storage falls and join cost rises.
  • Section 4 — The five-dimension trade-offquery speed, ETL complexity, storage cost, BI tool fit, best for; the matrix interviewers expect you to recite.
  • Section 5 — The decision matrix — four-question decision tree with worked SQL on both shapes so you can defend the verdict.

Why this is its own interview track and not a SQL round.

  • dimensional modeling is not OLTP design — the system under design is analytical, not transactional; the shape that optimises for OLAP is the opposite of the shape that optimises for OLTP.
  • The choices are shape-binding — choosing star vs snowflake locks ETL complexity, query latency, and BI-tool integration for years; a wrong choice is a multi-quarter refactor, not a one-day fix.
  • grain is the most-missed concept — every fact table has a declared grain (e.g., "one row per order line"); without it, every aggregate query is a guess.
  • conformed dimensions are the senior signal — a junior describes a single mart's star; a senior describes a dim_customer that is shared across fact_sales, fact_support, and fact_marketing so all three marts roll up consistently.
  • SCD type 2 is the discipline gate — a slowly-changing dimension without effective dates is the bug that makes historical reports lie; the senior signal is naming the SCD policy before the shape question.

Worked example — translate one OLTP table into both a star fact + dim and a snowflake fact + dim chain

Detailed explanation. Real interviews probe whether you can translate the same OLTP source onto both shapes and read off the structural differences. Below is the canonical translation: a single source orders_oltp table is reshaped into (a) a star with dim_product denormalised and (b) a snowflake with dim_product normalised into dim_category and dim_brand.

Question. Given a source OLTP orders_oltp table containing order_id, customer_id, product_id, product_name, category_name, brand_name, order_ts, quantity, unit_price, design (a) the equivalent star schema and (b) the equivalent snowflake schema, declaring the grain of the fact table and identifying which dimension columns move into sub-dimensions in the snowflake.

Input. One OLTP table, 10M rows. Each row is one order line; an order_id can repeat across rows if an order has multiple line items.

Code.

-- (a) STAR — one fact + four denormalised dims; product hierarchy is INLINE on dim_product.
CREATE TABLE fact_sales (
    sales_sk       BIGINT PRIMARY KEY,            -- surrogate key (grain anchor)
    customer_sk    BIGINT NOT NULL REFERENCES dim_customer,
    product_sk     BIGINT NOT NULL REFERENCES dim_product,
    date_sk        INT    NOT NULL REFERENCES dim_date,
    store_sk       INT    NOT NULL REFERENCES dim_store,
    quantity       INT    NOT NULL,
    unit_price     NUMERIC(18,4) NOT NULL,
    revenue        NUMERIC(18,4) NOT NULL          -- measure: quantity * unit_price
);
-- Declared grain: one row = one order LINE.

CREATE TABLE dim_product (
    product_sk     BIGINT PRIMARY KEY,
    product_id     VARCHAR(64) NOT NULL,           -- natural key (from OLTP)
    product_name   VARCHAR(256) NOT NULL,
    category_name  VARCHAR(128) NOT NULL,          -- denormalised hierarchy
    brand_name     VARCHAR(128) NOT NULL,          -- denormalised hierarchy
    supplier_name  VARCHAR(128) NOT NULL,          -- denormalised hierarchy
    effective_from DATE NOT NULL,                  -- SCD type 2
    effective_to   DATE,
    is_current     BOOLEAN NOT NULL
);

-- (b) SNOWFLAKE — same fact, but dim_product is normalised into dim_category + dim_brand.
CREATE TABLE dim_product_sf (
    product_sk     BIGINT PRIMARY KEY,
    product_id     VARCHAR(64) NOT NULL,
    product_name   VARCHAR(256) NOT NULL,
    category_sk    BIGINT NOT NULL REFERENCES dim_category,
    brand_sk       BIGINT NOT NULL REFERENCES dim_brand,
    effective_from DATE NOT NULL,
    effective_to   DATE,
    is_current     BOOLEAN NOT NULL
);

CREATE TABLE dim_category (
    category_sk    BIGINT PRIMARY KEY,
    category_name  VARCHAR(128) NOT NULL UNIQUE
);

CREATE TABLE dim_brand (
    brand_sk       BIGINT PRIMARY KEY,
    brand_name     VARCHAR(128) NOT NULL UNIQUE,
    supplier_sk    BIGINT REFERENCES dim_supplier  -- snowflake can branch further
);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The grain of fact_sales is declared as one row per order line — every aggregate downstream (revenue per region, AOV per category) reads from this grain.
  2. The star keeps category_name, brand_name, supplier_name inline on dim_product; one JOIN from fact to dim returns everything needed for a sliced-by-category report.
  3. The snowflake lifts those columns into dim_category and dim_brand (and dim_brand further references dim_supplier), eliminating redundancy at the cost of 2-3 extra joins per query.
  4. Both shapes use surrogate keys (product_sk) on the fact, not the natural OLTP product_id; this insulates the warehouse from upstream source-system key changes and is required for SCD type 2 versioning.
  5. SCD type 2 columns (effective_from, effective_to, is_current) live on dim_product in the star and on dim_product_sf in the snowflake; the SCD policy is identical, but the snowflake spreads the impact across the sub-dimensions only when they version too.

Output (counts of tables involved per analytical query for "revenue by category, last 30 days").

schema tables joined join steps typical query latency
Star 2 (fact_sales + dim_product) 1 ~150 ms on 10M rows
Snowflake 3 (fact_sales + dim_product_sf + dim_category) 2 ~280 ms on 10M rows

Rule of thumb: the snowflake adds one join per normalised hierarchy level. Two levels = roughly 2× the join cost; under cache + columnar storage the runtime gap narrows but never closes.

star schema vs snowflake schema — the four senior signals

Signal 1 — opinionated trade-off framing. Senior data engineers do not say "both schemas are fine"; they say "star for dashboards because Tableau and Looker auto-generate single-join SQL against it, snowflake for regulated finance reporting because the normalised sub-dimensions match the source-of-truth chart of accounts and survive audits."

Signal 2 — grain declared up front. Junior modellers describe tables; senior modellers describe grain. The first sentence of any fact-table answer is "the grain of this fact is one row per …"; without that, every downstream SUM is a guess.

Signal 3 — conformed dimensions over per-mart re-modelling. Senior teams ship one dim_customer shared across fact_sales, fact_support, and fact_marketing; the dimension is conformed once and reused, so cross-mart reporting (revenue + tickets + campaign attribution per customer) is a single, trustworthy join.

Signal 4 — SCD policy is a first-class decision. Senior data engineers state SCD policy before shape; "dim_product is SCD type 2 with effective_from/effective_to/is_current" comes out of their mouth in the first 60 seconds, because that policy is what makes historical re-runs reproducible.

SQL
Topic — data-modeling
Data modeling drills

Practice →

SQL
Topic — database
Database design practice

Practice →

Solution Using a fact-and-dimension catalogue table

Code.

-- One canonical catalogue table — every row maps a table to its role, grain, and SCD policy.
CREATE TABLE warehouse_catalogue AS
SELECT * FROM (VALUES
    ('fact_sales',     'fact',      'one row per order line',          'star',      'N/A'),
    ('fact_sales_sf',  'fact',      'one row per order line',          'snowflake', 'N/A'),
    ('dim_customer',   'dimension', 'one row per customer',            'conformed', 'SCD type 2'),
    ('dim_product',    'dimension', 'one row per product version',     'star',      'SCD type 2'),
    ('dim_product_sf', 'dimension', 'one row per product version',     'snowflake', 'SCD type 2'),
    ('dim_category',   'dimension', 'one row per category',            'snowflake', 'SCD type 1'),
    ('dim_brand',      'dimension', 'one row per brand',               'snowflake', 'SCD type 1'),
    ('dim_date',       'dimension', 'one row per calendar day',        'conformed', 'static'),
    ('dim_store',      'dimension', 'one row per store version',       'conformed', 'SCD type 2')
) AS t(table_name, role, grain, schema_shape, scd_policy);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

table_name role grain schema_shape scd_policy
fact_sales fact one row per order line star N/A
fact_sales_sf fact one row per order line snowflake N/A
dim_customer dimension one row per customer conformed SCD type 2
dim_product dimension one row per product version star SCD type 2
dim_product_sf dimension one row per product version snowflake SCD type 2
dim_category dimension one row per category snowflake SCD type 1
dim_brand dimension one row per brand snowflake SCD type 1
dim_date dimension one row per calendar day conformed static
dim_store dimension one row per store version conformed SCD type 2
  1. Rows 1-2 — the two fact variants share the same grain; only the surrounding dimension shape differs.
  2. Row 3 — dim_customer is conformed across multiple marts; this is the single biggest reuse lever in a warehouse.
  3. Rows 4-5 — the same product dimension exists in two shapes; the snowflake version is normalised but the SCD policy is identical.
  4. Rows 6-7 — dim_category and dim_brand are the sub-dimensions that distinguish snowflake from star; in a star, they would be columns on dim_product.
  5. Row 8 — dim_date is static (no SCD); the calendar does not version.
  6. Row 9 — dim_store is SCD type 2 because store ownership and address change over time, and historical reports must reflect the store-as-of-the-transaction.

Output.

table_name role schema_shape scd_policy
fact_sales fact star N/A
dim_customer dimension conformed SCD type 2
dim_product dimension star SCD type 2
dim_category dimension snowflake SCD type 1
dim_date dimension conformed static

Why this works — concept by concept:

  • Catalogue as artefact — turns the design into a queryable table; reviewers can WHERE role = 'fact' and audit grain declarations in one query.
  • Grain column — every table has its grain explicit and checked into git; the catalogue makes "what is the grain of fact_sales?" a SQL lookup, not a tribal-knowledge question.
  • schema_shape enumstar / snowflake / conformed makes the shape decision auditable; conformed dimensions are explicit, not implicit.
  • SCD policy as a columnSCD type 1 / SCD type 2 / static is the single most-skipped column in junior catalogues; senior teams treat it as load-bearing metadata.
  • CostO(1) to read the catalogue; the actual schema lives in information_schema and dbt manifests, but the intent lives here.

2. Star schema anatomy — fact + denormalised dimensions + single-step joins

Visual diagram of star schema anatomy — a central fact_sales table card with four foreign-key columns and three measure columns; four dimension cards (dim_customer, dim_product, dim_date, dim_store) arranged around it in a star pattern, each connected by a single thin arrow showing the FK relationship; a small grain chip 'one row = one order line'; on a light PipeCode card.

star schema — one fact at the centre, four denormalised dimensions, single-step joins

star schema is the canonical analytical shape: one fact table at the centre holding measures (quantity, revenue, discount) and foreign keys (customer_sk, product_sk, date_sk, store_sk); every surrounding dimension table holds the descriptive attributes of one business entity in a single, denormalised table — no further sub-dimensions, no normalisation. Every analytical query reaches its data in one join per dimension; that single-step join shape is what every modern BI tool (Tableau, Looker, Power BI, Mode, Hex) auto-generates SQL against.

The four anatomy rules of a star schema.

  • Rule 1 — one fact table per processfact_sales, fact_returns, fact_inventory are separate facts; do not jam two processes into one fact table.
  • Rule 2 — grain is declared and uniform — every row in the fact has the same grain; "one row per order line" is a declared contract, never an assumption.
  • Rule 3 — dimension tables are *denormalised* — dim_product holds category, brand, and supplier as columns, not as foreign keys; the hierarchy lives inline.
  • Rule 4 — surrogate keys everywhere — fact-to-dim joins use product_sk (a BIGINT generated by the warehouse), never the natural product_id; this enables SCD type 2 and insulates against upstream source-system key changes.

The canonical four-dimension star.

CREATE TABLE fact_sales (
    sales_sk     BIGINT PRIMARY KEY,
    customer_sk  BIGINT NOT NULL REFERENCES dim_customer,
    product_sk   BIGINT NOT NULL REFERENCES dim_product,
    date_sk      INT    NOT NULL REFERENCES dim_date,
    store_sk     INT    NOT NULL REFERENCES dim_store,
    quantity     INT    NOT NULL,
    unit_price   NUMERIC(18,4) NOT NULL,
    discount     NUMERIC(18,4) NOT NULL DEFAULT 0,
    revenue      NUMERIC(18,4) NOT NULL
);

CREATE TABLE dim_customer (
    customer_sk    BIGINT PRIMARY KEY,
    customer_id    VARCHAR(64) NOT NULL,
    customer_name  VARCHAR(256) NOT NULL,
    segment        VARCHAR(64),
    city           VARCHAR(128),
    region         VARCHAR(64),
    country        VARCHAR(64),
    signup_date    DATE,
    effective_from DATE NOT NULL,
    effective_to   DATE,
    is_current     BOOLEAN NOT NULL
);

CREATE TABLE dim_product (
    product_sk     BIGINT PRIMARY KEY,
    product_id     VARCHAR(64) NOT NULL,
    product_name   VARCHAR(256) NOT NULL,
    category       VARCHAR(128) NOT NULL,
    sub_category   VARCHAR(128),
    brand          VARCHAR(128) NOT NULL,
    supplier       VARCHAR(128) NOT NULL,
    effective_from DATE NOT NULL,
    effective_to   DATE,
    is_current     BOOLEAN NOT NULL
);

CREATE TABLE dim_date (
    date_sk     INT  PRIMARY KEY,            -- YYYYMMDD as INT
    date_value  DATE NOT NULL UNIQUE,
    day_of_week INT  NOT NULL,
    week        INT  NOT NULL,
    month       INT  NOT NULL,
    quarter     INT  NOT NULL,
    year        INT  NOT NULL,
    fiscal_yr   INT  NOT NULL,
    fiscal_qtr  INT  NOT NULL,
    is_weekend  BOOLEAN NOT NULL
);

CREATE TABLE dim_store (
    store_sk       INT PRIMARY KEY,
    store_id       VARCHAR(64) NOT NULL,
    store_name     VARCHAR(256),
    city           VARCHAR(128),
    region         VARCHAR(64),
    country        VARCHAR(64),
    manager_name   VARCHAR(256),
    effective_from DATE NOT NULL,
    effective_to   DATE,
    is_current     BOOLEAN NOT NULL
);
Enter fullscreen mode Exit fullscreen mode
  • fact_sales — measures (quantity, unit_price, discount, revenue) plus four _sk foreign keys; nothing else.
  • dim_customer — descriptive attributes of a customer, including geography inline; no dim_geography sub-dimension.
  • dim_product — hierarchy (category, sub_category, brand, supplier) is denormalised as columns; no dim_category.
  • dim_date — pre-populated calendar dimension with every day of the past + future N years; date_sk is the integer form YYYYMMDD so range scans (date_sk BETWEEN 20240101 AND 20240131) are index-friendly.
  • dim_store — store attributes with SCD type 2 versioning so historical reports show the manager and address as of the transaction.

The canonical star schema query — revenue by category, last 30 days.

SELECT
    p.category,
    SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product p ON p.product_sk = f.product_sk
JOIN dim_date    d ON d.date_sk    = f.date_sk
WHERE d.date_value >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY p.category
ORDER BY revenue DESC;
Enter fullscreen mode Exit fullscreen mode
  • Two joins — fact to dim_product, fact to dim_date; both single-step.
  • p.category lives inline on dim_product; no sub-dimension hop.
  • d.date_value filter uses the denormalised date column; the integer date_sk is the join key.
  • Plan — one hash join per dimension; columnar warehouses (Snowflake, BigQuery, Redshift) cache the dimension scans and run the fact aggregate in parallel; sub-second on 100M-row facts.

Worked example — design the star for a multi-channel retailer

Detailed explanation. A typical interview prompt is "design a star schema for a multi-channel retailer (web + store + mobile)". Below is the canonical answer, with grain declared up front and dim_channel introduced as a conformed dimension.

Question. A retailer sells through web, brick-and-mortar stores, and a mobile app. Design a star schema for fact_sales that captures (a) the order channel, (b) the customer, (c) the product, (d) the date, and (e) the store (which is 'web' or 'mobile' for non-physical channels). Declare the grain.

Input. Source OLTP feed: one row per order line, with order_id, customer_id, product_id, order_ts, channel, store_id (null for web/mobile), quantity, unit_price, discount.

Code.

-- Grain: one row per order LINE (not per order). An order with 3 line items contributes 3 fact rows.

CREATE TABLE fact_sales (
    sales_sk     BIGINT PRIMARY KEY,
    order_id     VARCHAR(64) NOT NULL,            -- degenerate dimension (lives on fact)
    customer_sk  BIGINT NOT NULL REFERENCES dim_customer,
    product_sk   BIGINT NOT NULL REFERENCES dim_product,
    date_sk      INT    NOT NULL REFERENCES dim_date,
    store_sk     INT    NOT NULL REFERENCES dim_store,
    channel_sk   INT    NOT NULL REFERENCES dim_channel,
    quantity     INT    NOT NULL,
    unit_price   NUMERIC(18,4) NOT NULL,
    discount     NUMERIC(18,4) NOT NULL DEFAULT 0,
    revenue      NUMERIC(18,4) NOT NULL            -- (unit_price * quantity) - discount
);

CREATE TABLE dim_channel (
    channel_sk   INT PRIMARY KEY,
    channel_name VARCHAR(64) NOT NULL UNIQUE       -- 'web' | 'store' | 'mobile'
);

-- dim_store has a sentinel row for web + mobile so the FK is never NULL.
INSERT INTO dim_store (store_sk, store_id, store_name, is_current)
VALUES
    (-1, 'WEB',    'Web (non-physical)',    TRUE),
    (-2, 'MOBILE', 'Mobile (non-physical)', TRUE);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Grain declared firstone row per order line; an order with three line items creates three fact rows. This grain is what makes SUM(revenue) GROUP BY product correct.
  2. order_id on the fact is a degenerate dimension — a dimension that has no other attributes worth a separate table; it lives as a column on the fact.
  3. dim_channel is its own conformed dimension because channel joins to fact_marketing, fact_returns, and fact_support as well — three separate fact tables that should all use the same channel_sk.
  4. dim_store sentinel rows — web and mobile orders use store_sk = -1 and -2; this preserves NOT NULL on the FK and makes "all-channel" rollups one GROUP BY channel_name away.
  5. revenue is *pre-computed* on the fact — (unit_price * quantity) - discount is stored, not computed at query time; this trades one numeric column of storage for a 100× speedup on aggregate queries.

Output (typical 1-day load profile).

process source_rows fact_rows dimensions_updated
Daily sales load 5,200,000 orders 8,800,000 lines dim_customer (+800 new), dim_product (+120 new)

Rule of thumb: if a fact aggregate could ever return wrong numbers because of grain ambiguity, the grain is undeclared. Declare it once, check it in CI with a COUNT(*) = COUNT(DISTINCT grain_key_combo) test, and never let it drift.

star schema — the four senior nuances

  • Degenerate dimensionsorder_id and invoice_number belong on the fact as columns, not as a one-column dim table.
  • Junk dimensions — combine 3-5 low-cardinality flags (is_promo, is_first_order, is_returning_customer) into one dim_order_flags rather than four separate dims.
  • Role-playing dimensionsdim_date joined as order_date_sk, ship_date_sk, delivery_date_sk is one underlying dim played three roles; alias the join.
  • Slowly-changing dimensions — every dimension that can have its descriptive attributes change and you need historical accuracy on must be SCD type 2; the rest can be SCD type 1 (overwrite).

SQL
Topic — joins
Star-schema join practice

Practice →

SQL
Topic — aggregation
Aggregation drills

Practice →

Solution Using a single-join-per-dimension star query

Code.

-- The canonical star query: one JOIN per dimension, single-pass aggregate.
SELECT
    p.category,
    p.brand,
    c.region        AS customer_region,
    s.region        AS store_region,
    ch.channel_name,
    d.year,
    d.quarter,
    SUM(f.quantity)               AS units,
    SUM(f.revenue)                AS revenue,
    AVG(f.unit_price)             AS avg_unit_price,
    SUM(f.revenue) / NULLIF(SUM(f.quantity), 0) AS effective_price
FROM fact_sales      f
JOIN dim_product     p  ON p.product_sk  = f.product_sk
JOIN dim_customer    c  ON c.customer_sk = f.customer_sk
JOIN dim_store       s  ON s.store_sk    = f.store_sk
JOIN dim_channel     ch ON ch.channel_sk = f.channel_sk
JOIN dim_date        d  ON d.date_sk     = f.date_sk
WHERE d.year = 2026 AND d.quarter = 1
GROUP BY p.category, p.brand, c.region, s.region, ch.channel_name, d.year, d.quarter
ORDER BY revenue DESC
LIMIT 50;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step operation rows in rows out
1 Scan fact_sales partition for Q1 2026 8,800,000 (annual) 2,150,000 (Q1)
2 Hash-join dim_product (~50K rows) 2,150,000 2,150,000
3 Hash-join dim_customer (~1.2M rows) 2,150,000 2,150,000
4 Hash-join dim_store (~300 rows) 2,150,000 2,150,000
5 Hash-join dim_channel (3 rows, broadcast) 2,150,000 2,150,000
6 Hash-join dim_date (~5K rows) 2,150,000 2,150,000
7 Group + aggregate 2,150,000 ~12,000 distinct combos
8 Order + limit 12,000 50
  1. Step 1 partition-prunes the fact to one quarter; the warehouse skips ~75% of the data without reading it.
  2. Steps 2-6 hash-join each dimension; dimensions are small enough to broadcast (replicate to every executor), so no shuffle is required.
  3. Step 7 performs the aggregate on the joined row set; columnar warehouses execute this in parallel across slots.
  4. Step 8 sorts the small aggregated result; latency is dominated by step 1 + step 7.
  5. Total wall-clock on Snowflake XS warehouse: ~600 ms on 8M rows.

Output (sample).

category brand customer_region store_region channel_name year quarter units revenue avg_unit_price effective_price
Electronics Acme NA NA web 2026 1 42,300 9,820,500.00 240.50 232.16
Electronics Acme EU EU web 2026 1 31,400 7,612,000.00 248.10 242.42
Apparel Beta NA NA store 2026 1 88,200 5,210,700.00 62.40 59.08

Why this works — concept by concept:

  • Single join per dim — each dimension is reached in exactly one hop; the optimiser builds one hash table per dim and probes the fact once.
  • Broadcast joins on small dimsdim_channel (3 rows), dim_store (300 rows), and dim_date (~5K rows) are broadcast; no shuffle cost.
  • Pre-computed revenue — the fact stores revenue directly; SUM(f.revenue) is one column read, not SUM((unit_price * quantity) - discount) re-derived per row.
  • NULLIF guardeffective_price = revenue / NULLIF(quantity, 0) protects against divide-by-zero on zero-quantity returns.
  • CostO(N) over the fact scan + O(N + D) per hash join where N is fact rows and D is dimension rows; on modern columnar warehouses the practical cost is dominated by the fact scan, not the joins.

3. Snowflake schema anatomy — normalised dimensions + branching sub-dimensions

Visual diagram of snowflake schema anatomy — a central fact_sales card with four foreign-key columns; four primary dimension cards around it, each branching into one or two normalised sub-dimension cards (dim_product → dim_category → dim_brand; dim_customer → dim_geography); arrows showing multi-step joins; a small 3NF chip and a 'storage save' chip; on a light PipeCode card.

snowflake schema — normalised dimensions, branching sub-dimensions, multi-step joins

snowflake schema is the same fact_sales at the centre, but each dimension table is normalised — typically to 3NF (third normal form) — so that hierarchies (category → brand → supplier; city → region → country) live in their own sub-dimension tables, connected by foreign keys. The result is less storage (no repeated category names across millions of products), more join steps per query (two or three hops instead of one), and a shape that matches audit-friendly source-of-truth references.

The four anatomy rules of a snowflake schema.

  • Rule 1 — same fact table shape as a star — the fact does not change; only the dimensions normalise.
  • Rule 2 — each hierarchy level becomes its own tabledim_product → dim_category → dim_brand → dim_supplier; one table per level.
  • Rule 3 — sub-dimensions enforce uniquenessdim_category.category_name is UNIQUE; a single source of truth for category names.
  • Rule 4 — query SQL is multi-join — any analytical query that slices by category joins fact_sales → dim_product → dim_category (two hops).

The canonical four-dimension snowflake.

CREATE TABLE fact_sales (
    sales_sk     BIGINT PRIMARY KEY,
    customer_sk  BIGINT NOT NULL REFERENCES dim_customer,
    product_sk   BIGINT NOT NULL REFERENCES dim_product,
    date_sk      INT    NOT NULL REFERENCES dim_date,
    store_sk     INT    NOT NULL REFERENCES dim_store,
    quantity     INT    NOT NULL,
    unit_price   NUMERIC(18,4) NOT NULL,
    discount     NUMERIC(18,4) NOT NULL DEFAULT 0,
    revenue      NUMERIC(18,4) NOT NULL
);

-- Product is normalised: dim_product → dim_category → dim_brand → dim_supplier.
CREATE TABLE dim_product (
    product_sk    BIGINT PRIMARY KEY,
    product_id    VARCHAR(64) NOT NULL,
    product_name  VARCHAR(256) NOT NULL,
    category_sk   BIGINT NOT NULL REFERENCES dim_category,
    brand_sk      BIGINT NOT NULL REFERENCES dim_brand,
    effective_from DATE NOT NULL, effective_to DATE, is_current BOOLEAN NOT NULL
);

CREATE TABLE dim_category (
    category_sk   BIGINT PRIMARY KEY,
    category_name VARCHAR(128) NOT NULL UNIQUE,
    parent_category_sk BIGINT REFERENCES dim_category  -- self-reference for sub-categories
);

CREATE TABLE dim_brand (
    brand_sk      BIGINT PRIMARY KEY,
    brand_name    VARCHAR(128) NOT NULL UNIQUE,
    supplier_sk   BIGINT NOT NULL REFERENCES dim_supplier
);

CREATE TABLE dim_supplier (
    supplier_sk   BIGINT PRIMARY KEY,
    supplier_name VARCHAR(128) NOT NULL UNIQUE,
    country       VARCHAR(64)
);

-- Customer is normalised: dim_customer → dim_geography.
CREATE TABLE dim_customer (
    customer_sk    BIGINT PRIMARY KEY,
    customer_id    VARCHAR(64) NOT NULL,
    customer_name  VARCHAR(256) NOT NULL,
    segment        VARCHAR(64),
    geography_sk   BIGINT NOT NULL REFERENCES dim_geography,
    signup_date    DATE,
    effective_from DATE NOT NULL, effective_to DATE, is_current BOOLEAN NOT NULL
);

CREATE TABLE dim_geography (
    geography_sk BIGINT PRIMARY KEY,
    city         VARCHAR(128) NOT NULL,
    region       VARCHAR(64)  NOT NULL,
    country      VARCHAR(64)  NOT NULL,
    UNIQUE (city, region, country)
);
Enter fullscreen mode Exit fullscreen mode
  • dim_product no longer carries category_name or brand_name — those live on the sub-dimensions and are reached via category_sk and brand_sk.
  • dim_category has a self-reference (parent_category_sk) so sub-categories link to parent categories without a sixth table.
  • dim_brand → dim_supplier — a brand belongs to one supplier; this hierarchy is enforced by FK rather than denormalised.
  • dim_geography centralises city / region / country; if 1.2M customers all live in 30K unique geographies, the storage saving is significant (~40 GB on a wide string-heavy customer table down to ~12 GB).

The canonical snowflake schema query — revenue by category, last 30 days.

SELECT
    c.category_name,
    SUM(f.revenue) AS revenue
FROM fact_sales       f
JOIN dim_product      p ON p.product_sk  = f.product_sk
JOIN dim_category     c ON c.category_sk = p.category_sk    -- the extra hop
JOIN dim_date         d ON d.date_sk     = f.date_sk
WHERE d.date_value >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY c.category_name
ORDER BY revenue DESC;
Enter fullscreen mode Exit fullscreen mode
  • Three joins — fact to dim_product, dim_product to dim_category, fact to dim_date; the extra hop is dim_product → dim_category.
  • c.category_name is no longer inline on dim_product; the query must traverse the sub-dimension.
  • Plan — one extra hash-join step; on a 100M-row fact the extra hop adds ~50-150 ms depending on warehouse size.
  • BI toolsTableau and Looker can model this, but the user (or the LookML / Tableau-relationship layer) has to declare the join path; the auto-generated SQL is no longer single-step.

Worked example — refactor a star to a snowflake to save 35 GB on dim_customer

Detailed explanation. A common production trigger for a snowflake refactor is storage pressure on a wide dimension. Below is the canonical refactor: a 1.2M-row dim_customer with 90% repeated city/region/country strings is normalised into dim_customer + dim_geography, saving ~35 GB.

Question. Your star dim_customer is 80 GB on 1.2M rows because each row carries city VARCHAR(128) + region VARCHAR(64) + country VARCHAR(64) and the strings are repeated across customers in the same geography. Refactor to a snowflake with dim_geography, write the migration SQL, and quantify the storage saving.

Input. dim_customer 1.2M rows × ~256 bytes geography strings ≈ 300 MB; with row overhead, dictionary encoding, and indexes the on-disk size is ~80 GB. Distinct geographies: ~30,000.

Code.

-- Step 1 — extract unique geographies into a sub-dim.
CREATE TABLE dim_geography AS
SELECT
    ROW_NUMBER() OVER (ORDER BY country, region, city) AS geography_sk,
    city, region, country
FROM (
    SELECT DISTINCT city, region, country FROM dim_customer
) g;

-- Step 2 — rebuild dim_customer pointing at the new sub-dim.
CREATE TABLE dim_customer_new AS
SELECT
    c.customer_sk,
    c.customer_id,
    c.customer_name,
    c.segment,
    g.geography_sk,
    c.signup_date,
    c.effective_from, c.effective_to, c.is_current
FROM dim_customer c
JOIN dim_geography g
  ON g.city = c.city AND g.region = c.region AND g.country = c.country;

-- Step 3 — atomic swap (Snowflake-style).
ALTER TABLE dim_customer        RENAME TO dim_customer_old;
ALTER TABLE dim_customer_new    RENAME TO dim_customer;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Step 1 materialises the distinct geography tuples; 30K rows replace the 1.2M repeated strings.
  2. Step 2 joins each customer to its geography surrogate key; the new dim_customer carries geography_sk (a BIGINT, ~8 bytes) instead of ~256 bytes of strings.
  3. Step 3 swaps the table in one atomic DDL; downstream queries that pre-existed need a tiny adjustment to JOIN dim_geography whenever they need city / region / country.
  4. Storage math — 1.2M rows × (256 − 8) bytes ≈ 290 MB raw saved; with row overhead, dictionary, and indexes the on-disk saving compounds to ~35 GB.
  5. Query cost — every query that slices by region now adds one hash join; in practice this is < 50 ms on warm dimensions because dim_geography (30K rows) fits in L2 cache.

Output (storage profile before/after).

table shape row count on-disk size
dim_customer (before) star 1,200,000 ~80 GB
dim_customer (after) snowflake 1,200,000 ~45 GB
dim_geography (new) snowflake 30,000 ~3 MB
net saving ~35 GB

Rule of thumb: snowflake the dimensions whose hierarchy is both high-cardinality strings and heavily repeated. A 1.2M-row dim with 30K unique geographies is a clear win; a 1.2M-row dim with 1.1M unique geographies (almost no repetition) is not.

snowflake schema — the four senior nuances

  • Normal forms — most production snowflakes are 3NF (third normal form); going beyond 3NF rarely pays off because the extra join cost outweighs any storage win.
  • Bridge tables — when a fact-to-dim relationship is many-to-many (a single sales line covers two promotional offers), a bridge table with weighting_factor columns is the snowflake pattern.
  • Outrigger dimensions — a dim that references another dim (e.g., dim_employee → dim_manager); fine in moderation, but more than two levels of outriggers is a smell.
  • Mini-dimensions — for dim_customer with frequently-changing low-cardinality attributes (age band, income tier), split those into a dim_customer_profile so the main customer history stays small.

SQL
Topic — joins
Multi-join SQL practice

Practice →

SQL
Topic — database
Normalised-schema drills

Practice →

Solution Using a multi-hop snowflake query against the normalised dimensions

Code.

-- Snowflake equivalent of the star query — same business question, more joins.
SELECT
    cat.category_name,
    br.brand_name,
    g.region        AS customer_region,
    s.region        AS store_region,
    ch.channel_name,
    d.year,
    d.quarter,
    SUM(f.quantity) AS units,
    SUM(f.revenue)  AS revenue,
    SUM(f.revenue) / NULLIF(SUM(f.quantity), 0) AS effective_price
FROM fact_sales         f
JOIN dim_product        p   ON p.product_sk    = f.product_sk
JOIN dim_category       cat ON cat.category_sk = p.category_sk    -- sub-dim hop
JOIN dim_brand          br  ON br.brand_sk     = p.brand_sk       -- sub-dim hop
JOIN dim_customer       c   ON c.customer_sk   = f.customer_sk
JOIN dim_geography      g   ON g.geography_sk  = c.geography_sk   -- sub-dim hop
JOIN dim_store          s   ON s.store_sk      = f.store_sk
JOIN dim_channel        ch  ON ch.channel_sk   = f.channel_sk
JOIN dim_date           d   ON d.date_sk       = f.date_sk
WHERE d.year = 2026 AND d.quarter = 1
GROUP BY cat.category_name, br.brand_name, g.region, s.region, ch.channel_name, d.year, d.quarter
ORDER BY revenue DESC
LIMIT 50;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step operation rows in rows out
1 Scan fact_sales Q1 2026 8,800,000 (annual) 2,150,000 (Q1)
2 Hash-join dim_product (~50K rows) 2,150,000 2,150,000
3 Hash-join dim_category (~1.2K rows, broadcast) 2,150,000 2,150,000
4 Hash-join dim_brand (~5K rows, broadcast) 2,150,000 2,150,000
5 Hash-join dim_customer (~1.2M rows) 2,150,000 2,150,000
6 Hash-join dim_geography (~30K rows, broadcast) 2,150,000 2,150,000
7 Hash-join dim_store (~300 rows, broadcast) 2,150,000 2,150,000
8 Hash-join dim_channel (3 rows, broadcast) 2,150,000 2,150,000
9 Hash-join dim_date (~5K rows, broadcast) 2,150,000 2,150,000
10 Group + aggregate 2,150,000 ~12,000
11 Order + limit 12,000 50
  1. Steps 1-2 are identical to the star (fact scan + dim_product join).
  2. Step 3 is the extra hop — dim_product → dim_category; broadcast because dim_category is tiny.
  3. Step 4 is another extra hop for the brand lookup.
  4. Step 6 is the geography hop on the customer side; broadcast because 30K rows fit in cache.
  5. The total wall-clock on Snowflake XS: ~900 ms — about 50% slower than the star (~600 ms) on the same data and same warehouse, even though all sub-dims are broadcast.

Output (sample).

category_name brand_name customer_region store_region channel_name year quarter units revenue effective_price
Electronics Acme NA NA web 2026 1 42,300 9,820,500.00 232.16
Electronics Acme EU EU web 2026 1 31,400 7,612,000.00 242.42
Apparel Beta NA NA store 2026 1 88,200 5,210,700.00 59.08

Why this works — concept by concept:

  • Multi-hop join chain — the snowflake forces fact → dim → sub-dim for every hierarchy slice; the SQL pays the extra join in exchange for normalised storage.
  • Broadcast joins on sub-dimsdim_category, dim_brand, dim_geography are small enough to broadcast; no shuffle cost on a modern columnar warehouse.
  • Same business answer — the result set is identical to the star query; only the SQL and the plan differ.
  • Latency tax — the extra hops cost ~30-50% more runtime in practice; for a sub-second dashboard query this is fine, for a 30-minute batch this is fine, for a 50-ms BI drilldown it is a problem.
  • CostO(N) over the fact scan + O(N + Dᵢ) per dim hop; cumulative cost scales linearly with hop count, which is why snowflakes with > 3 hops per query are slow in practice.

4. Star vs Snowflake — five-dimension trade-off (query speed, ETL, storage, BI fit, best for)

Side-by-side comparison card — five rows comparing star schema (left, green) vs snowflake schema (right, blue) on Query speed, ETL complexity, Storage cost, BI tool fit, and Best for; each row has a small icon and a one-line verdict for each side; on a light PipeCode card.

star schema vs snowflake schema — the five-dimension trade-off matrix

The five-dimension trade-off is the framework every senior dimensional modeling interviewer wants you to recite: query speed (joins per query), ETL complexity (load orchestration), storage cost (denormalised redundancy vs normalised reuse), BI tool fit (auto-generated SQL vs manual join paths), and best for (which workloads each schema wins at). Every senior fact table + dimension table discussion comes back to these five axes.

Dimension 1 — query speed.

  • Starfewer joins → faster. One join per dimension; columnar warehouses (Snowflake, BigQuery, Redshift, Databricks) hash-join one dim at a time; aggregate is single-pass.
  • Snowflakemore joins → slower on wide queries. Two or three joins per dimension hierarchy; broadcasts help small sub-dims but every extra hop adds optimiser work and cache pressure.
  • The empirical delta — on a 100M-row fact with 4 wide dimensions, the snowflake variant is typically 20-50% slower for a multi-dim slice query; for a single-dim slice the difference is < 10%.
  • The senior take — query speed matters most when the workload is interactive BI (sub-second dashboards); for batch / overnight reporting the difference is irrelevant.

Dimension 2 — ETL complexity.

  • Starheavier load on each dim, simpler shape. Building dim_product with denormalised category, brand, supplier means resolving each lookup once per load and writing the wide row; simpler orchestration (one dim table per business entity).
  • Snowflakelighter load per dim, more orchestration. Each sub-dim is updated independently; the load DAG has more nodes (one per sub-dim) and you must enforce parent-before-child loading order.
  • The empirical delta — in dbt terms, a typical star has ~5-8 dim models; the snowflake equivalent has ~10-14. Engineering time per dim is similar; total time scales with model count.
  • The senior take — pick whichever your team can maintain; an under-staffed team should not sign up for the orchestration overhead of a snowflake.

Dimension 3 — storage cost.

  • Starredundant strings on wide dims. dim_product with 1M rows × category VARCHAR(128) + brand VARCHAR(128) + supplier VARCHAR(128) carries ~400 MB of redundant strings; with row overhead and indexes the disk footprint is far larger.
  • Snowflake20-40% smaller on wide dims. Normalising the strings into sub-dims replaces the wide string columns with 8-byte surrogate keys; on dimensions with high repetition the saving is substantial.
  • The empirical deltadim_customer 1.2M × geography refactor in section 3 saved ~35 GB; on a 30M-row clickstream dim_event with repeating event_category, event_subcategory, the saving can hit 200-300 GB.
  • The senior take — storage cost matters when you are paying per-TB (cloud warehouses) at scale; at 10 TB it's a rounding error, at 10 PB it is real money.

Dimension 4 — BI tool fit.

  • StarTableau, Looker, Power BI, Mode, Hex love it. Every BI tool auto-generates SQL against a star with zero configuration; dim_product.category is a clickable field that joins fact-to-dim transparently.
  • Snowflakeneeds manual joins or views. Looker requires explicit LookML view definitions per sub-dim hop; Tableau requires relationship modelling; Power BI requires relationship arrows. The end-user click-and-explore experience is worse unless the BI layer abstracts the hops.
  • The empirical delta — onboarding a new dashboard analyst on a star takes hours; on a snowflake it takes days because they must learn the join paths.
  • The senior take — if business users self-serve in the BI tool, star wins; if all SQL is centrally authored by data engineers, either works.

Dimension 5 — best for (workloads).

  • Star — best for — interactive BI dashboards, ad-hoc analytics, self-serve exploration, marketing/sales/product KPI surfaces, fast time-to-first-insight, smaller-to-medium warehouses where storage is not the binding constraint.
  • Snowflake — best for — regulated reporting (finance, healthcare, insurance) where the source-of-truth hierarchy matches the audit chart of accounts, petabyte-scale warehouses where storage savings are material, deeply hierarchical dimensions (product taxonomies with 4+ levels), data-vault → mart pipelines where snowflake is the natural intermediate shape.

The honest meta-take. Most production warehouses ship both — a snowflake layer under the hood for raw / staging / data-vault, and a star layer at the consumption mart. The cleanest pattern is snowflake-on-the-way-in, star-on-the-way-out: normalise to sub-dimensions during ingestion to enforce hierarchy integrity, denormalise back to a star at the mart layer for BI consumption. This pattern is increasingly common in dbt + Snowflake + Looker stacks.

Worked example — score the same warehouse on all five dimensions

Detailed explanation. A realistic interview drill is "score your current warehouse on the five-dimension trade-off matrix". Below is the canonical scoring exercise for a mid-size retailer running a hybrid (snowflake staging, star mart) on Snowflake + Looker.

Question. A retailer has 100M-row fact_sales, 1.2M-row dim_customer, 50K-row dim_product, and 300-row dim_store. They run interactive Looker dashboards (~500 concurrent users), nightly finance reconciliation, and weekly product-hierarchy audits. Score star vs snowflake on the five dimensions and recommend a shape per layer.

Input. Workload mix: 70% interactive BI (sub-second SLA), 25% nightly batch (4-hour SLA), 5% audit queries (10-minute SLA). Storage budget: $5K/month.

Code.

CREATE TABLE shape_scorecard AS
SELECT * FROM (VALUES
    ('query_speed',     'star',      'wins',          'sub-second on 100M rows'),
    ('query_speed',     'snowflake', 'acceptable',    '0.9-1.5 s on 100M rows'),
    ('etl_complexity',  'star',      'medium',        '6 dim models + 1 fact'),
    ('etl_complexity',  'snowflake', 'higher',        '12 dim/sub-dim models + 1 fact'),
    ('storage_cost',    'star',      'baseline',      '900 GB total'),
    ('storage_cost',    'snowflake', 'cheaper',       '~620 GB total (saves $300/month)'),
    ('bi_tool_fit',     'star',      'wins',          'Looker auto-joins, zero LookML hops'),
    ('bi_tool_fit',     'snowflake', 'requires LookML','manual joins for each sub-dim'),
    ('best_for',        'star',      'BI mart',       'consumption layer for Looker'),
    ('best_for',        'snowflake', 'staging + audit','source-of-truth + finance reconciliation')
) AS t(dimension, schema_shape, verdict, evidence);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The scorecard is a single artefact a senior engineer can paste into an architecture doc.
  2. Each dimension has two rows — one verdict per shape; the comparison is explicit, not narrative.
  3. The evidence column anchors each verdict in numberssub-second on 100M rows, $300/month savings; this is the senior-signal column.
  4. The recommendation falls out: snowflake at staging + audit, star at the mart; this is the dominant production pattern in 2026.
  5. The scorecard is a living document — re-score quarterly as data volume and workload mix shift.

Output (recommendation table).

layer shape rationale
raw + staging snowflake matches source-of-truth audit hierarchy
consumption mart star Looker auto-joins, sub-second BI
finance audit views snowflake regulated reporting needs normalised dims

Rule of thumb: the layer drives the shape, not the warehouse-wide preference. Modern stacks rarely pick one shape for the entire warehouse.

The trade-off matrix as a one-screen reference

Dimension Star verdict Snowflake verdict
Query speed Fewer joins · faster More joins · slower on wide queries
ETL complexity Heavier per-dim load · simpler shape Lighter per-dim load · more orchestration
Storage cost Redundant strings on wide dims 20-40% smaller on wide dims
BI tool fit Tableau / Looker / Power BI auto-join Needs manual joins, LookML, or views
Best for Dashboards · ad-hoc analytics · self-serve Regulated reporting · audit trails · petabyte storage

SQL
Topic — sql
SQL practice library

Practice →

SQL
Topic — group-by
GROUP BY practice

Practice →

Solution Using a side-by-side query comparison + measured cost

Code.

-- Same business question, both shapes, with timing.
-- (A) STAR query — 2 joins.
EXPLAIN ANALYZE
SELECT p.category, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product_star p ON p.product_sk = f.product_sk
WHERE f.date_sk BETWEEN 20260101 AND 20260131
GROUP BY p.category;

-- (B) SNOWFLAKE query — 3 joins.
EXPLAIN ANALYZE
SELECT cat.category_name, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product_sf p    ON p.product_sk    = f.product_sk
JOIN dim_category   cat  ON cat.category_sk = p.category_sk
WHERE f.date_sk BETWEEN 20260101 AND 20260131
GROUP BY cat.category_name;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step star plan snowflake plan
1 Scan fact (Jan 2026) — 2.15M rows Scan fact (Jan 2026) — 2.15M rows
2 Hash-join dim_product_star (50K, broadcast) Hash-join dim_product_sf (50K, broadcast)
3 Group + aggregate by category Hash-join dim_category (1.2K, broadcast)
4 Order Group + aggregate by category_name
5 Order
Wall-clock ~420 ms on Snowflake XS ~610 ms on Snowflake XS
  1. Step 1 is identical — both shapes scan the same fact partition.
  2. Step 2 is identical — both shapes broadcast dim_product.
  3. The snowflake variant adds step 3 — an extra hash-join hop to dim_category.
  4. Steps 4-5 in the snowflake plan are the same as steps 3-4 in the star plan, just shifted by one.
  5. Total cost delta: ~190 ms (~45% slower) on this small example; on larger facts the delta widens further because the hash table cache pressure grows.

Output.

query joins wall_clock_ms result_rows
(A) star 2 420 28 categories
(B) snowflake 3 610 28 categories (same answer)

Why this works — concept by concept:

  • Same-result comparison — both queries return the same category-level totals; the SQL and the plan differ but the answer does not.
  • Measured wall-clock — interviewers want numbers, not opinions; bringing a measured wall-clock delta is the senior move.
  • Broadcast economics — small sub-dims (< 10K rows) broadcast cheaply; large sub-dims (> 1M rows) shuffle and the snowflake delta grows fast.
  • Optimiser caveats — the warehouse query planner may reorder joins; the number of joins is the floor on cost, not the ceiling.
  • CostO(N) fact scan + O(N + Dᵢ) per join hop; cumulative hops are the differentiator between the shapes.

5. Decision matrix — when to choose which (with worked SQL)

Decision-tree diagram for choosing between star and snowflake schema — a top question 'Is query latency the top priority?' branching yes → 'Star schema' and no → next question 'Are hierarchies deep + changing often?'; each leaf is a coloured verdict card; on a light PipeCode card.

star schema vs snowflake schema — a four-question decision tree

The decision matrix is the senior framework: four questions, four verdicts, one clear answer per workload. Memorise it and you can defend any shape choice in 60 seconds.

Q1 — Is query latency the #1 priority?

  • YESstar schema (denormalised). Interactive BI / sub-second dashboards demand the fewest joins possible. Modern columnar warehouses can mask one or two extra joins, but at 100+ concurrent users every saved millisecond pays compounding rent.
  • NO → continue to Q2.

Q2 — Are dimension hierarchies deep AND changing often?

  • YESsnowflake schema (normalised). Deep hierarchies (country → region → city → district → neighbourhood) with frequent re-org events (region boundaries shift) are painful to maintain as denormalised strings. Normalising into sub-dims means a re-org touches one row in dim_region, not millions of rows in dim_customer.
  • NO → continue to Q3.

Q3 — Is storage cost a meaningful constraint? (e.g. petabyte-scale)

  • YESsnowflake schema. At petabyte scale a 30% storage saving on wide dimensions translates to material dollars; the join cost is amortised across many queries.
  • NO → continue to Q4.

Q4 — Does your BI tool auto-join multi-step paths?

  • YESeither works (modern Looker with explicit LookML joins, Power BI with relationship views; both can mask snowflake hops from end users).
  • NOstar schema (safer default; minimises the BI-layer modelling cost).

The default verdict. Start star, refactor only if storage or audit requires it. For ~80% of warehouses, the star schema is the right starting shape; the cost of refactoring a star into a snowflake later is far smaller than the cost of forcing every analyst to learn snowflake join paths from day one.

When Data Vault is in the mix. Data Vault 2.0 (hubs + links + satellites) is its own paradigm and lives upstream of both star and snowflake; a typical pipeline is source → data vault → snowflake (intermediate) → star (consumption mart). The decision matrix above applies to the consumption layer, not the data-vault layer.

Worked example — pick the schema for three real workloads

Detailed explanation. Real senior interviews ask you to apply the decision tree to multiple workloads and defend each pick.

Question. For each workload, walk through the decision tree and recommend a schema with a one-sentence rationale: (a) a SaaS product analytics warehouse serving Looker dashboards to 800 product managers, (b) a bank's regulatory reporting warehouse generating Basel-III risk reports, (c) a clickstream warehouse storing 100B events for ML feature engineering.

Input. Three workloads, three different priority profiles.

Code.

CREATE TABLE workload_recommendations AS
SELECT * FROM (VALUES
    ('SaaS product analytics',  'Q1: YES (sub-second BI is the priority)',
                                'star',     'Looker auto-joins; PMs self-serve; storage is not the constraint'),
    ('Bank regulatory reporting','Q1: NO; Q2: YES (deep audit hierarchies that change quarterly)',
                                'snowflake','normalised dims match Basel-III source-of-truth references; audit-friendly'),
    ('Clickstream feature store','Q1: NO; Q2: NO; Q3: YES (100B events × wide string dims = petabytes)',
                                'snowflake','normalising event_category + event_subcategory saves ~200 GB per partition')
) AS t(workload, decision_trace, recommendation, rationale);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Workload (a) — SaaS product analytics — Q1 is YES (interactive BI), so the tree short-circuits to star; the rationale is BI auto-join + PM self-serve.
  2. Workload (b) — bank regulatory reporting — Q1 is NO (overnight batch is fine), Q2 is YES (Basel-III hierarchies are deep and re-organised quarterly); tree resolves to snowflake.
  3. Workload (c) — clickstream feature store — Q1 NO (ML feature jobs are batch), Q2 NO (hierarchies are shallow), Q3 YES (petabyte scale with redundant string dims); tree resolves to snowflake for storage economics.
  4. Each recommendation has a one-sentence rationale rooted in the decision-tree branch; this is the answer shape interviewers expect.
  5. The recommendations are defensible not because they are universally right but because they trace a known framework.

Output (the recommendation table).

workload recommendation rationale
SaaS product analytics star Looker auto-join + PM self-serve
Bank regulatory reporting snowflake normalised dims match audit hierarchy
Clickstream feature store snowflake petabyte storage savings on repeating dim strings

Rule of thumb: the decision tree is short-circuit; the first YES wins. Practise tracing the tree on three workloads before any interview — the muscle memory makes the answer feel automatic.

Worked SQL — answering the same business question on both shapes

The question. "For Q1 2026, what's the top-5 revenue by category, sliced by customer region, for online (web + mobile) orders only?"

Star answer.

SELECT
    p.category,
    c.region   AS customer_region,
    SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product  p  ON p.product_sk  = f.product_sk
JOIN dim_customer c  ON c.customer_sk = f.customer_sk
JOIN dim_channel  ch ON ch.channel_sk = f.channel_sk
JOIN dim_date     d  ON d.date_sk     = f.date_sk
WHERE d.year = 2026 AND d.quarter = 1
  AND ch.channel_name IN ('web', 'mobile')
GROUP BY p.category, c.region
ORDER BY revenue DESC
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Snowflake answer.

SELECT
    cat.category_name AS category,
    g.region          AS customer_region,
    SUM(f.revenue)    AS revenue
FROM fact_sales      f
JOIN dim_product_sf  p   ON p.product_sk    = f.product_sk
JOIN dim_category    cat ON cat.category_sk = p.category_sk
JOIN dim_customer    c   ON c.customer_sk   = f.customer_sk
JOIN dim_geography   g   ON g.geography_sk  = c.geography_sk
JOIN dim_channel     ch  ON ch.channel_sk   = f.channel_sk
JOIN dim_date        d   ON d.date_sk       = f.date_sk
WHERE d.year = 2026 AND d.quarter = 1
  AND ch.channel_name IN ('web', 'mobile')
GROUP BY cat.category_name, g.region
ORDER BY revenue DESC
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode
  • Same business answer, same result rows, same ordering.
  • Star — 4 joins. Snowflake — 6 joins (two extra hops: dim_category and dim_geography).
  • Wall-clock on warm cache — star ~500 ms, snowflake ~750 ms on a XS Snowflake warehouse against 8M Q1 rows.
  • The SQL is more readable on the snowflake in one specific way: the join paths are self-documenting (you can see the hierarchy) — at the cost of more typing.

SQL
Topic — data-modeling
Schema-choice practice

Practice →

SQL
Topic — joins
Multi-join SQL drills

Practice →

Solution Using a layered recommendation (snowflake-in, star-out)

Code.

-- Build the snowflake-in/star-out architecture as a single layered model.
-- Layer 1 — snowflake-shaped sub-dims (audit + storage win).
CREATE OR REPLACE VIEW v_dim_product_snowflake AS
SELECT p.product_sk, p.product_id, p.product_name,
       cat.category_name, br.brand_name, sup.supplier_name
FROM dim_product_sf  p
JOIN dim_category    cat ON cat.category_sk = p.category_sk
JOIN dim_brand       br  ON br.brand_sk     = p.brand_sk
JOIN dim_supplier    sup ON sup.supplier_sk = br.supplier_sk;

-- Layer 2 — flatten to a star-shaped consumption dim (BI win).
CREATE TABLE dim_product_star AS
SELECT product_sk, product_id, product_name,
       category_name AS category, brand_name AS brand, supplier_name AS supplier
FROM v_dim_product_snowflake;

-- Layer 3 — fact stays the same; both layers read the same fact.
-- BI tools point at dim_product_star; audit queries point at v_dim_product_snowflake.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

layer shape consumer refresh cadence
Sub-dims (dim_category, dim_brand, dim_supplier) snowflake audit, finance every load
v_dim_product_snowflake (view) snowflake audit queries virtual (no refresh)
dim_product_star (table) star BI tools, Looker every load (materialised)
fact_sales unchanged both layers every load
  1. Layer 1 — the sub-dims persist physically; they are the source of truth and survive audits.
  2. Layer 2 — a flattening view exposes the snowflake hierarchy as a single wide row; auditors prefer this over chasing FKs across multiple tables.
  3. Layer 3 — a materialised star dim is built from the flattening view and exposed to BI tools; query latency on the BI layer is identical to a pure star.
  4. The pattern is storage-efficient at the source + BI-friendly at the consumption layer — the best of both shapes.
  5. The DAG cost is one extra CREATE TABLE AS per dim; in dbt, this is one extra model per dim.

Output (one-row sample of dim_product_star).

product_sk product_id product_name category brand supplier
1001 SKU-9981 Acme Wireless Earbuds Pro Electronics Acme AcmeCorp Ltd

Why this works — concept by concept:

  • Snowflake-in, star-out — the dominant 2026 pattern; storage savings at staging + BI auto-join at the mart.
  • Materialised star dim — the consumption layer is physically denormalised so the BI tool sees a star; no query-time hops.
  • Audit view — the sub-dim hierarchy stays accessible to auditors via a thin view, so the snowflake structure survives.
  • Single factfact_sales is unchanged; both layers read the same fact, so storage on the fact is paid once.
  • Cost — one extra materialised dim per load; the storage cost is offset by the BI-layer query-latency win on every dashboard hit thereafter.

Choosing the right schema (cheat sheet)

A one-screen cheat sheet for star schema vs snowflake schema — pick the shape that matches your workload.

You care most about … Pick Why
Sub-second BI dashboards star Fewer joins → faster; BI tools auto-generate single-join SQL
Self-serve analyst exploration star Tableau / Looker / Power BI users don't need to learn join paths
Petabyte-scale storage economics snowflake 20-40% smaller wide dims; surrogate keys replace repeating strings
Audit-friendly regulated reporting snowflake Normalised dims match source-of-truth chart of accounts
Deep hierarchies (4+ levels) that change quarterly snowflake A re-org updates one sub-dim row, not millions of dim rows
Simplest ETL DAG star Fewer dim models, simpler orchestration
Smallest dim storage footprint snowflake Normalisation eliminates redundant strings
Easiest onboarding for new analysts star Single-step joins map to mental model of "fact + dim"
Single source of truth for hierarchies snowflake dim_category.category_name UNIQUE enforces uniqueness
Mixed workload (BI + audit) hybrid (snowflake staging, star mart) Snowflake-in, star-out is the 2026 default
Data Vault → mart pipeline snowflake intermediate, star mart Natural fit; hubs/links/satellites → snowflake → star
Conformed dim across many marts either Conformed dimensions are independent of star vs snowflake choice
SCD type 2 history either Both schemas handle SCD2 identically on the relevant dims
First-time warehouse build star Default safe choice; refactor later if needed
Multi-channel retailer fact star dim_channel as conformed dim + sentinel store rows for non-physical
Clickstream event store with repeating event metadata snowflake Storage savings on dim_event are substantial

Frequently asked questions

What is the difference between star schema and snowflake schema in one sentence?

A star schema has one fact table in the centre and a single layer of denormalised dimension tables around it — each dimension stores its hierarchy inline as columns, so every query reaches its data in one join per dimension. A snowflake schema has the same fact table and same primary dimensions, but each dimension is normalised into sub-dimension tables (e.g., dim_product → dim_category → dim_brand → dim_supplier), so analytical queries pay more joins per dimension hierarchy in exchange for less storage redundancy. The senior way to phrase the difference is "star denormalises for BI speed; snowflake normalises for storage and audit", and most production warehouses ship both shapes in different layers.

When should I choose star schema over snowflake schema?

Choose star schema when query latency is the #1 priority (interactive BI dashboards, self-serve analyst exploration, sub-second SLAs), when your BI tool (Tableau, Looker, Power BI, Mode, Hex) auto-generates SQL and you want zero LookML / relationship overhead, when storage cost is not a binding constraint, and when your team prefers a simpler ETL DAG with fewer dim models. The four-question decision tree from section 5 short-circuits: Q1 — is query latency the priority? YES → star, no further questions. As a default starting shape for a new warehouse, star wins ~80% of the time because the cost of refactoring star-to-snowflake later is smaller than the cost of forcing every analyst to learn join paths on day one. The exception is regulated industries (finance, healthcare, insurance) where audit hierarchies dictate the shape from day one.

When should I choose snowflake schema over star schema?

Choose snowflake schema when storage cost is a meaningful constraint (petabyte-scale warehouses where a 20-40% dim-storage saving translates to material dollars), when you have deep dimension hierarchies (4+ levels) that change often (regional re-orgs, product taxonomy rewrites), when regulated reporting requires the normalised structure to match a source-of-truth chart of accounts (Basel III, IFRS, GAAP, SOX, HIPAA), or when you are building the intermediate layer of a Data Vault → snowflake → star pipeline. The snowflake pays for itself in storage and audit-friendliness at the cost of query latency and BI-tool friction; on a Snowflake or BigQuery warehouse with broadcast joins on small sub-dims, the latency penalty is typically 20-50% — acceptable for batch and tolerable for most BI workloads, painful for sub-50-ms drilldowns.

What is a fact table vs a dimension table?

A fact table stores the measurable events of a business process — quantity, revenue, discount, unit_price — along with the foreign keys (customer_sk, product_sk, date_sk, store_sk) that point at the dimensions describing each event; the fact's grain is the declared "one row per X" contract (e.g., one row per order line, one row per shipment event, one row per page view). A dimension table stores the descriptive context of a business entity — customer_name, product_name, category, region, manager_name — and acts as the slice-and-dice surface for analytical queries; dimensions are reached by joining on the surrogate key (customer_sk, product_sk). The mnemonic is "facts are numbers you sum; dimensions are strings you group by"SUM(revenue) GROUP BY category is SUM(fact column) GROUP BY dim column. Every well-modelled warehouse has one fact per business process and one set of conformed dimensions reused across all facts.

What are conformed dimensions and slowly changing dimensions (SCD)?

Conformed dimensions are dimension tables that are shared across multiple fact tables — one dim_customer is used by fact_sales, fact_support, and fact_marketing so that cross-mart reporting (revenue + tickets + campaign attribution per customer) joins to the same customer rows everywhere; this is the single biggest reuse lever in a warehouse and the strongest senior signal in a dimensional modeling answer. Slowly changing dimensions (SCD) describe how dimension attributes change over time and how the warehouse preserves history: SCD type 1 overwrites the old value (no history); SCD type 2 versions the dimension row with effective_from, effective_to, and is_current columns so each historical fact joins to the dimension as it was at the time of the event; SCD type 3 keeps a current and a previous column (limited history). The senior interview answer is "dim_product is SCD type 2 because product attributes change and historical revenue reports must reflect the product hierarchy at the time of sale".

Is snowflake schema the same as the Snowflake data warehouse?

No — they share a name but are completely separate concepts. The snowflake schema (lowercase, dimensional-modeling concept) is the normalised dimension-table design pattern this guide compares to star; it was named in the 1990s by Ralph Kimball because the radial diagram of a normalised dim with sub-dimensions resembles a snowflake crystal. The Snowflake data warehouse (capital S, the company / product) is a cloud-native columnar warehouse vendor (Snowflake Inc., ticker SNOW) that runs on AWS, GCP, and Azure and competes with BigQuery, Databricks SQL, Redshift, and Synapse. Confusingly, the Snowflake warehouse supports both schema shapes — you can build a star schema or a snowflake schema on the Snowflake warehouse, and many production teams do exactly that (snowflake-in at staging, star-out at the mart). Interview tip: when an interviewer asks about "snowflake", clarify which one in your first sentence.


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including SQL + Python drills keyed to the same star schema vs snowflake schema skill set this guide teaches (fact + dim joins, surrogate key handling, aggregate parity across normalised vs denormalised dims, SCD type 2 effective-date joins, conformed-dim reuse, and the snowflake-in / star-out architecture pattern). Whether you're prepping for a senior dimensional modeling screen the night before or grinding the fact table + dimension table + grain + conformed dimensions + SCD loop over months, the practice library mirrors the same shapes, decision-tree thinking, and trade-off vocabulary interviewers expect.

Kick off via Explore practice →; drill the SQL practice lane →; fan out into the data-modeling lane →; rehearse joins drills →; reinforce aggregation patterns →; widen coverage on the full Python practice library →.

Top comments (0)