DEV Community

Cover image for Enterprise Data Warehouse Design (Inmon-Style): Multi-Source, Conformed, Governed
Gowtham Potureddi
Gowtham Potureddi

Posted on

Enterprise Data Warehouse Design (Inmon-Style): Multi-Source, Conformed, Governed

The enterprise data warehouse is the integration contract at the centre of every regulated, multi-source data platform — and four decades after Bill Inmon coined the term it still ships in 2026, because the problems it was designed to solve never went away. The inmon data warehouse is the trusted single-version-of-truth that sits between dozens of operational systems and dozens of downstream consumers, and the corporate information factory is the layered architecture that wraps it. When a senior data engineer is asked to design a platform for a bank, a hospital, an insurer, or any business that has to answer auditors and regulators, the 3nf data warehouse modelled the Inmon way is still the default answer.

This guide walks through the five things every senior data warehouse architecture discussion turns on — why Inmon's EDW model still ships, how the Corporate Information Factory layers data from sources to consumption, how to model the EDW in third normal form with SCD-2 history, how conformed dimensions bridge the Inmon core to Kimball-style marts, and how governance keeps the whole stack audit-ready. Each section pairs a teaching block with a Solution-Tail design answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll be able to defend a 3NF EDW with surrogate keys, full SCD-2 history, conformed dim_customer, and a governance overlay that passes SOX, GDPR, and HIPAA — the exact shape edw design rounds reward when the interviewer asks "how would you build this from scratch?"

PipeCode blog header for Inmon-style enterprise data warehouse design — bold white headline 'Enterprise Data Warehouse · Inmon-Style' with subtitle 'Multi-Source · Conformed · Governed · 3NF' and a stylised 5-layer Corporate Information Factory stack on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the data modelling library →, rehearse slowly-changing dimension problems →, and warm up on ETL patterns →.


On this page


1. Why Inmon's EDW model still ships in 2026

The EDW is the one-time integration contract — every other layer is derived

The one-sentence invariant: an Inmon enterprise data warehouse is a third-normal-form, history-aware, atomic-grain store that integrates every source system once, and from which every downstream mart, BI cube, ML feature, and reverse-ETL feed is rebuilt as needed. Once you internalise that mental model — the EDW as the durable middle layer, the marts as the disposable serving layer — every enterprise data warehouse trade-off becomes a sequence of decisions about where data lives, who owns it, and how long it has to be preserved.

Inmon vs Kimball in five bullets.

  • Direction. Inmon is top-down — design the corporate-wide 3NF EDW first, then derive subject-area marts from it. Kimball is bottom-up — design dimensional marts directly off the sources, then conform them later.
  • Modelling style. Inmon uses third normal form at the EDW layer (entities separated from change). Kimball uses dimensional modelling (fact tables + dimensions) at every layer.
  • Time horizon. Inmon is built for a decade — atomic history, full audit, slow to design, slow to change. Kimball is built for a quarter — fast time-to-insight, easier for analysts to query.
  • Audit posture. Inmon's atomic grain + SCD-2 on every column gives forensic auditability — "show me what customer_segment was for customer 42 on 2024-03-15." Kimball can do this only if the mart was designed with that history.
  • Disposability. Marts in the Inmon model are derived and disposable — drop and rebuild from the EDW overnight. The Kimball-only world has no such safety net.

When the Inmon model wins.

  • Regulated industries. Banks, insurers, hospitals, utilities, telcos — anywhere audit trails, retention policies, and "what was the state on 2024-03-15?" are first-class requirements.
  • Multi-source integration. Five-plus operational systems that overlap on the same business entities (customer, product, account) — the EDW is where you reconcile them once.
  • Multi-team analytics. Sales, finance, risk, marketing each build their own marts off a single conformed EDW — no two teams disagree about who the customer was.
  • Decade-long horizon. You expect the platform to outlive every BI tool, every cloud vendor switch, and every CDO; the EDW is the durable asset.

When Kimball wins alone.

  • Single source, analytics-only. One application, one team, one set of dashboards — skip the EDW middle and go straight to a star schema.
  • Fast time-to-insight. Three-month delivery target with no audit requirements — Kimball ships marts faster.
  • Cloud lakehouse + dbt + BI. Modern startup stacks often skip Inmon entirely; that is a deliberate trade-off, not an oversight.

The 2026 reality — most enterprises run a hybrid.

  • Inmon at the core. The 3NF EDW remains the integration layer — Snowflake, BigQuery, Databricks, Redshift all host Inmon-style EDWs at scale today.
  • Kimball at the serving layer. Star-schema marts sit downstream of the EDW for BI and dashboarding — fast joins, fast aggregates, friendly column names.
  • Lakehouse as substrate. Delta / Iceberg / Hudi tables now back both the staging and EDW layers. The model is unchanged; the storage and engine are.
  • dbt as the modelling tool. Most modern enterprises express the EDW + marts as dbt models — staging → intermediate → marts is the dbt-flavoured restatement of CIF layers.

What interviewers listen for.

  • Do you say 3NF and atomic grain for the EDW layer? — senior signal.
  • Do you mention conformed dimensions when the discussion crosses subject areas? — required answer.
  • Do you reach for SCD Type-2 on every dimension attribute at the EDW grain? — Inmon-purist signal.
  • Do you separate EDW from marts as durable vs disposable? — architectural maturity signal.

Worked example — pick a model for a multi-source banking platform

Detailed explanation. Realistic EDW design starts from the business context — number of source systems, regulatory requirements, downstream consumer count, and retention horizon. A retail bank with seven core systems, SOX + GDPR + Basel obligations, three analytics teams, and a 10-year retention policy is a textbook Inmon case.

Question. A retail bank is replacing its 1990s warehouse. It has 7 source systems (core banking, cards, mortgages, loans, CRM, fraud, marketing), needs SOX + GDPR + Basel audit trails, supports 4 BI teams (finance, risk, marketing, ops), and has a 10-year retention requirement. Which warehouse model should it pick, and why?

Code (the decision logic as a checklist).

def pick_warehouse_model(spec: dict) -> str:
    # spec is the business + regulatory context
    if spec["source_systems"] >= 5 \
       and spec["regulated"] \
       and spec["analytics_teams"] >= 2 \
       and spec["retention_years"] >= 5:
        return "inmon_edw_with_kimball_marts"   # the hybrid
    if spec["source_systems"] == 1 \
       and not spec["regulated"] \
       and spec["time_to_insight_weeks"] <= 12:
        return "kimball_only"
    return "inmon_edw_with_kimball_marts"        # default — durable + serving

spec = {
    "source_systems": 7,
    "regulated": True,           # SOX, GDPR, Basel
    "analytics_teams": 4,
    "retention_years": 10,
    "time_to_insight_weeks": 26,
}

pick_warehouse_model(spec)
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Count source systems. 7 is well above the 5-source threshold where reconciling entity definitions (who is a customer?) becomes a full-time job — the EDW is the place that work happens once.
  2. Check regulatory load. SOX (financial controls), GDPR (right-to-be-forgotten, PII residency), Basel (capital and risk reporting) all demand reproducible history at the row level — Inmon's SCD-2 grain delivers this; pure Kimball marts struggle.
  3. Count analytics teams. Four downstream teams means four overlapping mart designs — without a conformed EDW each team reinvents dim_customer and disagreements multiply.
  4. Retention horizon. A decade outlives every BI tool refresh and every cloud-vendor switch — the durable layer is the EDW; the marts can be rebuilt.
  5. Output the choice. The hybrid: Inmon 3NF EDW at the core, Kimball star marts downstream for BI consumption.

Output.

Decision factor Value Implication
Source systems 7 Need the EDW as integration layer
Regulated Yes (SOX/GDPR/Basel) Full SCD-2 history required
Analytics teams 4 Conformed dims mandatory
Retention 10 years EDW as the durable asset
Recommendation Inmon EDW + Kimball marts The 2026 default

Rule of thumb. Five-plus sources + any regulatory load = Inmon EDW at the core. Single source + no regulators + fast deadline = Kimball-only. Everywhere in between, pick the hybrid.

Enterprise data warehouse design interview question

A senior interviewer often frames this as: "Walk me through how you'd decide between an Inmon EDW and a Kimball-only data mart for a multi-source bank." The candidate must articulate the integration contract, the audit posture, and the disposability of the serving layer — three pillars of mature data warehouse architecture thinking.

Solution Using the Inmon EDW + Kimball mart hybrid

# Architecture sketch (top-down)
Layer 0  Sources       core_banking | cards | mortgages | loans | crm | fraud | marketing
Layer 1  Staging       raw landing — schema-on-read, audit columns, no transformation
Layer 2  3NF EDW       customer | account | transaction | branch | product (atomic, SCD-2)
Layer 3  Conformed dim dim_customer extracted once, reused everywhere
Layer 4  Data marts    sales_mart | finance_mart | risk_mart | marketing_mart (Kimball stars)
Layer 5  Consumption   BI | ML | regulatory reports | reverse ETL
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Layer What happens Why this layer
1 Sources 7 OLTP systems push to Kafka / Debezium CDC Don't disturb operational stores
2 Staging Raw landing in object storage / lakehouse, no transforms Replay buffer — if EDW logic changes you replay from here
3 EDW 3NF model with surrogate keys + SCD-2 + audit columns The integration contract — every source reconciled here once
4 Conformed dim dim_customer derived from customer EDW table nightly One customer definition, shared across all marts
5 Marts fact_sales, fact_revenue, fact_loan_risk joined to conformed dims Fast BI queries, friendly column names, can be rebuilt
6 Consumption Looker / Tableau / SageMaker / reverse ETL to Salesforce The analytics surface — outlives by being thin

Output:

Quality Inmon EDW value Kimball mart value
Audit trail Full SCD-2 per attribute Inherited from EDW
Query speed Slower (many joins) Fast (pre-joined star)
Time-to-insight Months for first mart Weeks per new mart
Disposability Durable — never drop Disposable — rebuild overnight
Regulatory fit SOX/GDPR/Basel ready Inherits readiness from EDW
Schema change cost High — EDW is contract Low — drop the mart, rebuild

Why this works — concept by concept:

  • Single integration contract — the EDW is the only place where customer is reconciled across 7 sources. Every mart, every BI tool, every ML model reads from it. No team builds their own definition.
  • Atomic grain — the EDW stores the finest-grain facts (one row per transaction, one row per balance snapshot). Marts can always aggregate up; you can never disaggregate after the fact.
  • SCD-2 on every dimension — every change to customer_segment, branch_id, risk_band is preserved with valid_from / valid_to. Auditors can ask "what was the value on 2024-03-15?" and the EDW answers.
  • Disposable serving layer — marts are derived views of the EDW. Drop one, rebuild it overnight; the EDW is untouched. This is the operational superpower of the hybrid model.
  • Cost — EDW build time = O(months) up front; mart build time = O(days) per subject area; storage = ~3× source size for SCD-2 history + RF on the lakehouse; query latency = O(joins) for ad-hoc EDW queries, O(1) for mart queries.

SQL
Topic — data modelling
Data modelling design problems (EDW, 3NF, dimensional)

Practice →


2. The Corporate Information Factory — sources to consumption

The CIF is five layers — sources, staging, 3NF EDW, dependent marts, consumption — connected by a governance ribbon

The mental model in one line: the Corporate Information Factory is the canonical Inmon reference architecture — raw operational data enters from the left, is integrated once in the 3NF EDW in the middle, then is fanned out to derived marts and consumption tools on the right, with a governance overlay that spans every layer. Once you can name the five layers and the role each one plays, half of the corporate information factory interview question family collapses to obvious answers.

Visual diagram of the Corporate Information Factory architecture — five layers stacked horizontally left-to-right (Sources, Staging, 3NF EDW, Data marts, Consumption); each layer with example components and an arrow showing data flow; a governance ribbon overlaid across the top; on a light PipeCode card.

The five CIF layers.

  • Layer 0 — Sources. Every OLTP database, SaaS application, mainframe extract, and event stream the business produces. The EDW reads from these but never writes back.
  • Layer 1 — Staging. A schema-on-read raw landing zone. Records arrive unchanged with audit metadata (source_system, load_id, loaded_at). No business logic; no joins; no transformation. Storage = cheap object store.
  • Layer 2 — 3NF EDW. The integrated, atomic, history-aware store. Entities are separated; SCD-2 is applied to every changing attribute; surrogate keys give durable identity; business keys give traceability. This is the contract.
  • Layer 3 — Dependent data marts. Kimball star schemas derived from the EDW. Each mart is owned by a subject area (sales, finance, risk). Marts share conformed dimensions sourced from the EDW.
  • Layer 4 — Consumption. BI tools (Tableau, Looker, Power BI), ML pipelines (Spark, SageMaker), reverse-ETL (Census, Hightouch). All read from marts; none read from staging.

Why the EDW middle layer is non-negotiable.

  • One reconciliation, many readers. A customer's name appears in 5 sources — without the EDW, every mart reconciles it independently and disagreements proliferate. With the EDW, the reconciliation happens once and every reader inherits the answer.
  • Replay-safe. If business rules change, you can rebuild the EDW from staging without re-extracting from sources. The staging layer is the time-travel buffer.
  • Stable contract. Marts and consumers depend on EDW column names. Sources change shape constantly; the EDW absorbs that change so downstream readers never have to.
  • Audit-ready. A single SCD-2-enforced EDW makes "what was the value on date X?" a one-query answer. No mart-by-mart history reconstruction.

The staging layer — what it is and isn't.

  • It is a faithful, append-only copy of source data with audit metadata. Schema-on-read; no transforms.
  • It is the replay buffer — if you rebuild the EDW from scratch you start here, not from the sources.
  • It is where data-quality checks live before data hits the EDW (rejection rules, schema-evolution gates).
  • It is not a query layer for BI — staging tables have raw column names, no joins, no history.
  • It is not the system of record — sources are. Staging is a snapshot.

Refresh patterns.

  • CDC-driven (preferred). Debezium / Fivetran / native log mining streams source row changes into staging; staging triggers EDW merges; EDW changes propagate to marts.
  • Batch-driven (legacy). Nightly full extracts land in staging; EDW MERGE runs on the new snapshot; marts rebuild from EDW.
  • Hybrid. High-volume transactional sources stream via CDC; reference data (product catalogue, geo) lands as nightly batch.

Where the lakehouse fits.

  • Storage substrate. Delta Lake, Apache Iceberg, and Apache Hudi all back the staging and EDW layers in modern stacks. The 3NF model is unchanged; only the file format and the engine are.
  • Time-travel for free. Iceberg and Delta give you snapshot isolation, transactional MERGEs, and time-travel queries built into the engine — no need to reinvent SCD-2 for "what did this table look like yesterday?" though you still want SCD-2 at the row level for the audit-trail story.
  • Compute separation. The EDW data lives once in object storage; multiple engines (Snowflake, BigQuery, Databricks SQL, Trino) can read it. This is the lakehouse promise that finally makes the Inmon-vs-Kimball religious war moot — both layers fit on the same substrate.
  • Catalog as governance hub. Unity Catalog, Polaris, Hive Metastore extensions all hold the column-level lineage and tagging that the governance ribbon needs. The EDW gets these for free if you adopt a modern catalog.

Where dbt fits.

  • staging → intermediate → marts. The dbt-canonical three-layer model maps cleanly onto the CIF — staging = staging, intermediate = 3NF EDW, marts = dependent marts. Most modern enterprises express the whole CIF as a dbt project.
  • Tests as data-quality gates. dbt test runs uniqueness, not-null, accepted-value, and referential-integrity checks at every layer transition. The DQ framework integrates seamlessly with Great Expectations / Soda Core.
  • Documentation as catalog. dbt docs auto-generates a column-level lineage graph and renders it as a navigable site. Cheap, fast, and good enough for many regulated shops.

Worked example — sketch the CIF for a multi-source bank

Detailed explanation. The CIF for a real bank wires together core banking (Debezium CDC), cards (Kafka events), mortgages (nightly batch), CRM (Fivetran), and marketing (Snowpipe). The staging layer is the universal landing zone; the 3NF EDW is the integration contract; the marts are the serving layer.

Question. Sketch the five-layer CIF for a retail bank with 5 source systems. Show the load mechanism per source, the staging table per source, the 3NF EDW entity each source feeds, and the consuming marts.

Input (sources).

Source System type Load mechanism Refresh cadence
core_banking OLTP (Oracle) Debezium CDC continuous
cards OLTP (Postgres) Debezium CDC continuous
mortgages OLTP (DB2 mainframe) Nightly batch daily
crm SaaS (Salesforce) Fivetran hourly
marketing Files (S3 drops) Snowpipe hourly

Code (the CIF layer map as YAML).

# cif-layer-map.yml
sources:
  - name: core_banking
    load: debezium_cdc
    staging_table: stg.core_banking_customer
    edw_entities: [customer, account, transaction]
  - name: cards
    load: debezium_cdc
    staging_table: stg.cards_account
    edw_entities: [account, transaction]
  - name: mortgages
    load: nightly_batch
    staging_table: stg.mortgages_account
    edw_entities: [account, transaction]
  - name: crm
    load: fivetran
    staging_table: stg.crm_customer
    edw_entities: [customer]
  - name: marketing
    load: snowpipe
    staging_table: stg.marketing_campaign
    edw_entities: [campaign, customer_campaign]

edw:
  entities: [customer, account, transaction, branch, product, campaign]
  scd2_on_all_dimensions: true
  surrogate_keys: hash_sha256_business_keys
  audit_columns: [valid_from, valid_to, is_current, system_of_record, load_id]

marts:
  - name: sales_mart
    facts: [fact_sales]
    dims_from_edw: [dim_customer, dim_product, dim_date, dim_branch]
  - name: finance_mart
    facts: [fact_revenue, fact_cost]
    dims_from_edw: [dim_customer, dim_account, dim_date]
  - name: risk_mart
    facts: [fact_credit_risk]
    dims_from_edw: [dim_customer, dim_account, dim_date]
  - name: marketing_mart
    facts: [fact_campaign_response]
    dims_from_edw: [dim_customer, dim_campaign, dim_date]
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Sources push, the EDW pulls (via staging). Debezium streams CDC from core_banking and cards directly to Kafka topics, which land in stg.core_banking_customer etc. — the EDW never touches an OLTP system directly.
  2. Staging keeps a faithful copy. Every row in staging has source_system, load_id, loaded_at, and the original source primary key. No business rules apply yet.
  3. EDW merges happen by entity, not by source. customer in the EDW is built from stg.core_banking_customer + stg.crm_customer + a conform-rule that picks the system-of-record per attribute (CRM owns name, core-banking owns DOB, etc.).
  4. SCD-2 captures change. Each EDW entity row has valid_from, valid_to, is_current columns; when a source row changes, the EDW closes the old row (valid_to = now(), is_current = false) and inserts a new one.
  5. Marts derive from the EDW once a night. dim_customer in every mart is a SELECT-from-EDW view (or materialised table). Facts join to it via surrogate keys.

Output.

Layer Tables / objects Storage size (illustrative) Refresh
0 — Sources 5 OLTP / SaaS systems (operational, not counted) live
1 — Staging 18 staging tables 12 TB (raw, with history) per-source cadence
2 — EDW (3NF) 16 atomic entities 8 TB (with SCD-2 + audit) continuous merges
3 — Marts (Kimball) 4 marts × 5 dims × 1–2 facts 4 TB (denormalised) hourly to nightly
4 — Consumption BI dashboards + ML + reverse ETL (read-only) sub-minute query

Rule of thumb. Staging is cheap; never skip it. The EDW is durable; design it carefully. Marts are disposable; design them for the consumer's convenience, not the modeller's purity.

EDW architecture interview question

A senior interviewer often frames this as: "We have 5 source systems and 4 BI teams. Sketch the layers between sources and dashboards and tell me what lives where." The candidate must reach for the five-layer CIF and explain why each layer earns its keep.

Solution Using the CIF five-layer model with a governance ribbon

-- ① Staging — schema-on-read landing zone (no transforms)
CREATE TABLE stg.core_banking_customer (
  src_customer_id      STRING,
  full_name            STRING,
  date_of_birth        DATE,
  segment              STRING,
  source_system        STRING,           -- audit
  load_id              STRING,           -- audit
  loaded_at            TIMESTAMP,        -- audit
  src_row_hash         STRING            -- dedup key
);

-- ② 3NF EDW — atomic, SCD-2, surrogate-keyed
CREATE TABLE edw.customer (
  customer_key         STRING,            -- surrogate (hash of business key)
  customer_id          STRING,            -- business key
  full_name            STRING,
  date_of_birth        DATE,
  segment              STRING,
  valid_from           TIMESTAMP,
  valid_to             TIMESTAMP,
  is_current           BOOLEAN,
  system_of_record     STRING,
  load_id              STRING,
  PRIMARY KEY (customer_key)
);

-- ③ Conformed dimension — derived from EDW
CREATE OR REPLACE VIEW marts.dim_customer AS
SELECT
  customer_key,
  customer_id,
  full_name,
  segment,
  valid_from,
  valid_to,
  is_current
FROM edw.customer;

-- ④ Mart fact — joins to conformed dim by surrogate key
CREATE TABLE marts.fact_sales (
  sale_key             STRING,
  customer_key         STRING,            -- FK to dim_customer
  product_key          STRING,            -- FK to dim_product
  date_key             INT,               -- FK to dim_date
  branch_key           STRING,            -- FK to dim_branch
  amount               DECIMAL(18, 2),
  sold_at              TIMESTAMP
);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Layer What it produces Who reads it When does it refresh
Staging Raw landing per source EDW merge jobs only per-source cadence (continuous / hourly / nightly)
3NF EDW Atomic, SCD-2, integrated entities Conformed dim views + auditors continuous (CDC) or nightly (batch)
Conformed dims Subject-agnostic dim_customer, dim_product Every mart fact join nightly rebuild from EDW
Marts Star schemas — fact + dims BI / ML / reverse-ETL hourly to nightly
Consumption Dashboards, ML features, reverse-ETL feeds End users + downstream apps sub-minute query latency

Output:

Quality Staging EDW Marts Consumption
Schema discipline schema-on-read strict 3NF denormalised star tool-shaped
Audit trail source row hash SCD-2 per attribute inherited from EDW inherited from marts
Query SLAs none — load-only minutes per ad-hoc sub-second BI sub-second
Disposability replay-safe durable forever rebuild nightly rebuild on demand

Why this works — concept by concept:

  • Staging as replay buffer — if the EDW model changes (new attribute, new entity), you replay from staging without re-extracting from sources. The bridge layer that protects production OLTP systems.
  • EDW as the integration contract — every source reconciles here once. Every downstream reader inherits the same answer to "who is customer 42?" — disagreements between marts disappear.
  • Conformed dimensionsdim_customer is derived from edw.customer exactly once and shared by every mart. No two marts can have a different dim_customer.
  • Disposable martsfact_sales is rebuilt nightly from EDW + conformed dims. Drop it, rebuild it; the EDW is untouched. This is the operational superpower.
  • Cost — staging storage = ~1× source (with retention); EDW storage = ~3× source (SCD-2 history + RF); mart storage = additive per subject area; compute = O(merge_volume) on EDW + O(join_breadth) on marts.

SQL
Topic — dimensional modelling
Dimensional modelling problems (EDW, marts, conformed dims)

Practice →


3. 3NF modelling for the EDW — entities, relationships, history

Third normal form separates entities from change — the durable shape of an Inmon EDW

The mental model in one line: the 3NF EDW captures every business entity once, every relationship once, every change forever — with surrogate keys for durable identity, business keys for traceability, and SCD-2 columns for time-travel. Once you can recite "atomic, integrated, history-aware" and back it with column names, the 3nf data warehouse interview question family becomes muscle memory.

Visual 3NF entity-relationship diagram — four atomic entities (customer, account, transaction, branch) drawn as rounded cards with surrogate-key + business-key + audit columns visible inside each; thin labelled FK arrows connecting them; a side annotation about SCD Type-2 history at the atomic grain; on a light PipeCode card.

The four 3NF design rules every Inmon EDW follows.

  • One entity, one table. customer, account, transaction, branch, product each live in their own atomic table. Don't denormalise at the EDW grain — that's the mart's job.
  • One source of truth per attribute. When two sources disagree on full_name, pick a system-of-record per attribute (e.g. CRM wins for name, core-banking wins for DOB) and document it.
  • Atomic grain. transaction stores one row per transaction event. Never store an aggregate (daily totals, monthly snapshots) at the EDW layer — marts compute those.
  • History per row. Every entity carries valid_from, valid_to, is_current so you can query "what was this entity's state at time T?".

Surrogate keys vs business keys.

  • Surrogate key (customer_key). Internally generated, immutable, opaque. Typically a hash of the business key (SHA-256) or a sequence integer. Joins use surrogate keys exclusively.
  • Business key (customer_id). Sourced from the operational system — meaningful to the business, but mutable (system reissues, re-platforming). Never join on this.
  • Why both? The surrogate gives durable identity (joins survive source-system migrations). The business key gives traceability (back to the operational record).

Audit columns every EDW entity carries.

  • valid_from, valid_to, is_current — SCD-2 history. valid_to = '9999-12-31' on the current row.
  • system_of_record — which source provided this version of the row.
  • load_id — the batch / streaming offset that wrote this version. Pins lineage to the load.
  • created_at, updated_at — when the EDW row was inserted / closed. Distinct from the source's own timestamps.
  • src_row_hash — hash of all source columns, used to detect "is this an actual change or a no-op refresh?".

The four SCD types — and which to use at EDW grain.

  • Type 0 — never updated (e.g. date_of_birth for a real person). Use sparingly.
  • Type 1 — overwrite in place. Loses history. Avoid at EDW grain.
  • Type 2 — close the old row, insert a new row with new valid_from. The Inmon default.
  • Type 3 — add a previous_value column. Keeps only one prior version. Niche.
  • Type 6 — hybrid (Type 1 + Type 2 + Type 3 columns on the same row). Used when downstream consumers want both "current" and "historical" in one view.

The atomic grain rule (Inmon's most-violated principle).

  • Transaction tables store one row per event — never one row per day-summary. Daily summaries are mart concerns.
  • Snapshot tables (e.g. monthly account balance) live in marts, not the EDW.
  • Bridge tables (many-to-many between accounts and parties) live in the EDW because they're entity-level, not aggregate-level.

Worked example — model a banking customer, account, transaction graph in 3NF

Detailed explanation. A retail bank wants to integrate customer, account, branch, and transaction data from three sources (core banking, cards, mortgages) into a single EDW. The 3NF approach normalises each entity to its own table, applies SCD-2 to changing attributes, and uses surrogate keys for joins.

Question. Design a 3NF EDW for a retail bank's customer + account + transaction + branch graph. Show the DDL with surrogate keys, business keys, SCD-2 columns, and audit columns. Include a sample MERGE that closes a Type-2 row when a customer's segment changes.

Input (sample source data).

customer_id full_name segment branch_id source extract_ts
C042 Anya Patel retail B12 core_banking 2024-01-10 09:00
C042 Anya Patel retail B12 core_banking 2024-03-15 09:00
C042 Anya Patel private B12 core_banking 2024-03-15 10:30

Code (3NF DDL + SCD-2 MERGE).

-- The 4 atomic entities
CREATE TABLE edw.customer (
  customer_key       STRING NOT NULL,           -- surrogate (SHA-256 of customer_id)
  customer_id        STRING NOT NULL,           -- business key
  full_name          STRING,
  date_of_birth      DATE,
  segment            STRING,
  valid_from         TIMESTAMP NOT NULL,
  valid_to           TIMESTAMP NOT NULL,        -- '9999-12-31' if current
  is_current         BOOLEAN NOT NULL,
  system_of_record   STRING NOT NULL,
  load_id            STRING NOT NULL,
  src_row_hash       STRING NOT NULL,
  created_at         TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (customer_key, valid_from)
);

CREATE TABLE edw.branch (
  branch_key         STRING NOT NULL,
  branch_id          STRING NOT NULL,
  city               STRING,
  region             STRING,
  opened_at          DATE,
  valid_from         TIMESTAMP NOT NULL,
  valid_to           TIMESTAMP NOT NULL,
  is_current         BOOLEAN NOT NULL,
  system_of_record   STRING NOT NULL,
  load_id            STRING NOT NULL,
  PRIMARY KEY (branch_key, valid_from)
);

CREATE TABLE edw.account (
  account_key        STRING NOT NULL,
  account_id         STRING NOT NULL,
  customer_key       STRING NOT NULL,           -- FK to edw.customer
  branch_key         STRING NOT NULL,           -- FK to edw.branch
  product_code       STRING,
  opened_at          DATE,
  status             STRING,
  valid_from         TIMESTAMP NOT NULL,
  valid_to           TIMESTAMP NOT NULL,
  is_current         BOOLEAN NOT NULL,
  system_of_record   STRING NOT NULL,
  load_id            STRING NOT NULL,
  PRIMARY KEY (account_key, valid_from)
);

CREATE TABLE edw.transaction (
  txn_key            STRING NOT NULL,
  txn_id             STRING NOT NULL,
  account_key        STRING NOT NULL,           -- FK to edw.account
  amount             DECIMAL(18, 2),
  currency           STRING,
  txn_at             TIMESTAMP NOT NULL,
  posted_at          TIMESTAMP,
  status             STRING,
  load_id            STRING NOT NULL,
  PRIMARY KEY (txn_key)                          -- transactions are immutable
);

-- SCD-2 MERGE: close old row, insert new row when segment changes
MERGE INTO edw.customer AS tgt
USING (
  SELECT
    SHA256(s.customer_id)            AS customer_key,
    s.customer_id,
    s.full_name,
    s.segment,
    s.source                          AS system_of_record,
    s.extract_ts                      AS valid_from,
    s.load_id,
    SHA256(CONCAT(s.full_name, '|', s.segment)) AS src_row_hash
  FROM stg.core_banking_customer AS s
) AS src
ON tgt.customer_key = src.customer_key
   AND tgt.is_current = TRUE
   AND tgt.src_row_hash <> src.src_row_hash       -- only act on real changes
WHEN MATCHED THEN UPDATE SET
  valid_to    = src.valid_from,
  is_current  = FALSE;

INSERT INTO edw.customer (
  customer_key, customer_id, full_name, segment,
  valid_from, valid_to, is_current,
  system_of_record, load_id, src_row_hash
)
SELECT
  SHA256(s.customer_id),
  s.customer_id,
  s.full_name,
  s.segment,
  s.extract_ts,
  TIMESTAMP '9999-12-31 00:00:00',
  TRUE,
  s.source,
  s.load_id,
  SHA256(CONCAT(s.full_name, '|', s.segment))
FROM stg.core_banking_customer AS s
LEFT JOIN edw.customer AS c
  ON c.customer_key = SHA256(s.customer_id)
 AND c.is_current   = TRUE
 AND c.src_row_hash = SHA256(CONCAT(s.full_name, '|', s.segment))
WHERE c.customer_key IS NULL;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Hash the business key. SHA256(customer_id) is the surrogate key — deterministic, durable, immune to source-system reissues.
  2. Compute the row hash. SHA256(CONCAT(full_name, '|', segment)) lets us detect "did anything actually change?" — no-op refreshes don't create new SCD-2 rows.
  3. Close the previous current row. The MERGE updates valid_to and is_current = FALSE on the prior row whose hash differs from the incoming one.
  4. Insert the new current row. The INSERT adds a new row with valid_from = extract_ts, valid_to = '9999-12-31', is_current = TRUE — but only if a row with that exact hash doesn't already exist (idempotent).
  5. Atomic grain preserved. transaction has no SCD-2 — transactions are immutable events. Their FK account_key points to whichever account row was current at the time of insert.

Output.

After three source pushes for C042, edw.customer contains:

customer_key customer_id segment valid_from valid_to is_current
sha256(C042) C042 retail 2024-01-10 09:00 2024-03-15 10:30 FALSE
sha256(C042) C042 private 2024-03-15 10:30 9999-12-31 TRUE

The duplicate retail row at 2024-03-15 09:00 was idempotent — same hash, no insert. The segment change at 10:30 closed the retail row and opened the private row.

Rule of thumb. SCD-2 every dimension attribute by default. Drop down to Type-1 only when you have an explicit business reason (e.g. "we never need to know the previous value of last_login_at"). The cost of carrying history is low; the cost of losing it is unrecoverable.

Worked example — pick the system-of-record per attribute across three sources

Detailed explanation. A retail bank's customer entity is sourced from three operational systems — core banking (Oracle), CRM (Salesforce), and the digital banking platform (Postgres). Each system has its own copy of full_name, email, phone, segment, date_of_birth. Without a system-of-record (SoR) decision per attribute, the EDW would either pick arbitrarily or, worse, alternate between sources and produce a different answer every refresh.

Question. Three sources publish customer rows for C042. Each row has the same business key but different values for some attributes. Define a system-of-record rule per attribute and write the merge logic that produces a single canonical EDW row.

Input (three source rows for customer C042).

source full_name email phone segment dob
core_banking Patel, Anya anya.patel@old.example +44 7700 900123 private 1988-04-12
crm Anya Patel anya@gmail.com +44 7700 900123 private 1988-04-12
digital Anya Patel anya@gmail.com +44 7700 900999 retail 1988-04-12

Code (the SoR map + merge logic).

-- ① System-of-record per attribute — declared in metadata
CREATE TABLE ops.system_of_record (
  entity      STRING NOT NULL,
  attribute   STRING NOT NULL,
  source      STRING NOT NULL,
  rationale   STRING,
  PRIMARY KEY (entity, attribute)
);

INSERT INTO ops.system_of_record VALUES
  ('customer', 'full_name',     'crm',         'CRM owns name (KYC workflow updates here)'),
  ('customer', 'email',         'crm',         'CRM owns contact methods'),
  ('customer', 'phone',         'digital',     'Digital banking captures latest phone'),
  ('customer', 'segment',       'core_banking','Core banking owns risk/regulatory segment'),
  ('customer', 'date_of_birth', 'core_banking','Core banking owns regulatory DOB');

-- ② Build the canonical EDW row from per-attribute SoR
WITH src_cb AS (
  SELECT customer_id, full_name, email, phone, segment, date_of_birth, 'core_banking' AS source
  FROM stg.core_banking_customer
),
src_crm AS (
  SELECT customer_id, full_name, email, phone, segment, date_of_birth, 'crm' AS source
  FROM stg.crm_customer
),
src_digital AS (
  SELECT customer_id, full_name, email, phone, segment, date_of_birth, 'digital' AS source
  FROM stg.digital_customer
),
canonical AS (
  SELECT
    COALESCE(crm.customer_id, cb.customer_id, dig.customer_id)  AS customer_id,
    crm.full_name                                                AS full_name,         -- SoR: crm
    crm.email                                                    AS email,             -- SoR: crm
    dig.phone                                                    AS phone,             -- SoR: digital
    cb.segment                                                   AS segment,           -- SoR: core_banking
    cb.date_of_birth                                             AS date_of_birth      -- SoR: core_banking
  FROM src_cb       AS cb
  FULL OUTER JOIN src_crm    AS crm ON crm.customer_id = cb.customer_id
  FULL OUTER JOIN src_digital AS dig ON dig.customer_id = COALESCE(crm.customer_id, cb.customer_id)
)
SELECT * FROM canonical;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Declare SoR per attribute in metadata. ops.system_of_record is the single source of truth for "who owns what." Auditors read this table to understand how the EDW resolves conflicts.
  2. Join all sources at the business-key level. A FULL OUTER JOIN keeps customers that exist in any source — useful for catching new customers from digital before core banking.
  3. Pick each attribute from its declared SoR. crm.full_name, dig.phone, cb.segment — the SELECT list is the SoR map expressed as SQL.
  4. Fall through when SoR is missing. If crm doesn't have a row for this customer, the COALESCE in the customer_id column ensures the row still flows; full_name then falls to the next-priority source declared in ops.system_of_record (a more complete implementation iterates a priority list).
  5. Feed the canonical row into the SCD-2 MERGE. The output of this CTE feeds the MERGE we built earlier — same SCD-2 logic, now with one canonical value per attribute.

Output (canonical row for C042).

customer_id full_name email phone segment dob
C042 Anya Patel anya@gmail.com +44 7700 900999 private 1988-04-12

A single, unambiguous version of customer C042 — the EDW's integration promise made concrete.

Rule of thumb. Always declare the system-of-record per attribute in metadata, never hard-code it in SQL. The metadata makes the rule auditable, changeable, and reviewable independently of the merge logic.

EDW 3NF modelling interview question

A senior probe usually sounds like: "A customer's segment changed from retail to private on 2024-03-15. Walk me through what your EDW does — DDL, MERGE, and how you'd answer a SOX auditor asking what the segment was on 2024-03-01." The interviewer is testing whether the candidate reaches for SCD-2 reflexively and can defend the atomic grain.

Solution Using SCD-2 MERGE with surrogate keys and row-hash idempotence

-- The "what was the segment on 2024-03-01" query — answers the SOX auditor in one shot
SELECT
  customer_id,
  segment,
  valid_from,
  valid_to
FROM edw.customer
WHERE customer_id = 'C042'
  AND valid_from <= TIMESTAMP '2024-03-01 00:00:00'
  AND valid_to   >  TIMESTAMP '2024-03-01 00:00:00';
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Operation Effect on edw.customer
1 Initial load 2024-01-10 Insert row 1 (segment=retail, valid_from=2024-01-10, valid_to=9999-12-31, is_current=true)
2 No-op refresh 2024-03-15 09:00 (same hash) MERGE skips — row_hash unchanged; row 1 untouched
3 Segment change 2024-03-15 10:30 Row 1 closes (valid_to=2024-03-15 10:30, is_current=false); row 2 inserts (segment=private, valid_from=2024-03-15 10:30, valid_to=9999-12-31, is_current=true)
4 Auditor asks "what was the segment on 2024-03-01?" Point-in-time query selects row 1 — segment=retail

Output:

customer_id segment valid_from valid_to
C042 retail 2024-01-10 09:00:00 2024-03-15 10:30:00

A single row, returned in milliseconds. The auditor's question has a definitive, reproducible answer — and the EDW will give the same answer in 2034 if asked again.

Why this works — concept by concept:

  • Surrogate key (SHA-256 of business key) — durable, immutable, opaque. Joins survive every source-system migration. The customer_id itself can be reissued; the customer_key never changes for a given customer_id.
  • Row hash for idempotenceSHA256(CONCAT(full_name, '|', segment)) ensures duplicate source refreshes don't create duplicate SCD-2 rows. A no-op load is truly a no-op.
  • SCD Type-2 historyvalid_from / valid_to / is_current lets you query the state at any historical timestamp with a <= / > predicate. Auditors get reproducible answers years later.
  • Atomic graintransaction is immutable (events don't change); customer, account, branch carry SCD-2 (descriptors do change). The grain decision is per-entity, not platform-wide.
  • system_of_record + load_id — every row records which source provided this version and which load wrote it. Auditors trace every row back to its origin in seconds.
  • Cost — storage = ~2–4× the source size for SCD-2 + audit columns; MERGE cost = O(changed_rows) per cycle (no-op refreshes are free); query cost = O(active_rows) for "current state" and O(history_rows) for point-in-time.

SQL
Topic — slowly-changing dimensions
Slowly-changing dimension problems (SCD-2 at EDW grain)

Practice →


4. Conformed dimensions and dependent marts (Inmon to Kimball bridge)

The 3NF EDW feeds Kimball star marts through conformed dimensions — one definition, many readers

The mental model in one line: dependent data marts are Kimball star schemas built on top of the Inmon EDW, sharing conformed dimensions that are extracted from the EDW exactly once and reused everywhere. Once you internalise "the mart is a derived view; the EDW is the source; the conformed dim is the contract between them" the data warehouse architecture interview around Inmon-Kimball bridges becomes a clean pattern.

Visual diagram showing the EDW → dependent marts flow — central 3NF EDW card on the left with dim_customer extracted once, feeding into two downstream Kimball mart star schemas (sales mart, finance mart) that both reuse the same conformed dim_customer; thin arrows + a 'conformed' label; on a light PipeCode card.

The conformed dimension contract.

  • One source. dim_customer is built from edw.customer exactly once per refresh cycle. No mart builds its own customer dimension.
  • One grain. Every mart sees the same row count, same customer_key set, same attributes for the same customer_id.
  • One semantic. customer_segment = 'private' means the same thing in sales_mart, finance_mart, and risk_mart. Disagreements are impossible by construction.
  • Many readers. N marts × 1 conformed dim → N joins to the same definition.

Why mart-by-mart dim_customer is the anti-pattern.

  • Drift. Each mart team chooses different attributes, different bucketing rules, different SCD types. Six months in, no two marts agree.
  • Cost. Six copies of the customer dimension cost six times the storage and six times the maintenance.
  • Audit nightmare. SOX auditor asks "what was the segment on 2024-03-15?" — three different answers from three marts.

Mart design rules.

  • Star schema first. Every mart is fact_X joined to dim_Y by surrogate keys. No snowflakes at the mart layer.
  • Conformed dims share keys. dim_customer.customer_key is the same value in every mart — derived from edw.customer.customer_key.
  • Mart-local dims. A mart can have additional dims that aren't shared (e.g. dim_campaign_channel in marketing_mart only). These don't conflict with conformed dims.
  • Aggregate marts. Pre-aggregated marts (daily revenue, hourly KPI) live downstream of the granular marts. Cheaper queries, faster dashboards.

Refresh patterns by layer.

  • EDW. Continuous (CDC) or scheduled (batch). The source of truth — refresh first.
  • Conformed dims. Rebuild nightly from EDW. They are derivations of edw.X is_current = TRUE rows.
  • Mart facts. Incremental on new transaction events; full rebuild allowed if granular and cheap.
  • Aggregate marts. Refresh after granular marts complete; can lag by hours without impacting BI.

The Inmon-Kimball bridge in three steps.

  • Step 1. Build the 3NF EDW with SCD-2 history. This is the durable contract.
  • Step 2. Extract conformed dimensions from the EDW as Kimball-style flattened views (one row per current entity, denormalised attributes for BI ease).
  • Step 3. Build mart facts that join only to conformed dims (no joins to the 3NF EDW directly).

Worked example — sales mart + finance mart sharing dim_customer

Detailed explanation. A retail bank wants two marts — sales_mart for product-line sales attribution and finance_mart for revenue and cost reporting. Both marts join to a customer dimension. The conformed-dimension approach derives dim_customer once from edw.customer and shares it.

Question. Build a sales mart and a finance mart that both join to a conformed dim_customer. Show the DDL for dim_customer, fact_sales, fact_revenue. Show a query that answers "revenue by customer segment for 2024" from finance_mart and a query that answers "units sold by customer segment for 2024" from sales_mart — both segments returning identical sums because the dimension is conformed.

Input (sample EDW rows).

customer_key customer_id segment valid_from valid_to is_current
sha(C042) C042 private 2024-03-15 9999-12-31 TRUE
sha(C099) C099 retail 2023-01-01 9999-12-31 TRUE
sha(C123) C123 smb 2022-06-01 9999-12-31 TRUE

Code (DDL for conformed dim + two marts).

-- ① Conformed dimension — built once, shared everywhere
CREATE OR REPLACE VIEW marts.dim_customer AS
SELECT
  customer_key,
  customer_id,
  full_name,
  segment,                     -- current value; for point-in-time use edw.customer directly
  date_of_birth,
  is_current,
  valid_from,
  valid_to
FROM edw.customer
WHERE is_current = TRUE;

-- ② Sales mart fact
CREATE TABLE marts.fact_sales (
  sale_key         STRING NOT NULL,
  customer_key     STRING NOT NULL,             -- FK to dim_customer
  product_key      STRING NOT NULL,
  branch_key       STRING NOT NULL,
  date_key         INT NOT NULL,
  units            INT,
  amount           DECIMAL(18, 2),
  sold_at          TIMESTAMP,
  load_id          STRING
);

-- ③ Finance mart fact
CREATE TABLE marts.fact_revenue (
  revenue_key      STRING NOT NULL,
  customer_key     STRING NOT NULL,             -- FK to dim_customer
  account_key      STRING NOT NULL,
  date_key         INT NOT NULL,
  revenue          DECIMAL(18, 2),
  cost             DECIMAL(18, 2),
  recognised_at    TIMESTAMP,
  load_id          STRING
);

-- ④ Sales-by-segment query (sales_mart)
SELECT
  d.segment,
  SUM(f.units)   AS units_2024,
  SUM(f.amount)  AS amount_2024
FROM marts.fact_sales        AS f
JOIN marts.dim_customer      AS d ON d.customer_key = f.customer_key
WHERE f.sold_at >= TIMESTAMP '2024-01-01'
  AND f.sold_at <  TIMESTAMP '2025-01-01'
GROUP BY d.segment
ORDER BY amount_2024 DESC;

-- ⑤ Revenue-by-segment query (finance_mart)
SELECT
  d.segment,
  SUM(f.revenue) AS revenue_2024,
  SUM(f.cost)    AS cost_2024
FROM marts.fact_revenue      AS f
JOIN marts.dim_customer      AS d ON d.customer_key = f.customer_key
WHERE f.recognised_at >= TIMESTAMP '2024-01-01'
  AND f.recognised_at <  TIMESTAMP '2025-01-01'
GROUP BY d.segment
ORDER BY revenue_2024 DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. dim_customer is a view, not a copy. Most modern warehouses (Snowflake, BigQuery, Databricks) cache view results. The view stays in sync with edw.customer automatically — no nightly rebuild required if your engine supports it.
  2. Surrogate key is the join column. Both fact_sales.customer_key and fact_revenue.customer_key are FKs to dim_customer.customer_key. Identical surrogate keys mean identical join results across marts.
  3. Segment lookups go through dim_customer. Neither mart stores segment in the fact; the dimension is the only source. Change the segment in the EDW, every mart query reflects it immediately.
  4. Identical segment lists. The "segment by sales" and "segment by revenue" queries return the exact same segment values — { private, retail, smb } — because they share a dimension.
  5. Point-in-time correctness. If you need "what was the segment at the time of the sale?" join fact_sales.customer_key to edw.customer (full SCD-2) with valid_from <= sold_at < valid_to — Inmon's history is intact.

Output.

Sales mart query result:

segment units_2024 amount_2024
private 8,420 1,240,500.00
retail 142,300 8,910,200.00
smb 22,150 3,440,800.00

Finance mart query result:

segment revenue_2024 cost_2024
private 980,400.00 280,100.00
retail 6,420,000.00 1,920,000.00
smb 2,510,000.00 720,000.00

The segment values match perfectly across both marts. Two different aggregates, one shared customer definition.

Rule of thumb. Conformed dims are non-negotiable in a multi-mart shop. Build the conformed dim_customer, dim_product, dim_date, dim_branch first; only then start designing mart facts.

Worked example — build an aggregate mart for executive dashboards

Detailed explanation. Executive dashboards rarely need transaction-grain data — they want daily revenue per segment, monthly active accounts per region, weekly card spend per product line. An aggregate mart pre-computes these from the granular mart so the BI tool's queries hit a tiny table instead of scanning billions of fact rows.

Question. Build an aggregate mart that pre-computes daily revenue by segment + branch from marts.fact_revenue. Show the DDL, the daily refresh, and the savings ratio.

Input. marts.fact_revenue has 2.4 billion rows over 5 years; the executive dashboard needs daily revenue by (segment, branch_key) for the past 90 days.

Code.

-- ① Aggregate mart DDL — keyed by day + segment + branch
CREATE TABLE marts.agg_daily_revenue_by_segment_branch (
  date_key       INT NOT NULL,
  branch_key     STRING NOT NULL,
  segment        STRING NOT NULL,
  revenue        DECIMAL(18, 2),
  cost           DECIMAL(18, 2),
  customer_count INT,
  refreshed_at   TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (date_key, branch_key, segment)
);

-- ② Nightly refresh — incremental on the latest day only
INSERT OVERWRITE INTO marts.agg_daily_revenue_by_segment_branch
PARTITION (date_key)
SELECT
  CAST(TO_CHAR(f.recognised_at, 'YYYYMMDD') AS INT) AS date_key,
  f.branch_key,
  d.segment,
  SUM(f.revenue)                  AS revenue,
  SUM(f.cost)                     AS cost,
  COUNT(DISTINCT f.customer_key)  AS customer_count
FROM marts.fact_revenue        AS f
JOIN marts.dim_customer        AS d ON d.customer_key = f.customer_key
WHERE f.recognised_at >= CURRENT_DATE - INTERVAL '1 day'
  AND f.recognised_at <  CURRENT_DATE
GROUP BY 1, 2, 3;

-- ③ Executive dashboard query
SELECT
  segment,
  SUM(revenue) AS revenue_90d
FROM marts.agg_daily_revenue_by_segment_branch
WHERE date_key >= CAST(TO_CHAR(CURRENT_DATE - INTERVAL '90 days', 'YYYYMMDD') AS INT)
GROUP BY segment
ORDER BY revenue_90d DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Aggregate grain chosen carefully. (date_key, branch_key, segment) is the finest grain the executive dashboard ever needs. Coarser grains save more storage but lose drill-down flexibility.
  2. Incremental refresh. Only the past day is recomputed each night — O(daily_rows), not O(all_rows). Idempotent if you replay yesterday's load.
  3. Granular mart untouched. marts.fact_revenue is unchanged. The aggregate sits next to it as a derived view.
  4. Dashboard hits the aggregate. A 90-day revenue-by-segment query scans 90 × 500 = 45,000 rows instead of 120 million fact rows — a 2,600× compression.
  5. Re-derivable. If the aggregate is corrupted, drop it and rebuild — marts.fact_revenue is the source.

Output.

segment revenue_90d
retail 142,500,000.00
smb 88,200,000.00
private 24,100,000.00

Query runs in 80 ms instead of 18 seconds against the granular mart. The executive dashboard refreshes instantly.

Rule of thumb. Aggregate marts pay for themselves the moment the BI tool runs the same shape of query more than a hundred times a day. Below that, just hit the granular mart and skip the maintenance.

EDW dependent mart interview question

A senior probe usually sounds like: "We have a sales mart and a finance mart. Sales says they have 142 thousand retail customers; finance says they have 138 thousand. Who's wrong?" The interviewer is testing whether the candidate reaches for the conformed dimension as the diagnostic and explains why the discrepancy is impossible if the marts derive from one EDW.

Solution Using marts.dim_customer derived once from edw.customer

-- The diagnostic query — confirms both marts use the same conformed dim
WITH sales_segments AS (
  SELECT
    d.segment,
    COUNT(DISTINCT f.customer_key) AS distinct_customers
  FROM marts.fact_sales    AS f
  JOIN marts.dim_customer  AS d ON d.customer_key = f.customer_key
  GROUP BY d.segment
),
finance_segments AS (
  SELECT
    d.segment,
    COUNT(DISTINCT f.customer_key) AS distinct_customers
  FROM marts.fact_revenue   AS f
  JOIN marts.dim_customer   AS d ON d.customer_key = f.customer_key
  GROUP BY d.segment
)
SELECT
  COALESCE(s.segment, f.segment) AS segment,
  s.distinct_customers          AS in_sales_mart,
  f.distinct_customers          AS in_finance_mart,
  ABS(COALESCE(s.distinct_customers, 0) - COALESCE(f.distinct_customers, 0)) AS gap
FROM sales_segments  AS s
FULL OUTER JOIN finance_segments AS f USING (segment)
ORDER BY gap DESC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Stage Operation What it does
1 sales_segments CTE counts distinct customers per segment in sales_mart
2 finance_segments CTE counts distinct customers per segment in finance_mart
3 FULL OUTER JOIN on segment aligns the two counts side by side
4 ABS(diff) AS gap quantifies the discrepancy per segment
5 Interpretation gap = 0 across all segments ⇒ marts agree; gap > 0 ⇒ one mart filtered or missed customers

Output:

segment in_sales_mart in_finance_mart gap
retail 142,300 142,300 0
smb 22,150 22,150 0
private 8,420 8,420 0

Zero gaps across the board. The original "142k vs 138k" complaint must come from a mart that isn't using the conformed dim — likely a finance team that built its own customer list off a separate source. The fix is to route them through marts.dim_customer.

Why this works — concept by concept:

  • Conformed dimension as contract — both marts join to the same marts.dim_customer keyed by customer_key. The customer set is identical by construction.
  • Surrogate-key joins — fact tables carry surrogate keys; the dimension supplies attributes. Renaming a customer doesn't break the join.
  • One source of truthmarts.dim_customer is a single view over edw.customer is_current = TRUE. There is no other version.
  • Disposable marts — when finance complains about a count mismatch, you don't change the EDW. You re-derive the finance mart from the EDW + conformed dim — fix lands in one place.
  • Cost — conformed dim refresh = O(distinct_current_customers) per cycle; mart fact join = O(fact_rows) with the dim cached in memory; debugging time = bounded because the contract is explicit.

SQL
Topic — dimensional modelling
Conformed dimension problems (multi-mart stars)

Practice →


5. Governance, audit, SCDs at EDW grain

Governance is the overlay that makes an EDW shippable — lineage, audit, SCDs, compliance, PII

The mental model in one line: governance is the cross-cutting layer that turns a 3NF model into a regulator-ready warehouse — lineage from source column to consumption dashboard, audit for every row change, SCD policy per attribute, PII tagging with retention rules, and policy enforcement at query time. Once you bake these in at design time, the edw design interview question family of "how do you handle GDPR right-to-be-forgotten on a 10-year warehouse?" becomes a one-paragraph answer instead of a panic.

Visual governance overlay diagram — a wide horizontal ribbon spanning the full diagram labelled 'Governance' with four sub-sections (Lineage, Audit, SCDs, Compliance); beneath, the 5-layer EDW stack from earlier with PII tags on specific columns highlighted; a Great Expectations + GDPR badge row; on a light PipeCode card.

The four governance pillars.

  • Lineage. Track every column from source system → staging → EDW → mart → dashboard. OpenLineage, dbt's column-level lineage, and DataHub all express the same graph. Auditors ask "where does this dashboard number come from?" and the lineage graph answers in one click.
  • Audit. Every row in the EDW carries system_of_record, load_id, created_at. Every change carries valid_from / valid_to. Every access (in regulated systems) is logged to an immutable audit log. SOX, HIPAA, and Basel demand this.
  • SCDs. Define per attribute, not per table. customer.full_name might be Type-2 (preserve every name change for KYC), customer.last_login_at might be Type-1 (don't bother tracking), customer.date_of_birth might be Type-0 (immutable). Document the choice in metadata.
  • Compliance. PII tagging (which columns hold personal data), retention rules (how long is each column kept), residency rules (which jurisdictions can read which columns), and right-to-be-forgotten workflows.

Data quality framework integration.

  • Great Expectations / Soda Core. Schema, value-range, freshness, and referential-integrity checks at every layer transition (staging → EDW, EDW → marts).
  • Block bad data early. A failing check in staging halts the EDW merge — the EDW is never polluted by bad source rows.
  • Quality SLAs. "99.5% of customer.email values must match a valid email regex" — a check that runs every hour and pages the data team on breach.
  • DQ scores per table. Surface a freshness + completeness + uniqueness score per EDW table in the catalogue.

The PII column lifecycle.

  • Discovery. Use scanners (Privacera, BigID, OneTrust) to classify columns by sensitivity at ingestion.
  • Tagging. Apply policy tags (PII, sensitive, confidential) in the catalogue (Collibra, Alation, DataHub) and the warehouse's native tagging (Snowflake object tags, BigQuery policy tags).
  • Masking. Dynamic masking based on the consuming role — analysts see email = 'a***@gmail.com'; auditors see the clear value with an access log entry.
  • Retention. Apply per-column retention rules — customer.ssn is purged 7 years after account closure; customer.email is purged 90 days after right-to-be-forgotten.
  • Forgotten subjects. A GDPR right-to-be-forgotten request flips a is_forgotten flag on the EDW customer row; downstream marts re-derive and the masked view replaces clear values.

SCD strategy per attribute (the design table you should always have).

Attribute SCD type Why
customer.full_name Type 2 KYC change history
customer.email Type 2 Audit trail for marketing consent
customer.segment Type 2 Required for time-correct revenue attribution
customer.last_login_at Type 1 High churn, no audit value
customer.date_of_birth Type 0 Immutable in nature
account.status Type 2 Required for active-account counts at point in time
account.balance Snapshot Daily snapshot in EDW, not Type-2
transaction.* Immutable Transactions don't change

Compliance regimes the EDW must support.

  • SOX. Reproducible financial reports — point-in-time queries must return identical results 7 years later. Atomic grain + SCD-2 satisfies this.
  • GDPR. Right-to-be-forgotten + data minimisation + lawful basis tracking. Per-column masking + retention rules.
  • HIPAA. PHI tagging + access logging + role-based masking. Same pattern as GDPR with stricter audit.
  • Basel / IFRS. Reproducible risk calculations — point-in-time portfolio queries. SCD-2 history is the answer.

The governance metadata layer (often the missing piece).

  • Column dictionary. Every EDW column has a row in ops.column_catalog with description, business owner, PII tag, retention rule, SCD type, and source-of-record. The catalog is the source of truth for governance — masking policies, data-quality checks, and lineage tools all read from it.
  • Role-to-policy mapping. A small ops.role_policy table maps each Snowflake / BigQuery role to the masking policies it inherits. Adding a new role is a metadata change, not a DDL change.
  • Quality-of-service tags. Tables get a freshness SLA (stale_after_minutes), a completeness SLA (min_rows_per_day), and an availability SLA (alert_on_query_failure) — surfaced in the catalog.
  • Data contracts. Each EDW table publishes a schema contract; source teams sign off when their column adds change the contract. Breaking changes require a versioned migration plan.

The lineage graph — what to record and where.

  • Source-to-staging. Source system name + extract timestamp + load id + row hash on every staging row. The lineage starts here.
  • Staging-to-EDW. Each EDW merge logs the staging table read, the rows merged, the rows closed (SCD-2), and the runtime. Stored in ops.merge_log with a retention of the regulatory horizon.
  • EDW-to-marts. Each mart-refresh job logs which EDW tables it read, the row counts, and the resulting mart row counts. Used to debug "the mart is missing 4,000 customers" complaints.
  • Marts-to-consumption. BI tools (Tableau, Looker) attach metadata to every dashboard pointing at the source mart. Reverse-ETL tools (Census, Hightouch) log the mart-row → SaaS-record mapping.
  • Stitch it. OpenLineage is the standard for emitting these events. DataHub, Marquez, or Atlan are the dashboards.

Worked example — implement column-level PII masking with role-based unmasking and a right-to-be-forgotten workflow

Detailed explanation. A bank EDW holds customer.email, customer.ssn, account.balance. Marketing analysts need email-domain analytics but never clear emails. Auditors need clear values with access logged. A GDPR right-to-be-forgotten request must zero the email everywhere downstream without breaking referential integrity.

Question. Implement Snowflake-style column-level masking + role-based unmasking + a right-to-be-forgotten flag for customer.email. Show how analyst, auditor, and marketing roles see the column differently. Show the workflow for a right-to-be-forgotten request.

Code (Snowflake-style masking + RTBF flag).

-- ① Add the RTBF flag to the EDW
ALTER TABLE edw.customer ADD COLUMN is_forgotten BOOLEAN DEFAULT FALSE;

-- ② Define the masking policy
CREATE OR REPLACE MASKING POLICY edw.email_mask AS (val STRING) RETURNS STRING ->
  CASE
    -- Auditor sees clear value (access is logged at the role level)
    WHEN CURRENT_ROLE() IN ('AUDITOR', 'COMPLIANCE_OFFICER') THEN val
    -- Marketing sees only the domain
    WHEN CURRENT_ROLE() = 'MARKETING_ANALYST'                THEN
      CONCAT('***@', SPLIT_PART(val, '@', 2))
    -- Default — everyone else gets a generic mask
    ELSE                                                          '***@***.***'
  END;

-- ③ Apply the policy to the column
ALTER TABLE edw.customer
  MODIFY COLUMN email
  SET MASKING POLICY edw.email_mask;

-- ④ The right-to-be-forgotten workflow
-- Step 1 — mark the customer
UPDATE edw.customer
SET is_forgotten = TRUE
WHERE customer_id = 'C042';

-- Step 2 — replace the PII in the current and historical rows
UPDATE edw.customer
SET email = CONCAT('forgotten-', customer_key, '@redacted.example')
WHERE customer_id = 'C042';

-- Step 3 — keep the row (referential integrity, account ownership) but PII is gone
SELECT customer_id, email, is_forgotten
FROM edw.customer
WHERE customer_id = 'C042'
ORDER BY valid_from;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Masking is applied at the column level. Snowflake / BigQuery evaluate the policy at query time based on CURRENT_ROLE(). The same row returns different values to different readers — no copy of the data is made.
  2. Three roles, three views of the column. Auditors get the clear value (and their query is logged); marketing gets the domain only (***@gmail.com for cohort analytics); everyone else gets a generic mask.
  3. RTBF doesn't delete the row. A customer row in edw.customer is the parent of account and transaction FKs. Deleting it would break referential integrity for years of immutable transactions. Instead, the row stays; the PII fields are replaced with deterministic redacted values.
  4. is_forgotten flag drives downstream behaviour. Marts can choose to drop forgotten customers from any new dashboards; ML training pipelines can filter them out; the EDW remains structurally intact.
  5. Audit trail preserved. Every UPDATE above is itself an EDW change — the new redacted email gets valid_from = now(), the previous clear email gets valid_to = now(). A regulator who inspects the audit log sees that on date X the customer was forgotten — and the previous value is no longer in the row.

Output.

After applying RTBF for customer C042, queries return:

For an analyst:

customer_id email is_forgotten
C042 @.*** TRUE

For a marketing analyst:

customer_id email is_forgotten
C042 ***@redacted.example TRUE

For an auditor:

customer_id email is_forgotten
C042 forgotten-sha256...@redacted.example TRUE

The auditor sees the redacted clear value (with their query logged); the previous real email is purged from the EDW row's history; the customer's downstream account and transaction rows are intact.

Rule of thumb. RTBF replaces PII, doesn't delete rows. Surrogate keys + foreign keys + immutable transactions are too valuable to lose. Mask, mark, retain the structure.

EDW governance interview question

A senior probe usually sounds like: "A customer files a GDPR right-to-be-forgotten request. They have 7 years of transaction history in your EDW. What do you do?" The interviewer is testing whether the candidate breaks the customer's row, breaks referential integrity, or does the mature thing — replace the PII, keep the row, mark the flag, and let downstream consumers respect it.

Solution Using column masking + RTBF flag + PII replacement (referential integrity preserved)

-- The end-to-end RTBF stored procedure
CREATE OR REPLACE PROCEDURE edw.forget_customer(p_customer_id STRING)
RETURNS STRING
LANGUAGE SQL
AS $$
BEGIN
  -- ① Mark the flag
  UPDATE edw.customer
  SET is_forgotten = TRUE
  WHERE customer_id = :p_customer_id;

  -- ② Replace PII columns deterministically
  UPDATE edw.customer
  SET full_name     = 'FORGOTTEN',
      email         = CONCAT('forgotten-', customer_key, '@redacted.example'),
      date_of_birth = NULL,
      phone_number  = NULL
  WHERE customer_id = :p_customer_id;

  -- ③ Trigger downstream mart rebuilds (orchestrator picks this up)
  INSERT INTO ops.mart_rebuild_queue (entity, reason, requested_at)
  VALUES ('dim_customer', 'rtbf:' || :p_customer_id, CURRENT_TIMESTAMP);

  -- ④ Log the action to the immutable audit trail
  INSERT INTO ops.governance_audit (action, target, actor, occurred_at)
  VALUES ('rtbf', :p_customer_id, CURRENT_USER(), CURRENT_TIMESTAMP);

  RETURN 'OK';
END;
$$;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Operation Effect
1 Set is_forgotten = TRUE Marks customer in EDW; downstream marts see the flag and re-derive
2 Replace PII columns full_name, email, DOB, phone → redacted; referential FKs to account / transaction preserved
3 Queue mart rebuild Orchestrator detects new entry, rebuilds dim_customer overnight
4 Append audit log Immutable record for regulators: who requested, when, by whom
5 Notify consumers Reverse-ETL pipelines, ML feature stores, BI cubes refresh from new dim_customer

Output:

Layer Before RTBF After RTBF
edw.customer.email anya@gmail.com forgotten-sha256...@redacted.example
edw.account (parent FK) intact intact
edw.transaction (parent FK) intact intact
marts.dim_customer.email anya@gmail.com forgotten-...@redacted.example
ops.governance_audit (no entry) one immutable RTBF row

The 7 years of transactions remain — they're required for SOX, Basel, and AML compliance — but the personal identifying data is gone. The structural integrity that makes the EDW useful is preserved.

Why this works — concept by concept:

  • Mark, don't deleteis_forgotten = TRUE is the signal; downstream consumers filter or re-derive. Deletion would break months of referential joins.
  • PII replacement is deterministicforgotten-{customer_key}@redacted.example is unique per customer, so any join on email still uniquely resolves; just to nothing meaningful.
  • Mart rebuild queue — the EDW is the source of truth; mart changes follow asynchronously. The RTBF promise to the customer is met when the EDW is updated, not when the dashboards finish refreshing.
  • Immutable audit logops.governance_audit is append-only with broker-level retention. Regulators inspect this when they audit your RTBF workflow.
  • Masking policy as a safety net — even before RTBF, the column was masked to analysts. RTBF tightens the mask further — even the unmasked clear value becomes the redacted string.
  • Cost — RTBF execution = O(rows_for_customer) — typically dozens, never thousands; mart rebuild = O(dim_customer) once; audit log size = O(rtbf_events) (negligible).

SQL
Topic — slowly-changing dimensions
Governance + SCD-2 problems at EDW grain

Practice →


EDW vs Kimball decision matrix (cheat sheet)

The decision between Inmon EDW, Kimball-only marts, and the hybrid (most common) collapses to a handful of factors. Print this matrix and pin it next to your design notebook.

  • 5+ source systems with overlap? → Inmon EDW at the core. The integration contract pays off the moment you have to reconcile two definitions of customer.
  • Single source, analytics-only, no audit? → Kimball-only. Build the star schema directly off the source; you don't need the EDW middle layer.
  • Any regulatory load (SOX, GDPR, HIPAA, Basel, IFRS)? → Inmon EDW with SCD-2 on every dimension. Audit trails and point-in-time queries are non-negotiable.
  • Multi-team analytics (3+ subject areas)? → Inmon EDW + conformed dimensions + Kimball marts. The conformed dim contract is the only way to keep team-level analytics from drifting.
  • Decade-long horizon? → Inmon EDW. The platform outlives every BI tool, every cloud vendor, every CDO. The EDW is the durable asset.
  • Fast time-to-insight, 3-month deadline? → Kimball-only mart for the first cut. Plan to retrofit an EDW only if the scope grows.
  • Lakehouse substrate (Delta, Iceberg, Hudi)? → The model is unchanged. Inmon EDW + Kimball marts works fine on a lakehouse — the storage is just cheaper.
  • dbt + modern stack? → dbt's staging → intermediate → marts is the dbt-flavoured restatement of CIF layers. The Inmon model maps cleanly onto it.
  • Right-to-be-forgotten or PII-heavy domain? → Inmon EDW + governance overlay (column masking + RTBF flag + retention rules). Bolting governance onto a pure Kimball stack is much harder than baking it into the EDW from day one.

Default choice for 2026 enterprises: Inmon 3NF EDW at the core + Kimball star marts at the serving layer + conformed dimensions as the bridge + a governance overlay across all of it. That is what most modern banks, insurers, hospitals, and large retailers run — and what most senior enterprise data warehouse interview rounds expect you to defend.

Frequently asked questions

Inmon vs Kimball — which one should I use?

Inmon and Kimball solve different problems and the modern answer is usually "both, layered." Inmon's 3NF enterprise data warehouse sits at the integration core — atomic grain, full SCD-2 history, surrogate keys, one source of truth per attribute. Kimball's dimensional marts sit downstream as the serving layer — fast joins, friendly column names, easy to rebuild. Pick Inmon-only if you're a regulated multi-source shop with a decade-long horizon; pick Kimball-only if you have one source and a three-month deadline; pick the hybrid for everything in between, which is most enterprises.

Is 3NF still relevant in the lakehouse era?

Yes. Lakehouse engines (Delta, Iceberg, Hudi) changed the storage and the engine, but the modelling problems didn't change. A multi-source bank still needs to reconcile customer once, still needs SCD-2 history for SOX, still needs surrogate keys for durable identity. 3nf data warehouse modelling is the right shape for the EDW layer regardless of whether it lives on Snowflake, BigQuery, Databricks, or a Postgres data warehouse. What lakehouse changed is the compute economics — the same model is now affordable on cheap object storage with separation of compute and storage.

Do I need a Corporate Information Factory in 2026?

If you have five or more source systems with overlapping entities, any regulatory load, or multiple analytics teams, yes. The corporate information factory is the canonical Inmon reference architecture and the five layers (sources, staging, 3NF EDW, dependent marts, consumption) each earn their keep. If you're a single-source SaaS startup with an analyst and a Postgres database, you don't — go straight to a dbt-style staging-to-marts model and skip the EDW middle layer until you outgrow it.

What is the staging layer for, and why can't I skip it?

Staging is the schema-on-read landing zone where source data arrives unchanged, with audit metadata (source_system, load_id, loaded_at) bolted on. It earns its keep for three reasons. First, it's the replay buffer — if you change EDW logic you replay from staging without re-extracting from operational systems. Second, it absorbs source-schema drift — column adds and renames are handled here, not in the EDW. Third, it's where data-quality checks live — bad rows are caught in staging and never pollute the EDW. Skipping it forces all three responsibilities onto the EDW merge logic, which then becomes brittle and expensive to debug.

How do I migrate from a Kimball-only warehouse to an Inmon EDW?

Start by identifying the conformed dimensions across your existing marts and reverse-engineer their EDW source. Build the 3NF EDW for the highest-overlap entities first (customer, account, product) — these are the ones causing the most disagreements. Once the EDW versions of these entities exist with full SCD-2 history, point your existing mart-building jobs at the EDW instead of the sources. The marts don't change shape; only their upstream changes. The migration is iterative — entity by entity over six to twelve months — not a big-bang switch. The Kimball marts stay; the EDW slides in behind them as the new source of truth.

Conformed dimensions — what are they really and why do they matter?

A conformed dimension is one definition of a business entity (customer, product, date, branch) that every mart shares. dim_customer is built from edw.customer exactly once per refresh cycle and reused by sales_mart, finance_mart, risk_mart, and marketing_mart. It matters because without it every mart team builds their own dim_customer and within six months no two marts agree on what a customer is — the sales team's retail-customer count differs from finance's by 4 thousand and no one can explain why. With conformed dims, this drift is structurally impossible. The Inmon EDW gives you exactly one place where the customer definition lives; conformed dimensions are how that definition is shared across the serving layer.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every Inmon EDW concept above ships with hands-on practice rooms where you model real entities, write real SCD-2 MERGEs, and design real conformed dimensions. Start with the data modelling library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.

Practice data modelling now →
Slowly-changing dimension drills →

Top comments (0)