The enterprise data warehouse is the integration contract at the centre of every regulated, multi-source data platform — and four decades after Bill Inmon coined the term it still ships in 2026, because the problems it was designed to solve never went away. The inmon data warehouse is the trusted single-version-of-truth that sits between dozens of operational systems and dozens of downstream consumers, and the corporate information factory is the layered architecture that wraps it. When a senior data engineer is asked to design a platform for a bank, a hospital, an insurer, or any business that has to answer auditors and regulators, the 3nf data warehouse modelled the Inmon way is still the default answer.
This guide walks through the five things every senior data warehouse architecture discussion turns on — why Inmon's EDW model still ships, how the Corporate Information Factory layers data from sources to consumption, how to model the EDW in third normal form with SCD-2 history, how conformed dimensions bridge the Inmon core to Kimball-style marts, and how governance keeps the whole stack audit-ready. Each section pairs a teaching block with a Solution-Tail design answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll be able to defend a 3NF EDW with surrogate keys, full SCD-2 history, conformed dim_customer, and a governance overlay that passes SOX, GDPR, and HIPAA — the exact shape edw design rounds reward when the interviewer asks "how would you build this from scratch?"
When you want hands-on reps immediately after reading, drill the data modelling library →, rehearse slowly-changing dimension problems →, and warm up on ETL patterns →.
On this page
- Why Inmon's EDW model still ships in 2026
- The Corporate Information Factory — sources to consumption
- 3NF modelling for the EDW — entities, relationships, history
- Conformed dimensions and dependent marts (Inmon to Kimball bridge)
- Governance, audit, SCDs at EDW grain
- EDW vs Kimball decision matrix (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why Inmon's EDW model still ships in 2026
The EDW is the one-time integration contract — every other layer is derived
The one-sentence invariant: an Inmon enterprise data warehouse is a third-normal-form, history-aware, atomic-grain store that integrates every source system once, and from which every downstream mart, BI cube, ML feature, and reverse-ETL feed is rebuilt as needed. Once you internalise that mental model — the EDW as the durable middle layer, the marts as the disposable serving layer — every enterprise data warehouse trade-off becomes a sequence of decisions about where data lives, who owns it, and how long it has to be preserved.
Inmon vs Kimball in five bullets.
- Direction. Inmon is top-down — design the corporate-wide 3NF EDW first, then derive subject-area marts from it. Kimball is bottom-up — design dimensional marts directly off the sources, then conform them later.
- Modelling style. Inmon uses third normal form at the EDW layer (entities separated from change). Kimball uses dimensional modelling (fact tables + dimensions) at every layer.
- Time horizon. Inmon is built for a decade — atomic history, full audit, slow to design, slow to change. Kimball is built for a quarter — fast time-to-insight, easier for analysts to query.
-
Audit posture. Inmon's atomic grain + SCD-2 on every column gives forensic auditability — "show me what
customer_segmentwas for customer 42 on 2024-03-15." Kimball can do this only if the mart was designed with that history. - Disposability. Marts in the Inmon model are derived and disposable — drop and rebuild from the EDW overnight. The Kimball-only world has no such safety net.
When the Inmon model wins.
- Regulated industries. Banks, insurers, hospitals, utilities, telcos — anywhere audit trails, retention policies, and "what was the state on 2024-03-15?" are first-class requirements.
- Multi-source integration. Five-plus operational systems that overlap on the same business entities (customer, product, account) — the EDW is where you reconcile them once.
- Multi-team analytics. Sales, finance, risk, marketing each build their own marts off a single conformed EDW — no two teams disagree about who the customer was.
- Decade-long horizon. You expect the platform to outlive every BI tool, every cloud vendor switch, and every CDO; the EDW is the durable asset.
When Kimball wins alone.
- Single source, analytics-only. One application, one team, one set of dashboards — skip the EDW middle and go straight to a star schema.
- Fast time-to-insight. Three-month delivery target with no audit requirements — Kimball ships marts faster.
- Cloud lakehouse + dbt + BI. Modern startup stacks often skip Inmon entirely; that is a deliberate trade-off, not an oversight.
The 2026 reality — most enterprises run a hybrid.
- Inmon at the core. The 3NF EDW remains the integration layer — Snowflake, BigQuery, Databricks, Redshift all host Inmon-style EDWs at scale today.
- Kimball at the serving layer. Star-schema marts sit downstream of the EDW for BI and dashboarding — fast joins, fast aggregates, friendly column names.
- Lakehouse as substrate. Delta / Iceberg / Hudi tables now back both the staging and EDW layers. The model is unchanged; the storage and engine are.
-
dbt as the modelling tool. Most modern enterprises express the EDW + marts as dbt models —
staging → intermediate → martsis the dbt-flavoured restatement of CIF layers.
What interviewers listen for.
- Do you say 3NF and atomic grain for the EDW layer? — senior signal.
- Do you mention conformed dimensions when the discussion crosses subject areas? — required answer.
- Do you reach for SCD Type-2 on every dimension attribute at the EDW grain? — Inmon-purist signal.
- Do you separate EDW from marts as durable vs disposable? — architectural maturity signal.
Worked example — pick a model for a multi-source banking platform
Detailed explanation. Realistic EDW design starts from the business context — number of source systems, regulatory requirements, downstream consumer count, and retention horizon. A retail bank with seven core systems, SOX + GDPR + Basel obligations, three analytics teams, and a 10-year retention policy is a textbook Inmon case.
Question. A retail bank is replacing its 1990s warehouse. It has 7 source systems (core banking, cards, mortgages, loans, CRM, fraud, marketing), needs SOX + GDPR + Basel audit trails, supports 4 BI teams (finance, risk, marketing, ops), and has a 10-year retention requirement. Which warehouse model should it pick, and why?
Code (the decision logic as a checklist).
def pick_warehouse_model(spec: dict) -> str:
# spec is the business + regulatory context
if spec["source_systems"] >= 5 \
and spec["regulated"] \
and spec["analytics_teams"] >= 2 \
and spec["retention_years"] >= 5:
return "inmon_edw_with_kimball_marts" # the hybrid
if spec["source_systems"] == 1 \
and not spec["regulated"] \
and spec["time_to_insight_weeks"] <= 12:
return "kimball_only"
return "inmon_edw_with_kimball_marts" # default — durable + serving
spec = {
"source_systems": 7,
"regulated": True, # SOX, GDPR, Basel
"analytics_teams": 4,
"retention_years": 10,
"time_to_insight_weeks": 26,
}
pick_warehouse_model(spec)
Step-by-step explanation.
-
Count source systems. 7 is well above the 5-source threshold where reconciling entity definitions (who is a
customer?) becomes a full-time job — the EDW is the place that work happens once. - Check regulatory load. SOX (financial controls), GDPR (right-to-be-forgotten, PII residency), Basel (capital and risk reporting) all demand reproducible history at the row level — Inmon's SCD-2 grain delivers this; pure Kimball marts struggle.
-
Count analytics teams. Four downstream teams means four overlapping mart designs — without a conformed EDW each team reinvents
dim_customerand disagreements multiply. - Retention horizon. A decade outlives every BI tool refresh and every cloud-vendor switch — the durable layer is the EDW; the marts can be rebuilt.
- Output the choice. The hybrid: Inmon 3NF EDW at the core, Kimball star marts downstream for BI consumption.
Output.
| Decision factor | Value | Implication |
|---|---|---|
| Source systems | 7 | Need the EDW as integration layer |
| Regulated | Yes (SOX/GDPR/Basel) | Full SCD-2 history required |
| Analytics teams | 4 | Conformed dims mandatory |
| Retention | 10 years | EDW as the durable asset |
| Recommendation | Inmon EDW + Kimball marts | The 2026 default |
Rule of thumb. Five-plus sources + any regulatory load = Inmon EDW at the core. Single source + no regulators + fast deadline = Kimball-only. Everywhere in between, pick the hybrid.
Enterprise data warehouse design interview question
A senior interviewer often frames this as: "Walk me through how you'd decide between an Inmon EDW and a Kimball-only data mart for a multi-source bank." The candidate must articulate the integration contract, the audit posture, and the disposability of the serving layer — three pillars of mature data warehouse architecture thinking.
Solution Using the Inmon EDW + Kimball mart hybrid
# Architecture sketch (top-down)
Layer 0 Sources core_banking | cards | mortgages | loans | crm | fraud | marketing
Layer 1 Staging raw landing — schema-on-read, audit columns, no transformation
Layer 2 3NF EDW customer | account | transaction | branch | product (atomic, SCD-2)
Layer 3 Conformed dim dim_customer extracted once, reused everywhere
Layer 4 Data marts sales_mart | finance_mart | risk_mart | marketing_mart (Kimball stars)
Layer 5 Consumption BI | ML | regulatory reports | reverse ETL
Step-by-step trace.
| Step | Layer | What happens | Why this layer |
|---|---|---|---|
| 1 | Sources | 7 OLTP systems push to Kafka / Debezium CDC | Don't disturb operational stores |
| 2 | Staging | Raw landing in object storage / lakehouse, no transforms | Replay buffer — if EDW logic changes you replay from here |
| 3 | EDW | 3NF model with surrogate keys + SCD-2 + audit columns | The integration contract — every source reconciled here once |
| 4 | Conformed dim |
dim_customer derived from customer EDW table nightly |
One customer definition, shared across all marts |
| 5 | Marts |
fact_sales, fact_revenue, fact_loan_risk joined to conformed dims |
Fast BI queries, friendly column names, can be rebuilt |
| 6 | Consumption | Looker / Tableau / SageMaker / reverse ETL to Salesforce | The analytics surface — outlives by being thin |
Output:
| Quality | Inmon EDW value | Kimball mart value |
|---|---|---|
| Audit trail | Full SCD-2 per attribute | Inherited from EDW |
| Query speed | Slower (many joins) | Fast (pre-joined star) |
| Time-to-insight | Months for first mart | Weeks per new mart |
| Disposability | Durable — never drop | Disposable — rebuild overnight |
| Regulatory fit | SOX/GDPR/Basel ready | Inherits readiness from EDW |
| Schema change cost | High — EDW is contract | Low — drop the mart, rebuild |
Why this works — concept by concept:
-
Single integration contract — the EDW is the only place where
customeris reconciled across 7 sources. Every mart, every BI tool, every ML model reads from it. No team builds their own definition. - Atomic grain — the EDW stores the finest-grain facts (one row per transaction, one row per balance snapshot). Marts can always aggregate up; you can never disaggregate after the fact.
-
SCD-2 on every dimension — every change to
customer_segment,branch_id,risk_bandis preserved withvalid_from/valid_to. Auditors can ask "what was the value on 2024-03-15?" and the EDW answers. - Disposable serving layer — marts are derived views of the EDW. Drop one, rebuild it overnight; the EDW is untouched. This is the operational superpower of the hybrid model.
- Cost — EDW build time = O(months) up front; mart build time = O(days) per subject area; storage = ~3× source size for SCD-2 history + RF on the lakehouse; query latency = O(joins) for ad-hoc EDW queries, O(1) for mart queries.
SQL
Topic — data modelling
Data modelling design problems (EDW, 3NF, dimensional)
2. The Corporate Information Factory — sources to consumption
The CIF is five layers — sources, staging, 3NF EDW, dependent marts, consumption — connected by a governance ribbon
The mental model in one line: the Corporate Information Factory is the canonical Inmon reference architecture — raw operational data enters from the left, is integrated once in the 3NF EDW in the middle, then is fanned out to derived marts and consumption tools on the right, with a governance overlay that spans every layer. Once you can name the five layers and the role each one plays, half of the corporate information factory interview question family collapses to obvious answers.
The five CIF layers.
- Layer 0 — Sources. Every OLTP database, SaaS application, mainframe extract, and event stream the business produces. The EDW reads from these but never writes back.
-
Layer 1 — Staging. A schema-on-read raw landing zone. Records arrive unchanged with audit metadata (
source_system,load_id,loaded_at). No business logic; no joins; no transformation. Storage = cheap object store. - Layer 2 — 3NF EDW. The integrated, atomic, history-aware store. Entities are separated; SCD-2 is applied to every changing attribute; surrogate keys give durable identity; business keys give traceability. This is the contract.
- Layer 3 — Dependent data marts. Kimball star schemas derived from the EDW. Each mart is owned by a subject area (sales, finance, risk). Marts share conformed dimensions sourced from the EDW.
- Layer 4 — Consumption. BI tools (Tableau, Looker, Power BI), ML pipelines (Spark, SageMaker), reverse-ETL (Census, Hightouch). All read from marts; none read from staging.
Why the EDW middle layer is non-negotiable.
- One reconciliation, many readers. A customer's name appears in 5 sources — without the EDW, every mart reconciles it independently and disagreements proliferate. With the EDW, the reconciliation happens once and every reader inherits the answer.
- Replay-safe. If business rules change, you can rebuild the EDW from staging without re-extracting from sources. The staging layer is the time-travel buffer.
- Stable contract. Marts and consumers depend on EDW column names. Sources change shape constantly; the EDW absorbs that change so downstream readers never have to.
- Audit-ready. A single SCD-2-enforced EDW makes "what was the value on date X?" a one-query answer. No mart-by-mart history reconstruction.
The staging layer — what it is and isn't.
- It is a faithful, append-only copy of source data with audit metadata. Schema-on-read; no transforms.
- It is the replay buffer — if you rebuild the EDW from scratch you start here, not from the sources.
- It is where data-quality checks live before data hits the EDW (rejection rules, schema-evolution gates).
- It is not a query layer for BI — staging tables have raw column names, no joins, no history.
- It is not the system of record — sources are. Staging is a snapshot.
Refresh patterns.
- CDC-driven (preferred). Debezium / Fivetran / native log mining streams source row changes into staging; staging triggers EDW merges; EDW changes propagate to marts.
- Batch-driven (legacy). Nightly full extracts land in staging; EDW MERGE runs on the new snapshot; marts rebuild from EDW.
- Hybrid. High-volume transactional sources stream via CDC; reference data (product catalogue, geo) lands as nightly batch.
Where the lakehouse fits.
- Storage substrate. Delta Lake, Apache Iceberg, and Apache Hudi all back the staging and EDW layers in modern stacks. The 3NF model is unchanged; only the file format and the engine are.
- Time-travel for free. Iceberg and Delta give you snapshot isolation, transactional MERGEs, and time-travel queries built into the engine — no need to reinvent SCD-2 for "what did this table look like yesterday?" though you still want SCD-2 at the row level for the audit-trail story.
- Compute separation. The EDW data lives once in object storage; multiple engines (Snowflake, BigQuery, Databricks SQL, Trino) can read it. This is the lakehouse promise that finally makes the Inmon-vs-Kimball religious war moot — both layers fit on the same substrate.
- Catalog as governance hub. Unity Catalog, Polaris, Hive Metastore extensions all hold the column-level lineage and tagging that the governance ribbon needs. The EDW gets these for free if you adopt a modern catalog.
Where dbt fits.
-
staging → intermediate → marts. The dbt-canonical three-layer model maps cleanly onto the CIF —staging= staging,intermediate= 3NF EDW,marts= dependent marts. Most modern enterprises express the whole CIF as a dbt project. -
Tests as data-quality gates.
dbt testruns uniqueness, not-null, accepted-value, and referential-integrity checks at every layer transition. The DQ framework integrates seamlessly with Great Expectations / Soda Core. -
Documentation as catalog.
dbt docsauto-generates a column-level lineage graph and renders it as a navigable site. Cheap, fast, and good enough for many regulated shops.
Worked example — sketch the CIF for a multi-source bank
Detailed explanation. The CIF for a real bank wires together core banking (Debezium CDC), cards (Kafka events), mortgages (nightly batch), CRM (Fivetran), and marketing (Snowpipe). The staging layer is the universal landing zone; the 3NF EDW is the integration contract; the marts are the serving layer.
Question. Sketch the five-layer CIF for a retail bank with 5 source systems. Show the load mechanism per source, the staging table per source, the 3NF EDW entity each source feeds, and the consuming marts.
Input (sources).
| Source | System type | Load mechanism | Refresh cadence |
|---|---|---|---|
| core_banking | OLTP (Oracle) | Debezium CDC | continuous |
| cards | OLTP (Postgres) | Debezium CDC | continuous |
| mortgages | OLTP (DB2 mainframe) | Nightly batch | daily |
| crm | SaaS (Salesforce) | Fivetran | hourly |
| marketing | Files (S3 drops) | Snowpipe | hourly |
Code (the CIF layer map as YAML).
# cif-layer-map.yml
sources:
- name: core_banking
load: debezium_cdc
staging_table: stg.core_banking_customer
edw_entities: [customer, account, transaction]
- name: cards
load: debezium_cdc
staging_table: stg.cards_account
edw_entities: [account, transaction]
- name: mortgages
load: nightly_batch
staging_table: stg.mortgages_account
edw_entities: [account, transaction]
- name: crm
load: fivetran
staging_table: stg.crm_customer
edw_entities: [customer]
- name: marketing
load: snowpipe
staging_table: stg.marketing_campaign
edw_entities: [campaign, customer_campaign]
edw:
entities: [customer, account, transaction, branch, product, campaign]
scd2_on_all_dimensions: true
surrogate_keys: hash_sha256_business_keys
audit_columns: [valid_from, valid_to, is_current, system_of_record, load_id]
marts:
- name: sales_mart
facts: [fact_sales]
dims_from_edw: [dim_customer, dim_product, dim_date, dim_branch]
- name: finance_mart
facts: [fact_revenue, fact_cost]
dims_from_edw: [dim_customer, dim_account, dim_date]
- name: risk_mart
facts: [fact_credit_risk]
dims_from_edw: [dim_customer, dim_account, dim_date]
- name: marketing_mart
facts: [fact_campaign_response]
dims_from_edw: [dim_customer, dim_campaign, dim_date]
Step-by-step explanation.
-
Sources push, the EDW pulls (via staging). Debezium streams CDC from
core_bankingandcardsdirectly to Kafka topics, which land instg.core_banking_customeretc. — the EDW never touches an OLTP system directly. -
Staging keeps a faithful copy. Every row in staging has
source_system,load_id,loaded_at, and the original source primary key. No business rules apply yet. -
EDW merges happen by entity, not by source.
customerin the EDW is built fromstg.core_banking_customer+stg.crm_customer+ a conform-rule that picks the system-of-record per attribute (CRM owns name, core-banking owns DOB, etc.). -
SCD-2 captures change. Each EDW entity row has
valid_from,valid_to,is_currentcolumns; when a source row changes, the EDW closes the old row (valid_to = now(),is_current = false) and inserts a new one. -
Marts derive from the EDW once a night.
dim_customerin every mart is a SELECT-from-EDW view (or materialised table). Facts join to it via surrogate keys.
Output.
| Layer | Tables / objects | Storage size (illustrative) | Refresh |
|---|---|---|---|
| 0 — Sources | 5 OLTP / SaaS systems | (operational, not counted) | live |
| 1 — Staging | 18 staging tables | 12 TB (raw, with history) | per-source cadence |
| 2 — EDW (3NF) | 16 atomic entities | 8 TB (with SCD-2 + audit) | continuous merges |
| 3 — Marts (Kimball) | 4 marts × 5 dims × 1–2 facts | 4 TB (denormalised) | hourly to nightly |
| 4 — Consumption | BI dashboards + ML + reverse ETL | (read-only) | sub-minute query |
Rule of thumb. Staging is cheap; never skip it. The EDW is durable; design it carefully. Marts are disposable; design them for the consumer's convenience, not the modeller's purity.
EDW architecture interview question
A senior interviewer often frames this as: "We have 5 source systems and 4 BI teams. Sketch the layers between sources and dashboards and tell me what lives where." The candidate must reach for the five-layer CIF and explain why each layer earns its keep.
Solution Using the CIF five-layer model with a governance ribbon
-- ① Staging — schema-on-read landing zone (no transforms)
CREATE TABLE stg.core_banking_customer (
src_customer_id STRING,
full_name STRING,
date_of_birth DATE,
segment STRING,
source_system STRING, -- audit
load_id STRING, -- audit
loaded_at TIMESTAMP, -- audit
src_row_hash STRING -- dedup key
);
-- ② 3NF EDW — atomic, SCD-2, surrogate-keyed
CREATE TABLE edw.customer (
customer_key STRING, -- surrogate (hash of business key)
customer_id STRING, -- business key
full_name STRING,
date_of_birth DATE,
segment STRING,
valid_from TIMESTAMP,
valid_to TIMESTAMP,
is_current BOOLEAN,
system_of_record STRING,
load_id STRING,
PRIMARY KEY (customer_key)
);
-- ③ Conformed dimension — derived from EDW
CREATE OR REPLACE VIEW marts.dim_customer AS
SELECT
customer_key,
customer_id,
full_name,
segment,
valid_from,
valid_to,
is_current
FROM edw.customer;
-- ④ Mart fact — joins to conformed dim by surrogate key
CREATE TABLE marts.fact_sales (
sale_key STRING,
customer_key STRING, -- FK to dim_customer
product_key STRING, -- FK to dim_product
date_key INT, -- FK to dim_date
branch_key STRING, -- FK to dim_branch
amount DECIMAL(18, 2),
sold_at TIMESTAMP
);
Step-by-step trace.
| Layer | What it produces | Who reads it | When does it refresh |
|---|---|---|---|
| Staging | Raw landing per source | EDW merge jobs only | per-source cadence (continuous / hourly / nightly) |
| 3NF EDW | Atomic, SCD-2, integrated entities | Conformed dim views + auditors | continuous (CDC) or nightly (batch) |
| Conformed dims | Subject-agnostic dim_customer, dim_product | Every mart fact join | nightly rebuild from EDW |
| Marts | Star schemas — fact + dims | BI / ML / reverse-ETL | hourly to nightly |
| Consumption | Dashboards, ML features, reverse-ETL feeds | End users + downstream apps | sub-minute query latency |
Output:
| Quality | Staging | EDW | Marts | Consumption |
|---|---|---|---|---|
| Schema discipline | schema-on-read | strict 3NF | denormalised star | tool-shaped |
| Audit trail | source row hash | SCD-2 per attribute | inherited from EDW | inherited from marts |
| Query SLAs | none — load-only | minutes per ad-hoc | sub-second BI | sub-second |
| Disposability | replay-safe | durable forever | rebuild nightly | rebuild on demand |
Why this works — concept by concept:
- Staging as replay buffer — if the EDW model changes (new attribute, new entity), you replay from staging without re-extracting from sources. The bridge layer that protects production OLTP systems.
- EDW as the integration contract — every source reconciles here once. Every downstream reader inherits the same answer to "who is customer 42?" — disagreements between marts disappear.
-
Conformed dimensions —
dim_customeris derived fromedw.customerexactly once and shared by every mart. No two marts can have a differentdim_customer. -
Disposable marts —
fact_salesis rebuilt nightly from EDW + conformed dims. Drop it, rebuild it; the EDW is untouched. This is the operational superpower. - Cost — staging storage = ~1× source (with retention); EDW storage = ~3× source (SCD-2 history + RF); mart storage = additive per subject area; compute = O(merge_volume) on EDW + O(join_breadth) on marts.
SQL
Topic — dimensional modelling
Dimensional modelling problems (EDW, marts, conformed dims)
3. 3NF modelling for the EDW — entities, relationships, history
Third normal form separates entities from change — the durable shape of an Inmon EDW
The mental model in one line: the 3NF EDW captures every business entity once, every relationship once, every change forever — with surrogate keys for durable identity, business keys for traceability, and SCD-2 columns for time-travel. Once you can recite "atomic, integrated, history-aware" and back it with column names, the 3nf data warehouse interview question family becomes muscle memory.
The four 3NF design rules every Inmon EDW follows.
-
One entity, one table.
customer,account,transaction,branch,producteach live in their own atomic table. Don't denormalise at the EDW grain — that's the mart's job. -
One source of truth per attribute. When two sources disagree on
full_name, pick a system-of-record per attribute (e.g. CRM wins for name, core-banking wins for DOB) and document it. -
Atomic grain.
transactionstores one row per transaction event. Never store an aggregate (daily totals, monthly snapshots) at the EDW layer — marts compute those. -
History per row. Every entity carries
valid_from,valid_to,is_currentso you can query "what was this entity's state at time T?".
Surrogate keys vs business keys.
-
Surrogate key (
customer_key). Internally generated, immutable, opaque. Typically a hash of the business key (SHA-256) or a sequence integer. Joins use surrogate keys exclusively. -
Business key (
customer_id). Sourced from the operational system — meaningful to the business, but mutable (system reissues, re-platforming). Never join on this. - Why both? The surrogate gives durable identity (joins survive source-system migrations). The business key gives traceability (back to the operational record).
Audit columns every EDW entity carries.
-
valid_from,valid_to,is_current— SCD-2 history.valid_to = '9999-12-31'on the current row. -
system_of_record— which source provided this version of the row. -
load_id— the batch / streaming offset that wrote this version. Pins lineage to the load. -
created_at,updated_at— when the EDW row was inserted / closed. Distinct from the source's own timestamps. -
src_row_hash— hash of all source columns, used to detect "is this an actual change or a no-op refresh?".
The four SCD types — and which to use at EDW grain.
-
Type 0 — never updated (e.g.
date_of_birthfor a real person). Use sparingly. - Type 1 — overwrite in place. Loses history. Avoid at EDW grain.
-
Type 2 — close the old row, insert a new row with new
valid_from. The Inmon default. -
Type 3 — add a
previous_valuecolumn. Keeps only one prior version. Niche. - Type 6 — hybrid (Type 1 + Type 2 + Type 3 columns on the same row). Used when downstream consumers want both "current" and "historical" in one view.
The atomic grain rule (Inmon's most-violated principle).
- Transaction tables store one row per event — never one row per day-summary. Daily summaries are mart concerns.
- Snapshot tables (e.g. monthly account balance) live in marts, not the EDW.
- Bridge tables (many-to-many between accounts and parties) live in the EDW because they're entity-level, not aggregate-level.
Worked example — model a banking customer, account, transaction graph in 3NF
Detailed explanation. A retail bank wants to integrate customer, account, branch, and transaction data from three sources (core banking, cards, mortgages) into a single EDW. The 3NF approach normalises each entity to its own table, applies SCD-2 to changing attributes, and uses surrogate keys for joins.
Question. Design a 3NF EDW for a retail bank's customer + account + transaction + branch graph. Show the DDL with surrogate keys, business keys, SCD-2 columns, and audit columns. Include a sample MERGE that closes a Type-2 row when a customer's segment changes.
Input (sample source data).
| customer_id | full_name | segment | branch_id | source | extract_ts |
|---|---|---|---|---|---|
| C042 | Anya Patel | retail | B12 | core_banking | 2024-01-10 09:00 |
| C042 | Anya Patel | retail | B12 | core_banking | 2024-03-15 09:00 |
| C042 | Anya Patel | private | B12 | core_banking | 2024-03-15 10:30 |
Code (3NF DDL + SCD-2 MERGE).
-- The 4 atomic entities
CREATE TABLE edw.customer (
customer_key STRING NOT NULL, -- surrogate (SHA-256 of customer_id)
customer_id STRING NOT NULL, -- business key
full_name STRING,
date_of_birth DATE,
segment STRING,
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP NOT NULL, -- '9999-12-31' if current
is_current BOOLEAN NOT NULL,
system_of_record STRING NOT NULL,
load_id STRING NOT NULL,
src_row_hash STRING NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (customer_key, valid_from)
);
CREATE TABLE edw.branch (
branch_key STRING NOT NULL,
branch_id STRING NOT NULL,
city STRING,
region STRING,
opened_at DATE,
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP NOT NULL,
is_current BOOLEAN NOT NULL,
system_of_record STRING NOT NULL,
load_id STRING NOT NULL,
PRIMARY KEY (branch_key, valid_from)
);
CREATE TABLE edw.account (
account_key STRING NOT NULL,
account_id STRING NOT NULL,
customer_key STRING NOT NULL, -- FK to edw.customer
branch_key STRING NOT NULL, -- FK to edw.branch
product_code STRING,
opened_at DATE,
status STRING,
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP NOT NULL,
is_current BOOLEAN NOT NULL,
system_of_record STRING NOT NULL,
load_id STRING NOT NULL,
PRIMARY KEY (account_key, valid_from)
);
CREATE TABLE edw.transaction (
txn_key STRING NOT NULL,
txn_id STRING NOT NULL,
account_key STRING NOT NULL, -- FK to edw.account
amount DECIMAL(18, 2),
currency STRING,
txn_at TIMESTAMP NOT NULL,
posted_at TIMESTAMP,
status STRING,
load_id STRING NOT NULL,
PRIMARY KEY (txn_key) -- transactions are immutable
);
-- SCD-2 MERGE: close old row, insert new row when segment changes
MERGE INTO edw.customer AS tgt
USING (
SELECT
SHA256(s.customer_id) AS customer_key,
s.customer_id,
s.full_name,
s.segment,
s.source AS system_of_record,
s.extract_ts AS valid_from,
s.load_id,
SHA256(CONCAT(s.full_name, '|', s.segment)) AS src_row_hash
FROM stg.core_banking_customer AS s
) AS src
ON tgt.customer_key = src.customer_key
AND tgt.is_current = TRUE
AND tgt.src_row_hash <> src.src_row_hash -- only act on real changes
WHEN MATCHED THEN UPDATE SET
valid_to = src.valid_from,
is_current = FALSE;
INSERT INTO edw.customer (
customer_key, customer_id, full_name, segment,
valid_from, valid_to, is_current,
system_of_record, load_id, src_row_hash
)
SELECT
SHA256(s.customer_id),
s.customer_id,
s.full_name,
s.segment,
s.extract_ts,
TIMESTAMP '9999-12-31 00:00:00',
TRUE,
s.source,
s.load_id,
SHA256(CONCAT(s.full_name, '|', s.segment))
FROM stg.core_banking_customer AS s
LEFT JOIN edw.customer AS c
ON c.customer_key = SHA256(s.customer_id)
AND c.is_current = TRUE
AND c.src_row_hash = SHA256(CONCAT(s.full_name, '|', s.segment))
WHERE c.customer_key IS NULL;
Step-by-step explanation.
-
Hash the business key.
SHA256(customer_id)is the surrogate key — deterministic, durable, immune to source-system reissues. -
Compute the row hash.
SHA256(CONCAT(full_name, '|', segment))lets us detect "did anything actually change?" — no-op refreshes don't create new SCD-2 rows. -
Close the previous current row. The MERGE updates
valid_toandis_current = FALSEon the prior row whose hash differs from the incoming one. -
Insert the new current row. The INSERT adds a new row with
valid_from = extract_ts,valid_to = '9999-12-31',is_current = TRUE— but only if a row with that exact hash doesn't already exist (idempotent). -
Atomic grain preserved.
transactionhas no SCD-2 — transactions are immutable events. Their FKaccount_keypoints to whicheveraccountrow was current at the time of insert.
Output.
After three source pushes for C042, edw.customer contains:
| customer_key | customer_id | segment | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| sha256(C042) | C042 | retail | 2024-01-10 09:00 | 2024-03-15 10:30 | FALSE |
| sha256(C042) | C042 | private | 2024-03-15 10:30 | 9999-12-31 | TRUE |
The duplicate retail row at 2024-03-15 09:00 was idempotent — same hash, no insert. The segment change at 10:30 closed the retail row and opened the private row.
Rule of thumb. SCD-2 every dimension attribute by default. Drop down to Type-1 only when you have an explicit business reason (e.g. "we never need to know the previous value of last_login_at"). The cost of carrying history is low; the cost of losing it is unrecoverable.
Worked example — pick the system-of-record per attribute across three sources
Detailed explanation. A retail bank's customer entity is sourced from three operational systems — core banking (Oracle), CRM (Salesforce), and the digital banking platform (Postgres). Each system has its own copy of full_name, email, phone, segment, date_of_birth. Without a system-of-record (SoR) decision per attribute, the EDW would either pick arbitrarily or, worse, alternate between sources and produce a different answer every refresh.
Question. Three sources publish customer rows for C042. Each row has the same business key but different values for some attributes. Define a system-of-record rule per attribute and write the merge logic that produces a single canonical EDW row.
Input (three source rows for customer C042).
| source | full_name | phone | segment | dob | |
|---|---|---|---|---|---|
| core_banking | Patel, Anya | anya.patel@old.example | +44 7700 900123 | private | 1988-04-12 |
| crm | Anya Patel | anya@gmail.com | +44 7700 900123 | private | 1988-04-12 |
| digital | Anya Patel | anya@gmail.com | +44 7700 900999 | retail | 1988-04-12 |
Code (the SoR map + merge logic).
-- ① System-of-record per attribute — declared in metadata
CREATE TABLE ops.system_of_record (
entity STRING NOT NULL,
attribute STRING NOT NULL,
source STRING NOT NULL,
rationale STRING,
PRIMARY KEY (entity, attribute)
);
INSERT INTO ops.system_of_record VALUES
('customer', 'full_name', 'crm', 'CRM owns name (KYC workflow updates here)'),
('customer', 'email', 'crm', 'CRM owns contact methods'),
('customer', 'phone', 'digital', 'Digital banking captures latest phone'),
('customer', 'segment', 'core_banking','Core banking owns risk/regulatory segment'),
('customer', 'date_of_birth', 'core_banking','Core banking owns regulatory DOB');
-- ② Build the canonical EDW row from per-attribute SoR
WITH src_cb AS (
SELECT customer_id, full_name, email, phone, segment, date_of_birth, 'core_banking' AS source
FROM stg.core_banking_customer
),
src_crm AS (
SELECT customer_id, full_name, email, phone, segment, date_of_birth, 'crm' AS source
FROM stg.crm_customer
),
src_digital AS (
SELECT customer_id, full_name, email, phone, segment, date_of_birth, 'digital' AS source
FROM stg.digital_customer
),
canonical AS (
SELECT
COALESCE(crm.customer_id, cb.customer_id, dig.customer_id) AS customer_id,
crm.full_name AS full_name, -- SoR: crm
crm.email AS email, -- SoR: crm
dig.phone AS phone, -- SoR: digital
cb.segment AS segment, -- SoR: core_banking
cb.date_of_birth AS date_of_birth -- SoR: core_banking
FROM src_cb AS cb
FULL OUTER JOIN src_crm AS crm ON crm.customer_id = cb.customer_id
FULL OUTER JOIN src_digital AS dig ON dig.customer_id = COALESCE(crm.customer_id, cb.customer_id)
)
SELECT * FROM canonical;
Step-by-step explanation.
-
Declare SoR per attribute in metadata.
ops.system_of_recordis the single source of truth for "who owns what." Auditors read this table to understand how the EDW resolves conflicts. - Join all sources at the business-key level. A FULL OUTER JOIN keeps customers that exist in any source — useful for catching new customers from digital before core banking.
-
Pick each attribute from its declared SoR.
crm.full_name,dig.phone,cb.segment— the SELECT list is the SoR map expressed as SQL. -
Fall through when SoR is missing. If
crmdoesn't have a row for this customer, the COALESCE in the customer_id column ensures the row still flows;full_namethen falls to the next-priority source declared inops.system_of_record(a more complete implementation iterates a priority list). - Feed the canonical row into the SCD-2 MERGE. The output of this CTE feeds the MERGE we built earlier — same SCD-2 logic, now with one canonical value per attribute.
Output (canonical row for C042).
| customer_id | full_name | phone | segment | dob | |
|---|---|---|---|---|---|
| C042 | Anya Patel | anya@gmail.com | +44 7700 900999 | private | 1988-04-12 |
A single, unambiguous version of customer C042 — the EDW's integration promise made concrete.
Rule of thumb. Always declare the system-of-record per attribute in metadata, never hard-code it in SQL. The metadata makes the rule auditable, changeable, and reviewable independently of the merge logic.
EDW 3NF modelling interview question
A senior probe usually sounds like: "A customer's segment changed from retail to private on 2024-03-15. Walk me through what your EDW does — DDL, MERGE, and how you'd answer a SOX auditor asking what the segment was on 2024-03-01." The interviewer is testing whether the candidate reaches for SCD-2 reflexively and can defend the atomic grain.
Solution Using SCD-2 MERGE with surrogate keys and row-hash idempotence
-- The "what was the segment on 2024-03-01" query — answers the SOX auditor in one shot
SELECT
customer_id,
segment,
valid_from,
valid_to
FROM edw.customer
WHERE customer_id = 'C042'
AND valid_from <= TIMESTAMP '2024-03-01 00:00:00'
AND valid_to > TIMESTAMP '2024-03-01 00:00:00';
Step-by-step trace.
| Step | Operation | Effect on edw.customer
|
|---|---|---|
| 1 | Initial load 2024-01-10 | Insert row 1 (segment=retail, valid_from=2024-01-10, valid_to=9999-12-31, is_current=true) |
| 2 | No-op refresh 2024-03-15 09:00 (same hash) | MERGE skips — row_hash unchanged; row 1 untouched |
| 3 | Segment change 2024-03-15 10:30 | Row 1 closes (valid_to=2024-03-15 10:30, is_current=false); row 2 inserts (segment=private, valid_from=2024-03-15 10:30, valid_to=9999-12-31, is_current=true) |
| 4 | Auditor asks "what was the segment on 2024-03-01?" | Point-in-time query selects row 1 — segment=retail |
Output:
| customer_id | segment | valid_from | valid_to |
|---|---|---|---|
| C042 | retail | 2024-01-10 09:00:00 | 2024-03-15 10:30:00 |
A single row, returned in milliseconds. The auditor's question has a definitive, reproducible answer — and the EDW will give the same answer in 2034 if asked again.
Why this works — concept by concept:
-
Surrogate key (SHA-256 of business key) — durable, immutable, opaque. Joins survive every source-system migration. The
customer_iditself can be reissued; thecustomer_keynever changes for a givencustomer_id. -
Row hash for idempotence —
SHA256(CONCAT(full_name, '|', segment))ensures duplicate source refreshes don't create duplicate SCD-2 rows. A no-op load is truly a no-op. -
SCD Type-2 history —
valid_from/valid_to/is_currentlets you query the state at any historical timestamp with a<=/>predicate. Auditors get reproducible answers years later. -
Atomic grain —
transactionis immutable (events don't change);customer,account,branchcarry SCD-2 (descriptors do change). The grain decision is per-entity, not platform-wide. - system_of_record + load_id — every row records which source provided this version and which load wrote it. Auditors trace every row back to its origin in seconds.
- Cost — storage = ~2–4× the source size for SCD-2 + audit columns; MERGE cost = O(changed_rows) per cycle (no-op refreshes are free); query cost = O(active_rows) for "current state" and O(history_rows) for point-in-time.
SQL
Topic — slowly-changing dimensions
Slowly-changing dimension problems (SCD-2 at EDW grain)
4. Conformed dimensions and dependent marts (Inmon to Kimball bridge)
The 3NF EDW feeds Kimball star marts through conformed dimensions — one definition, many readers
The mental model in one line: dependent data marts are Kimball star schemas built on top of the Inmon EDW, sharing conformed dimensions that are extracted from the EDW exactly once and reused everywhere. Once you internalise "the mart is a derived view; the EDW is the source; the conformed dim is the contract between them" the data warehouse architecture interview around Inmon-Kimball bridges becomes a clean pattern.
The conformed dimension contract.
-
One source.
dim_customeris built fromedw.customerexactly once per refresh cycle. No mart builds its own customer dimension. -
One grain. Every mart sees the same row count, same
customer_keyset, same attributes for the samecustomer_id. -
One semantic.
customer_segment = 'private'means the same thing insales_mart,finance_mart, andrisk_mart. Disagreements are impossible by construction. - Many readers. N marts × 1 conformed dim → N joins to the same definition.
Why mart-by-mart dim_customer is the anti-pattern.
- Drift. Each mart team chooses different attributes, different bucketing rules, different SCD types. Six months in, no two marts agree.
- Cost. Six copies of the customer dimension cost six times the storage and six times the maintenance.
- Audit nightmare. SOX auditor asks "what was the segment on 2024-03-15?" — three different answers from three marts.
Mart design rules.
-
Star schema first. Every mart is
fact_Xjoined todim_Yby surrogate keys. No snowflakes at the mart layer. -
Conformed dims share keys.
dim_customer.customer_keyis the same value in every mart — derived fromedw.customer.customer_key. -
Mart-local dims. A mart can have additional dims that aren't shared (e.g.
dim_campaign_channelin marketing_mart only). These don't conflict with conformed dims. - Aggregate marts. Pre-aggregated marts (daily revenue, hourly KPI) live downstream of the granular marts. Cheaper queries, faster dashboards.
Refresh patterns by layer.
- EDW. Continuous (CDC) or scheduled (batch). The source of truth — refresh first.
-
Conformed dims. Rebuild nightly from EDW. They are derivations of
edw.Xis_current = TRUErows. - Mart facts. Incremental on new transaction events; full rebuild allowed if granular and cheap.
- Aggregate marts. Refresh after granular marts complete; can lag by hours without impacting BI.
The Inmon-Kimball bridge in three steps.
- Step 1. Build the 3NF EDW with SCD-2 history. This is the durable contract.
- Step 2. Extract conformed dimensions from the EDW as Kimball-style flattened views (one row per current entity, denormalised attributes for BI ease).
- Step 3. Build mart facts that join only to conformed dims (no joins to the 3NF EDW directly).
Worked example — sales mart + finance mart sharing dim_customer
Detailed explanation. A retail bank wants two marts — sales_mart for product-line sales attribution and finance_mart for revenue and cost reporting. Both marts join to a customer dimension. The conformed-dimension approach derives dim_customer once from edw.customer and shares it.
Question. Build a sales mart and a finance mart that both join to a conformed dim_customer. Show the DDL for dim_customer, fact_sales, fact_revenue. Show a query that answers "revenue by customer segment for 2024" from finance_mart and a query that answers "units sold by customer segment for 2024" from sales_mart — both segments returning identical sums because the dimension is conformed.
Input (sample EDW rows).
| customer_key | customer_id | segment | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| sha(C042) | C042 | private | 2024-03-15 | 9999-12-31 | TRUE |
| sha(C099) | C099 | retail | 2023-01-01 | 9999-12-31 | TRUE |
| sha(C123) | C123 | smb | 2022-06-01 | 9999-12-31 | TRUE |
Code (DDL for conformed dim + two marts).
-- ① Conformed dimension — built once, shared everywhere
CREATE OR REPLACE VIEW marts.dim_customer AS
SELECT
customer_key,
customer_id,
full_name,
segment, -- current value; for point-in-time use edw.customer directly
date_of_birth,
is_current,
valid_from,
valid_to
FROM edw.customer
WHERE is_current = TRUE;
-- ② Sales mart fact
CREATE TABLE marts.fact_sales (
sale_key STRING NOT NULL,
customer_key STRING NOT NULL, -- FK to dim_customer
product_key STRING NOT NULL,
branch_key STRING NOT NULL,
date_key INT NOT NULL,
units INT,
amount DECIMAL(18, 2),
sold_at TIMESTAMP,
load_id STRING
);
-- ③ Finance mart fact
CREATE TABLE marts.fact_revenue (
revenue_key STRING NOT NULL,
customer_key STRING NOT NULL, -- FK to dim_customer
account_key STRING NOT NULL,
date_key INT NOT NULL,
revenue DECIMAL(18, 2),
cost DECIMAL(18, 2),
recognised_at TIMESTAMP,
load_id STRING
);
-- ④ Sales-by-segment query (sales_mart)
SELECT
d.segment,
SUM(f.units) AS units_2024,
SUM(f.amount) AS amount_2024
FROM marts.fact_sales AS f
JOIN marts.dim_customer AS d ON d.customer_key = f.customer_key
WHERE f.sold_at >= TIMESTAMP '2024-01-01'
AND f.sold_at < TIMESTAMP '2025-01-01'
GROUP BY d.segment
ORDER BY amount_2024 DESC;
-- ⑤ Revenue-by-segment query (finance_mart)
SELECT
d.segment,
SUM(f.revenue) AS revenue_2024,
SUM(f.cost) AS cost_2024
FROM marts.fact_revenue AS f
JOIN marts.dim_customer AS d ON d.customer_key = f.customer_key
WHERE f.recognised_at >= TIMESTAMP '2024-01-01'
AND f.recognised_at < TIMESTAMP '2025-01-01'
GROUP BY d.segment
ORDER BY revenue_2024 DESC;
Step-by-step explanation.
-
dim_customeris a view, not a copy. Most modern warehouses (Snowflake, BigQuery, Databricks) cache view results. The view stays in sync withedw.customerautomatically — no nightly rebuild required if your engine supports it. -
Surrogate key is the join column. Both
fact_sales.customer_keyandfact_revenue.customer_keyare FKs todim_customer.customer_key. Identical surrogate keys mean identical join results across marts. -
Segment lookups go through
dim_customer. Neither mart storessegmentin the fact; the dimension is the only source. Change the segment in the EDW, every mart query reflects it immediately. -
Identical segment lists. The "segment by sales" and "segment by revenue" queries return the exact same
segmentvalues —{ private, retail, smb }— because they share a dimension. -
Point-in-time correctness. If you need "what was the segment at the time of the sale?" join
fact_sales.customer_keytoedw.customer(full SCD-2) withvalid_from <= sold_at < valid_to— Inmon's history is intact.
Output.
Sales mart query result:
| segment | units_2024 | amount_2024 |
|---|---|---|
| private | 8,420 | 1,240,500.00 |
| retail | 142,300 | 8,910,200.00 |
| smb | 22,150 | 3,440,800.00 |
Finance mart query result:
| segment | revenue_2024 | cost_2024 |
|---|---|---|
| private | 980,400.00 | 280,100.00 |
| retail | 6,420,000.00 | 1,920,000.00 |
| smb | 2,510,000.00 | 720,000.00 |
The segment values match perfectly across both marts. Two different aggregates, one shared customer definition.
Rule of thumb. Conformed dims are non-negotiable in a multi-mart shop. Build the conformed dim_customer, dim_product, dim_date, dim_branch first; only then start designing mart facts.
Worked example — build an aggregate mart for executive dashboards
Detailed explanation. Executive dashboards rarely need transaction-grain data — they want daily revenue per segment, monthly active accounts per region, weekly card spend per product line. An aggregate mart pre-computes these from the granular mart so the BI tool's queries hit a tiny table instead of scanning billions of fact rows.
Question. Build an aggregate mart that pre-computes daily revenue by segment + branch from marts.fact_revenue. Show the DDL, the daily refresh, and the savings ratio.
Input. marts.fact_revenue has 2.4 billion rows over 5 years; the executive dashboard needs daily revenue by (segment, branch_key) for the past 90 days.
Code.
-- ① Aggregate mart DDL — keyed by day + segment + branch
CREATE TABLE marts.agg_daily_revenue_by_segment_branch (
date_key INT NOT NULL,
branch_key STRING NOT NULL,
segment STRING NOT NULL,
revenue DECIMAL(18, 2),
cost DECIMAL(18, 2),
customer_count INT,
refreshed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (date_key, branch_key, segment)
);
-- ② Nightly refresh — incremental on the latest day only
INSERT OVERWRITE INTO marts.agg_daily_revenue_by_segment_branch
PARTITION (date_key)
SELECT
CAST(TO_CHAR(f.recognised_at, 'YYYYMMDD') AS INT) AS date_key,
f.branch_key,
d.segment,
SUM(f.revenue) AS revenue,
SUM(f.cost) AS cost,
COUNT(DISTINCT f.customer_key) AS customer_count
FROM marts.fact_revenue AS f
JOIN marts.dim_customer AS d ON d.customer_key = f.customer_key
WHERE f.recognised_at >= CURRENT_DATE - INTERVAL '1 day'
AND f.recognised_at < CURRENT_DATE
GROUP BY 1, 2, 3;
-- ③ Executive dashboard query
SELECT
segment,
SUM(revenue) AS revenue_90d
FROM marts.agg_daily_revenue_by_segment_branch
WHERE date_key >= CAST(TO_CHAR(CURRENT_DATE - INTERVAL '90 days', 'YYYYMMDD') AS INT)
GROUP BY segment
ORDER BY revenue_90d DESC;
Step-by-step explanation.
-
Aggregate grain chosen carefully.
(date_key, branch_key, segment)is the finest grain the executive dashboard ever needs. Coarser grains save more storage but lose drill-down flexibility. -
Incremental refresh. Only the past day is recomputed each night —
O(daily_rows), notO(all_rows). Idempotent if you replay yesterday's load. -
Granular mart untouched.
marts.fact_revenueis unchanged. The aggregate sits next to it as a derived view. -
Dashboard hits the aggregate. A 90-day revenue-by-segment query scans
90 × 500 = 45,000rows instead of120 millionfact rows — a 2,600× compression. -
Re-derivable. If the aggregate is corrupted, drop it and rebuild —
marts.fact_revenueis the source.
Output.
| segment | revenue_90d |
|---|---|
| retail | 142,500,000.00 |
| smb | 88,200,000.00 |
| private | 24,100,000.00 |
Query runs in 80 ms instead of 18 seconds against the granular mart. The executive dashboard refreshes instantly.
Rule of thumb. Aggregate marts pay for themselves the moment the BI tool runs the same shape of query more than a hundred times a day. Below that, just hit the granular mart and skip the maintenance.
EDW dependent mart interview question
A senior probe usually sounds like: "We have a sales mart and a finance mart. Sales says they have 142 thousand retail customers; finance says they have 138 thousand. Who's wrong?" The interviewer is testing whether the candidate reaches for the conformed dimension as the diagnostic and explains why the discrepancy is impossible if the marts derive from one EDW.
Solution Using marts.dim_customer derived once from edw.customer
-- The diagnostic query — confirms both marts use the same conformed dim
WITH sales_segments AS (
SELECT
d.segment,
COUNT(DISTINCT f.customer_key) AS distinct_customers
FROM marts.fact_sales AS f
JOIN marts.dim_customer AS d ON d.customer_key = f.customer_key
GROUP BY d.segment
),
finance_segments AS (
SELECT
d.segment,
COUNT(DISTINCT f.customer_key) AS distinct_customers
FROM marts.fact_revenue AS f
JOIN marts.dim_customer AS d ON d.customer_key = f.customer_key
GROUP BY d.segment
)
SELECT
COALESCE(s.segment, f.segment) AS segment,
s.distinct_customers AS in_sales_mart,
f.distinct_customers AS in_finance_mart,
ABS(COALESCE(s.distinct_customers, 0) - COALESCE(f.distinct_customers, 0)) AS gap
FROM sales_segments AS s
FULL OUTER JOIN finance_segments AS f USING (segment)
ORDER BY gap DESC;
Step-by-step trace.
| Stage | Operation | What it does |
|---|---|---|
| 1 | sales_segments CTE | counts distinct customers per segment in sales_mart |
| 2 | finance_segments CTE | counts distinct customers per segment in finance_mart |
| 3 | FULL OUTER JOIN on segment | aligns the two counts side by side |
| 4 | ABS(diff) AS gap | quantifies the discrepancy per segment |
| 5 | Interpretation | gap = 0 across all segments ⇒ marts agree; gap > 0 ⇒ one mart filtered or missed customers |
Output:
| segment | in_sales_mart | in_finance_mart | gap |
|---|---|---|---|
| retail | 142,300 | 142,300 | 0 |
| smb | 22,150 | 22,150 | 0 |
| private | 8,420 | 8,420 | 0 |
Zero gaps across the board. The original "142k vs 138k" complaint must come from a mart that isn't using the conformed dim — likely a finance team that built its own customer list off a separate source. The fix is to route them through marts.dim_customer.
Why this works — concept by concept:
-
Conformed dimension as contract — both marts join to the same
marts.dim_customerkeyed bycustomer_key. The customer set is identical by construction. - Surrogate-key joins — fact tables carry surrogate keys; the dimension supplies attributes. Renaming a customer doesn't break the join.
-
One source of truth —
marts.dim_customeris a single view overedw.customer is_current = TRUE. There is no other version. - Disposable marts — when finance complains about a count mismatch, you don't change the EDW. You re-derive the finance mart from the EDW + conformed dim — fix lands in one place.
- Cost — conformed dim refresh = O(distinct_current_customers) per cycle; mart fact join = O(fact_rows) with the dim cached in memory; debugging time = bounded because the contract is explicit.
SQL
Topic — dimensional modelling
Conformed dimension problems (multi-mart stars)
5. Governance, audit, SCDs at EDW grain
Governance is the overlay that makes an EDW shippable — lineage, audit, SCDs, compliance, PII
The mental model in one line: governance is the cross-cutting layer that turns a 3NF model into a regulator-ready warehouse — lineage from source column to consumption dashboard, audit for every row change, SCD policy per attribute, PII tagging with retention rules, and policy enforcement at query time. Once you bake these in at design time, the edw design interview question family of "how do you handle GDPR right-to-be-forgotten on a 10-year warehouse?" becomes a one-paragraph answer instead of a panic.
The four governance pillars.
- Lineage. Track every column from source system → staging → EDW → mart → dashboard. OpenLineage, dbt's column-level lineage, and DataHub all express the same graph. Auditors ask "where does this dashboard number come from?" and the lineage graph answers in one click.
-
Audit. Every row in the EDW carries
system_of_record,load_id,created_at. Every change carriesvalid_from/valid_to. Every access (in regulated systems) is logged to an immutable audit log. SOX, HIPAA, and Basel demand this. -
SCDs. Define per attribute, not per table.
customer.full_namemight be Type-2 (preserve every name change for KYC),customer.last_login_atmight be Type-1 (don't bother tracking),customer.date_of_birthmight be Type-0 (immutable). Document the choice in metadata. - Compliance. PII tagging (which columns hold personal data), retention rules (how long is each column kept), residency rules (which jurisdictions can read which columns), and right-to-be-forgotten workflows.
Data quality framework integration.
- Great Expectations / Soda Core. Schema, value-range, freshness, and referential-integrity checks at every layer transition (staging → EDW, EDW → marts).
- Block bad data early. A failing check in staging halts the EDW merge — the EDW is never polluted by bad source rows.
-
Quality SLAs. "99.5% of
customer.emailvalues must match a valid email regex" — a check that runs every hour and pages the data team on breach. - DQ scores per table. Surface a freshness + completeness + uniqueness score per EDW table in the catalogue.
The PII column lifecycle.
- Discovery. Use scanners (Privacera, BigID, OneTrust) to classify columns by sensitivity at ingestion.
-
Tagging. Apply policy tags (
PII,sensitive,confidential) in the catalogue (Collibra, Alation, DataHub) and the warehouse's native tagging (Snowflake object tags, BigQuery policy tags). -
Masking. Dynamic masking based on the consuming role — analysts see
email = 'a***@gmail.com'; auditors see the clear value with an access log entry. -
Retention. Apply per-column retention rules —
customer.ssnis purged 7 years after account closure;customer.emailis purged 90 days after right-to-be-forgotten. -
Forgotten subjects. A GDPR right-to-be-forgotten request flips a
is_forgottenflag on the EDW customer row; downstream marts re-derive and the masked view replaces clear values.
SCD strategy per attribute (the design table you should always have).
| Attribute | SCD type | Why |
|---|---|---|
customer.full_name |
Type 2 | KYC change history |
customer.email |
Type 2 | Audit trail for marketing consent |
customer.segment |
Type 2 | Required for time-correct revenue attribution |
customer.last_login_at |
Type 1 | High churn, no audit value |
customer.date_of_birth |
Type 0 | Immutable in nature |
account.status |
Type 2 | Required for active-account counts at point in time |
account.balance |
Snapshot | Daily snapshot in EDW, not Type-2 |
transaction.* |
Immutable | Transactions don't change |
Compliance regimes the EDW must support.
- SOX. Reproducible financial reports — point-in-time queries must return identical results 7 years later. Atomic grain + SCD-2 satisfies this.
- GDPR. Right-to-be-forgotten + data minimisation + lawful basis tracking. Per-column masking + retention rules.
- HIPAA. PHI tagging + access logging + role-based masking. Same pattern as GDPR with stricter audit.
- Basel / IFRS. Reproducible risk calculations — point-in-time portfolio queries. SCD-2 history is the answer.
The governance metadata layer (often the missing piece).
-
Column dictionary. Every EDW column has a row in
ops.column_catalogwith description, business owner, PII tag, retention rule, SCD type, and source-of-record. The catalog is the source of truth for governance — masking policies, data-quality checks, and lineage tools all read from it. -
Role-to-policy mapping. A small
ops.role_policytable maps each Snowflake / BigQuery role to the masking policies it inherits. Adding a new role is a metadata change, not a DDL change. -
Quality-of-service tags. Tables get a freshness SLA (
stale_after_minutes), a completeness SLA (min_rows_per_day), and an availability SLA (alert_on_query_failure) — surfaced in the catalog. - Data contracts. Each EDW table publishes a schema contract; source teams sign off when their column adds change the contract. Breaking changes require a versioned migration plan.
The lineage graph — what to record and where.
- Source-to-staging. Source system name + extract timestamp + load id + row hash on every staging row. The lineage starts here.
-
Staging-to-EDW. Each EDW merge logs the staging table read, the rows merged, the rows closed (SCD-2), and the runtime. Stored in
ops.merge_logwith a retention of the regulatory horizon. - EDW-to-marts. Each mart-refresh job logs which EDW tables it read, the row counts, and the resulting mart row counts. Used to debug "the mart is missing 4,000 customers" complaints.
- Marts-to-consumption. BI tools (Tableau, Looker) attach metadata to every dashboard pointing at the source mart. Reverse-ETL tools (Census, Hightouch) log the mart-row → SaaS-record mapping.
- Stitch it. OpenLineage is the standard for emitting these events. DataHub, Marquez, or Atlan are the dashboards.
Worked example — implement column-level PII masking with role-based unmasking and a right-to-be-forgotten workflow
Detailed explanation. A bank EDW holds customer.email, customer.ssn, account.balance. Marketing analysts need email-domain analytics but never clear emails. Auditors need clear values with access logged. A GDPR right-to-be-forgotten request must zero the email everywhere downstream without breaking referential integrity.
Question. Implement Snowflake-style column-level masking + role-based unmasking + a right-to-be-forgotten flag for customer.email. Show how analyst, auditor, and marketing roles see the column differently. Show the workflow for a right-to-be-forgotten request.
Code (Snowflake-style masking + RTBF flag).
-- ① Add the RTBF flag to the EDW
ALTER TABLE edw.customer ADD COLUMN is_forgotten BOOLEAN DEFAULT FALSE;
-- ② Define the masking policy
CREATE OR REPLACE MASKING POLICY edw.email_mask AS (val STRING) RETURNS STRING ->
CASE
-- Auditor sees clear value (access is logged at the role level)
WHEN CURRENT_ROLE() IN ('AUDITOR', 'COMPLIANCE_OFFICER') THEN val
-- Marketing sees only the domain
WHEN CURRENT_ROLE() = 'MARKETING_ANALYST' THEN
CONCAT('***@', SPLIT_PART(val, '@', 2))
-- Default — everyone else gets a generic mask
ELSE '***@***.***'
END;
-- ③ Apply the policy to the column
ALTER TABLE edw.customer
MODIFY COLUMN email
SET MASKING POLICY edw.email_mask;
-- ④ The right-to-be-forgotten workflow
-- Step 1 — mark the customer
UPDATE edw.customer
SET is_forgotten = TRUE
WHERE customer_id = 'C042';
-- Step 2 — replace the PII in the current and historical rows
UPDATE edw.customer
SET email = CONCAT('forgotten-', customer_key, '@redacted.example')
WHERE customer_id = 'C042';
-- Step 3 — keep the row (referential integrity, account ownership) but PII is gone
SELECT customer_id, email, is_forgotten
FROM edw.customer
WHERE customer_id = 'C042'
ORDER BY valid_from;
Step-by-step explanation.
-
Masking is applied at the column level. Snowflake / BigQuery evaluate the policy at query time based on
CURRENT_ROLE(). The same row returns different values to different readers — no copy of the data is made. -
Three roles, three views of the column. Auditors get the clear value (and their query is logged); marketing gets the domain only (
***@gmail.comfor cohort analytics); everyone else gets a generic mask. -
RTBF doesn't delete the row. A customer row in
edw.customeris the parent ofaccountandtransactionFKs. Deleting it would break referential integrity for years of immutable transactions. Instead, the row stays; the PII fields are replaced with deterministic redacted values. -
is_forgottenflag drives downstream behaviour. Marts can choose to drop forgotten customers from any new dashboards; ML training pipelines can filter them out; the EDW remains structurally intact. -
Audit trail preserved. Every UPDATE above is itself an EDW change — the new redacted email gets
valid_from = now(), the previous clear email getsvalid_to = now(). A regulator who inspects the audit log sees that on date X the customer was forgotten — and the previous value is no longer in the row.
Output.
After applying RTBF for customer C042, queries return:
For an analyst:
| customer_id | is_forgotten | |
|---|---|---|
| C042 | @.*** | TRUE |
For a marketing analyst:
| customer_id | is_forgotten | |
|---|---|---|
| C042 | ***@redacted.example | TRUE |
For an auditor:
| customer_id | is_forgotten | |
|---|---|---|
| C042 | forgotten-sha256...@redacted.example | TRUE |
The auditor sees the redacted clear value (with their query logged); the previous real email is purged from the EDW row's history; the customer's downstream account and transaction rows are intact.
Rule of thumb. RTBF replaces PII, doesn't delete rows. Surrogate keys + foreign keys + immutable transactions are too valuable to lose. Mask, mark, retain the structure.
EDW governance interview question
A senior probe usually sounds like: "A customer files a GDPR right-to-be-forgotten request. They have 7 years of transaction history in your EDW. What do you do?" The interviewer is testing whether the candidate breaks the customer's row, breaks referential integrity, or does the mature thing — replace the PII, keep the row, mark the flag, and let downstream consumers respect it.
Solution Using column masking + RTBF flag + PII replacement (referential integrity preserved)
-- The end-to-end RTBF stored procedure
CREATE OR REPLACE PROCEDURE edw.forget_customer(p_customer_id STRING)
RETURNS STRING
LANGUAGE SQL
AS $$
BEGIN
-- ① Mark the flag
UPDATE edw.customer
SET is_forgotten = TRUE
WHERE customer_id = :p_customer_id;
-- ② Replace PII columns deterministically
UPDATE edw.customer
SET full_name = 'FORGOTTEN',
email = CONCAT('forgotten-', customer_key, '@redacted.example'),
date_of_birth = NULL,
phone_number = NULL
WHERE customer_id = :p_customer_id;
-- ③ Trigger downstream mart rebuilds (orchestrator picks this up)
INSERT INTO ops.mart_rebuild_queue (entity, reason, requested_at)
VALUES ('dim_customer', 'rtbf:' || :p_customer_id, CURRENT_TIMESTAMP);
-- ④ Log the action to the immutable audit trail
INSERT INTO ops.governance_audit (action, target, actor, occurred_at)
VALUES ('rtbf', :p_customer_id, CURRENT_USER(), CURRENT_TIMESTAMP);
RETURN 'OK';
END;
$$;
Step-by-step trace.
| Step | Operation | Effect |
|---|---|---|
| 1 | Set is_forgotten = TRUE
|
Marks customer in EDW; downstream marts see the flag and re-derive |
| 2 | Replace PII columns |
full_name, email, DOB, phone → redacted; referential FKs to account / transaction preserved |
| 3 | Queue mart rebuild | Orchestrator detects new entry, rebuilds dim_customer overnight |
| 4 | Append audit log | Immutable record for regulators: who requested, when, by whom |
| 5 | Notify consumers | Reverse-ETL pipelines, ML feature stores, BI cubes refresh from new dim_customer |
Output:
| Layer | Before RTBF | After RTBF |
|---|---|---|
edw.customer.email |
anya@gmail.com |
forgotten-sha256...@redacted.example |
edw.account (parent FK) |
intact | intact |
edw.transaction (parent FK) |
intact | intact |
marts.dim_customer.email |
anya@gmail.com |
forgotten-...@redacted.example |
ops.governance_audit |
(no entry) | one immutable RTBF row |
The 7 years of transactions remain — they're required for SOX, Basel, and AML compliance — but the personal identifying data is gone. The structural integrity that makes the EDW useful is preserved.
Why this works — concept by concept:
-
Mark, don't delete —
is_forgotten = TRUEis the signal; downstream consumers filter or re-derive. Deletion would break months of referential joins. -
PII replacement is deterministic —
forgotten-{customer_key}@redacted.exampleis unique per customer, so any join on email still uniquely resolves; just to nothing meaningful. - Mart rebuild queue — the EDW is the source of truth; mart changes follow asynchronously. The RTBF promise to the customer is met when the EDW is updated, not when the dashboards finish refreshing.
-
Immutable audit log —
ops.governance_auditis append-only with broker-level retention. Regulators inspect this when they audit your RTBF workflow. - Masking policy as a safety net — even before RTBF, the column was masked to analysts. RTBF tightens the mask further — even the unmasked clear value becomes the redacted string.
- Cost — RTBF execution = O(rows_for_customer) — typically dozens, never thousands; mart rebuild = O(dim_customer) once; audit log size = O(rtbf_events) (negligible).
SQL
Topic — slowly-changing dimensions
Governance + SCD-2 problems at EDW grain
EDW vs Kimball decision matrix (cheat sheet)
The decision between Inmon EDW, Kimball-only marts, and the hybrid (most common) collapses to a handful of factors. Print this matrix and pin it next to your design notebook.
-
5+ source systems with overlap? → Inmon EDW at the core. The integration contract pays off the moment you have to reconcile two definitions of
customer. - Single source, analytics-only, no audit? → Kimball-only. Build the star schema directly off the source; you don't need the EDW middle layer.
- Any regulatory load (SOX, GDPR, HIPAA, Basel, IFRS)? → Inmon EDW with SCD-2 on every dimension. Audit trails and point-in-time queries are non-negotiable.
- Multi-team analytics (3+ subject areas)? → Inmon EDW + conformed dimensions + Kimball marts. The conformed dim contract is the only way to keep team-level analytics from drifting.
- Decade-long horizon? → Inmon EDW. The platform outlives every BI tool, every cloud vendor, every CDO. The EDW is the durable asset.
- Fast time-to-insight, 3-month deadline? → Kimball-only mart for the first cut. Plan to retrofit an EDW only if the scope grows.
- Lakehouse substrate (Delta, Iceberg, Hudi)? → The model is unchanged. Inmon EDW + Kimball marts works fine on a lakehouse — the storage is just cheaper.
-
dbt + modern stack? → dbt's
staging → intermediate → martsis the dbt-flavoured restatement of CIF layers. The Inmon model maps cleanly onto it. - Right-to-be-forgotten or PII-heavy domain? → Inmon EDW + governance overlay (column masking + RTBF flag + retention rules). Bolting governance onto a pure Kimball stack is much harder than baking it into the EDW from day one.
Default choice for 2026 enterprises: Inmon 3NF EDW at the core + Kimball star marts at the serving layer + conformed dimensions as the bridge + a governance overlay across all of it. That is what most modern banks, insurers, hospitals, and large retailers run — and what most senior enterprise data warehouse interview rounds expect you to defend.
Frequently asked questions
Inmon vs Kimball — which one should I use?
Inmon and Kimball solve different problems and the modern answer is usually "both, layered." Inmon's 3NF enterprise data warehouse sits at the integration core — atomic grain, full SCD-2 history, surrogate keys, one source of truth per attribute. Kimball's dimensional marts sit downstream as the serving layer — fast joins, friendly column names, easy to rebuild. Pick Inmon-only if you're a regulated multi-source shop with a decade-long horizon; pick Kimball-only if you have one source and a three-month deadline; pick the hybrid for everything in between, which is most enterprises.
Is 3NF still relevant in the lakehouse era?
Yes. Lakehouse engines (Delta, Iceberg, Hudi) changed the storage and the engine, but the modelling problems didn't change. A multi-source bank still needs to reconcile customer once, still needs SCD-2 history for SOX, still needs surrogate keys for durable identity. 3nf data warehouse modelling is the right shape for the EDW layer regardless of whether it lives on Snowflake, BigQuery, Databricks, or a Postgres data warehouse. What lakehouse changed is the compute economics — the same model is now affordable on cheap object storage with separation of compute and storage.
Do I need a Corporate Information Factory in 2026?
If you have five or more source systems with overlapping entities, any regulatory load, or multiple analytics teams, yes. The corporate information factory is the canonical Inmon reference architecture and the five layers (sources, staging, 3NF EDW, dependent marts, consumption) each earn their keep. If you're a single-source SaaS startup with an analyst and a Postgres database, you don't — go straight to a dbt-style staging-to-marts model and skip the EDW middle layer until you outgrow it.
What is the staging layer for, and why can't I skip it?
Staging is the schema-on-read landing zone where source data arrives unchanged, with audit metadata (source_system, load_id, loaded_at) bolted on. It earns its keep for three reasons. First, it's the replay buffer — if you change EDW logic you replay from staging without re-extracting from operational systems. Second, it absorbs source-schema drift — column adds and renames are handled here, not in the EDW. Third, it's where data-quality checks live — bad rows are caught in staging and never pollute the EDW. Skipping it forces all three responsibilities onto the EDW merge logic, which then becomes brittle and expensive to debug.
How do I migrate from a Kimball-only warehouse to an Inmon EDW?
Start by identifying the conformed dimensions across your existing marts and reverse-engineer their EDW source. Build the 3NF EDW for the highest-overlap entities first (customer, account, product) — these are the ones causing the most disagreements. Once the EDW versions of these entities exist with full SCD-2 history, point your existing mart-building jobs at the EDW instead of the sources. The marts don't change shape; only their upstream changes. The migration is iterative — entity by entity over six to twelve months — not a big-bang switch. The Kimball marts stay; the EDW slides in behind them as the new source of truth.
Conformed dimensions — what are they really and why do they matter?
A conformed dimension is one definition of a business entity (customer, product, date, branch) that every mart shares. dim_customer is built from edw.customer exactly once per refresh cycle and reused by sales_mart, finance_mart, risk_mart, and marketing_mart. It matters because without it every mart team builds their own dim_customer and within six months no two marts agree on what a customer is — the sales team's retail-customer count differs from finance's by 4 thousand and no one can explain why. With conformed dims, this drift is structurally impossible. The Inmon EDW gives you exactly one place where the customer definition lives; conformed dimensions are how that definition is shared across the serving layer.
Practice on PipeCode
- Drill the data modelling practice library → for EDW design, 3NF normalisation, and conformed-dim problems.
- Rehearse slowly-changing dimension drills → for SCD-1/2/3/6 at EDW grain.
- Sharpen ETL pattern problems → when the interviewer wants the load mechanism, not just the schema.
- Stack the SQL muscles with aggregation drills → and window-function problems → — both heavily used in mart design.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Reinforce the modelling muscle with data modelling for DE interviews →.
- For the design-round muscles, work through ETL system design for DE interviews →.
- Reinforce the SQL side with SQL for DE interviews — from zero to FAANG →.
Pipecode.ai is Leetcode for Data Engineering — every Inmon EDW concept above ships with hands-on practice rooms where you model real entities, write real SCD-2 MERGEs, and design real conformed dimensions. Start with the data modelling library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.
Practice data modelling now →
Slowly-changing dimension drills →





Top comments (0)