DEV Community

Cover image for Data Lakehouse vs Data Warehouse vs Data Lake: Which Architecture Wins
Gowtham Potureddi
Gowtham Potureddi

Posted on

Data Lakehouse vs Data Warehouse vs Data Lake: Which Architecture Wins

The data lakehouse vs data warehouse debate is the architecture decision every modern data team makes — and it does not have a single winner, only the right answer per workload. The three architectures — data warehouse, data lake, lakehouse — each evolved to solve a specific failure mode of the one that came before, and each one still wins inside its lane: warehouses dominate BI and dashboards, lakes dominate cheap raw storage and ML, lakehouses dominate mixed workloads that need both. The right way to compare them is not "which is best" but rather "which storage layer, which compute engine, and which transactional guarantees fit my workload — and what does the migration path between them actually cost".

This guide walks the three architectures end-to-end at deep-guide depth — data lake vs data warehouse at the storage / ingest / schema / governance layer, lakehouse architecture at the open-table layer (Delta, Iceberg, Hudi), and data warehouse architecture vs data lake architecture at the engine and cost-profile layer — with a five-dimension decision matrix, three worked migration scenarios, and SQL / Python snippets that match the exact shapes panelists ask in senior data-platform interviews. By the end you will be able to defend any of the three on the right workload, name the failure mode each was invented to solve, quote the cost-and-ACID tradeoffs from memory, and walk through a real migration without hand-waving.

PipeCode blog header for a deep-dive comparison of data lakehouse vs data warehouse vs data lake — bold white headline 'Lakehouse vs Warehouse vs Lake' with subtitle 'Which architecture wins for which workload' and three stylised mini-architecture cards side-by-side on a dark gradient with purple, orange, blue, and green accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, browse data-modeling practice →, drill ETL pipeline problems →, sharpen dimensional-modeling drills →, rehearse aggregation patterns for BI workloads →, reinforce database design problems →, or widen coverage on the full SQL practice library →.


On this page


1. Why the three-architecture comparison matters in 2026

data lakehouse vs data warehouse — three architectures, three failure modes, one decision per workload

The one-sentence invariant: the three analytical architectures are not competitors — they are a historical sequence, each one invented to solve the failure mode of the one before, and the modern stack in 2026 typically runs at least two of them side by side. A senior data engineer does not say "warehouses are dead, lakehouses won"; they say "warehouses still serve BI fastest, lakes still archive raw cheapest, and lakehouses bridge both with open table formats — pick by workload, not by hype-cycle".

The historical sequence at a glance.

  • 1980s-2010s — data warehouse era. Teradata, Oracle Exadata, then Redshift / Snowflake / BigQuery / Synapse. Won at: BI, dashboards, structured SQL, ACID guarantees, fine-grained governance. Failed at: cheap raw storage, semi-structured data (JSON / Avro), ML feature pipelines, multi-engine flexibility, ingestion velocity.
  • 2010s-2020s — data lake era. Hadoop HDFS, then S3 + Glue + Athena, ADLS Gen2, GCS. Won at: cheap storage at any scale, raw archival, any file format, ML training data, schema-on-read flexibility. Failed at: ACID transactions, schema enforcement, BI consistency, fine-grained updates, governance maturity.
  • 2020s-now — lakehouse era. Databricks Delta Lake, Apache Iceberg, Apache Hudi, Snowflake Iceberg tables, BigLake, Microsoft Fabric. Won at: lake economics + warehouse reliability, ACID on object storage, multi-engine reads of the same tables, open formats, unified catalog. Trade-offs: still maturing tooling, table-format choice is a long-term commitment, governance bolt-on requires extra effort.

What changed in 2026 that makes this comparison different from 2018.

  • Open table formats matured. Delta Lake 3.x with UniForm reads as Iceberg; Iceberg v3 ships in Snowflake, BigQuery, Redshift, and Athena; Hudi 1.0 finalised its Streamer API. Open tables are no longer a Databricks-only story.
  • Warehouses embraced lake formats. Snowflake reads and writes Iceberg; BigQuery has BigLake and Iceberg native tables; Redshift queries Iceberg-on-S3 directly. The warehouse vs lake wall fell.
  • Lakes got ACID. Before Delta / Iceberg, an UPDATE on a lake meant rewriting a partition by hand; today, UPDATE, DELETE, MERGE, and time-travel are first-class on object storage.
  • Compute fully separated from storage. Spark, Trino, Presto, Flink, DuckDB, Snowflake, BigQuery, Athena, ClickHouse — multiple engines read the same Iceberg table from the same S3 bucket with the same governance.
  • Cost pressure forced honesty. Warehouses still bundle compute + storage (or charge a premium for storage); lake / lakehouse stacks decouple them. At petabyte scale the difference is six figures a year.

Who should read which comparison.

  • data lake vs data warehouse — read section 2 + section 3; the classic 2015-2020 debate, still relevant when a team is choosing its first analytical platform.
  • data lakehouse vs data warehouse — read section 2 + section 4; the 2022-now debate, relevant when migrating off Redshift / Synapse for cost or flexibility reasons.
  • data lake vs data lakehouse — read section 3 + section 4; the 2021-now debate, relevant when an existing lake's lack of ACID and BI consistency starts hurting.
  • All three at once — read the full guide; the modern reality is hybrid, and senior interviews expect you to defend the choice across all three lanes.

Worked example — map a single workload onto all three architectures

Detailed explanation. A canonical interview prompt is "a marketplace wants daily GMV dashboards, monthly cohort retention, and real-time fraud scoring — design the data platform". The honest answer touches all three architectures, and the worked example below walks the mapping cell by cell.

Question. A marketplace ships 3 TB / day of clickstream events, 80 GB / day of OLTP CDC, and needs (a) an executive GMV dashboard refreshed every 15 minutes, (b) monthly cohort retention reports run by analysts, and (c) a fraud-scoring ML pipeline that retrains nightly on 6 months of raw events. Which architecture serves each workload, and how do they share data?

Input. Three workloads, three SLAs, one storage layer. Source systems: PostgreSQL OLTP (CDC via Debezium), Kafka clickstream (1 M events / sec peak), and the SaaS billing API (hourly REST pulls).

Code.

-- A canonical workload-to-architecture mapping table.
CREATE TABLE workload_architecture_map AS
SELECT * FROM (VALUES
    ('exec_gmv_dashboard',        '15 min',  'warehouse_or_lakehouse', 'star-schema fact_orders',          'BI engine'),
    ('monthly_cohort_retention',  '1 day',   'lakehouse',              'iceberg fact_events + dim_user',   'spark sql'),
    ('fraud_ml_training',         '1 day',   'lake_or_lakehouse',      'parquet partitioned by event_dt',  'spark mllib'),
    ('raw_event_archive_7y',      'n/a',     'lake',                   'parquet glacier-tiered',           'cold storage')
) AS t(workload, sla, architecture, storage_layout, engine);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. exec_gmv_dashboard lives in the warehouse lane or the lakehouse lane; either serves star-schema BI at 15-minute latency. The warehouse wins on raw query speed; the lakehouse wins on cost-per-TB if the data already lives in S3.
  2. monthly_cohort_retention lives in the lakehouse lane; analysts can query the same Iceberg table the GMV dashboard reads, plus historical depth that would be prohibitive to keep in the warehouse.
  3. fraud_ml_training lives in the lake lane or the lakehouse lane; ML engineers need raw Parquet partitioned by event_dt, and Spark MLlib reads it directly without going through a warehouse engine.
  4. raw_event_archive_7y lives in the lake lane with cold-tier S3 Glacier; warehouses charge real money to keep 7 years of clickstream that is read twice a year.
  5. The shared storage layer is the punchline — S3 + Iceberg lets all four workloads sit on top of the same files with different engines.

Output (the workload map).

workload sla architecture engine
exec_gmv_dashboard 15 min warehouse_or_lakehouse BI engine
monthly_cohort_retention 1 day lakehouse spark sql
fraud_ml_training 1 day lake_or_lakehouse spark mllib
raw_event_archive_7y n/a lake cold storage

Rule of thumb: never force one architecture to serve all workloads — the senior answer is "lakehouse as the storage spine + a warehouse for the BI hot path + the lake's cold tier for archive".

data lake vs data warehouse — the four senior signals that separate hype from substance

Signal 1 — opinionated workload mapping, not blanket claims. Senior engineers do not say "lakehouses replace warehouses"; they say "lakehouses replace the warehouse's archival and ML lanes, but a real-time BI dashboard on 500 concurrent users still benefits from a warehouse's query engine and result cache".

Signal 2 — quoting the open-table-format tradeoffs, not just naming them. Junior answers list Delta, Iceberg, Hudi without distinction. Senior answers say "Delta has the strongest ecosystem inside Databricks; Iceberg has the strongest cross-engine support and is winning on neutrality; Hudi has the best record-level upsert and CDC story but a smaller community".

Signal 3 — cost-and-egress reasoning, not feature checklists. Senior engineers reason about storage cost per TB-month, compute cost per TB-scanned, egress between regions, and the hidden cost of keeping data in the warehouse format (Snowflake's storage premium over raw S3 is ~5-10x). Junior engineers compare feature lists.

Signal 4 — migration realism. When asked "how would you migrate from Redshift to a lakehouse", junior engineers say "copy the tables to S3 as Iceberg". Senior engineers say "unload to S3 as Parquet, convert to Iceberg in place, dual-write for two weeks while the BI tools point at the warehouse, cut BI over to a Trino-on-Iceberg endpoint, retire Redshift compute, keep storage tier for one quarter as rollback insurance".

SQL
Topic — etl
ETL pipeline drills

Practice →

Data modeling
Lane — data-modeling
Data modeling practice library

Practice →

Solution Using a five-dimension architecture scorecard

Code.

-- One canonical scorecard — every architecture scored on five dimensions.
CREATE TABLE architecture_scorecard AS
SELECT * FROM (VALUES
    ('warehouse', 'best_workload',     'BI / dashboards / SQL'),
    ('warehouse', 'format_support',    'structured + JSON'),
    ('warehouse', 'acid_guarantees',   'full ACID'),
    ('warehouse', 'cost_profile',      'compute + storage bundled'),
    ('warehouse', 'maturity',          '30+ years'),
    ('lake',      'best_workload',     'ML / raw archive / semi-structured'),
    ('lake',      'format_support',    'any format'),
    ('lake',      'acid_guarantees',   'none by default'),
    ('lake',      'cost_profile',      'cheapest storage'),
    ('lake',      'maturity',          '15+ years'),
    ('lakehouse', 'best_workload',     'mixed BI + ML + streaming'),
    ('lakehouse', 'format_support',    'any format + open tables'),
    ('lakehouse', 'acid_guarantees',   'ACID via Delta / Iceberg / Hudi'),
    ('lakehouse', 'cost_profile',      'cheap storage + pay per engine'),
    ('lakehouse', 'maturity',          'modern + fast-evolving')
) AS t(architecture, dimension, verdict);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

architecture dimension verdict
warehouse best_workload BI / dashboards / SQL
warehouse format_support structured + JSON
warehouse acid_guarantees full ACID
warehouse cost_profile compute + storage bundled
warehouse maturity 30+ years
lake best_workload ML / raw archive / semi-structured
lake format_support any format
lake acid_guarantees none by default
lake cost_profile cheapest storage
lake maturity 15+ years
lakehouse best_workload mixed BI + ML + streaming
lakehouse format_support any format + open tables
lakehouse acid_guarantees ACID via Delta / Iceberg / Hudi
lakehouse cost_profile cheap storage + pay per engine
lakehouse maturity modern + fast-evolving
  1. Row 1-5 (warehouse) — five clean wins on BI, format-strict, full ACID; pay the cost-profile premium for those.
  2. Row 6-10 (lake) — cheapest storage, every format, but ACID is on you to enforce; great for ML, dangerous for BI.
  3. Row 11-15 (lakehouse) — bridges both lanes; the cost-profile is "cheap storage + you pay per engine", which is the senior tradeoff every CFO asks about.
  4. The matrix is the artefact you draw on the whiteboard when someone asks "compare warehouse vs lake vs lakehouse".
  5. Memorise the 15 cells; senior interviewers expect you to recite the row for any dimension on demand.

Output.

architecture best_workload acid_guarantees cost_profile
warehouse BI / dashboards / SQL full ACID compute + storage bundled
lake ML / raw archive / semi-structured none by default cheapest storage
lakehouse mixed BI + ML + streaming ACID via Delta / Iceberg / Hudi cheap storage + pay per engine

Why this works — concept by concept:

  • Five-dimension scorecard — turns a fuzzy "which is best" question into 15 scored cells; interviewers love a tester who can recite the matrix instead of waving.
  • Best-workload binding — pairs each architecture with the workload it wins at, not the workloads it tolerates; this is the discipline that separates senior answers from blog summaries.
  • ACID column — explicit on which architectures ship full ACID by default; the lake row's "none by default" is the single most consequential cell in the whole matrix.
  • Cost profile — exposes the unbundled-storage reality; modern stacks live or die on whether storage is bundled with compute.
  • CostO(1) to read the scorecard; the actual workloads have their own runtime costs but the decision itself is constant-time.

2. Data warehouse architecture — schema-on-write, ETL, star schema, BI-first

Visual diagram of a classic data warehouse architecture — sources on the left, a central ETL block, a star-schema warehouse in the middle with three coloured layers (staging, ODS, marts), and BI consumers on the right; a tight governance ribbon overlaid; on a light PipeCode card.

data warehouse architecture — schema-on-write, ETL, ODS, marts, BI

data warehouse architecture is the architecture that defined analytics for thirty years and still wins on BI workloads today. The defining property is schema-on-write: data is shaped before it lands. Every column is typed, every constraint is enforced, every row passes ACID. The pipeline is ETL (extract → transform → load) — transformations happen before the warehouse, not after — and the canonical layout is staging → ODS → star-schema marts with BI tools (Power BI, Tableau, Looker) reading the marts.

The four pillars of warehouse architecture.

  • schema-on-write — every column type, nullability, PK, and FK is enforced on write; an attempted insert with the wrong type fails. The cost: ingestion is slower; the win: every downstream query sees a clean shape.
  • ETL pipeline — transformations happen in a dedicated tool (Informatica, Talend, dbt, hand-rolled Python / SQL) before data lands in the warehouse. Compare to ELT in lakes, where data lands raw and is transformed later.
  • star schema — fact tables (events) joined to dimension tables (entities) via surrogate keys; fact_orders joins dim_customer, dim_product, dim_date. Optimised for the GROUP BY ... SUM(...) ... JOIN dim_x shape that 90% of BI queries take.
  • ACID + governance — full transactional semantics (INSERT, UPDATE, DELETE are atomic), plus row- and column-level access control, audit logs, and lineage. The warehouse is the most trustworthy data surface in the company.

The canonical layered layout.

  • Layer 1 — staging tables. Raw extracts from sources, typed but not modelled. Truncate-and-reload daily. Owned by ingestion engineers.
  • Layer 2 — ODS / EDW (Operational Data Store / Enterprise Data Warehouse). Normalised in 3NF; one row per real-world entity. Owned by data engineers.
  • Layer 3 — marts. Denormalised star or snowflake schemas keyed by analytic subject area (finance_mart, marketing_mart, product_mart). Owned by analytics engineers.
  • Consumers. BI tools, operational reports, embedded analytics, and dbt macros that compose mart-level metrics.

The big-name implementations in 2026.

  • Snowflake — cloud-native, separation of compute and storage inside a closed format, virtual warehouses (clusters) per workload, multi-cluster auto-scaling. Most popular in 2026.
  • BigQuery — serverless, scan-based pricing, Capacitor columnar format, decoupled storage in Google Cloud Storage. Strongest on ad-hoc analytical SQL.
  • Redshift — AWS-native, recently added RA3 (decoupled storage), Spectrum (S3 query), and Iceberg table support. Still common in AWS-only shops.
  • Synapse — Azure-native, blended SQL pool + Spark pool, now folded into Microsoft Fabric (which is itself moving toward lakehouse).
  • Teradata / Oracle Exadata — on-prem incumbents; still dominant in banking + telco; the systems that defined the term "data warehouse".

Where warehouses still win.

  • BI workloads with strict latency. A Tableau dashboard serving 500 concurrent users needs sub-second response on cached aggregations; the warehouse's result cache and BI-vendor integrations make this trivial.
  • Strictly structured + small JSON. When all data is relational and JSON is the occasional column, warehouses serve it with full ACID and SQL semantics. Once JSON is the primary shape, lakes win.
  • Fine-grained governance. Column masking, row-level security, audit trails — mature in warehouses, still bolt-on in lake stacks.
  • Financial close + regulatory reporting. SOX / GAAP-grade auditability needs ACID + immutable history + lineage — the warehouse heritage.

Where warehouses struggle.

  • Petabyte-scale raw archive. Storing 7 years of clickstream at Snowflake list price is six figures a month; the same data on S3 cold tier is four figures.
  • Semi-structured / unstructured data. Logs, images, PDFs, IoT payloads — possible in warehouses but expensive and awkward.
  • ML feature engineering. Spark, Ray, and PyTorch want to read raw Parquet directly; pulling through a warehouse adds latency and cost.
  • Multi-engine flexibility. A warehouse is one engine; you cannot point Trino, Spark, and DuckDB at the same warehouse table without paying for additional compute (or moving data).

Worked example — design a star schema for an e-commerce GMV mart

Detailed explanation. Real interviews ask you to lay out the star schema for a specific subject area. Below is the canonical e-commerce fact_orders mart with three dimension tables — the shape that 90% of warehouse BI queries take.

Question. Design a fact_orders star-schema mart for an e-commerce business. Include the fact table, three dimension tables (dim_customer, dim_product, dim_date), and a representative BI query that computes daily GMV by region for the last 30 days.

Input. Source staging.orders has columns order_id, customer_id, product_id, order_ts, quantity, unit_price, discount, currency. Source staging.customers and staging.products provide the dimension rows.

Code.

-- Dimension tables (denormalised, surrogate-keyed)
CREATE TABLE dim_customer (
    customer_sk     BIGINT PRIMARY KEY,
    customer_id     VARCHAR(64) NOT NULL,
    region          VARCHAR(32),
    signup_date     DATE,
    customer_tier   VARCHAR(16)
);

CREATE TABLE dim_product (
    product_sk      BIGINT PRIMARY KEY,
    product_id      VARCHAR(64) NOT NULL,
    category        VARCHAR(64),
    brand           VARCHAR(64),
    list_price_usd  NUMERIC(10,2)
);

CREATE TABLE dim_date (
    date_sk         INT PRIMARY KEY,
    full_date       DATE NOT NULL,
    day_of_week     VARCHAR(10),
    is_weekend      BOOLEAN,
    fiscal_quarter  VARCHAR(8)
);

-- Fact table (narrow, additive metrics, surrogate FKs)
CREATE TABLE fact_orders (
    order_sk        BIGINT PRIMARY KEY,
    order_id        VARCHAR(64) NOT NULL,
    customer_sk     BIGINT REFERENCES dim_customer(customer_sk),
    product_sk      BIGINT REFERENCES dim_product(product_sk),
    date_sk         INT    REFERENCES dim_date(date_sk),
    quantity        INT,
    unit_price_usd  NUMERIC(10,2),
    discount_usd    NUMERIC(10,2),
    gmv_usd         NUMERIC(12,2)
);

-- The canonical BI query: daily GMV by region, last 30 days
SELECT
    d.full_date,
    c.region,
    SUM(f.gmv_usd) AS gmv
FROM fact_orders f
JOIN dim_customer c ON f.customer_sk = c.customer_sk
JOIN dim_date     d ON f.date_sk     = d.date_sk
WHERE d.full_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY d.full_date, c.region
ORDER BY d.full_date, c.region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. dim_customer holds one row per customer; surrogate customer_sk decouples from the source customer_id so SCDs can be modelled without rewriting facts.
  2. dim_product holds one row per product; same surrogate-key pattern.
  3. dim_date is the canonical date dimension — generated once, joined to every fact. Holds is_weekend, fiscal_quarter, holiday flags.
  4. fact_orders is narrow — every column is either a surrogate FK or an additive metric (quantity, unit_price_usd, discount_usd, gmv_usd).
  5. The BI query is the canonical star-join shape: filter on dim_date, group by dim_date + dim_customer.region, sum fact_orders.gmv_usd. Sub-second on a warehouse with the right clustering.

Output (truncated to 3 rows).

full_date region gmv
2026-05-01 EMEA 1245678.90
2026-05-01 NA 2891234.50
2026-05-01 APAC 987654.32

Rule of thumb: star schemas are narrow facts + denormalised dims — never the other way around. Wide facts kill scan cost; normalised dims kill BI tools.

data warehouse architecture — the four senior signals

Signal 1 — explicit on schema-on-write vs schema-on-read. Senior engineers state the property by name; junior engineers say "the warehouse is structured". Schema-on-write is the property; structured is the outcome.

Signal 2 — naming the BI hot-path optimisations. "Snowflake clusters on (order_date, region), the BI tool's result cache lives in the SQL workbench, and partition pruning shrinks scans from 30 TB to 200 GB" — this is the senior answer.

Signal 3 — owning the cost model. "Snowflake billed in credits; one X-Small warehouse = 1 credit / hour ≈ $2-4. A 200-user dashboard concurrency burst spins up a 2X-Large = 32 credits / hour. Storage is on top at $23 / TB / month for compressed." — senior cost fluency.

Signal 4 — explicit on what not to put in the warehouse. "7 years of raw clickstream goes in S3 cold tier, not Snowflake. ML features get materialised to Parquet on S3, not into Snowflake tables. Image / PDF / audio payloads never enter the warehouse at all."

Data modeling
Topic — dimensional-modeling
Star-schema dimensional modeling

Practice →

SQL
Topic — aggregation
Aggregation patterns for BI workloads

Practice →

Solution Using a slowly-changing dimension type 2 + a narrow fact

Code.

-- Type-2 SCD on dim_customer: track region history without losing the past.
CREATE TABLE dim_customer (
    customer_sk     BIGINT PRIMARY KEY,
    customer_id     VARCHAR(64) NOT NULL,
    region          VARCHAR(32),
    signup_date     DATE,
    customer_tier   VARCHAR(16),
    valid_from      TIMESTAMP NOT NULL,
    valid_to        TIMESTAMP,                 -- NULL = currently active
    is_current      BOOLEAN   NOT NULL DEFAULT TRUE
);

-- Insert: close the previous row, insert a new row
WITH src AS (
    SELECT customer_id, region, customer_tier
    FROM staging.customers_today
), changed AS (
    SELECT s.*
    FROM src s
    LEFT JOIN dim_customer d
      ON  d.customer_id = s.customer_id AND d.is_current
    WHERE d.customer_id IS NULL                 -- net new customer
       OR d.region        != s.region           -- region changed
       OR d.customer_tier != s.customer_tier    -- tier changed
)
-- close the prior current row for any changed customer
UPDATE dim_customer d
SET valid_to = CURRENT_TIMESTAMP, is_current = FALSE
FROM changed c
WHERE d.customer_id = c.customer_id AND d.is_current;

INSERT INTO dim_customer (
    customer_sk, customer_id, region, customer_tier,
    valid_from, valid_to, is_current
)
SELECT
    nextval('dim_customer_sk_seq'),
    c.customer_id, c.region, c.customer_tier,
    CURRENT_TIMESTAMP, NULL, TRUE
FROM changed c;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

customer_id region customer_tier valid_from valid_to is_current
C001 EMEA gold 2025-01-01 2026-05-29 false
C001 NA gold 2026-05-29 NULL true
  1. src materialises today's customer snapshot from staging.
  2. changed LEFT JOINs against the current row in dim_customer; new + changed customers fall out.
  3. The UPDATE closes the prior current row by setting valid_to and flipping is_current.
  4. The INSERT writes a new surrogate-keyed row for each changed customer.
  5. Facts written before the region change still reference the old customer_sk; facts after reference the new one. This is the whole point of SCD type 2.

Output (one row after a region change).

customer_id region valid_from valid_to is_current
C001 NA 2026-05-29 NULL true

Why this works — concept by concept:

  • SCD type 2 — keeps a full history of dimension changes; without it, last quarter's GMV-by-region report rewrites itself when a customer moves regions.
  • Surrogate keyscustomer_sk decouples facts from natural keys; SCD type 2 only works because the SK is per-version, not per-customer.
  • is_current + valid_to — two complementary indicators; is_current is fast for BI lookups, valid_to is precise for point-in-time queries.
  • Narrow factfact_orders carries surrogate FKs, not denormalised columns; this is why the fact stays small even as dims grow rich.
  • CostO(N) per load over the changed-customers slice; on a million-row dimension with 0.5% daily churn, that is 5k row writes — trivial for any warehouse.

3. Data lake architecture — schema-on-read, ELT, open formats, cheap raw storage

Visual diagram of a data lake architecture — sources on the left, ELT into a multi-zone object store (raw / curated / sandbox) in the middle, with a catalog + permissions ribbon overlaid, and downstream consumers (Spark ML, query engines, exploratory notebooks) on the right; a small 'no ACID by default' warning chip; on a light PipeCode card.

data lake architecture — schema-on-read, ELT, multi-zone object storage

data lake architecture flips every warehouse assumption: data lands raw, fast, and cheap, and shape is imposed at read time, not write time. The defining property is schema-on-read. The pipeline is ELT (extract → load → transform — note the order). The storage layer is object storage (S3, ADLS Gen2, GCS, or on-prem HDFS), organised into zones (raw / curated / sandbox), and the file format is open (Parquet, Avro, ORC, JSON, CSV, plus raw blobs like images and PDFs). Compute is decoupled: any engine — Spark, Presto / Trino, Athena, Dremio, DuckDB — can read the files.

The four pillars of lake architecture.

  • schema-on-read — schema is imposed by the query engine at read time, not enforced at write. The cost: bad data lands; the win: ingestion is fast, format-agnostic, and survives upstream schema drift.
  • ELT pipeline — data lands raw, then gets transformed in place by Spark / dbt / SQL. Inverts the warehouse's ETL order.
  • multi-zone layout — raw / curated / sandbox; each zone has its own SLA, owner, and retention policy. The lake is not a swamp because of this discipline.
  • open file formatsParquet for columnar analytics, Avro for row-oriented streaming, ORC for Hive-era pipelines, plus raw JSON / CSV / images / PDFs. The format choice is yours, not the platform's.

The canonical zone layout.

  • Raw zone (raw/). Untouched extracts. One subfolder per source. Daily partitions by ingest date. No transformations. Owned by ingestion. Retention: 7+ years (compliance archive).
  • Curated zone (curated/). Cleansed, deduplicated, type-coerced. Owned by data engineering. The "trusted" lake surface that ML and SQL engines read.
  • Sandbox zone (sandbox/). Data scientist scratch space. Read access to curated; write access to personal subfolder. Auto-expires after 90 days.

The big-name implementations in 2026.

  • Amazon S3 + AWS Glue + Athena — the canonical AWS lake stack; Glue is the catalog, Athena the serverless SQL engine, S3 the storage. Pay-per-scan economics.
  • Azure Data Lake Storage Gen2 — hierarchical namespace over Blob Storage; query via Synapse Serverless, Databricks, or Microsoft Fabric.
  • Google Cloud Storage + BigLake — GCS for storage, BigLake for the federated catalog and IAM; query via BigQuery external tables or Dataproc Spark.
  • Hadoop HDFS — the on-prem incumbent; declining but still real in financial services, telco, and government. Often migrating to S3 / MinIO / Ozone.

Where lakes still win.

  • Cheap storage at petabyte scale. S3 Standard is $23 / TB / month; Glacier Deep Archive is $1 / TB / month. A warehouse cannot match this even before egress.
  • Any format. Parquet, Avro, ORC, JSON, CSV, MP4, JPEG, PDF, PCAP — the lake is format-agnostic.
  • ML training data. Spark, PyTorch, Ray, TensorFlow all read Parquet directly from S3 — no warehouse hop, no transformation pass.
  • Streaming sinks. Kafka → S3 via Kafka Connect or Flink is the canonical lake-landing pattern; millions of events per second land in raw zone.

Where lakes struggle.

  • No ACID by default. An UPDATE is "rewrite the partition". A concurrent reader during a write sees a half-rewritten partition. Mid-2010s lake outages were all this bug.
  • No schema enforcement. Parquet remembers the schema of the row group, not the table. Schema drift across files is your problem to detect.
  • BI consistency is shaky. "Why does the dashboard change while I'm reading it?" — because a partition was overwritten mid-query.
  • Small-file problem. Streaming sinks create thousands of small files per partition; query performance degrades; periodic compaction is a real operational tax.
  • Governance is bolt-on. IAM + Lake Formation + Ranger + Glue work, but require deliberate setup; warehouses ship governance by default.

Worked example — partition + file-format design for a clickstream lake

Detailed explanation. Real interviews ask "design the storage layout for 3 TB / day of clickstream". The answer is partitioning + file format + compaction policy — three decisions that determine whether the lake serves queries in 2 seconds or 2 hours.

Question. Design the S3 layout for a 3 TB / day clickstream pipeline that needs to support (a) Athena ad-hoc queries by event_date + country, (b) nightly Spark ML feature pipelines reading 90 days of history, and (c) 7-year compliance retention.

Input. Kafka → Kafka Connect S3 Sink → S3, ~30M events / sec peak. Each event is ~200 bytes JSON.

Code.

# S3 layout — partition by event_date and country; Parquet + Snappy.
s3://co-data-lake/raw/clickstream/
    event_date=2026-05-29/
        country=US/
            events_2026-05-29_US_001.parquet     # ~512 MB target
            events_2026-05-29_US_002.parquet
        country=GB/
            events_2026-05-29_GB_001.parquet
        country=IN/
            events_2026-05-29_IN_001.parquet
    event_date=2026-05-30/
        ...

# Daily compaction job — merge 100s of small files into 512 MB targets.
df = (spark.read.parquet("s3://co-data-lake/raw/clickstream/event_date=2026-05-29/")
              .repartition("country"))
(df.write
   .mode("overwrite")
   .partitionBy("event_date", "country")
   .option("maxRecordsPerFile", 5_000_000)
   .parquet("s3://co-data-lake/curated/clickstream/"))

# Lifecycle policy — auto-tier to Glacier after 90 days, expire after 7 years.
{
  "Rules": [
    {
      "Id": "clickstream-glacier",
      "Filter": {"Prefix": "raw/clickstream/"},
      "Status": "Enabled",
      "Transitions": [{"Days": 90, "StorageClass": "GLACIER"}],
      "Expiration": {"Days": 2555}
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Partition by event_date then country. Athena's predicate pushdown turns "WHERE event_date = '2026-05-29' AND country = 'US'" into reading one folder, not the whole lake.
  2. Parquet + Snappy. Parquet is columnar (4-10x smaller than JSON); Snappy is fast to decompress; together they make Athena scans cheap.
  3. 512 MB file target. S3 + Athena hate millions of 1 MB files; compaction merges them. The 512 MB target is the sweet spot for parallel-read engines.
  4. partitionBy("event_date", "country") — the Spark write fans out into the right folder structure.
  5. Lifecycle policy — auto-tier to Glacier after 90 days saves real money; expire after 7 years matches compliance.

Output (the resulting S3 listing for one day, one country).

key size storage_class
raw/clickstream/event_date=2026-05-29/country=US/events_001.parquet 512 MB STANDARD
raw/clickstream/event_date=2026-05-29/country=US/events_002.parquet 489 MB STANDARD
raw/clickstream/event_date=2026-05-29/country=US/events_003.parquet 503 MB STANDARD

Rule of thumb: every lake design boils down to partition for predicate pushdown, file size for parallel reads, lifecycle for cost — get those three right and the lake stays performant for years.

data lake architecture — the four senior signals

Signal 1 — partitioning explicit and bounded. Senior engineers know that partitioning by user_id creates millions of folders and kills the lake; partitioning by event_date + country creates ~1k folders per day and works. The rule: partition cardinality should be bounded and predicate-aligned.

Signal 2 — file size matters more than format. A 1 GB Parquet file outperforms a 1 MB Parquet file by 100x on a typical Athena scan. Senior engineers always own a compaction job; junior engineers ignore the small-file problem until it costs them a SEV-2.

Signal 3 — explicit on ACID gaps. Senior engineers state "the lake has no ACID without a table format on top — that's why we added Iceberg / Delta". Junior engineers either don't know or don't say.

Signal 4 — governance discipline, not just tooling. Senior engineers describe the IAM + Lake Formation + Glue policy stack and explain how column-level masking is enforced. Junior engineers say "S3 has IAM" and move on.

SQL
Topic — etl
ETL + ELT lake pipeline drills

Practice →

Streaming
Topic — streaming
Streaming + landing-zone drills

Practice →

Solution Using a three-zone lake with a Glue catalog + Athena

Code.

-- Glue catalog: register the curated zone as an external Athena table.
CREATE EXTERNAL TABLE curated.fact_clickstream (
    event_id        STRING,
    user_id         STRING,
    session_id      STRING,
    event_name      STRING,
    event_ts        TIMESTAMP,
    page_url        STRING,
    user_agent      STRING,
    revenue_usd     DECIMAL(10,2)
)
PARTITIONED BY (
    event_date      DATE,
    country         STRING
)
STORED AS PARQUET
LOCATION 's3://co-data-lake/curated/clickstream/'
TBLPROPERTIES (
    'parquet.compression' = 'SNAPPY',
    'projection.enabled'  = 'true',
    'projection.event_date.type'   = 'date',
    'projection.event_date.format' = 'yyyy-MM-dd',
    'projection.event_date.range'  = '2024-01-01,NOW',
    'projection.country.type'      = 'enum',
    'projection.country.values'    = 'US,GB,IN,DE,FR,BR,JP,AU'
);

-- The canonical analyst query: revenue by day + country, last 7 days.
SELECT
    event_date,
    country,
    SUM(revenue_usd) AS revenue
FROM curated.fact_clickstream
WHERE event_date >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY event_date, country
ORDER BY event_date, country;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

event_date country revenue
2026-05-23 US 1234567.89
2026-05-23 GB 234567.12
2026-05-23 IN 198765.43
2026-05-24 US 1298765.40
2026-05-24 GB 245678.90
  1. CREATE EXTERNAL TABLE registers an Athena view over the S3 prefix; no data is moved.
  2. PARTITIONED BY (event_date, country) matches the on-disk folder layout; Athena prunes accordingly.
  3. Partition projection (the projection.* properties) tells Athena to generate partitions from the schema instead of querying Glue per scan — turns 2-minute query startups into 500 ms.
  4. parquet.compression = SNAPPY is the default for Athena-on-S3; tradeoff favours decompression speed.
  5. The analyst query reads exactly the 56 partitions (7 days × 8 countries); Athena scans ~10% of the lake instead of the full 90 TB.

Output.

event_date country revenue
2026-05-23 US 1234567.89
2026-05-24 US 1298765.40
2026-05-25 US 1310987.65

Why this works — concept by concept:

  • Schema-on-read — the table definition lives in Glue, not on the files; you can swap the schema (add a column) without rewriting the lake.
  • External table — Athena owns no storage; it queries the open Parquet files in place. Compare to a warehouse, which owns both the format and the storage.
  • Partition projection — eliminates the Glue API roundtrip; cuts query startup from seconds to milliseconds on partitioned tables.
  • Snappy + Parquet — columnar + cheap decompression; the canonical lake format for analytical SQL.
  • CostO(P × S) where P = pruned partitions and S = scan size per partition; Athena bills $5 / TB scanned, so partition pruning directly = cost reduction.

4. Lakehouse architecture — open table formats (Delta/Iceberg/Hudi) + multi-engine compute

Visual diagram of a lakehouse architecture — a three-layer stack (object storage at the bottom, open table format Delta/Iceberg/Hudi in the middle, multi-engine compute on top); a unified catalog ribbon overlaid on the right; arrows from BI, SQL, streaming, and ML engines all reading from the same Delta tables; on a light PipeCode card.

lakehouse architecture — the three-layer stack that bridges lake economics + warehouse reliability

lakehouse architecture is the architecture that fixes the lake's biggest flaws — no ACID, no schema enforcement, no efficient UPDATE / DELETE, no time travel — without giving up cheap object storage or the multi-engine flexibility. The trick is a table format that sits on top of Parquet and adds a metadata log describing which files belong to which version of the table. The three open table formats that matter — Delta Lake (Databricks-origin), Apache Iceberg (Netflix-origin), Apache Hudi (Uber-origin) — all solve the same problem with different tradeoffs.

The three-layer lakehouse stack.

  • Layer 1 — object storage. Same S3 / ADLS Gen2 / GCS you'd use for a plain lake. The files are still Parquet; the lakehouse is additive, not a replacement.
  • Layer 2 — open table format. Delta / Iceberg / Hudi. Stores a transaction log + snapshot history alongside the data files; lets engines read a consistent version of the table even while another engine is writing.
  • Layer 3 — compute engines. Spark, Trino, Presto, Flink, DuckDB, Snowflake (Iceberg), BigQuery (BigLake / Iceberg), Redshift (Iceberg), Athena (Iceberg). All read the same tables; no data movement.

What the table format actually adds.

  • ACID transactionsINSERT, UPDATE, DELETE, MERGE are atomic; concurrent readers always see a consistent snapshot.
  • Schema enforcement + evolution — adding a column is metadata-only; dropping a column is supported; type promotion is bounded.
  • Time travelSELECT * FROM table VERSION AS OF 5 or TIMESTAMP AS OF '2026-05-01 00:00:00'; instant rollback and audit.
  • Hidden partitioningIceberg partitions on day(event_ts) without exposing a partition_date column; partition layout can evolve without rewriting facts.
  • Compaction + vacuum — built-in OPTIMIZE / VACUUM commands; no hand-rolled compaction job.
  • Statistics for query pruning — min/max/null-count per column per file; engines skip files without scanning them.

The three open table formats — strengths and trade-offs.

  • Delta Lake — strongest ecosystem inside Databricks; first-class on Databricks Unity Catalog; recently shipped UniForm so Delta tables read as Iceberg from other engines. Strength: deepest tooling on Databricks; trade-off: best cross-engine support requires UniForm.
  • Apache Iceberg — strongest cross-engine support; first-class in Snowflake, BigQuery, Redshift, Athena, Trino, Spark, Flink. Strength: vendor-neutrality (won the 2024-2026 format war on this axis); trade-off: less tightly integrated with any single platform than Delta is with Databricks.
  • Apache Hudi — strongest record-level upsert + CDC story; designed around incremental processing from day one; powers many of Uber's pipelines. Strength: best for streaming + CDC ingestion; trade-off: smaller community + ecosystem than Delta or Iceberg.

The big-name implementations in 2026.

  • Databricks (Delta + Unity Catalog) — the original lakehouse vendor; canonical end-to-end stack; deepest tooling around Delta.
  • Snowflake Iceberg tables — Snowflake reads and writes Iceberg; lets you store in your own S3 bucket while paying for Snowflake compute.
  • Microsoft Fabric + OneLake — Microsoft's lakehouse play; Delta-formatted, single-tenant lake per org, integrated with Power BI.
  • Google BigLake + Iceberg native tables — GCP's bridge between BigQuery storage and external lake / lakehouse; reads Iceberg / Delta on GCS.
  • Open OSS stackMinIO (or S3) + Iceberg + Nessie catalog + Trino / Spark + dbt; pure open source, no vendor lock.

Where lakehouses win — the modern default.

  • Mixed BI + ML + streaming on one storage layer. BI hits Iceberg via Trino; ML reads the same Iceberg via Spark; streaming writes via Flink — all on the same files.
  • Cost-effective at scale. Storage on S3 is cheap; compute is per-engine, per-workload, so you pay only for what runs.
  • Multi-engine flexibility. Cannot afford lock-in? Iceberg is the safest choice; the format is open and supported across all major engines.
  • Open formats + governance maturity. Unity Catalog, Nessie, and Polaris are converging on a real cross-engine catalog story; column masking + row filtering work across engines.

Where lakehouses still struggle.

  • Sub-second BI on 500-user dashboards. A warehouse's result cache still beats Trino-on-Iceberg on the BI hot path; many shops keep the warehouse as a serving layer in front of the lakehouse.
  • Tooling maturity for governance. Closing the gap fast, but warehouse-grade row-level security is still more mature on Snowflake / BigQuery than on Iceberg-via-Trino.
  • Operational complexity. Three layers (storage, table format, engine) means three places to debug. Warehouses are simpler.
  • Format choice is a long-term commitment. Picking Delta vs Iceberg vs Hudi at year 0 binds you for a decade.

Worked example — create an Iceberg table and run an ACID MERGE

Detailed explanation. Real interviews ask you to write the lakehouse equivalent of a warehouse MERGE. Below is the canonical Iceberg table + a MERGE INTO that performs an idempotent upsert — the shape every modern CDC pipeline takes.

Question. Create an Iceberg table for fact_orders on S3, then write an idempotent MERGE INTO that upserts a daily batch of new + updated orders from a Spark-loaded staging.orders_today view.

Input. Source staging.orders_today has 1.2M rows (98% net new, 2% updates to prior-day rows). Target fact_orders Iceberg table holds 600M rows across 24 months of history.

Code.

-- Create the Iceberg table on S3 with hidden partitioning by day(order_ts)
CREATE TABLE prod.fact_orders (
    order_id        BIGINT,
    customer_id     BIGINT,
    product_id      BIGINT,
    order_ts        TIMESTAMP,
    quantity        INT,
    unit_price_usd  DECIMAL(10,2),
    discount_usd    DECIMAL(10,2),
    gmv_usd         DECIMAL(12,2)
)
USING ICEBERG
PARTITIONED BY (days(order_ts))
LOCATION 's3://co-lakehouse/prod/fact_orders/'
TBLPROPERTIES (
    'write.format.default'         = 'parquet',
    'write.parquet.compression-codec' = 'zstd',
    'write.target-file-size-bytes' = '536870912'   -- 512 MB
);

-- Idempotent upsert: insert net-new, update changed, keep history intact
MERGE INTO prod.fact_orders AS tgt
USING staging.orders_today AS src
   ON tgt.order_id = src.order_id
WHEN MATCHED AND (
       tgt.quantity      != src.quantity
    OR tgt.unit_price_usd != src.unit_price_usd
    OR tgt.discount_usd   != src.discount_usd
    OR tgt.gmv_usd        != src.gmv_usd
) THEN UPDATE SET
    quantity       = src.quantity,
    unit_price_usd = src.unit_price_usd,
    discount_usd   = src.discount_usd,
    gmv_usd        = src.gmv_usd
WHEN NOT MATCHED THEN INSERT (
    order_id, customer_id, product_id, order_ts,
    quantity, unit_price_usd, discount_usd, gmv_usd
) VALUES (
    src.order_id, src.customer_id, src.product_id, src.order_ts,
    src.quantity, src.unit_price_usd, src.discount_usd, src.gmv_usd
);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. USING ICEBERG tells Spark / Trino / Snowflake to use the Iceberg table format; the underlying files are still Parquet.
  2. PARTITIONED BY (days(order_ts)) is hidden partitioning — no explicit order_date column; Iceberg derives the partition value from order_ts automatically.
  3. write.target-file-size-bytes = 512 MB sets the engine's compaction target; files are rewritten to hit this size during OPTIMIZE.
  4. MERGE INTO is the canonical idempotent upsert; safe to re-run; atomic; ACID.
  5. WHEN MATCHED AND ... clause skips no-op updates — only rewrites files whose rows actually changed; this is the optimization that keeps daily MERGE jobs from rewriting the whole table.

Output (after the MERGE on a 1.2M-row batch).

outcome rows
inserted 1176000
updated 24000
files_rewritten 47
snapshot_id 8125094521

Rule of thumb: lakehouse MERGE is the modern equivalent of a warehouse UPSERT; once you can write it, you can run ACID CDC into a lake at warehouse-grade reliability.

lakehouse architecture — the four senior signals

Signal 1 — opinionated on format choice. "I default to Iceberg for multi-engine neutrality; Delta if the org is Databricks-first; Hudi only when record-level upsert at streaming velocity is the dominant requirement" — senior phrasing.

Signal 2 — quoting time-travel use cases. Time travel is not a party trick — it's how you recover from a bad transformation. "We rolled back the bad PR by RESTORE TABLE fact_orders TO VERSION AS OF 47; took 10 seconds; would have been a 4-hour restore on Redshift."

Signal 3 — owning compaction + vacuum cadence. "OPTIMIZE runs nightly to compact small files; VACUUM runs weekly with 7-day retention to keep storage bounded; both are idempotent and re-runnable."

Signal 4 — multi-engine reasoning, not single-vendor. "BI uses Trino-on-Iceberg for sub-second latency on cached aggregates; Spark runs the nightly ML pipeline on the same tables; Flink writes streaming CDC into the same Iceberg with MERGE. One storage layer, three engines."

Data modeling
Topic — slowly-changing-data
SCD + lakehouse upsert practice

Practice →

Company
Company — databricks
Databricks interview practice

Practice →

Solution Using an Iceberg snapshot + a time-travel rollback

Code.

-- 1) Discover the snapshot history (the audit trail every Iceberg table ships with).
SELECT
    snapshot_id,
    parent_id,
    operation,
    committed_at,
    summary['added-records']     AS added,
    summary['deleted-records']   AS deleted,
    summary['changed-partition-count'] AS changed_parts
FROM prod.fact_orders.snapshots
ORDER BY committed_at DESC
LIMIT 10;

-- 2) Query the table at a prior version (time travel).
SELECT COUNT(*)
FROM prod.fact_orders VERSION AS OF 47;

-- 3) Restore the table to the prior snapshot in a single transaction.
CALL system.rollback_to_snapshot('prod.fact_orders', 47);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

snapshot_id operation committed_at added deleted
8125094521 append 2026-05-29 02:14 1176000 0
8125094520 overwrite 2026-05-28 02:11 1198432 1198432
8125094519 append 2026-05-27 02:09 1184502 0
47 append 2026-05-26 02:13 1167789 0
  1. prod.fact_orders.snapshots is a metadata table that ships with every Iceberg table — instant audit trail with zero extra plumbing.
  2. The VERSION AS OF clause reads the table as it existed at snapshot 47; no data was moved, no extra storage burned.
  3. The rollback_to_snapshot procedure rewrites only the metadata pointer — O(1) operation, atomic, ACID-safe.
  4. Concurrent readers continue reading the prior current snapshot until the rollback commits; no half-state visible.
  5. The rollback is itself a new snapshot — fully auditable; you can roll forward again if needed.

Output.

snapshot_id operation committed_at
8125094522 rollback_to_snapshot 2026-05-29 02:30

Why this works — concept by concept:

  • Snapshot metadata — Iceberg writes every commit as a new snapshot; the chain is the table's full history, zero extra cost.
  • Time travelVERSION AS OF lets you debug, audit, and rollback without restoring from backup; the warehouse equivalent is a multi-hour restore.
  • O(1) rollback — only the metadata pointer moves; underlying files are untouched until VACUUM cleans up orphans.
  • ACID across engines — Spark, Trino, and Snowflake all see the same snapshot consistently; lakehouse's biggest win over plain lakes.
  • CostO(1) metadata read for snapshot history; O(1) rollback; O(N) only on VACUUM. The math is why Iceberg / Delta dominate modern lakehouses.

5. Decision matrix — pick the right architecture per workload (with worked migration scenarios)

Three-column decision matrix comparing Warehouse, Lake, and Lakehouse across five rows — Best workload, Format support, ACID guarantees, Cost profile, Maturity; each cell is a colour-coded verdict pill; on a light PipeCode card.

data lake vs data warehouse vs lakehouse — the five-dimension decision matrix

This is the matrix you should be able to draw on a whiteboard from memory in any senior interview. Five dimensions × three architectures = fifteen cells; the verdict in each cell is the one-line answer interviewers reward.

The five-dimension decision matrix.

Dimension Warehouse Lake Lakehouse
Best workload BI / dashboards / SQL ML / raw archive / semi-structured Mixed BI + ML + streaming
Format support Structured + JSON Any format Any format + open tables
ACID guarantees Full ACID None by default ACID via Delta / Iceberg / Hudi
Cost profile Compute + storage bundled Cheapest storage Cheap storage + pay per engine
Maturity 30+ years (proven) 15+ years (proven) Modern + fast-evolving

Reading the matrix — three canonical decisions.

  • "My only workload is a BI dashboard for 500 concurrent users on structured SQL."Warehouse wins. ACID + result cache + BI integrations + governance maturity are all warehouse strengths. Snowflake, BigQuery, Redshift, or Synapse — pick by cloud.
  • "My only workload is ML training on 5 PB of raw clickstream + image data."Lake wins. Cheapest storage + any format + direct read from Spark / PyTorch. S3 + Glue + Athena, or ADLS + Synapse Serverless.
  • "I have mixed BI + ML + CDC + streaming on overlapping data."Lakehouse wins. Open Iceberg / Delta tables let every engine read the same files; storage stays cheap; ACID stays solid; format stays open.

The four-question decision tree (the senior shorthand).

  • Q1 — Is your workload 100% BI on structured SQL? Yes → warehouse. No → continue.
  • Q2 — Do you need ACID guarantees on lake-scale storage? Yes → lakehouse. No → continue.
  • Q3 — Do you need to share data across many compute engines without copying? Yes → lakehouse. No → continue.
  • Q4 — Default for everything else → lake (and revisit when ACID or BI consistency starts hurting).

Worked example — three real migration scenarios with cost + risk

Detailed explanation. Real interviews don't ask "which architecture" — they ask "how would you migrate". Below are three canonical migration scenarios with the steps, the order, and the rollback strategy each one ships with.

Question. Walk through three migrations end-to-end: (A) Redshift warehouse → Iceberg lakehouse on S3 + Trino; (B) S3 + Glue lake → Iceberg lakehouse + Snowflake serving; (C) Databricks Delta lakehouse → multi-engine Iceberg via UniForm.

Input. Each migration has a 50-100 TB starting footprint and a 90-day timeline. The success criterion is zero downtime for BI consumers and full cost parity within 6 months.

Code.

# Migration A — Redshift warehouse → Iceberg lakehouse on S3 + Trino
# (90-day plan; the most common 2026 migration)

migration_a_steps = [
    ("week_1",  "audit_redshift_tables",       "list top 200 tables by query volume + size"),
    ("week_2",  "unload_to_parquet_on_s3",     "UNLOAD ('SELECT ...') TO 's3://...' FORMAT PARQUET"),
    ("week_3",  "convert_parquet_to_iceberg",  "CALL system.add_files_from_table('parquet_table')"),
    ("week_4",  "stand_up_trino_endpoint",     "deploy Trino cluster with iceberg catalog"),
    ("week_5",  "dual_write_via_dbt",          "every model writes to both Redshift and Iceberg"),
    ("week_6",  "row_count_parity_tests",      "dbt tests on COUNT(*) + SUM(amount) for top 50 tables"),
    ("week_7",  "point_bi_at_trino",           "Tableau / Looker switch endpoint; smoke-test on top 20 dashboards"),
    ("week_8",  "monitor_2_weeks",             "watch query latency, cost, error rates"),
    ("week_9",  "cut_redshift_compute",        "pause cluster; keep storage tier for 30 days as rollback"),
    ("week_10", "decommission",                "drop Redshift cluster; finalise cost report"),
]

# Migration B — S3 + Glue lake → Iceberg lakehouse + Snowflake serving
migration_b_steps = [
    ("week_1",  "audit_glue_catalog",          "list tables, partitions, file counts"),
    ("week_2",  "convert_external_to_iceberg", "CREATE TABLE iceberg.x AS SELECT * FROM parquet.x"),
    ("week_3",  "switch_compaction_to_optimize","replace manual compaction with Iceberg OPTIMIZE"),
    ("week_4",  "configure_snowflake_iceberg", "CREATE EXTERNAL VOLUME + CREATE ICEBERG TABLE"),
    ("week_5",  "expose_iceberg_to_bi",        "Snowflake serves Iceberg to Power BI / Looker"),
    ("week_6",  "decommission_glue_metastore", "keep Glue for legacy Athena; new tables Iceberg-only"),
]

# Migration C — Databricks Delta → multi-engine Iceberg via UniForm
migration_c_steps = [
    ("week_1",  "enable_uniform_on_delta",     "ALTER TABLE x SET TBLPROPERTIES ('delta.universalFormat.enabledFormats'='iceberg')"),
    ("week_2",  "register_in_unity_catalog",   "tables now readable as Iceberg from external engines"),
    ("week_3",  "point_external_trino_at_uc",  "Trino reads via Iceberg catalog; same files, no copy"),
    ("week_4",  "validate_external_reads",     "row-count + checksum parity Databricks vs Trino"),
    ("week_5",  "open_data_to_partners",       "external partners read Iceberg without buying Databricks seats"),
]
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Migration A is the most common 2026 path — Redshift cost pressure + multi-engine requirements + cheap storage demand all point toward Iceberg on S3 + Trino. The 10-week plan is conservative; aggressive teams compress it to 6 weeks.
  2. Migration B is the "lake-grew-up" path — an existing S3 + Glue lake adds Iceberg for ACID + schema evolution, then uses Snowflake as a serving layer in front (Snowflake reads Iceberg natively as of 2024).
  3. Migration C is the "open the format" path — a Databricks shop enables UniForm so Delta tables also expose an Iceberg interface; external Trino / Snowflake / BigQuery clients read the same files without buying Databricks seats.
  4. Common pattern — every migration includes a dual-write window + parity tests + a rollback tier kept for one quarter. The single biggest mistake is cutting the old system before parity is proven.
  5. Cost reality — migrations A and B typically pay back inside 6-12 months; migration C is mostly a feature-unlock, not a cost play.

Output (the migration tracker for Migration A at week 7).

step status parity_pass cost_so_far_usd
audit_redshift_tables done n/a 0
unload_to_parquet_on_s3 done n/a 8400
convert_parquet_to_iceberg done n/a 1200
stand_up_trino_endpoint done n/a 4500/mo
dual_write_via_dbt done yes (50/50) 2200/mo
row_count_parity_tests done yes 0
point_bi_at_trino in_progress n/a 0

Rule of thumb: migrations are won by dual-writing + parity tests + a rollback tier, not by clever code. Every senior plan includes all three.

lakehouse architecture — the four senior migration signals

Signal 1 — explicit dual-write window. "We dual-wrote for two weeks while BI still pointed at Redshift; cut over only after row-count + SUM parity passed on 50 critical tables." Senior teams never cut over without a parity gate.

Signal 2 — keep the old system as a rollback tier. "We paused the Redshift cluster but kept the storage tier for 30 days; cost was $X / month for insurance; we never needed it but the option mattered." Senior teams budget for the rollback.

Signal 3 — migration order matters. "Migrate cold tables first (low risk), warm tables second (medium risk), hot BI tables last (highest risk)." Senior teams sequence by blast radius.

Signal 4 — measurable success criterion. "Success = cost parity within 6 months + zero downtime for BI + 100% of top-50 dashboards passing smoke tests." Junior teams say "migrate the data"; senior teams say "hit these three numbers".

SQL
Topic — etl
ETL + migration pipeline drills

Practice →

Company
Company — snowflake
Snowflake interview practice

Practice →

Solution Using a workload-to-architecture decision tree + parity-gated migration

Code.

# A reusable decision-tree + migration-gate harness.
# Inputs: workload spec; outputs: architecture verdict + migration steps.

WORKLOADS = [
    {"name": "exec_dashboard",  "consumers": 500, "latency_ms": 800, "data_tb": 5,   "formats": ["sql"]},
    {"name": "ml_training",     "consumers": 8,   "latency_ms": 5000, "data_tb": 50, "formats": ["parquet","jpeg"]},
    {"name": "cdc_ingest",      "consumers": 4,   "latency_ms": 60_000, "data_tb": 80, "formats": ["json","parquet"]},
    {"name": "regulatory_archive","consumers":1,  "latency_ms": 600_000,"data_tb":300,"formats": ["parquet"]},
]

def pick_architecture(w):
    if w["consumers"] > 100 and "sql" in w["formats"] and w["latency_ms"] < 1500:
        return "warehouse_or_lakehouse"
    if w["data_tb"] > 100 and w["latency_ms"] > 60_000:
        return "lake"
    if any(f in w["formats"] for f in ("parquet","json","jpeg")) and w["latency_ms"] >= 5000:
        return "lake_or_lakehouse"
    return "lakehouse"

def parity_check(src_table, tgt_table):
    # Row-count + SUM(amount) parity within 0.01% tolerance
    sql = f"""
    SELECT
        ABS((SELECT COUNT(*) FROM {src_table}) - (SELECT COUNT(*) FROM {tgt_table})) AS row_delta,
        ABS((SELECT COALESCE(SUM(amount),0) FROM {src_table})
            - (SELECT COALESCE(SUM(amount),0) FROM {tgt_table}))
        / NULLIF((SELECT COALESCE(SUM(amount),0) FROM {src_table}), 0) AS rel_amt_delta
    """
    row = engine.execute(sql).first()
    return row.row_delta == 0 and row.rel_amt_delta < 0.0001

for w in WORKLOADS:
    print(w["name"], "->", pick_architecture(w))
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

workload consumers latency_ms data_tb verdict
exec_dashboard 500 800 5 warehouse_or_lakehouse
ml_training 8 5000 50 lake_or_lakehouse
cdc_ingest 4 60000 80 lakehouse
regulatory_archive 1 600000 300 lake
  1. pick_architecture codifies the four-question decision tree as Python; one branch per workload class.
  2. exec_dashboard lands in warehouse_or_lakehouse — many consumers + sub-second latency demand a hot query engine.
  3. ml_training lands in lake_or_lakehouse — non-SQL formats + tolerance for 5-second latency means the lake's economics win.
  4. cdc_ingest lands in lakehouse — mixed formats + need for ACID upserts at minute-level latency means the lakehouse is the only architecture that fits.
  5. regulatory_archive lands in lake — cold storage + minute-tier latency tolerance + single consumer means even a lakehouse is overkill.
  6. parity_check is the gate every migration step runs before promoting; the 0.0001 tolerance band tolerates floating-point noise without masking real drift.

Output.

workload verdict
exec_dashboard warehouse_or_lakehouse
ml_training lake_or_lakehouse
cdc_ingest lakehouse
regulatory_archive lake

Why this works — concept by concept:

  • Workload-spec inputs — turns architecture choice into a function of (consumers, latency, data size, formats); senior answers always tie the choice to measurable workload properties.
  • Decision tree — codifies the four-question shorthand so every team-mate gets the same answer for the same workload.
  • Parity-gated migration — every step is conditional on row-count + value parity passing; no step ships without proof.
  • Tolerance band0.0001 is the senior-grade default; raw equality would block on harmless floating-point noise.
  • CostO(1) per workload to run pick_architecture; O(N) per table for parity_check; the function is the artefact you point at when someone asks "why did we pick X for workload Y".

Choosing the right architecture (cheat sheet)

A one-screen cheat sheet for data lakehouse vs data warehouse vs data lake — pick the architecture that matches the workload you actually have.

You want to support … Architecture Canonical stack Why
Sub-second BI on structured SQL, 500 concurrent users Warehouse Snowflake / BigQuery / Redshift / Synapse Result cache + BI vendor integrations + ACID maturity
Cheap petabyte-scale raw archive Lake S3 + Glacier + Glue catalog $1-23 / TB / month; no other architecture comes close
ML training on raw multi-format data Lake or Lakehouse S3 + Spark + (optional Iceberg) Spark / PyTorch read Parquet directly; lakehouse adds ACID for shared tables
Mixed BI + ML + CDC + streaming on one storage layer Lakehouse S3 + Iceberg + Trino + Spark + Flink One storage, many engines, ACID across all
Multi-engine reads without data copies Lakehouse Iceberg + Unity / Polaris / Nessie Open format + cross-engine catalog
ACID upserts at lake economics Lakehouse Iceberg or Delta + Spark MERGE Atomic MERGE INTO on object storage
Time travel + auditable rollback Lakehouse Iceberg / Delta snapshot history VERSION AS OF instead of restoring from backup
Fine-grained governance (row + column security) Warehouse first, Lakehouse if open is mandatory Snowflake masking policies / Unity Catalog Warehouse-grade governance still slightly ahead
Sub-millisecond OLTP transactional reads Neither (use OLTP DB) PostgreSQL / MySQL / DynamoDB None of the three analytical architectures fit OLTP
Real-time fraud scoring on streaming events Lakehouse + streaming engine Iceberg + Flink + feature store Stream into Iceberg; consume with Flink ML pipeline
Cross-cloud portability of data Lakehouse Iceberg on S3 / ADLS / GCS Open format avoids vendor lock-in on the storage layer
Mature 7-year regulatory archive Lake (cold tier) S3 Glacier + Glue catalog $1 / TB / month + queryable on-demand via Athena
Migration off Teradata / Oracle Warehouse-first, then Lakehouse Snowflake / BigQuery, later Iceberg Land in modern warehouse first; open the format later
Cost-pressure relief on existing Snowflake / Redshift Lakehouse migration Iceberg on S3 + Trino + Snowflake-as-serving Cuts storage cost 5-10x without losing BI surface

Frequently asked questions

What is the difference between a data lakehouse, a data warehouse, and a data lake?

A data warehouse is a structured, ACID-compliant, schema-on-write store optimised for BI and SQL analytics (Snowflake, BigQuery, Redshift, Synapse); a data lake is a cheap, schema-on-read object store that lands data in any format and lets ML / SQL engines read it directly (S3, ADLS Gen2, GCS + Glue / Athena); a lakehouse is a lake plus an open table format (Delta, Iceberg, Hudi) that adds ACID, schema enforcement, time travel, and efficient UPDATE / DELETE so the same storage layer can serve BI and ML and streaming. In 2026 most enterprises run all three side by side — a warehouse for the BI hot path, a lake for cold archive and raw ML data, and a lakehouse as the shared storage spine. The right architecture is always per workload, not blanket.

When should I use a lakehouse instead of a data warehouse?

Use a lakehouse when (a) you need multi-engine flexibility — Spark, Trino, Snowflake, and BigQuery all reading the same tables; (b) your storage cost is dominated by cold or semi-structured data that the warehouse charges a premium for; (c) you have mixed BI + ML + streaming workloads that want to share data without copying; or (d) vendor neutrality on the storage layer is a strategic requirement. Use a warehouse when your workload is 100% structured BI on a small concurrency-heavy set of dashboards and sub-second latency matters more than storage cost. The hybrid pattern most teams adopt in 2026 is lakehouse as the shared storage spine + warehouse as the BI serving layer in front — best of both, no architecture forced to serve every workload.

What are the main lakehouse table formats and how do I choose between Delta, Iceberg, and Hudi?

The three open table formats are Delta Lake (Databricks-origin), Apache Iceberg (Netflix-origin), and Apache Hudi (Uber-origin) — all add ACID, schema evolution, time travel, and efficient MERGE on top of Parquet files on object storage. Pick Iceberg as the default if you want cross-engine neutrality — Snowflake, BigQuery, Redshift, Athena, Trino, Spark, and Flink all read Iceberg natively. Pick Delta if you are Databricks-first — the tooling, performance optimisations, and Unity Catalog integrations are deepest there (and UniForm lets Delta tables read as Iceberg from external engines). Pick Hudi when record-level upsert + CDC at streaming velocity is the dominant requirement — its Streamer API and merge-on-read storage type were designed for that case. The 2026 community trend: Iceberg won the neutrality race, Delta won the Databricks ecosystem, Hudi remains best-in-class for streaming CDC.

Does a lakehouse really replace a data warehouse for BI workloads?

For most BI workloads, yes — Trino or Databricks SQL on an Iceberg / Delta table delivers the dashboards, ACID, and partition pruning that a warehouse does. For high-concurrency, sub-second BI on cached aggregations (think: 500-user executive dashboards), warehouses still have an edge because of the result cache and purpose-built BI vendor integrations. The pragmatic pattern is lakehouse as the storage spine + warehouse as the BI serving layer — store data once in Iceberg, then load (or live-query via external table) the hot aggregates into Snowflake / BigQuery for the dashboard front-end. This hybrid gives you lake economics on the cold + raw data and warehouse performance on the BI hot path, without forcing one architecture to do everything.

How does ETL change between a warehouse, a lake, and a lakehouse?

In a warehouse the pipeline is classic ETL — extract from sources, transform in a dedicated tool (Informatica, Talend, dbt, or hand-rolled), load clean data into staging → ODS → marts. Schema-on-write means transformations must succeed before data lands. In a lake the pipeline inverts to ELT — extract, load raw, then transform later with Spark / dbt / SQL on the raw zone; schema-on-read means bad data lands and is filtered downstream. In a lakehouse the pipeline is also ELT but with ACID atopMERGE INTO iceberg_table USING staging is the idempotent canonical pattern; you keep lake flexibility and warehouse-grade transactional guarantees. The senior takeaway: ELT into a lakehouse with MERGE is the modern default; pure ETL into a warehouse is still right for narrow BI-only workloads; pure ELT into a raw lake is still right for ML and archival.

What is the typical cost difference between a data lake, a data warehouse, and a lakehouse at petabyte scale?

At petabyte scale, storage cost dominates and the ranking is fairly stable. A lake on S3 Standard costs roughly $23 / TB / month; with cold-tier (Glacier Deep Archive) the cold portion drops to ~$1 / TB / month. A lakehouse (Iceberg on S3 + Trino / Spark) costs the same storage as the lake, plus pay-per-use compute on whichever engines you run (typically $20-60 / TB scanned via Trino or Athena). A warehouse like Snowflake or Redshift charges a storage premium of 5-10x over raw S3 ($40-80 / TB / month for compressed) and bundles compute via virtual-warehouse credits ($2-4 / credit-hour for an X-Small, scaling up). In practice teams migrating from a 1 PB Redshift footprint to Iceberg on S3 + Trino report 40-70% cost reduction with no loss of BI surface — exact numbers depend on workload mix, concurrency, and how aggressively cold data is tiered.


Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including SQL + Python + data-modeling drills keyed to the exact data lakehouse vs data warehouse skill set this guide teaches (star-schema design, partition + file-format choice on lakes, Delta / Iceberg / Hudi upsert patterns, multi-engine ACID, BI vs ML workload mapping, migration parity tests). Whether you're drilling data lake vs data warehouse questions the night before a screen or grinding the architecture-selection decision tree over 12 weeks of prep, the practice library mirrors the same five-dimension mental model — plus the Spark, Trino, Snowflake, Databricks, BigQuery, and Redshift tooling you'll wire into a real production lakehouse.

Kick off via Explore practice →; drill the data-modeling practice lane →; fan out into the ETL pipeline drills →; rehearse dimensional-modeling patterns →; reinforce aggregation reconciliation drills →; widen coverage on the full SQL practice library →; or stress-test with Databricks-specific drills → and Snowflake-specific drills →.

Top comments (0)