The data lakehouse vs data warehouse debate is the architecture decision every modern data team makes — and it does not have a single winner, only the right answer per workload. The three architectures — data warehouse, data lake, lakehouse — each evolved to solve a specific failure mode of the one that came before, and each one still wins inside its lane: warehouses dominate BI and dashboards, lakes dominate cheap raw storage and ML, lakehouses dominate mixed workloads that need both. The right way to compare them is not "which is best" but rather "which storage layer, which compute engine, and which transactional guarantees fit my workload — and what does the migration path between them actually cost".
This guide walks the three architectures end-to-end at deep-guide depth — data lake vs data warehouse at the storage / ingest / schema / governance layer, lakehouse architecture at the open-table layer (Delta, Iceberg, Hudi), and data warehouse architecture vs data lake architecture at the engine and cost-profile layer — with a five-dimension decision matrix, three worked migration scenarios, and SQL / Python snippets that match the exact shapes panelists ask in senior data-platform interviews. By the end you will be able to defend any of the three on the right workload, name the failure mode each was invented to solve, quote the cost-and-ACID tradeoffs from memory, and walk through a real migration without hand-waving.
When you want hands-on reps immediately after reading, browse data-modeling practice →, drill ETL pipeline problems →, sharpen dimensional-modeling drills →, rehearse aggregation patterns for BI workloads →, reinforce database design problems →, or widen coverage on the full SQL practice library →.
On this page
- Why the three-architecture comparison matters in 2026
- Data warehouse architecture — schema-on-write, ETL, star schema, BI-first
- Data lake architecture — schema-on-read, ELT, open formats, cheap raw storage
- Lakehouse architecture — open table formats (Delta/Iceberg/Hudi) + multi-engine compute
- Decision matrix — pick the right architecture per workload (with worked migration scenarios)
- Choosing the right architecture (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why the three-architecture comparison matters in 2026
data lakehouse vs data warehouse — three architectures, three failure modes, one decision per workload
The one-sentence invariant: the three analytical architectures are not competitors — they are a historical sequence, each one invented to solve the failure mode of the one before, and the modern stack in 2026 typically runs at least two of them side by side. A senior data engineer does not say "warehouses are dead, lakehouses won"; they say "warehouses still serve BI fastest, lakes still archive raw cheapest, and lakehouses bridge both with open table formats — pick by workload, not by hype-cycle".
The historical sequence at a glance.
-
1980s-2010s — data warehouse era.
Teradata,Oracle Exadata, thenRedshift/Snowflake/BigQuery/Synapse. Won at: BI, dashboards, structured SQL, ACID guarantees, fine-grained governance. Failed at: cheap raw storage, semi-structured data (JSON / Avro), ML feature pipelines, multi-engine flexibility, ingestion velocity. -
2010s-2020s — data lake era.
Hadoop HDFS, thenS3+Glue+Athena,ADLS Gen2,GCS. Won at: cheap storage at any scale, raw archival, any file format, ML training data, schema-on-read flexibility. Failed at: ACID transactions, schema enforcement, BI consistency, fine-grained updates, governance maturity. -
2020s-now — lakehouse era.
Databricks Delta Lake,Apache Iceberg,Apache Hudi,Snowflake Iceberg tables,BigLake,Microsoft Fabric. Won at: lake economics + warehouse reliability, ACID on object storage, multi-engine reads of the same tables, open formats, unified catalog. Trade-offs: still maturing tooling, table-format choice is a long-term commitment, governance bolt-on requires extra effort.
What changed in 2026 that makes this comparison different from 2018.
-
Open table formats matured.
Delta Lake3.x withUniFormreads asIceberg;Icebergv3 ships inSnowflake,BigQuery,Redshift, andAthena;Hudi1.0 finalised itsStreamerAPI. Open tables are no longer a Databricks-only story. -
Warehouses embraced lake formats.
Snowflakereads and writesIceberg;BigQueryhasBigLakeandIcebergnative tables;RedshiftqueriesIceberg-on-S3directly. The warehouse vs lake wall fell. -
Lakes got ACID. Before
Delta/Iceberg, anUPDATEon a lake meant rewriting a partition by hand; today,UPDATE,DELETE,MERGE, and time-travel are first-class on object storage. -
Compute fully separated from storage. Spark, Trino, Presto, Flink, DuckDB, Snowflake, BigQuery, Athena, ClickHouse — multiple engines read the same
Icebergtable from the sameS3bucket with the same governance. - Cost pressure forced honesty. Warehouses still bundle compute + storage (or charge a premium for storage); lake / lakehouse stacks decouple them. At petabyte scale the difference is six figures a year.
Who should read which comparison.
-
data lake vs data warehouse— read section 2 + section 3; the classic 2015-2020 debate, still relevant when a team is choosing its first analytical platform. -
data lakehouse vs data warehouse— read section 2 + section 4; the 2022-now debate, relevant when migrating offRedshift/Synapsefor cost or flexibility reasons. -
data lake vs data lakehouse— read section 3 + section 4; the 2021-now debate, relevant when an existing lake's lack of ACID and BI consistency starts hurting. - All three at once — read the full guide; the modern reality is hybrid, and senior interviews expect you to defend the choice across all three lanes.
Worked example — map a single workload onto all three architectures
Detailed explanation. A canonical interview prompt is "a marketplace wants daily GMV dashboards, monthly cohort retention, and real-time fraud scoring — design the data platform". The honest answer touches all three architectures, and the worked example below walks the mapping cell by cell.
Question. A marketplace ships 3 TB / day of clickstream events, 80 GB / day of OLTP CDC, and needs (a) an executive GMV dashboard refreshed every 15 minutes, (b) monthly cohort retention reports run by analysts, and (c) a fraud-scoring ML pipeline that retrains nightly on 6 months of raw events. Which architecture serves each workload, and how do they share data?
Input. Three workloads, three SLAs, one storage layer. Source systems: PostgreSQL OLTP (CDC via Debezium), Kafka clickstream (1 M events / sec peak), and the SaaS billing API (hourly REST pulls).
Code.
-- A canonical workload-to-architecture mapping table.
CREATE TABLE workload_architecture_map AS
SELECT * FROM (VALUES
('exec_gmv_dashboard', '15 min', 'warehouse_or_lakehouse', 'star-schema fact_orders', 'BI engine'),
('monthly_cohort_retention', '1 day', 'lakehouse', 'iceberg fact_events + dim_user', 'spark sql'),
('fraud_ml_training', '1 day', 'lake_or_lakehouse', 'parquet partitioned by event_dt', 'spark mllib'),
('raw_event_archive_7y', 'n/a', 'lake', 'parquet glacier-tiered', 'cold storage')
) AS t(workload, sla, architecture, storage_layout, engine);
Step-by-step explanation.
-
exec_gmv_dashboardlives in the warehouse lane or the lakehouse lane; either serves star-schema BI at 15-minute latency. The warehouse wins on raw query speed; the lakehouse wins on cost-per-TB if the data already lives inS3. -
monthly_cohort_retentionlives in the lakehouse lane; analysts can query the sameIcebergtable the GMV dashboard reads, plus historical depth that would be prohibitive to keep in the warehouse. -
fraud_ml_traininglives in the lake lane or the lakehouse lane; ML engineers need rawParquetpartitioned byevent_dt, andSpark MLlibreads it directly without going through a warehouse engine. -
raw_event_archive_7ylives in the lake lane with cold-tierS3 Glacier; warehouses charge real money to keep 7 years of clickstream that is read twice a year. - The shared storage layer is the punchline —
S3+Iceberglets all four workloads sit on top of the same files with different engines.
Output (the workload map).
| workload | sla | architecture | engine |
|---|---|---|---|
| exec_gmv_dashboard | 15 min | warehouse_or_lakehouse | BI engine |
| monthly_cohort_retention | 1 day | lakehouse | spark sql |
| fraud_ml_training | 1 day | lake_or_lakehouse | spark mllib |
| raw_event_archive_7y | n/a | lake | cold storage |
Rule of thumb: never force one architecture to serve all workloads — the senior answer is "lakehouse as the storage spine + a warehouse for the BI hot path + the lake's cold tier for archive".
data lake vs data warehouse — the four senior signals that separate hype from substance
Signal 1 — opinionated workload mapping, not blanket claims. Senior engineers do not say "lakehouses replace warehouses"; they say "lakehouses replace the warehouse's archival and ML lanes, but a real-time BI dashboard on 500 concurrent users still benefits from a warehouse's query engine and result cache".
Signal 2 — quoting the open-table-format tradeoffs, not just naming them. Junior answers list Delta, Iceberg, Hudi without distinction. Senior answers say "Delta has the strongest ecosystem inside Databricks; Iceberg has the strongest cross-engine support and is winning on neutrality; Hudi has the best record-level upsert and CDC story but a smaller community".
Signal 3 — cost-and-egress reasoning, not feature checklists. Senior engineers reason about storage cost per TB-month, compute cost per TB-scanned, egress between regions, and the hidden cost of keeping data in the warehouse format (Snowflake's storage premium over raw S3 is ~5-10x). Junior engineers compare feature lists.
Signal 4 — migration realism. When asked "how would you migrate from Redshift to a lakehouse", junior engineers say "copy the tables to S3 as Iceberg". Senior engineers say "unload to S3 as Parquet, convert to Iceberg in place, dual-write for two weeks while the BI tools point at the warehouse, cut BI over to a Trino-on-Iceberg endpoint, retire Redshift compute, keep storage tier for one quarter as rollback insurance".
SQL
Topic — etl
ETL pipeline drills
Data modeling
Lane — data-modeling
Data modeling practice library
Solution Using a five-dimension architecture scorecard
Code.
-- One canonical scorecard — every architecture scored on five dimensions.
CREATE TABLE architecture_scorecard AS
SELECT * FROM (VALUES
('warehouse', 'best_workload', 'BI / dashboards / SQL'),
('warehouse', 'format_support', 'structured + JSON'),
('warehouse', 'acid_guarantees', 'full ACID'),
('warehouse', 'cost_profile', 'compute + storage bundled'),
('warehouse', 'maturity', '30+ years'),
('lake', 'best_workload', 'ML / raw archive / semi-structured'),
('lake', 'format_support', 'any format'),
('lake', 'acid_guarantees', 'none by default'),
('lake', 'cost_profile', 'cheapest storage'),
('lake', 'maturity', '15+ years'),
('lakehouse', 'best_workload', 'mixed BI + ML + streaming'),
('lakehouse', 'format_support', 'any format + open tables'),
('lakehouse', 'acid_guarantees', 'ACID via Delta / Iceberg / Hudi'),
('lakehouse', 'cost_profile', 'cheap storage + pay per engine'),
('lakehouse', 'maturity', 'modern + fast-evolving')
) AS t(architecture, dimension, verdict);
Step-by-step trace.
| architecture | dimension | verdict |
|---|---|---|
| warehouse | best_workload | BI / dashboards / SQL |
| warehouse | format_support | structured + JSON |
| warehouse | acid_guarantees | full ACID |
| warehouse | cost_profile | compute + storage bundled |
| warehouse | maturity | 30+ years |
| lake | best_workload | ML / raw archive / semi-structured |
| lake | format_support | any format |
| lake | acid_guarantees | none by default |
| lake | cost_profile | cheapest storage |
| lake | maturity | 15+ years |
| lakehouse | best_workload | mixed BI + ML + streaming |
| lakehouse | format_support | any format + open tables |
| lakehouse | acid_guarantees | ACID via Delta / Iceberg / Hudi |
| lakehouse | cost_profile | cheap storage + pay per engine |
| lakehouse | maturity | modern + fast-evolving |
- Row 1-5 (warehouse) — five clean wins on BI, format-strict, full ACID; pay the cost-profile premium for those.
- Row 6-10 (lake) — cheapest storage, every format, but ACID is on you to enforce; great for ML, dangerous for BI.
- Row 11-15 (lakehouse) — bridges both lanes; the cost-profile is "cheap storage + you pay per engine", which is the senior tradeoff every CFO asks about.
- The matrix is the artefact you draw on the whiteboard when someone asks "compare warehouse vs lake vs lakehouse".
- Memorise the 15 cells; senior interviewers expect you to recite the row for any dimension on demand.
Output.
| architecture | best_workload | acid_guarantees | cost_profile |
|---|---|---|---|
| warehouse | BI / dashboards / SQL | full ACID | compute + storage bundled |
| lake | ML / raw archive / semi-structured | none by default | cheapest storage |
| lakehouse | mixed BI + ML + streaming | ACID via Delta / Iceberg / Hudi | cheap storage + pay per engine |
Why this works — concept by concept:
- Five-dimension scorecard — turns a fuzzy "which is best" question into 15 scored cells; interviewers love a tester who can recite the matrix instead of waving.
- Best-workload binding — pairs each architecture with the workload it wins at, not the workloads it tolerates; this is the discipline that separates senior answers from blog summaries.
- ACID column — explicit on which architectures ship full ACID by default; the lake row's "none by default" is the single most consequential cell in the whole matrix.
- Cost profile — exposes the unbundled-storage reality; modern stacks live or die on whether storage is bundled with compute.
-
Cost —
O(1)to read the scorecard; the actual workloads have their own runtime costs but the decision itself is constant-time.
2. Data warehouse architecture — schema-on-write, ETL, star schema, BI-first
data warehouse architecture — schema-on-write, ETL, ODS, marts, BI
data warehouse architecture is the architecture that defined analytics for thirty years and still wins on BI workloads today. The defining property is schema-on-write: data is shaped before it lands. Every column is typed, every constraint is enforced, every row passes ACID. The pipeline is ETL (extract → transform → load) — transformations happen before the warehouse, not after — and the canonical layout is staging → ODS → star-schema marts with BI tools (Power BI, Tableau, Looker) reading the marts.
The four pillars of warehouse architecture.
-
schema-on-write— every column type, nullability, PK, and FK is enforced on write; an attempted insert with the wrong type fails. The cost: ingestion is slower; the win: every downstream query sees a clean shape. -
ETL pipeline— transformations happen in a dedicated tool (Informatica,Talend,dbt, hand-rolled Python / SQL) before data lands in the warehouse. Compare to ELT in lakes, where data lands raw and is transformed later. -
star schema— fact tables (events) joined to dimension tables (entities) via surrogate keys;fact_ordersjoinsdim_customer,dim_product,dim_date. Optimised for theGROUP BY ... SUM(...) ... JOIN dim_xshape that 90% of BI queries take. -
ACID + governance— full transactional semantics (INSERT,UPDATE,DELETEare atomic), plus row- and column-level access control, audit logs, and lineage. The warehouse is the most trustworthy data surface in the company.
The canonical layered layout.
- Layer 1 — staging tables. Raw extracts from sources, typed but not modelled. Truncate-and-reload daily. Owned by ingestion engineers.
- Layer 2 — ODS / EDW (Operational Data Store / Enterprise Data Warehouse). Normalised in 3NF; one row per real-world entity. Owned by data engineers.
-
Layer 3 — marts. Denormalised star or snowflake schemas keyed by analytic subject area (
finance_mart,marketing_mart,product_mart). Owned by analytics engineers. -
Consumers. BI tools, operational reports, embedded analytics, and
dbtmacros that compose mart-level metrics.
The big-name implementations in 2026.
-
Snowflake— cloud-native, separation of compute and storage inside a closed format, virtual warehouses (clusters) per workload, multi-cluster auto-scaling. Most popular in 2026. -
BigQuery— serverless, scan-based pricing,Capacitorcolumnar format, decoupled storage inGoogle Cloud Storage. Strongest on ad-hoc analytical SQL. -
Redshift— AWS-native, recently addedRA3(decoupled storage),Spectrum(S3 query), andIcebergtable support. Still common in AWS-only shops. -
Synapse— Azure-native, blended SQL pool + Spark pool, now folded intoMicrosoft Fabric(which is itself moving toward lakehouse). -
Teradata/Oracle Exadata— on-prem incumbents; still dominant in banking + telco; the systems that defined the term "data warehouse".
Where warehouses still win.
-
BI workloads with strict latency. A
Tableaudashboard serving 500 concurrent users needs sub-second response on cached aggregations; the warehouse's result cache and BI-vendor integrations make this trivial. - Strictly structured + small JSON. When all data is relational and JSON is the occasional column, warehouses serve it with full ACID and SQL semantics. Once JSON is the primary shape, lakes win.
- Fine-grained governance. Column masking, row-level security, audit trails — mature in warehouses, still bolt-on in lake stacks.
- Financial close + regulatory reporting. SOX / GAAP-grade auditability needs ACID + immutable history + lineage — the warehouse heritage.
Where warehouses struggle.
- Petabyte-scale raw archive. Storing 7 years of clickstream at Snowflake list price is six figures a month; the same data on S3 cold tier is four figures.
- Semi-structured / unstructured data. Logs, images, PDFs, IoT payloads — possible in warehouses but expensive and awkward.
-
ML feature engineering.
Spark,Ray, andPyTorchwant to read rawParquetdirectly; pulling through a warehouse adds latency and cost. -
Multi-engine flexibility. A warehouse is one engine; you cannot point
Trino,Spark, andDuckDBat the same warehouse table without paying for additional compute (or moving data).
Worked example — design a star schema for an e-commerce GMV mart
Detailed explanation. Real interviews ask you to lay out the star schema for a specific subject area. Below is the canonical e-commerce fact_orders mart with three dimension tables — the shape that 90% of warehouse BI queries take.
Question. Design a fact_orders star-schema mart for an e-commerce business. Include the fact table, three dimension tables (dim_customer, dim_product, dim_date), and a representative BI query that computes daily GMV by region for the last 30 days.
Input. Source staging.orders has columns order_id, customer_id, product_id, order_ts, quantity, unit_price, discount, currency. Source staging.customers and staging.products provide the dimension rows.
Code.
-- Dimension tables (denormalised, surrogate-keyed)
CREATE TABLE dim_customer (
customer_sk BIGINT PRIMARY KEY,
customer_id VARCHAR(64) NOT NULL,
region VARCHAR(32),
signup_date DATE,
customer_tier VARCHAR(16)
);
CREATE TABLE dim_product (
product_sk BIGINT PRIMARY KEY,
product_id VARCHAR(64) NOT NULL,
category VARCHAR(64),
brand VARCHAR(64),
list_price_usd NUMERIC(10,2)
);
CREATE TABLE dim_date (
date_sk INT PRIMARY KEY,
full_date DATE NOT NULL,
day_of_week VARCHAR(10),
is_weekend BOOLEAN,
fiscal_quarter VARCHAR(8)
);
-- Fact table (narrow, additive metrics, surrogate FKs)
CREATE TABLE fact_orders (
order_sk BIGINT PRIMARY KEY,
order_id VARCHAR(64) NOT NULL,
customer_sk BIGINT REFERENCES dim_customer(customer_sk),
product_sk BIGINT REFERENCES dim_product(product_sk),
date_sk INT REFERENCES dim_date(date_sk),
quantity INT,
unit_price_usd NUMERIC(10,2),
discount_usd NUMERIC(10,2),
gmv_usd NUMERIC(12,2)
);
-- The canonical BI query: daily GMV by region, last 30 days
SELECT
d.full_date,
c.region,
SUM(f.gmv_usd) AS gmv
FROM fact_orders f
JOIN dim_customer c ON f.customer_sk = c.customer_sk
JOIN dim_date d ON f.date_sk = d.date_sk
WHERE d.full_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY d.full_date, c.region
ORDER BY d.full_date, c.region;
Step-by-step explanation.
-
dim_customerholds one row per customer; surrogatecustomer_skdecouples from the sourcecustomer_idso SCDs can be modelled without rewriting facts. -
dim_productholds one row per product; same surrogate-key pattern. -
dim_dateis the canonical date dimension — generated once, joined to every fact. Holdsis_weekend,fiscal_quarter, holiday flags. -
fact_ordersis narrow — every column is either a surrogate FK or an additive metric (quantity,unit_price_usd,discount_usd,gmv_usd). - The BI query is the canonical star-join shape: filter on
dim_date, group bydim_date+dim_customer.region, sumfact_orders.gmv_usd. Sub-second on a warehouse with the right clustering.
Output (truncated to 3 rows).
| full_date | region | gmv |
|---|---|---|
| 2026-05-01 | EMEA | 1245678.90 |
| 2026-05-01 | NA | 2891234.50 |
| 2026-05-01 | APAC | 987654.32 |
Rule of thumb: star schemas are narrow facts + denormalised dims — never the other way around. Wide facts kill scan cost; normalised dims kill BI tools.
data warehouse architecture — the four senior signals
Signal 1 — explicit on schema-on-write vs schema-on-read. Senior engineers state the property by name; junior engineers say "the warehouse is structured". Schema-on-write is the property; structured is the outcome.
Signal 2 — naming the BI hot-path optimisations. "Snowflake clusters on (order_date, region), the BI tool's result cache lives in the SQL workbench, and partition pruning shrinks scans from 30 TB to 200 GB" — this is the senior answer.
Signal 3 — owning the cost model. "Snowflake billed in credits; one X-Small warehouse = 1 credit / hour ≈ $2-4. A 200-user dashboard concurrency burst spins up a 2X-Large = 32 credits / hour. Storage is on top at $23 / TB / month for compressed." — senior cost fluency.
Signal 4 — explicit on what not to put in the warehouse. "7 years of raw clickstream goes in S3 cold tier, not Snowflake. ML features get materialised to Parquet on S3, not into Snowflake tables. Image / PDF / audio payloads never enter the warehouse at all."
Data modeling
Topic — dimensional-modeling
Star-schema dimensional modeling
SQL
Topic — aggregation
Aggregation patterns for BI workloads
Solution Using a slowly-changing dimension type 2 + a narrow fact
Code.
-- Type-2 SCD on dim_customer: track region history without losing the past.
CREATE TABLE dim_customer (
customer_sk BIGINT PRIMARY KEY,
customer_id VARCHAR(64) NOT NULL,
region VARCHAR(32),
signup_date DATE,
customer_tier VARCHAR(16),
valid_from TIMESTAMP NOT NULL,
valid_to TIMESTAMP, -- NULL = currently active
is_current BOOLEAN NOT NULL DEFAULT TRUE
);
-- Insert: close the previous row, insert a new row
WITH src AS (
SELECT customer_id, region, customer_tier
FROM staging.customers_today
), changed AS (
SELECT s.*
FROM src s
LEFT JOIN dim_customer d
ON d.customer_id = s.customer_id AND d.is_current
WHERE d.customer_id IS NULL -- net new customer
OR d.region != s.region -- region changed
OR d.customer_tier != s.customer_tier -- tier changed
)
-- close the prior current row for any changed customer
UPDATE dim_customer d
SET valid_to = CURRENT_TIMESTAMP, is_current = FALSE
FROM changed c
WHERE d.customer_id = c.customer_id AND d.is_current;
INSERT INTO dim_customer (
customer_sk, customer_id, region, customer_tier,
valid_from, valid_to, is_current
)
SELECT
nextval('dim_customer_sk_seq'),
c.customer_id, c.region, c.customer_tier,
CURRENT_TIMESTAMP, NULL, TRUE
FROM changed c;
Step-by-step trace.
| customer_id | region | customer_tier | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| C001 | EMEA | gold | 2025-01-01 | 2026-05-29 | false |
| C001 | NA | gold | 2026-05-29 | NULL | true |
-
srcmaterialises today's customer snapshot from staging. -
changedLEFT JOINs against the current row indim_customer; new + changed customers fall out. - The
UPDATEcloses the prior current row by settingvalid_toand flippingis_current. - The
INSERTwrites a new surrogate-keyed row for each changed customer. - Facts written before the region change still reference the old
customer_sk; facts after reference the new one. This is the whole point of SCD type 2.
Output (one row after a region change).
| customer_id | region | valid_from | valid_to | is_current |
|---|---|---|---|---|
| C001 | NA | 2026-05-29 | NULL | true |
Why this works — concept by concept:
- SCD type 2 — keeps a full history of dimension changes; without it, last quarter's GMV-by-region report rewrites itself when a customer moves regions.
-
Surrogate keys —
customer_skdecouples facts from natural keys; SCD type 2 only works because the SK is per-version, not per-customer. -
is_current + valid_to — two complementary indicators;
is_currentis fast for BI lookups,valid_tois precise for point-in-time queries. -
Narrow fact —
fact_orderscarries surrogate FKs, not denormalised columns; this is why the fact stays small even as dims grow rich. -
Cost —
O(N)per load over the changed-customers slice; on a million-row dimension with 0.5% daily churn, that is 5k row writes — trivial for any warehouse.
3. Data lake architecture — schema-on-read, ELT, open formats, cheap raw storage
data lake architecture — schema-on-read, ELT, multi-zone object storage
data lake architecture flips every warehouse assumption: data lands raw, fast, and cheap, and shape is imposed at read time, not write time. The defining property is schema-on-read. The pipeline is ELT (extract → load → transform — note the order). The storage layer is object storage (S3, ADLS Gen2, GCS, or on-prem HDFS), organised into zones (raw / curated / sandbox), and the file format is open (Parquet, Avro, ORC, JSON, CSV, plus raw blobs like images and PDFs). Compute is decoupled: any engine — Spark, Presto / Trino, Athena, Dremio, DuckDB — can read the files.
The four pillars of lake architecture.
-
schema-on-read— schema is imposed by the query engine at read time, not enforced at write. The cost: bad data lands; the win: ingestion is fast, format-agnostic, and survives upstream schema drift. -
ELT pipeline— data lands raw, then gets transformed in place bySpark/dbt/ SQL. Inverts the warehouse's ETL order. -
multi-zone layout— raw / curated / sandbox; each zone has its own SLA, owner, and retention policy. The lake is not a swamp because of this discipline. -
open file formats—Parquetfor columnar analytics,Avrofor row-oriented streaming,ORCfor Hive-era pipelines, plus rawJSON/CSV/ images / PDFs. The format choice is yours, not the platform's.
The canonical zone layout.
-
Raw zone (
raw/). Untouched extracts. One subfolder per source. Daily partitions by ingest date. No transformations. Owned by ingestion. Retention: 7+ years (compliance archive). -
Curated zone (
curated/). Cleansed, deduplicated, type-coerced. Owned by data engineering. The "trusted" lake surface that ML and SQL engines read. -
Sandbox zone (
sandbox/). Data scientist scratch space. Read access to curated; write access to personal subfolder. Auto-expires after 90 days.
The big-name implementations in 2026.
-
Amazon S3+AWS Glue+Athena— the canonical AWS lake stack; Glue is the catalog, Athena the serverless SQL engine, S3 the storage. Pay-per-scan economics. -
Azure Data Lake Storage Gen2— hierarchical namespace over Blob Storage; query viaSynapse Serverless,Databricks, orMicrosoft Fabric. -
Google Cloud Storage+BigLake— GCS for storage, BigLake for the federated catalog and IAM; query viaBigQueryexternal tables orDataproc Spark. -
Hadoop HDFS— the on-prem incumbent; declining but still real in financial services, telco, and government. Often migrating toS3/MinIO/Ozone.
Where lakes still win.
- Cheap storage at petabyte scale. S3 Standard is $23 / TB / month; Glacier Deep Archive is $1 / TB / month. A warehouse cannot match this even before egress.
- Any format. Parquet, Avro, ORC, JSON, CSV, MP4, JPEG, PDF, PCAP — the lake is format-agnostic.
-
ML training data. Spark, PyTorch, Ray, TensorFlow all read
Parquetdirectly from S3 — no warehouse hop, no transformation pass. -
Streaming sinks. Kafka → S3 via
Kafka ConnectorFlinkis the canonical lake-landing pattern; millions of events per second land in raw zone.
Where lakes struggle.
-
No ACID by default. An
UPDATEis "rewrite the partition". A concurrent reader during a write sees a half-rewritten partition. Mid-2010s lake outages were all this bug. - No schema enforcement. Parquet remembers the schema of the row group, not the table. Schema drift across files is your problem to detect.
- BI consistency is shaky. "Why does the dashboard change while I'm reading it?" — because a partition was overwritten mid-query.
- Small-file problem. Streaming sinks create thousands of small files per partition; query performance degrades; periodic compaction is a real operational tax.
-
Governance is bolt-on. IAM +
Lake Formation+Ranger+Gluework, but require deliberate setup; warehouses ship governance by default.
Worked example — partition + file-format design for a clickstream lake
Detailed explanation. Real interviews ask "design the storage layout for 3 TB / day of clickstream". The answer is partitioning + file format + compaction policy — three decisions that determine whether the lake serves queries in 2 seconds or 2 hours.
Question. Design the S3 layout for a 3 TB / day clickstream pipeline that needs to support (a) Athena ad-hoc queries by event_date + country, (b) nightly Spark ML feature pipelines reading 90 days of history, and (c) 7-year compliance retention.
Input. Kafka → Kafka Connect S3 Sink → S3, ~30M events / sec peak. Each event is ~200 bytes JSON.
Code.
# S3 layout — partition by event_date and country; Parquet + Snappy.
s3://co-data-lake/raw/clickstream/
event_date=2026-05-29/
country=US/
events_2026-05-29_US_001.parquet # ~512 MB target
events_2026-05-29_US_002.parquet
country=GB/
events_2026-05-29_GB_001.parquet
country=IN/
events_2026-05-29_IN_001.parquet
event_date=2026-05-30/
...
# Daily compaction job — merge 100s of small files into 512 MB targets.
df = (spark.read.parquet("s3://co-data-lake/raw/clickstream/event_date=2026-05-29/")
.repartition("country"))
(df.write
.mode("overwrite")
.partitionBy("event_date", "country")
.option("maxRecordsPerFile", 5_000_000)
.parquet("s3://co-data-lake/curated/clickstream/"))
# Lifecycle policy — auto-tier to Glacier after 90 days, expire after 7 years.
{
"Rules": [
{
"Id": "clickstream-glacier",
"Filter": {"Prefix": "raw/clickstream/"},
"Status": "Enabled",
"Transitions": [{"Days": 90, "StorageClass": "GLACIER"}],
"Expiration": {"Days": 2555}
}
]
}
Step-by-step explanation.
-
Partition by
event_datethencountry. Athena's predicate pushdown turns "WHERE event_date = '2026-05-29' AND country = 'US'" into reading one folder, not the whole lake. - Parquet + Snappy. Parquet is columnar (4-10x smaller than JSON); Snappy is fast to decompress; together they make Athena scans cheap.
- 512 MB file target. S3 + Athena hate millions of 1 MB files; compaction merges them. The 512 MB target is the sweet spot for parallel-read engines.
-
partitionBy("event_date", "country")— the Spark write fans out into the right folder structure. - Lifecycle policy — auto-tier to Glacier after 90 days saves real money; expire after 7 years matches compliance.
Output (the resulting S3 listing for one day, one country).
| key | size | storage_class |
|---|---|---|
| raw/clickstream/event_date=2026-05-29/country=US/events_001.parquet | 512 MB | STANDARD |
| raw/clickstream/event_date=2026-05-29/country=US/events_002.parquet | 489 MB | STANDARD |
| raw/clickstream/event_date=2026-05-29/country=US/events_003.parquet | 503 MB | STANDARD |
Rule of thumb: every lake design boils down to partition for predicate pushdown, file size for parallel reads, lifecycle for cost — get those three right and the lake stays performant for years.
data lake architecture — the four senior signals
Signal 1 — partitioning explicit and bounded. Senior engineers know that partitioning by user_id creates millions of folders and kills the lake; partitioning by event_date + country creates ~1k folders per day and works. The rule: partition cardinality should be bounded and predicate-aligned.
Signal 2 — file size matters more than format. A 1 GB Parquet file outperforms a 1 MB Parquet file by 100x on a typical Athena scan. Senior engineers always own a compaction job; junior engineers ignore the small-file problem until it costs them a SEV-2.
Signal 3 — explicit on ACID gaps. Senior engineers state "the lake has no ACID without a table format on top — that's why we added Iceberg / Delta". Junior engineers either don't know or don't say.
Signal 4 — governance discipline, not just tooling. Senior engineers describe the IAM + Lake Formation + Glue policy stack and explain how column-level masking is enforced. Junior engineers say "S3 has IAM" and move on.
SQL
Topic — etl
ETL + ELT lake pipeline drills
Streaming
Topic — streaming
Streaming + landing-zone drills
Solution Using a three-zone lake with a Glue catalog + Athena
Code.
-- Glue catalog: register the curated zone as an external Athena table.
CREATE EXTERNAL TABLE curated.fact_clickstream (
event_id STRING,
user_id STRING,
session_id STRING,
event_name STRING,
event_ts TIMESTAMP,
page_url STRING,
user_agent STRING,
revenue_usd DECIMAL(10,2)
)
PARTITIONED BY (
event_date DATE,
country STRING
)
STORED AS PARQUET
LOCATION 's3://co-data-lake/curated/clickstream/'
TBLPROPERTIES (
'parquet.compression' = 'SNAPPY',
'projection.enabled' = 'true',
'projection.event_date.type' = 'date',
'projection.event_date.format' = 'yyyy-MM-dd',
'projection.event_date.range' = '2024-01-01,NOW',
'projection.country.type' = 'enum',
'projection.country.values' = 'US,GB,IN,DE,FR,BR,JP,AU'
);
-- The canonical analyst query: revenue by day + country, last 7 days.
SELECT
event_date,
country,
SUM(revenue_usd) AS revenue
FROM curated.fact_clickstream
WHERE event_date >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY event_date, country
ORDER BY event_date, country;
Step-by-step trace.
| event_date | country | revenue |
|---|---|---|
| 2026-05-23 | US | 1234567.89 |
| 2026-05-23 | GB | 234567.12 |
| 2026-05-23 | IN | 198765.43 |
| 2026-05-24 | US | 1298765.40 |
| 2026-05-24 | GB | 245678.90 |
-
CREATE EXTERNAL TABLEregisters an Athena view over the S3 prefix; no data is moved. -
PARTITIONED BY (event_date, country)matches the on-disk folder layout; Athena prunes accordingly. -
Partition projection (the
projection.*properties) tells Athena to generate partitions from the schema instead of querying Glue per scan — turns 2-minute query startups into 500 ms. -
parquet.compression = SNAPPYis the default for Athena-on-S3; tradeoff favours decompression speed. - The analyst query reads exactly the 56 partitions (7 days × 8 countries); Athena scans ~10% of the lake instead of the full 90 TB.
Output.
| event_date | country | revenue |
|---|---|---|
| 2026-05-23 | US | 1234567.89 |
| 2026-05-24 | US | 1298765.40 |
| 2026-05-25 | US | 1310987.65 |
Why this works — concept by concept:
- Schema-on-read — the table definition lives in Glue, not on the files; you can swap the schema (add a column) without rewriting the lake.
- External table — Athena owns no storage; it queries the open Parquet files in place. Compare to a warehouse, which owns both the format and the storage.
- Partition projection — eliminates the Glue API roundtrip; cuts query startup from seconds to milliseconds on partitioned tables.
- Snappy + Parquet — columnar + cheap decompression; the canonical lake format for analytical SQL.
-
Cost —
O(P × S)whereP= pruned partitions andS= scan size per partition; Athena bills $5 / TB scanned, so partition pruning directly = cost reduction.
4. Lakehouse architecture — open table formats (Delta/Iceberg/Hudi) + multi-engine compute
lakehouse architecture — the three-layer stack that bridges lake economics + warehouse reliability
lakehouse architecture is the architecture that fixes the lake's biggest flaws — no ACID, no schema enforcement, no efficient UPDATE / DELETE, no time travel — without giving up cheap object storage or the multi-engine flexibility. The trick is a table format that sits on top of Parquet and adds a metadata log describing which files belong to which version of the table. The three open table formats that matter — Delta Lake (Databricks-origin), Apache Iceberg (Netflix-origin), Apache Hudi (Uber-origin) — all solve the same problem with different tradeoffs.
The three-layer lakehouse stack.
-
Layer 1 — object storage. Same
S3/ADLS Gen2/GCSyou'd use for a plain lake. The files are stillParquet; the lakehouse is additive, not a replacement. -
Layer 2 — open table format.
Delta/Iceberg/Hudi. Stores a transaction log + snapshot history alongside the data files; lets engines read a consistent version of the table even while another engine is writing. -
Layer 3 — compute engines.
Spark,Trino,Presto,Flink,DuckDB,Snowflake(Iceberg),BigQuery(BigLake / Iceberg),Redshift(Iceberg),Athena(Iceberg). All read the same tables; no data movement.
What the table format actually adds.
-
ACID transactions —
INSERT,UPDATE,DELETE,MERGEare atomic; concurrent readers always see a consistent snapshot. - Schema enforcement + evolution — adding a column is metadata-only; dropping a column is supported; type promotion is bounded.
-
Time travel —
SELECT * FROM table VERSION AS OF 5orTIMESTAMP AS OF '2026-05-01 00:00:00'; instant rollback and audit. -
Hidden partitioning —
Icebergpartitions onday(event_ts)without exposing apartition_datecolumn; partition layout can evolve without rewriting facts. -
Compaction + vacuum — built-in
OPTIMIZE/VACUUMcommands; no hand-rolled compaction job. - Statistics for query pruning — min/max/null-count per column per file; engines skip files without scanning them.
The three open table formats — strengths and trade-offs.
-
Delta Lake— strongest ecosystem inside Databricks; first-class on Databricks Unity Catalog; recently shippedUniFormso Delta tables read as Iceberg from other engines. Strength: deepest tooling on Databricks; trade-off: best cross-engine support requires UniForm. -
Apache Iceberg— strongest cross-engine support; first-class in Snowflake, BigQuery, Redshift, Athena, Trino, Spark, Flink. Strength: vendor-neutrality (won the 2024-2026 format war on this axis); trade-off: less tightly integrated with any single platform than Delta is with Databricks. -
Apache Hudi— strongest record-level upsert + CDC story; designed around incremental processing from day one; powers many of Uber's pipelines. Strength: best for streaming + CDC ingestion; trade-off: smaller community + ecosystem than Delta or Iceberg.
The big-name implementations in 2026.
-
Databricks(Delta + Unity Catalog) — the original lakehouse vendor; canonical end-to-end stack; deepest tooling around Delta. -
Snowflake Iceberg tables— Snowflake reads and writes Iceberg; lets you store in your own S3 bucket while paying for Snowflake compute. -
Microsoft Fabric+OneLake— Microsoft's lakehouse play; Delta-formatted, single-tenant lake per org, integrated with Power BI. -
Google BigLake+Icebergnative tables — GCP's bridge between BigQuery storage and external lake / lakehouse; reads Iceberg / Delta on GCS. -
Open OSS stack —
MinIO(orS3) +Iceberg+Nessiecatalog +Trino/Spark+dbt; pure open source, no vendor lock.
Where lakehouses win — the modern default.
- Mixed BI + ML + streaming on one storage layer. BI hits Iceberg via Trino; ML reads the same Iceberg via Spark; streaming writes via Flink — all on the same files.
- Cost-effective at scale. Storage on S3 is cheap; compute is per-engine, per-workload, so you pay only for what runs.
- Multi-engine flexibility. Cannot afford lock-in? Iceberg is the safest choice; the format is open and supported across all major engines.
- Open formats + governance maturity. Unity Catalog, Nessie, and Polaris are converging on a real cross-engine catalog story; column masking + row filtering work across engines.
Where lakehouses still struggle.
- Sub-second BI on 500-user dashboards. A warehouse's result cache still beats Trino-on-Iceberg on the BI hot path; many shops keep the warehouse as a serving layer in front of the lakehouse.
- Tooling maturity for governance. Closing the gap fast, but warehouse-grade row-level security is still more mature on Snowflake / BigQuery than on Iceberg-via-Trino.
- Operational complexity. Three layers (storage, table format, engine) means three places to debug. Warehouses are simpler.
- Format choice is a long-term commitment. Picking Delta vs Iceberg vs Hudi at year 0 binds you for a decade.
Worked example — create an Iceberg table and run an ACID MERGE
Detailed explanation. Real interviews ask you to write the lakehouse equivalent of a warehouse MERGE. Below is the canonical Iceberg table + a MERGE INTO that performs an idempotent upsert — the shape every modern CDC pipeline takes.
Question. Create an Iceberg table for fact_orders on S3, then write an idempotent MERGE INTO that upserts a daily batch of new + updated orders from a Spark-loaded staging.orders_today view.
Input. Source staging.orders_today has 1.2M rows (98% net new, 2% updates to prior-day rows). Target fact_orders Iceberg table holds 600M rows across 24 months of history.
Code.
-- Create the Iceberg table on S3 with hidden partitioning by day(order_ts)
CREATE TABLE prod.fact_orders (
order_id BIGINT,
customer_id BIGINT,
product_id BIGINT,
order_ts TIMESTAMP,
quantity INT,
unit_price_usd DECIMAL(10,2),
discount_usd DECIMAL(10,2),
gmv_usd DECIMAL(12,2)
)
USING ICEBERG
PARTITIONED BY (days(order_ts))
LOCATION 's3://co-lakehouse/prod/fact_orders/'
TBLPROPERTIES (
'write.format.default' = 'parquet',
'write.parquet.compression-codec' = 'zstd',
'write.target-file-size-bytes' = '536870912' -- 512 MB
);
-- Idempotent upsert: insert net-new, update changed, keep history intact
MERGE INTO prod.fact_orders AS tgt
USING staging.orders_today AS src
ON tgt.order_id = src.order_id
WHEN MATCHED AND (
tgt.quantity != src.quantity
OR tgt.unit_price_usd != src.unit_price_usd
OR tgt.discount_usd != src.discount_usd
OR tgt.gmv_usd != src.gmv_usd
) THEN UPDATE SET
quantity = src.quantity,
unit_price_usd = src.unit_price_usd,
discount_usd = src.discount_usd,
gmv_usd = src.gmv_usd
WHEN NOT MATCHED THEN INSERT (
order_id, customer_id, product_id, order_ts,
quantity, unit_price_usd, discount_usd, gmv_usd
) VALUES (
src.order_id, src.customer_id, src.product_id, src.order_ts,
src.quantity, src.unit_price_usd, src.discount_usd, src.gmv_usd
);
Step-by-step explanation.
-
USING ICEBERGtells Spark / Trino / Snowflake to use the Iceberg table format; the underlying files are still Parquet. -
PARTITIONED BY (days(order_ts))is hidden partitioning — no explicitorder_datecolumn; Iceberg derives the partition value fromorder_tsautomatically. -
write.target-file-size-bytes = 512 MBsets the engine's compaction target; files are rewritten to hit this size duringOPTIMIZE. -
MERGE INTOis the canonical idempotent upsert; safe to re-run; atomic; ACID. -
WHEN MATCHED AND ...clause skips no-op updates — only rewrites files whose rows actually changed; this is the optimization that keeps daily MERGE jobs from rewriting the whole table.
Output (after the MERGE on a 1.2M-row batch).
| outcome | rows |
|---|---|
| inserted | 1176000 |
| updated | 24000 |
| files_rewritten | 47 |
| snapshot_id | 8125094521 |
Rule of thumb: lakehouse MERGE is the modern equivalent of a warehouse UPSERT; once you can write it, you can run ACID CDC into a lake at warehouse-grade reliability.
lakehouse architecture — the four senior signals
Signal 1 — opinionated on format choice. "I default to Iceberg for multi-engine neutrality; Delta if the org is Databricks-first; Hudi only when record-level upsert at streaming velocity is the dominant requirement" — senior phrasing.
Signal 2 — quoting time-travel use cases. Time travel is not a party trick — it's how you recover from a bad transformation. "We rolled back the bad PR by RESTORE TABLE fact_orders TO VERSION AS OF 47; took 10 seconds; would have been a 4-hour restore on Redshift."
Signal 3 — owning compaction + vacuum cadence. "OPTIMIZE runs nightly to compact small files; VACUUM runs weekly with 7-day retention to keep storage bounded; both are idempotent and re-runnable."
Signal 4 — multi-engine reasoning, not single-vendor. "BI uses Trino-on-Iceberg for sub-second latency on cached aggregates; Spark runs the nightly ML pipeline on the same tables; Flink writes streaming CDC into the same Iceberg with MERGE. One storage layer, three engines."
Data modeling
Topic — slowly-changing-data
SCD + lakehouse upsert practice
Company
Company — databricks
Databricks interview practice
Solution Using an Iceberg snapshot + a time-travel rollback
Code.
-- 1) Discover the snapshot history (the audit trail every Iceberg table ships with).
SELECT
snapshot_id,
parent_id,
operation,
committed_at,
summary['added-records'] AS added,
summary['deleted-records'] AS deleted,
summary['changed-partition-count'] AS changed_parts
FROM prod.fact_orders.snapshots
ORDER BY committed_at DESC
LIMIT 10;
-- 2) Query the table at a prior version (time travel).
SELECT COUNT(*)
FROM prod.fact_orders VERSION AS OF 47;
-- 3) Restore the table to the prior snapshot in a single transaction.
CALL system.rollback_to_snapshot('prod.fact_orders', 47);
Step-by-step trace.
| snapshot_id | operation | committed_at | added | deleted |
|---|---|---|---|---|
| 8125094521 | append | 2026-05-29 02:14 | 1176000 | 0 |
| 8125094520 | overwrite | 2026-05-28 02:11 | 1198432 | 1198432 |
| 8125094519 | append | 2026-05-27 02:09 | 1184502 | 0 |
| 47 | append | 2026-05-26 02:13 | 1167789 | 0 |
-
prod.fact_orders.snapshotsis a metadata table that ships with every Iceberg table — instant audit trail with zero extra plumbing. - The
VERSION AS OFclause reads the table as it existed at snapshot 47; no data was moved, no extra storage burned. - The
rollback_to_snapshotprocedure rewrites only the metadata pointer —O(1)operation, atomic, ACID-safe. - Concurrent readers continue reading the prior current snapshot until the rollback commits; no half-state visible.
- The rollback is itself a new snapshot — fully auditable; you can roll forward again if needed.
Output.
| snapshot_id | operation | committed_at |
|---|---|---|
| 8125094522 | rollback_to_snapshot | 2026-05-29 02:30 |
Why this works — concept by concept:
- Snapshot metadata — Iceberg writes every commit as a new snapshot; the chain is the table's full history, zero extra cost.
-
Time travel —
VERSION AS OFlets you debug, audit, and rollback without restoring from backup; the warehouse equivalent is a multi-hour restore. -
O(1) rollback — only the metadata pointer moves; underlying files are untouched until
VACUUMcleans up orphans. - ACID across engines — Spark, Trino, and Snowflake all see the same snapshot consistently; lakehouse's biggest win over plain lakes.
-
Cost —
O(1)metadata read for snapshot history;O(1)rollback;O(N)only onVACUUM. The math is why Iceberg / Delta dominate modern lakehouses.
5. Decision matrix — pick the right architecture per workload (with worked migration scenarios)
data lake vs data warehouse vs lakehouse — the five-dimension decision matrix
This is the matrix you should be able to draw on a whiteboard from memory in any senior interview. Five dimensions × three architectures = fifteen cells; the verdict in each cell is the one-line answer interviewers reward.
The five-dimension decision matrix.
| Dimension | Warehouse | Lake | Lakehouse |
|---|---|---|---|
| Best workload | BI / dashboards / SQL | ML / raw archive / semi-structured | Mixed BI + ML + streaming |
| Format support | Structured + JSON | Any format | Any format + open tables |
| ACID guarantees | Full ACID | None by default | ACID via Delta / Iceberg / Hudi |
| Cost profile | Compute + storage bundled | Cheapest storage | Cheap storage + pay per engine |
| Maturity | 30+ years (proven) | 15+ years (proven) | Modern + fast-evolving |
Reading the matrix — three canonical decisions.
- "My only workload is a BI dashboard for 500 concurrent users on structured SQL." → Warehouse wins. ACID + result cache + BI integrations + governance maturity are all warehouse strengths. Snowflake, BigQuery, Redshift, or Synapse — pick by cloud.
- "My only workload is ML training on 5 PB of raw clickstream + image data." → Lake wins. Cheapest storage + any format + direct read from Spark / PyTorch. S3 + Glue + Athena, or ADLS + Synapse Serverless.
- "I have mixed BI + ML + CDC + streaming on overlapping data." → Lakehouse wins. Open Iceberg / Delta tables let every engine read the same files; storage stays cheap; ACID stays solid; format stays open.
The four-question decision tree (the senior shorthand).
- Q1 — Is your workload 100% BI on structured SQL? Yes → warehouse. No → continue.
- Q2 — Do you need ACID guarantees on lake-scale storage? Yes → lakehouse. No → continue.
- Q3 — Do you need to share data across many compute engines without copying? Yes → lakehouse. No → continue.
- Q4 — Default for everything else → lake (and revisit when ACID or BI consistency starts hurting).
Worked example — three real migration scenarios with cost + risk
Detailed explanation. Real interviews don't ask "which architecture" — they ask "how would you migrate". Below are three canonical migration scenarios with the steps, the order, and the rollback strategy each one ships with.
Question. Walk through three migrations end-to-end: (A) Redshift warehouse → Iceberg lakehouse on S3 + Trino; (B) S3 + Glue lake → Iceberg lakehouse + Snowflake serving; (C) Databricks Delta lakehouse → multi-engine Iceberg via UniForm.
Input. Each migration has a 50-100 TB starting footprint and a 90-day timeline. The success criterion is zero downtime for BI consumers and full cost parity within 6 months.
Code.
# Migration A — Redshift warehouse → Iceberg lakehouse on S3 + Trino
# (90-day plan; the most common 2026 migration)
migration_a_steps = [
("week_1", "audit_redshift_tables", "list top 200 tables by query volume + size"),
("week_2", "unload_to_parquet_on_s3", "UNLOAD ('SELECT ...') TO 's3://...' FORMAT PARQUET"),
("week_3", "convert_parquet_to_iceberg", "CALL system.add_files_from_table('parquet_table')"),
("week_4", "stand_up_trino_endpoint", "deploy Trino cluster with iceberg catalog"),
("week_5", "dual_write_via_dbt", "every model writes to both Redshift and Iceberg"),
("week_6", "row_count_parity_tests", "dbt tests on COUNT(*) + SUM(amount) for top 50 tables"),
("week_7", "point_bi_at_trino", "Tableau / Looker switch endpoint; smoke-test on top 20 dashboards"),
("week_8", "monitor_2_weeks", "watch query latency, cost, error rates"),
("week_9", "cut_redshift_compute", "pause cluster; keep storage tier for 30 days as rollback"),
("week_10", "decommission", "drop Redshift cluster; finalise cost report"),
]
# Migration B — S3 + Glue lake → Iceberg lakehouse + Snowflake serving
migration_b_steps = [
("week_1", "audit_glue_catalog", "list tables, partitions, file counts"),
("week_2", "convert_external_to_iceberg", "CREATE TABLE iceberg.x AS SELECT * FROM parquet.x"),
("week_3", "switch_compaction_to_optimize","replace manual compaction with Iceberg OPTIMIZE"),
("week_4", "configure_snowflake_iceberg", "CREATE EXTERNAL VOLUME + CREATE ICEBERG TABLE"),
("week_5", "expose_iceberg_to_bi", "Snowflake serves Iceberg to Power BI / Looker"),
("week_6", "decommission_glue_metastore", "keep Glue for legacy Athena; new tables Iceberg-only"),
]
# Migration C — Databricks Delta → multi-engine Iceberg via UniForm
migration_c_steps = [
("week_1", "enable_uniform_on_delta", "ALTER TABLE x SET TBLPROPERTIES ('delta.universalFormat.enabledFormats'='iceberg')"),
("week_2", "register_in_unity_catalog", "tables now readable as Iceberg from external engines"),
("week_3", "point_external_trino_at_uc", "Trino reads via Iceberg catalog; same files, no copy"),
("week_4", "validate_external_reads", "row-count + checksum parity Databricks vs Trino"),
("week_5", "open_data_to_partners", "external partners read Iceberg without buying Databricks seats"),
]
Step-by-step explanation.
- Migration A is the most common 2026 path — Redshift cost pressure + multi-engine requirements + cheap storage demand all point toward Iceberg on S3 + Trino. The 10-week plan is conservative; aggressive teams compress it to 6 weeks.
- Migration B is the "lake-grew-up" path — an existing S3 + Glue lake adds Iceberg for ACID + schema evolution, then uses Snowflake as a serving layer in front (Snowflake reads Iceberg natively as of 2024).
-
Migration C is the "open the format" path — a Databricks shop enables
UniFormso Delta tables also expose an Iceberg interface; external Trino / Snowflake / BigQuery clients read the same files without buying Databricks seats. - Common pattern — every migration includes a dual-write window + parity tests + a rollback tier kept for one quarter. The single biggest mistake is cutting the old system before parity is proven.
- Cost reality — migrations A and B typically pay back inside 6-12 months; migration C is mostly a feature-unlock, not a cost play.
Output (the migration tracker for Migration A at week 7).
| step | status | parity_pass | cost_so_far_usd |
|---|---|---|---|
| audit_redshift_tables | done | n/a | 0 |
| unload_to_parquet_on_s3 | done | n/a | 8400 |
| convert_parquet_to_iceberg | done | n/a | 1200 |
| stand_up_trino_endpoint | done | n/a | 4500/mo |
| dual_write_via_dbt | done | yes (50/50) | 2200/mo |
| row_count_parity_tests | done | yes | 0 |
| point_bi_at_trino | in_progress | n/a | 0 |
Rule of thumb: migrations are won by dual-writing + parity tests + a rollback tier, not by clever code. Every senior plan includes all three.
lakehouse architecture — the four senior migration signals
Signal 1 — explicit dual-write window. "We dual-wrote for two weeks while BI still pointed at Redshift; cut over only after row-count + SUM parity passed on 50 critical tables." Senior teams never cut over without a parity gate.
Signal 2 — keep the old system as a rollback tier. "We paused the Redshift cluster but kept the storage tier for 30 days; cost was $X / month for insurance; we never needed it but the option mattered." Senior teams budget for the rollback.
Signal 3 — migration order matters. "Migrate cold tables first (low risk), warm tables second (medium risk), hot BI tables last (highest risk)." Senior teams sequence by blast radius.
Signal 4 — measurable success criterion. "Success = cost parity within 6 months + zero downtime for BI + 100% of top-50 dashboards passing smoke tests." Junior teams say "migrate the data"; senior teams say "hit these three numbers".
SQL
Topic — etl
ETL + migration pipeline drills
Company
Company — snowflake
Snowflake interview practice
Solution Using a workload-to-architecture decision tree + parity-gated migration
Code.
# A reusable decision-tree + migration-gate harness.
# Inputs: workload spec; outputs: architecture verdict + migration steps.
WORKLOADS = [
{"name": "exec_dashboard", "consumers": 500, "latency_ms": 800, "data_tb": 5, "formats": ["sql"]},
{"name": "ml_training", "consumers": 8, "latency_ms": 5000, "data_tb": 50, "formats": ["parquet","jpeg"]},
{"name": "cdc_ingest", "consumers": 4, "latency_ms": 60_000, "data_tb": 80, "formats": ["json","parquet"]},
{"name": "regulatory_archive","consumers":1, "latency_ms": 600_000,"data_tb":300,"formats": ["parquet"]},
]
def pick_architecture(w):
if w["consumers"] > 100 and "sql" in w["formats"] and w["latency_ms"] < 1500:
return "warehouse_or_lakehouse"
if w["data_tb"] > 100 and w["latency_ms"] > 60_000:
return "lake"
if any(f in w["formats"] for f in ("parquet","json","jpeg")) and w["latency_ms"] >= 5000:
return "lake_or_lakehouse"
return "lakehouse"
def parity_check(src_table, tgt_table):
# Row-count + SUM(amount) parity within 0.01% tolerance
sql = f"""
SELECT
ABS((SELECT COUNT(*) FROM {src_table}) - (SELECT COUNT(*) FROM {tgt_table})) AS row_delta,
ABS((SELECT COALESCE(SUM(amount),0) FROM {src_table})
- (SELECT COALESCE(SUM(amount),0) FROM {tgt_table}))
/ NULLIF((SELECT COALESCE(SUM(amount),0) FROM {src_table}), 0) AS rel_amt_delta
"""
row = engine.execute(sql).first()
return row.row_delta == 0 and row.rel_amt_delta < 0.0001
for w in WORKLOADS:
print(w["name"], "->", pick_architecture(w))
Step-by-step trace.
| workload | consumers | latency_ms | data_tb | verdict |
|---|---|---|---|---|
| exec_dashboard | 500 | 800 | 5 | warehouse_or_lakehouse |
| ml_training | 8 | 5000 | 50 | lake_or_lakehouse |
| cdc_ingest | 4 | 60000 | 80 | lakehouse |
| regulatory_archive | 1 | 600000 | 300 | lake |
-
pick_architecturecodifies the four-question decision tree as Python; one branch per workload class. -
exec_dashboardlands inwarehouse_or_lakehouse— many consumers + sub-second latency demand a hot query engine. -
ml_traininglands inlake_or_lakehouse— non-SQL formats + tolerance for 5-second latency means the lake's economics win. -
cdc_ingestlands inlakehouse— mixed formats + need for ACID upserts at minute-level latency means the lakehouse is the only architecture that fits. -
regulatory_archivelands inlake— cold storage + minute-tier latency tolerance + single consumer means even a lakehouse is overkill. -
parity_checkis the gate every migration step runs before promoting; the0.0001tolerance band tolerates floating-point noise without masking real drift.
Output.
| workload | verdict |
|---|---|
| exec_dashboard | warehouse_or_lakehouse |
| ml_training | lake_or_lakehouse |
| cdc_ingest | lakehouse |
| regulatory_archive | lake |
Why this works — concept by concept:
-
Workload-spec inputs — turns architecture choice into a function of
(consumers, latency, data size, formats); senior answers always tie the choice to measurable workload properties. - Decision tree — codifies the four-question shorthand so every team-mate gets the same answer for the same workload.
- Parity-gated migration — every step is conditional on row-count + value parity passing; no step ships without proof.
-
Tolerance band —
0.0001is the senior-grade default; raw equality would block on harmless floating-point noise. -
Cost —
O(1)per workload to runpick_architecture;O(N)per table forparity_check; the function is the artefact you point at when someone asks "why did we pick X for workload Y".
Choosing the right architecture (cheat sheet)
A one-screen cheat sheet for data lakehouse vs data warehouse vs data lake — pick the architecture that matches the workload you actually have.
| You want to support … | Architecture | Canonical stack | Why |
|---|---|---|---|
| Sub-second BI on structured SQL, 500 concurrent users | Warehouse | Snowflake / BigQuery / Redshift / Synapse | Result cache + BI vendor integrations + ACID maturity |
| Cheap petabyte-scale raw archive | Lake | S3 + Glacier + Glue catalog | $1-23 / TB / month; no other architecture comes close |
| ML training on raw multi-format data | Lake or Lakehouse | S3 + Spark + (optional Iceberg) | Spark / PyTorch read Parquet directly; lakehouse adds ACID for shared tables |
| Mixed BI + ML + CDC + streaming on one storage layer | Lakehouse | S3 + Iceberg + Trino + Spark + Flink | One storage, many engines, ACID across all |
| Multi-engine reads without data copies | Lakehouse | Iceberg + Unity / Polaris / Nessie | Open format + cross-engine catalog |
| ACID upserts at lake economics | Lakehouse | Iceberg or Delta + Spark MERGE | Atomic MERGE INTO on object storage |
| Time travel + auditable rollback | Lakehouse | Iceberg / Delta snapshot history |
VERSION AS OF instead of restoring from backup |
| Fine-grained governance (row + column security) | Warehouse first, Lakehouse if open is mandatory | Snowflake masking policies / Unity Catalog | Warehouse-grade governance still slightly ahead |
| Sub-millisecond OLTP transactional reads | Neither (use OLTP DB) | PostgreSQL / MySQL / DynamoDB | None of the three analytical architectures fit OLTP |
| Real-time fraud scoring on streaming events | Lakehouse + streaming engine | Iceberg + Flink + feature store | Stream into Iceberg; consume with Flink ML pipeline |
| Cross-cloud portability of data | Lakehouse | Iceberg on S3 / ADLS / GCS | Open format avoids vendor lock-in on the storage layer |
| Mature 7-year regulatory archive | Lake (cold tier) | S3 Glacier + Glue catalog | $1 / TB / month + queryable on-demand via Athena |
| Migration off Teradata / Oracle | Warehouse-first, then Lakehouse | Snowflake / BigQuery, later Iceberg | Land in modern warehouse first; open the format later |
| Cost-pressure relief on existing Snowflake / Redshift | Lakehouse migration | Iceberg on S3 + Trino + Snowflake-as-serving | Cuts storage cost 5-10x without losing BI surface |
Frequently asked questions
What is the difference between a data lakehouse, a data warehouse, and a data lake?
A data warehouse is a structured, ACID-compliant, schema-on-write store optimised for BI and SQL analytics (Snowflake, BigQuery, Redshift, Synapse); a data lake is a cheap, schema-on-read object store that lands data in any format and lets ML / SQL engines read it directly (S3, ADLS Gen2, GCS + Glue / Athena); a lakehouse is a lake plus an open table format (Delta, Iceberg, Hudi) that adds ACID, schema enforcement, time travel, and efficient UPDATE / DELETE so the same storage layer can serve BI and ML and streaming. In 2026 most enterprises run all three side by side — a warehouse for the BI hot path, a lake for cold archive and raw ML data, and a lakehouse as the shared storage spine. The right architecture is always per workload, not blanket.
When should I use a lakehouse instead of a data warehouse?
Use a lakehouse when (a) you need multi-engine flexibility — Spark, Trino, Snowflake, and BigQuery all reading the same tables; (b) your storage cost is dominated by cold or semi-structured data that the warehouse charges a premium for; (c) you have mixed BI + ML + streaming workloads that want to share data without copying; or (d) vendor neutrality on the storage layer is a strategic requirement. Use a warehouse when your workload is 100% structured BI on a small concurrency-heavy set of dashboards and sub-second latency matters more than storage cost. The hybrid pattern most teams adopt in 2026 is lakehouse as the shared storage spine + warehouse as the BI serving layer in front — best of both, no architecture forced to serve every workload.
What are the main lakehouse table formats and how do I choose between Delta, Iceberg, and Hudi?
The three open table formats are Delta Lake (Databricks-origin), Apache Iceberg (Netflix-origin), and Apache Hudi (Uber-origin) — all add ACID, schema evolution, time travel, and efficient MERGE on top of Parquet files on object storage. Pick Iceberg as the default if you want cross-engine neutrality — Snowflake, BigQuery, Redshift, Athena, Trino, Spark, and Flink all read Iceberg natively. Pick Delta if you are Databricks-first — the tooling, performance optimisations, and Unity Catalog integrations are deepest there (and UniForm lets Delta tables read as Iceberg from external engines). Pick Hudi when record-level upsert + CDC at streaming velocity is the dominant requirement — its Streamer API and merge-on-read storage type were designed for that case. The 2026 community trend: Iceberg won the neutrality race, Delta won the Databricks ecosystem, Hudi remains best-in-class for streaming CDC.
Does a lakehouse really replace a data warehouse for BI workloads?
For most BI workloads, yes — Trino or Databricks SQL on an Iceberg / Delta table delivers the dashboards, ACID, and partition pruning that a warehouse does. For high-concurrency, sub-second BI on cached aggregations (think: 500-user executive dashboards), warehouses still have an edge because of the result cache and purpose-built BI vendor integrations. The pragmatic pattern is lakehouse as the storage spine + warehouse as the BI serving layer — store data once in Iceberg, then load (or live-query via external table) the hot aggregates into Snowflake / BigQuery for the dashboard front-end. This hybrid gives you lake economics on the cold + raw data and warehouse performance on the BI hot path, without forcing one architecture to do everything.
How does ETL change between a warehouse, a lake, and a lakehouse?
In a warehouse the pipeline is classic ETL — extract from sources, transform in a dedicated tool (Informatica, Talend, dbt, or hand-rolled), load clean data into staging → ODS → marts. Schema-on-write means transformations must succeed before data lands. In a lake the pipeline inverts to ELT — extract, load raw, then transform later with Spark / dbt / SQL on the raw zone; schema-on-read means bad data lands and is filtered downstream. In a lakehouse the pipeline is also ELT but with ACID atop — MERGE INTO iceberg_table USING staging is the idempotent canonical pattern; you keep lake flexibility and warehouse-grade transactional guarantees. The senior takeaway: ELT into a lakehouse with MERGE is the modern default; pure ETL into a warehouse is still right for narrow BI-only workloads; pure ELT into a raw lake is still right for ML and archival.
What is the typical cost difference between a data lake, a data warehouse, and a lakehouse at petabyte scale?
At petabyte scale, storage cost dominates and the ranking is fairly stable. A lake on S3 Standard costs roughly $23 / TB / month; with cold-tier (Glacier Deep Archive) the cold portion drops to ~$1 / TB / month. A lakehouse (Iceberg on S3 + Trino / Spark) costs the same storage as the lake, plus pay-per-use compute on whichever engines you run (typically $20-60 / TB scanned via Trino or Athena). A warehouse like Snowflake or Redshift charges a storage premium of 5-10x over raw S3 ($40-80 / TB / month for compressed) and bundles compute via virtual-warehouse credits ($2-4 / credit-hour for an X-Small, scaling up). In practice teams migrating from a 1 PB Redshift footprint to Iceberg on S3 + Trino report 40-70% cost reduction with no loss of BI surface — exact numbers depend on workload mix, concurrency, and how aggressively cold data is tiered.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including SQL + Python + data-modeling drills keyed to the exact data lakehouse vs data warehouse skill set this guide teaches (star-schema design, partition + file-format choice on lakes, Delta / Iceberg / Hudi upsert patterns, multi-engine ACID, BI vs ML workload mapping, migration parity tests). Whether you're drilling data lake vs data warehouse questions the night before a screen or grinding the architecture-selection decision tree over 12 weeks of prep, the practice library mirrors the same five-dimension mental model — plus the Spark, Trino, Snowflake, Databricks, BigQuery, and Redshift tooling you'll wire into a real production lakehouse.
Kick off via Explore practice →; drill the data-modeling practice lane →; fan out into the ETL pipeline drills →; rehearse dimensional-modeling patterns →; reinforce aggregation reconciliation drills →; widen coverage on the full SQL practice library →; or stress-test with Databricks-specific drills → and Snowflake-specific drills →.





Top comments (0)