Data lake architecture questions in data-engineering interviews almost always reduce to four primitives: medallion zones (bronze → silver → gold) for progressive refinement, an ingestion → metadata catalog → compute flow on object storage, the lake vs cloud warehouse vs lakehouse decision driven by open table formats (Iceberg, Delta, Hudi), and a disciplined answer shape that covers grain, idempotency, lineage, and aggregate reconciliation. Whether the prompt is "design our analytics lake from scratch", "how would you land CDC from Postgres into the lake", "when would you pick a lakehouse over a warehouse", or "why do counts drift between the lake and the source app", interviewers grade the same handful of mental models — and candidates who skip straight to vendor names without naming the primitives lose the round.
This guide walks four topic clusters end-to-end, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, common beginner mistakes, and an interview-style scenario with a full answer that traces the design step by step. Every section ends with a concept-by-concept breakdown that explains why the design works, what it costs, and where beginners typically slip. Storage examples assume an S3-style object store on the cloud, but every primitive transfers to GCS, Azure Blob / ADLS, or any other modern object backend.
Top data lake architecture interview topics
The four numbered sections below follow this topic map — one row per H2, every row expanded into a full section with sub-topics, a worked scenario, an interview-style design question, and a step-by-step solution:
| # | Topic | Why it shows up in DE interviews |
|---|---|---|
| 1 | Bronze / silver / gold medallion zones | Progressive refinement is the single biggest lake-architecture concept; interviewers grade whether you know which transformations belong in landing/bronze vs refined/silver vs curated/gold and how SLAs differ per layer. |
| 2 | Ingestion → catalog → compute flow on object storage | Sources land into S3/GCS/ABS, register in a Hive/Glue/Unity catalog, and are queried by Spark, Trino, or warehouse external tables; the small-file problem, partition pruning, and schema evolution all live here. |
| 3 | Lake vs cloud warehouse vs lakehouse — and Iceberg / Delta / Hudi | The pattern-selection question is canonical; open table formats are what turn a lake into a lakehouse and bring ACID, time travel, and partition evolution to object storage. |
| 4 | Interview answer shape — grain, idempotency, lineage, reconciliation | Even system-design rounds reduce to a five-step template: clarify grain, separate landing from conformed, make loads idempotent, attach lineage keys, and reconcile aggregates against the source. |
Beginner-friendly framing: A data lake is cheap, durable object storage plus conventions for layout, metadata, and processing. The "lake vs warehouse" decision is rarely binary — most large organizations run a blend, with the lake handling flexible high-volume ingestion and ML feature stores while a warehouse or lakehouse handles curated SQL analytics. Interviews test whether you can place each workload on the right side of that line and explain the trade-offs without reaching for vendor names as a substitute for first principles.
1. Bronze / Silver / Gold Medallion Zones for Data Lake Architecture
Progressive refinement through landing/bronze, refined/silver, and curated/gold zones
"Walk me through how you would lay out an analytics lake from scratch" is the signature opening prompt — and the cleanest answer is medallion architecture with three numbered zones. The mental model: landing/bronze is an append-only mirror of the source payloads with minimal transformation; refined/silver applies dedup, type coercion, and conformed business keys; curated/gold publishes subject-area tables and star-schema facts/dims that downstream applications and BI tools consume. Each zone has a different SLA, different read/write permissions, and different retention. The names vary across vendors — Databricks coined "bronze/silver/gold", AWS uses "raw/curated/consumption", Microsoft uses "landing/refined/analytics" — but the three-tier shape is universal.
Pro tip: When you whiteboard the medallion zones, label each box with who writes, who reads, and what breaks if the job reruns. Idempotent writes and clear grain matter as much in a lake as they do in a warehouse — interviewers grade the candidate who naturally adds these annotations without prompting.
Landing / bronze — append-only mirror of source payloads
The landing-zone invariant: bronze is an append-only, immutable copy of the source payload with minimal transformation; the schema is captured but not enforced; partitioning is by **ingest_date (or ingest_hour for high-frequency sources); replays are safe because writes never overwrite**. The zone optimizes for fidelity and replay, not query performance.
-
Append-only writes — every batch produces a new file under a date-partitioned prefix;
MERGEandUPDATEare forbidden. -
Source-payload fidelity — store the raw shape (JSON, Avro, CSV, Parquet snapshot) plus an
ingest_idandsource_tsper row. -
Partition by
ingest_date— makes back-fill, replay, and audit trivially scoped. - Retention — keep at least 30-90 days; audits and reconciliations need historical bronze.
Worked example. A Postgres CDC pipeline lands daily JSON snapshots into s3://analytics-lake/bronze/orders/.
| prefix | files | purpose |
|---|---|---|
bronze/orders/ingest_date=2026-04-11/ |
part-00000.json |
Apr 11 snapshot |
bronze/orders/ingest_date=2026-04-12/ |
part-00000.json |
Apr 12 snapshot |
bronze/orders/ingest_date=2026-04-13/ |
part-00000.json |
Apr 13 snapshot |
Step-by-step explanation.
- The source app emits one JSON snapshot per day at 02:00 UTC.
- The ingestion job lands each snapshot under a calendar-keyed prefix
bronze/orders/ingest_date=YYYY-MM-DD/so partition pruning works for any date filter downstream. - Each batch is also stamped with a unique
ingest_id(timestamp + UUID) sub-prefix so retries write fresh files instead of overwriting a previous attempt. - Files inside a partition are append-only
part-NNNNN.json; bronze never edits a written file — corrected payloads land as new files under a newingest_id. - After three days you have three day-partitions; each is independently re-readable with
WHERE ingest_date = 'YYYY-MM-DD'and any single day can be replayed without touching the others.
Worked-example solution. A landing-zone object layout:
s3://analytics-lake/bronze/orders/
ingest_date=2026-04-13/
ingest_id=20260413T0200Z/
part-00000.json
part-00001.json
Rule of thumb: never edit a bronze file. If a payload is wrong, drop a corrected file under a new ingest_id and let the silver-layer dedup logic resolve it; never overwrite history.
Refined / silver — deduped, typed, conformed business keys
The refined-zone invariant: silver applies dedup against natural or business keys, coerces types to a canonical schema, conforms key columns across sources, and may emit slowly-changing-dimension (SCD) history; the zone is the single source of truth for downstream application code and most analyst SQL. Idempotency at the silver layer is non-negotiable — re-running a daily job must produce byte-identical output.
-
Dedup on
(business_key, source_ts)—ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY source_ts DESC) = 1. -
Type coercion — JSON strings → typed columns; epoch ms →
TIMESTAMP; cents →DECIMAL(18,2). -
Conformed dimensions —
customer_id,product_id,geo_idmapped to one canonical form across every source. -
SCD type 2 — emit
(valid_from, valid_to, is_current)columns when downstream consumers need point-in-time joins.
Worked example. Bronze orders rows arrive twice for order_id=448 due to a CDC retry; silver dedup keeps the latest.
| order_id | source_ts | bronze_rn |
|---|---|---|
| 448 | 2026-04-12 09:30:00 | 2 |
| 448 | 2026-04-12 09:30:15 | 1 (kept) |
| 449 | 2026-04-12 10:00:00 | 1 (kept) |
Step-by-step explanation.
- Bronze contains three rows for
ingest_date = 2026-04-12: two fororder_id = 448(a CDC retry produced two payloads at 09:30:00 and 09:30:15) and one fororder_id = 449. -
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC)numbers rows independently inside eachorder_idgroup, with the latestsource_tsgettingrn = 1. - For
order_id = 448: the row at 09:30:15 is later, so it getsrn = 1; the 09:30:00 row getsrn = 2. - For
order_id = 449: only one row, so it getsrn = 1automatically. - The outer
WHERE rn = 1keeps two rows — the latestorder_id = 448and the onlyorder_id = 449— and silently drops the duplicate, producing a deterministic single-row-per-business-key silver table.
Worked-example solution.
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY source_ts DESC
) AS rn
FROM bronze.orders
WHERE ingest_date = DATE '2026-04-12'
)
SELECT order_id,
customer_id,
amount::DECIMAL(18,2) AS amount,
order_status,
source_ts AS as_of_ts,
CURRENT_TIMESTAMP AS silver_loaded_at
FROM ranked
WHERE rn = 1;
Rule of thumb: the silver zone is where ETL bugs hide — invest in unit-tested dedup logic, schema-evolution tests, and aggregate reconciliation against bronze totals before promoting to gold.
Curated / gold — subject-area tables and star schemas
The curated-zone invariant: gold publishes tables shaped for downstream consumption: dimensional models (fact tables + conformed dimensions), subject-area marts, or one-big-table (OBT) flattenings; SLAs are stricter, freshness is tracked, and consumer contracts are explicit. Each gold table maps to exactly one consumer class — analysts, dashboards, ML feature pipelines, or reverse-ETL into operational systems.
-
Star schema —
fact_ordersjoined todim_customer,dim_product,dim_date; one row per business event. -
Subject-area marts — domain-scoped denormalized tables (e.g.,
mart_marketing_attribution). - OBT flattening — when consumers prefer one wide table over a join (Looker, Power BI dashboards).
-
Consumer contracts — column types, refresh cadence, breakage policy declared in
dbt-style metadata.
Worked example. A gold star schema for the orders subject area.
| table | grain | example columns |
|---|---|---|
fact_orders |
one row per order line |
order_id, line_id, customer_key, product_key, date_key, qty, revenue
|
dim_customer |
one row per customer (SCD2) |
customer_key, customer_id, name, valid_from, valid_to
|
dim_product |
one row per product |
product_key, product_id, category
|
dim_date |
one row per calendar date |
date_key, date, iso_week, is_weekend
|
Step-by-step explanation.
-
fact_ordersis the central transactional table at order-line grain — one row per line item, with numeric measures (qty,revenue) and foreign-key columns to every dimension. -
dim_customeris an SCD2 dimension: a single real-world customer can appear in multiple rows over time, each withvalid_from/valid_to/is_currentcolumns to capture historical attribute changes. -
dim_productis a simpler Type-1 dimension: one row per product, current state only — overwrites on update with no history. -
dim_dateis the conformed date dimension: one row per calendar date with pre-computed week, month, quarter, year, andis_weekendcolumns so dashboards never have to compute date math at query time. - Joins from
fact_ordersto each dimension use the surrogate keys (customer_key,product_key,date_key) — never the natural business IDs — so SCD2 history is preserved when the same customer's row evolves over time.
Worked-example solution.
CREATE TABLE gold.fact_orders AS
SELECT
s.order_id,
s.line_id,
dc.customer_key,
dp.product_key,
dd.date_key,
s.qty,
s.qty * s.unit_price AS revenue
FROM silver.order_lines s
JOIN gold.dim_customer dc
ON dc.customer_id = s.customer_id
AND s.order_ts BETWEEN dc.valid_from AND dc.valid_to
JOIN gold.dim_product dp ON dp.product_id = s.product_id
JOIN gold.dim_date dd ON dd.date = s.order_ts::DATE;
Rule of thumb: gold tables are the only zone customers should reference by name; if a dashboard reads silver directly, your contract is leaking. Use views or feature-flagged exposures rather than letting consumers couple to interim grains.
Common beginner mistakes
- Treating bronze as a junk drawer with no
ingest_datepartitioning — replay and audit become impossible. - Doing dedup at gold instead of silver — every downstream job has to repeat the work and answers diverge.
- Letting consumers query silver directly — silver schemas can change without notice; gold contracts are explicit.
- Skipping
ingest_idandsource_tslineage columns — when counts drift, you have no way to reconstruct what landed when. - Mixing batch and streaming writes into the same bronze prefix without a partition key for write-mode — late arrivals overwrite eager batches.
Data Lake Interview Question on Designing Layered Zones
A team dumps daily JSON exports of orders into a single S3 prefix. Analysts complain that order counts drift versus the source application by 0.5–2% on most days. Design a three-zone medallion layout that fixes the drift, makes the discrepancy investigable, and supports daily reruns without producing duplicates.
Solution Using Bronze (append-only) + Silver (dedup) + Gold (star schema)
1. Move existing daily dumps into:
s3://analytics-lake/bronze/orders/ingest_date=YYYY-MM-DD/ingest_id=<batch>/
Append-only; never overwrite a date partition.
2. Build silver/orders as a daily MERGE that:
- Dedups bronze rows by ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) = 1
- Coerces JSON fields to a typed schema
- Joins against dim_customer / dim_product on conformed keys
- Carries ingest_id + source_ts as lineage columns
3. Promote to gold/fact_orders only after a silver-vs-source aggregate-reconciliation job
passes a tolerance threshold (e.g., |silver_count - source_count| / source_count < 0.001).
4. Surface a row-count + revenue-sum dashboard sourced from BOTH bronze and the source app's
replica, so any future drift surfaces within one ingest cycle.
Why this works: the append-only bronze layer makes the discrepancy investigable — every historical payload is preserved with ingest_id and ingest_date, so analysts can replay any day's source state; the silver dedup converts CDC retries and late-arriving rows into a deterministic single row per order_id; the gold layer is gated by an aggregate-reconciliation step that catches drift before it reaches dashboards; and the dual-source row-count dashboard surfaces residual drift immediately. The combination addresses both the prevention (idempotent dedup) and detection (reconciliation + dashboard) sides of the failure mode.
Step-by-step trace for the drift scenario on 2026-04-12:
| step | action | observation |
|---|---|---|
| 1 | bronze ingests ingest_date=2026-04-12
|
12,847 raw rows including 12 CDC retries |
| 2 | silver dedup keeps rn = 1
|
12,835 unique order_ids |
| 3 | source-app replica reports | 12,835 orders for 2026-04-12 |
| 4 | reconciliation passes | drift = 0 / 12,835 = 0.0% |
| 5 | promote to gold/fact_orders | 12,835 fact rows; counts match dashboard |
Output: the fixed-state contract per ingest day:
| metric | bronze | silver | source | gold | drift |
|---|---|---|---|---|---|
| row count | 12,847 | 12,835 | 12,835 | 12,835 | 0.0% |
| total revenue | $4,128,931 | $4,128,931 | $4,128,931 | $4,128,931 | 0.0% |
Why this works — concept by concept:
-
Append-only bronze with
ingest_datepartitioning — every payload is preserved and addressable; replay is aWHERE ingest_date = ...filter rather than a re-ingest. -
Silver dedup via
ROW_NUMBERover(order_id ORDER BY source_ts DESC)— collapses CDC retries to a deterministic single row per business key; idempotent on rerun. -
Lineage columns
ingest_id+source_ts— every silver row points back to a specific bronze file and source moment; forensic debugging is one join away. -
Aggregate reconciliation gate before gold — drift cannot reach dashboards because gold is gated on
|silver - source| / source < threshold; failures page the on-call rather than silently corrupt the BI tool. - Dual-source dashboard — surface drift instantly even when reconciliation isn't perfect; the early-warning loop pays for itself the first incident.
-
O(|bronze|)time per day — single linear scan + window for dedup; reconciliation adds one aggregate per zone, negligible compared to ingest cost.
Inline CTA: Drill the ETL practice page for medallion-zone problems and the dimensional modeling practice page for star-schema patterns at the gold layer.
ETL
Topic — ETL pipelines
ETL practice problems
SQL
Topic — dimensional modeling
Dimensional modeling problems
SQL
Language — SQL
All SQL practice problems
2. Ingestion → Catalog → Compute Flow on Object Storage
Sources to query engines through metadata catalogs in data lake architecture
"How does data physically get from a Postgres source into a query engine like Spark or Trino on the lake?" is the signature design follow-up — and the cleanest answer is the ingest → register → query flow with three distinct components. The mental model: sources (databases, APIs, streaming platforms, file feeds) ingest into object storage as files; a metadata catalog (Hive Metastore, AWS Glue, Unity Catalog, Polaris, Iceberg REST catalog) maps logical tables to physical file paths and column schemas; compute engines (Spark, Trino, Presto, DuckDB-in-the-cloud, Snowflake external tables) read the catalog to discover tables and read the object store to fetch data. The decoupling is the entire value proposition — many engines can read the same footprint, and storage scales independently from compute.
Pro tip: When the interviewer asks "where does Spark get the schema from?", the answer is the catalog, not the file. Files (Parquet, ORC, Avro) carry their own schema in the footer, but the catalog is what makes a logical table addressable across sessions and engines. State this distinction explicitly — it separates candidates who learned data lake architecture by reading docs from those who learned by debugging production.
Object storage as the storage layer — S3, GCS, ADLS
The object-store invariant: modern lakes use cloud object storage (Amazon S3, Google Cloud Storage, Azure Data Lake Storage / ADLS Gen2) rather than HDFS; storage is **infinitely scalable, durable, and decoupled from compute, with eventual-consistency semantics that the table format is responsible for masking**. Files are typically Parquet (columnar, compressed) or ORC; Avro shows up in streaming pipelines.
-
Hive-style partitioning —
s3://bucket/table/col=value/file.parquetfor partition pruning at query time. - File sizes — target 128MB-1GB per file; smaller files trigger the small-file problem (excessive metadata, slow planning).
- Compaction — periodic batch jobs that rewrite many small files into fewer large ones.
- Eventual consistency — S3 was eventually consistent for many years; the table format handles the retry / commit semantics that mask this from queries.
Worked example. A Hive-style partition layout for a daily-loaded orders table in silver.
| prefix | role |
|---|---|
s3://analytics-lake/silver/orders/ |
table root |
…/ingest_date=2026-04-13/ |
partition value |
…/ingest_date=2026-04-13/part-00000.parquet |
data file |
…/_delta_log/ or …/metadata/
|
table-format metadata (if Delta/Iceberg) |
Step-by-step explanation.
- The table root
s3://analytics-lake/silver/orders/is the registered location in the catalog; everything under it belongs to one logical table. - Each child prefix
ingest_date=YYYY-MM-DD/is one Hive partition value; thekey=valuesyntax is the convention every engine (Spark, Trino, Athena, Snowflake) recognizes. - Inside each partition, multiple Parquet files (~180MB each) split the data so a Spark reader can fetch them in parallel; the file count is bounded by your micro-batch size.
- The
_delta_log/(Delta) ormetadata/(Iceberg) prefix holds the table-format commit log — a sequence of JSON files describing every transaction, which is what gives you ACID and time travel on top of plain object storage. - A query with
WHERE ingest_date = '2026-04-13'triggers partition pruning: the planner reads only files under that one prefix, skipping every other day's files entirely — the difference between 200ms and 60s.
Worked-example solution. Object layout for a partitioned silver table:
s3://analytics-lake/silver/orders/
ingest_date=2026-04-13/
part-00000.parquet (180MB, 1.2M rows)
part-00001.parquet (165MB, 1.1M rows)
ingest_date=2026-04-12/
part-00000.parquet (175MB)
_delta_log/ # Delta Lake commit log
00000000000000000001.json
00000000000000000002.json
Rule of thumb: if your average file size is below 50MB, schedule a daily compaction job; if it's above 1GB, your partitions are too coarse. Both extremes hurt query latency.
Metadata catalog — Hive Metastore, AWS Glue, Unity Catalog
The catalog invariant: a metadata catalog maps logical names (silver.orders) to physical locations (s3://analytics-lake/silver/orders), column schemas, partition definitions, and table properties; it is the single source of truth for "what tables exist" across every compute engine that reads the lake. The catalog can be a long-running service (Hive Metastore, AWS Glue Data Catalog, Databricks Unity Catalog) or a REST API on top of files (Iceberg REST catalog, Polaris).
-
Logical → physical mapping —
silver.orders→s3://...; column names, types, partition keys. - Engine-agnostic — Spark, Trino, Presto, Snowflake external tables, Athena, DuckDB all read the same catalog.
- Schema evolution — add column, widen type, rename (with caveats); the catalog records the evolution history.
- Permissions — many catalogs (Unity, Glue with Lake Formation) carry table/column-level access policies.
Worked example. Registering a partitioned silver.orders table in Glue.
| field | value |
|---|---|
| logical name | silver.orders |
| location | s3://analytics-lake/silver/orders/ |
| input format | parquet |
| partition keys | ingest_date STRING |
| schema | order_id BIGINT, customer_id BIGINT, amount DECIMAL(18,2), source_ts TIMESTAMP |
Step-by-step explanation.
-
CREATE EXTERNAL TABLE silver.ordersdeclares a logical name in the catalog without copying or moving any data files. - The column list (
order_id BIGINT, …) declares the schema the engine should expect; Parquet files store their own schema in the footer, but the catalog is the canonical answer the planner trusts. -
PARTITIONED BY (ingest_date STRING)declares the partition column; this column is derived from the prefix path, not stored in the data files, which keeps each partition lean. -
LOCATION 's3://analytics-lake/silver/orders/'is the prefix the engine scans when reading; data files must already exist at this location. -
MSCK REPAIR TABLE silver.orderswalks the S3 prefix, discovers any partition values it doesn't yet know about, and registers them; without this command after a backfill, the planner returns zero rows for the new dates.
Worked-example solution.
CREATE EXTERNAL TABLE silver.orders (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(18,2),
source_ts TIMESTAMP,
ingest_id STRING
)
PARTITIONED BY (ingest_date STRING)
STORED AS PARQUET
LOCATION 's3://analytics-lake/silver/orders/';
MSCK REPAIR TABLE silver.orders;
Rule of thumb: always run MSCK REPAIR TABLE (or the engine equivalent) after a backfill that adds new partition prefixes; otherwise the catalog won't know about them and the partition predicate will return zero rows.
Compute engines — Spark, Trino, Presto, DuckDB
The compute invariant: compute engines read the catalog to discover tables, plan queries with partition pruning and predicate pushdown, then read the relevant Parquet/ORC files from object storage; storage and compute scale independently and the same data can be queried by multiple engines simultaneously. Spark dominates for batch + streaming pipelines; Trino/Presto dominate for interactive SQL; DuckDB is rising for single-node analytics.
- Spark — JVM, batch + streaming, rich ecosystem (Iceberg/Delta connectors, Spark SQL, MLlib, Structured Streaming).
- Trino / Presto — interactive SQL across many catalogs; great for federated queries across lake + warehouse.
- DuckDB — single-node, embeddable, blazing fast for sub-TB analytics; popular for ad-hoc + notebooks.
- Snowflake / BigQuery / Redshift external tables — read lake data from inside a managed warehouse.
Worked example. A Spark SQL query against silver.orders with partition pruning.
| layer | action | data scanned |
|---|---|---|
| catalog | resolve silver.orders → s3://...
|
metadata only |
| planner | prune partitions for ingest_date = '2026-04-13'
|
one partition |
| Spark workers | read Parquet column-block for amount
|
~50MB |
| executor | aggregate SUM(amount)
|
local |
Step-by-step explanation.
- Spark resolves
silver.ordersagainst the catalog — pure metadata fetch, zero data scanned, returns the location plus the partition schema. - The planner sees
WHERE ingest_date = '2026-04-13'and prunes the partition list to a single value, so workers only need to list files under one S3 prefix instead of all of them. - Workers issue an S3
LISTfor that single partition, fetching a list of ~one to ten Parquet file paths. - Each Parquet reader uses footer metadata to skip every column except
amount, then streams just that column block — typically 50MB instead of the full 1GB Parquet file. - Each task computes a partial
SUM(amount)locally; a final shuffle sums the partial values to one number — the entire query isO(rows in one partition)and runs in sub-second time.
Worked-example solution.
SELECT SUM(amount) AS daily_revenue
FROM silver.orders
WHERE ingest_date = '2026-04-13';
Rule of thumb: always include the partition key in your WHERE clause to enable partition pruning; without it, the planner reads every partition (terabytes), and your query goes from 500ms to 50 seconds.
Common beginner mistakes
- Skipping the catalog and reading raw S3 paths in every job — schemas drift, no central source of truth, no permissions.
- Ignoring file-size budgets — millions of 5KB files (the small-file problem) make Spark planning slower than the actual scan.
- Not declaring partition keys — full-table scans on every query, costs balloon by 100x.
- Mixing file formats inside one logical table (some Parquet, some JSON) — the planner can't push predicates and queries error out.
- Forgetting to refresh the catalog after a backfill —
MSCK REPAIR TABLEorREFRESH TABLEis the single most-forgotten command.
Data Lake Interview Question on CDC Ingestion from Postgres
Design a near-real-time ingestion pipeline that lands changes from a 10TB Postgres database into the lake, registers them in a catalog, and exposes them to Spark and Trino with sub-five-minute freshness.
Solution Using Debezium → Kafka → Iceberg with Hive Metastore
Postgres (with logical replication enabled)
│
▼
Debezium connector (CDC reader, emits change events)
│
▼
Kafka topic per table (key = primary key; value = before/after JSON or Avro)
│
▼
Spark Structured Streaming job (1-minute trigger):
- Reads Kafka topic
- Writes to bronze.orders_cdc as append-only Iceberg files (partitioned by event_date)
│
▼
Hive Metastore / Glue catalog:
- bronze.orders_cdc registered with Iceberg metadata
- silver.orders_current registered as a Spark MERGE-on-read view
│
▼
Compute consumers:
- Trino: SELECT * FROM silver.orders_current WHERE event_date = today
- Spark batch: nightly compaction + table-maintenance
Why this works: Debezium reads the Postgres write-ahead log (WAL) directly via logical replication, so it captures every insert/update/delete with no impact on the source; Kafka decouples the producer from the consumer and absorbs traffic spikes; the Spark Structured Streaming job runs with a one-minute trigger, so the lake is at most one minute behind; Iceberg's ACID transactions make concurrent micro-batch writes safe; the Hive Metastore registers the table once, and both Trino and Spark see the same schema; partitioning by event_date enables prune-friendly time-window queries; nightly compaction keeps file sizes in the 128MB-1GB sweet spot.
Step-by-step trace for an order update at 09:30:00.000:
| time | component | action |
|---|---|---|
| 09:30:00.000 | Postgres | UPDATE orders SET status='shipped' WHERE order_id=448 |
| 09:30:00.150 | Debezium | reads WAL, emits change event to Kafka |
| 09:30:00.300 | Kafka | persists change event to topic orders.cdc
|
| 09:30:30.000 | Spark Streaming | next 1-min trigger; reads change events |
| 09:30:35.000 | Spark Streaming | writes Parquet to bronze.orders_cdc/event_date=2026-04-13/
|
| 09:30:35.500 | Iceberg | commits new snapshot; catalog updated |
| 09:30:40.000 | Trino |
SELECT … FROM silver.orders_current WHERE order_id=448 returns updated row |
End-to-end latency: ~40 seconds. Well within the five-minute SLA.
Output: the consumer-visible contract per minute:
| metric | target | actual |
|---|---|---|
| freshness (P50) | < 5 min | ~40 sec |
| freshness (P99) | < 5 min | ~2 min |
| dropped events | 0 | 0 |
| schema-drift incidents | < 1/quarter | 0 last quarter |
Why this works — concept by concept:
- Postgres logical replication + Debezium — captures every row change at the WAL layer; no impact on source query performance; no missed events.
- Kafka as the decoupler — handles backpressure, replays, and multiple downstream consumers; lake outages don't lose source events.
- Spark Structured Streaming with 1-minute trigger — micro-batch sweet spot; latency vs throughput trade-off favors throughput here.
- Iceberg table format — ACID commits make concurrent micro-batch writes safe; time travel makes "what did the table look like at 09:30?" a one-line query.
- Hive Metastore as the unified catalog — Spark and Trino see the same schema; no per-engine duplication.
-
event_datepartitioning + nightly compaction — bounds query scan size and keeps file count manageable; both maintenance jobs are idempotent. - End-to-end latency ~40s — well inside the 5-min SLA; the 4.5-min headroom absorbs Kafka rebalances and Spark micro-batch jitter without alerting.
Inline CTA: Drill the streaming practice page for Kafka + micro-batch problems and the Python practice page for PySpark Structured Streaming patterns. Course: PySpark Fundamentals.
ETL
Topic — streaming
Streaming practice problems
PYTHON
Language — Python
Python practice for data pipelines
ETL
Topic — ETL pipelines
ETL practice problems
3. Lake vs Cloud Warehouse vs Lakehouse — Iceberg, Delta, Hudi
Pattern selection and open table formats in data lake architecture
"When would you pick a lakehouse over a warehouse?" and "what is the difference between Iceberg, Delta Lake, and Hudi?" are the two signature pattern-selection prompts — and they share one mental model: a data lake is files + a catalog; a cloud warehouse is a managed ACID SQL system with proprietary storage; a lakehouse is a lake plus an open table format that adds ACID, time travel, partition evolution, and concurrent writers — bringing warehouse-like semantics to object storage. Iceberg, Delta Lake, and Hudi are the three dominant open table formats, each with slightly different trade-offs.
Pro tip: Most large organizations run a blend: lake for flexible high-volume ingestion and ML feature stores, warehouse or lakehouse SQL for curated analytics. Don't propose a single-pattern solution to a system-design question — describe the boundary between the two and the contracts that flow across it. That's the senior signal.
Data lake — files on object storage with a catalog
The lake invariant: a data lake is object storage (S3/GCS/ADLS) plus a metadata catalog plus open file formats (Parquet, ORC, Avro) plus convention-based partitioning; reads are cheap and parallel, writes are eventual-consistent unless wrapped in a table format, and the cost model is storage + compute-at-query-time. Lakes shine when data shapes are diverse and high-volume.
- Strengths — accepts any data format; massive scale; cheap storage; many engines can read.
- Watch-outs — no ACID without a table format; no time travel; concurrent writes can corrupt the table; small-file problem.
- Best fit — ML feature stores, log archives, raw event data, ingestion landing zones.
- Cost — storage ~$0.023/GB/month (S3 Standard); compute pay-per-query.
Worked example. A 50TB clickstream feature store in S3 + Glue.
| attribute | value |
|---|---|
| storage | S3 Standard, ~$1,150/month for 50TB |
| catalog | AWS Glue (free for first million objects) |
| compute | Athena, ~$5/TB scanned |
| typical query | scan 100GB → ~$0.50 |
Step-by-step explanation.
- Storage line: 50TB × 1024GB × $0.023/GB/month (S3 Standard pricing) ≈ $1,150/month — this is the floor regardless of query activity.
- Catalog line: AWS Glue is free for the first million metadata objects; a 50TB clickstream table partitioned by year/month/day fits comfortably under that limit.
- Compute line: Athena charges per TB scanned, not per query — write efficient SQL (use the partition predicate, project only needed columns) and you pay only for what you actually read.
- Typical query: a partition-pruned + column-projected scan touches ~100GB → 0.1 TB × $5/TB ≈ $0.50; an unpruned full-table scan would touch 50TB → $250 per query.
- Net at this scale: storage dominates the monthly bill (~$1,150) and compute scales linearly with query discipline — bad queries cost real money, good queries are nearly free.
Worked-example solution. A lake-first deployment for clickstream:
s3://feature-lake/raw_events/year=2026/month=04/day=13/
part-00000.parquet
part-00001.parquet
…
Glue catalog: feature_lake.raw_events
Athena query: SELECT user_id, COUNT(*) FROM feature_lake.raw_events WHERE day = '2026-04-13' GROUP BY user_id;
Rule of thumb: a pure lake is the right answer when data is high-volume, schema-flexible, and primarily consumed by ML or batch analytics; reach for a lakehouse the moment you need ACID or concurrent writers.
Cloud warehouse — managed ACID SQL on proprietary storage
The warehouse invariant: a cloud warehouse (Snowflake, BigQuery, Redshift, Synapse) is a managed system that owns both storage and compute, exposes SQL as the primary interface, provides ACID transactions out of the box, and handles indexing, statistics, and query optimization automatically. Warehouses shine when data is structured and the primary consumer is analyst SQL.
- Strengths — mature SQL; ACID; managed governance products (RBAC, masking); workload management.
- Watch-outs — proprietary storage = vendor lock; cost at huge semi-structured scale; less flexible for non-tabular data.
- Best fit — curated analytics, BI dashboards, financial reporting, dimensional models.
- Cost — ~$2-5 per credit-hour or per-TB-scanned; storage ~$0.02-0.04/GB/month (compressed).
Worked example. A 5TB curated finance mart in Snowflake.
| attribute | value |
|---|---|
| storage | Snowflake, ~$200/month for 5TB compressed |
| compute | Small warehouse, ~$2/credit-hour |
| typical query | dashboard refresh in ~30 seconds |
| ACID | full transactions across multi-table updates |
Step-by-step explanation.
- Storage line: Snowflake compresses raw data 3-5x, so 5TB raw becomes ~1-1.5TB stored at ~$23-46/TB/month, landing around $200/month total.
- Compute line: a Small warehouse runs at ~$2/credit-hour; nightly ELT jobs plus business-hours dashboards consume ~50-200 credits/month for a finance mart of this size.
- Typical query: dashboard refresh hits a sub-30-second target because data is co-located with compute and the planner has full statistics.
- ACID guarantee: multi-table updates within a
BEGIN ... COMMITblock are atomic — the finance close cannot land half-updated, which is the whole reason finance reports run on a warehouse rather than a raw lake. - Net at 5TB scale: the warehouse premium (~$200 storage) is small versus a lake's ~$115 equivalent; ergonomics, SQL-first BI integration, and ACID tilt the choice clearly toward warehouse.
Worked-example solution. A curated star schema in Snowflake:
CREATE OR REPLACE TABLE finance.fact_revenue AS
SELECT date_key, region_key, product_key, SUM(amount) AS revenue
FROM silver.order_lines
GROUP BY 1, 2, 3;
Rule of thumb: warehouses are the right answer when SQL ergonomics and ACID matter more than format flexibility; reach for a lakehouse when you need both and the ability to query the same data from outside the warehouse.
Lakehouse with Iceberg / Delta / Hudi — ACID on object storage
The lakehouse invariant: a lakehouse is an open-table-format layer (Apache Iceberg, Delta Lake, Apache Hudi) on top of object storage that adds ACID transactions, schema evolution, partition evolution, time travel, and safe concurrent writers; the data sits in standard Parquet files but is governed by a JSON/Avro commit log that any engine can read. Lakehouse architectures combine lake scale with warehouse-like semantics.
- Apache Iceberg — table format invented at Netflix; broad engine support (Spark, Trino, Snowflake, BigQuery, Dremio); REST catalog spec.
-
Delta Lake — invented at Databricks; strong Spark integration; commit log in
_delta_log/; OSS Delta works across engines. - Apache Hudi — invented at Uber; optimized for upsert-heavy CDC workloads; merge-on-read and copy-on-write modes.
- All three — provide ACID, time travel, schema evolution, and partition pruning; pick by ecosystem and team skill.
Worked example. A 50TB lakehouse on S3 + Iceberg + Spark/Trino.
| dimension | data lake | warehouse | lakehouse (Iceberg) |
|---|---|---|---|
| storage cost | ✓ cheap | ✗ expensive | ✓ cheap |
| ACID transactions | ✗ | ✓ | ✓ |
| concurrent writers | ✗ | ✓ | ✓ |
| time travel | ✗ | depends | ✓ |
| schema evolution | manual | managed | managed |
| vendor lock | none | high | low (open standard) |
| ML / Python access | direct | via connector | direct |
Step-by-step explanation.
- Storage cost row: lake and lakehouse both win because data sits in cheap object storage; warehouse loses at scale because storage is bundled with managed compute.
- ACID + concurrent writers rows: warehouse and lakehouse both provide them out of the box; pure lake does not — concurrent writers can corrupt a lake table without an open table format on top.
- Time travel row: only the lakehouse exposes it natively via the Iceberg/Delta snapshot log; some warehouses offer it as a managed feature; pure lake has no concept.
- Schema evolution row: lakehouse and warehouse both manage adding/widening columns as a metadata commit; pure-lake users do it manually with file rewrites.
- Vendor lock + ML/Python rows: pure lake is open standard; lakehouse is open standard with a richer feature set; warehouse is proprietary and ML access usually requires connectors that copy data back out — which is why ML teams gravitate to lake/lakehouse for feature stores.
Worked-example solution. Creating an Iceberg table via Spark SQL:
CREATE TABLE lakehouse.orders (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(18,2),
order_date DATE
)
USING iceberg
PARTITIONED BY (days(order_date))
LOCATION 's3://lakehouse-bucket/orders/';
MERGE INTO lakehouse.orders t
USING staging.orders_delta s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Rule of thumb: the lakehouse pattern is the right answer when you need ACID + time travel + concurrent writers + the ability to query from multiple engines; pick Iceberg for the broadest engine support, Delta for tightest Databricks/Spark integration, Hudi for upsert-heavy CDC.
Common beginner mistakes
- Conflating "lake" with "Hadoop / HDFS" — modern lakes are object storage; HDFS is the legacy on-prem variant.
- Picking a lakehouse "because it's modern" without matching it to the workload — for pure curated SQL analytics, a warehouse is often simpler and cheaper.
- Treating Iceberg / Delta / Hudi as interchangeable — Hudi is upsert-tuned; Delta is Spark-tightest; Iceberg is most engine-agnostic. The choice has long-term implications.
- Forgetting that lakehouses still need governance — IAM, lineage, quality tests, contracts; the "open" part is the storage, not the operational discipline.
- Underestimating the operational cost of running an open lakehouse vs a managed warehouse — engineering time matters.
Data Lake Interview Question on Pattern Selection
A retail company stores 200TB of clickstream events plus a 5TB curated finance mart and a 1TB ML feature store. Should they run on a pure data lake, a cloud warehouse, a lakehouse, or a hybrid? Walk through your decision.
Solution Using a Hybrid — Lakehouse for Clickstream + Features, Warehouse for Finance
Workload Volume Pattern recommended Why
──────────────────────── ──────── ───────────────────── ─────────────────────────────────────────
Clickstream events 200 TB Lakehouse (Iceberg) Volume + schema flexibility + ML access
ML feature store 1 TB Lakehouse (Iceberg) Same engine, same catalog as clickstream
Curated finance mart 5 TB Cloud warehouse SQL ergonomics, ACID across many tables, BI tools
Snowflake / BigQuery / Redshift
Boundary contract:
- Clickstream + features stay in S3 + Iceberg
- Finance mart loads nightly from Iceberg via Snowflake external tables
- Reverse-ETL syncs finance summaries back into the lakehouse for ML feature joins
Why this works: clickstream at 200TB is the workload that justifies the cheaper object-storage cost model; the lakehouse table format adds ACID and time travel that the team will need for replays and audits; the ML feature store sits on the same engine + catalog so feature engineers can JOIN against clickstream without a cross-system data hop; the finance mart at only 5TB is small enough that warehouse storage cost is negligible, and the team's BI tools and analyst SQL ergonomics dominate the decision; the boundary contract (Snowflake external tables) lets finance read curated lake tables without copying them, and reverse-ETL closes the loop for ML.
Step-by-step trace of the decision:
| step | question | answer |
|---|---|---|
| 1 | Is volume > 50TB? | yes (clickstream) → lake or lakehouse |
| 2 | Need ACID + concurrent writers + time travel? | yes (CDC + ML feature recomputation) → lakehouse, not pure lake |
| 3 | Pick a table format | Iceberg (broadest engine support across Spark, Trino, Snowflake) |
| 4 | Is the curated SQL workload < 10TB? | yes (finance, 5TB) → warehouse is fine |
| 5 | Pick a warehouse | Snowflake (ergonomics + multi-cloud + Iceberg external table support) |
| 6 | Boundary contract | Snowflake external tables on Iceberg; reverse-ETL nightly job |
Output: the recommended architecture summary:
| zone | technology | volume | primary consumer |
|---|---|---|---|
| Clickstream lakehouse | S3 + Iceberg + Spark/Trino | 200 TB | ML pipelines, analyst SQL via Trino |
| ML feature store | S3 + Iceberg + Spark | 1 TB | ML training + serving |
| Finance warehouse | Snowflake (managed) | 5 TB | Finance analysts, BI dashboards |
| Boundary | Snowflake external tables on Iceberg | — | finance reads curated lake data zero-copy |
Why this works — concept by concept:
- Volume-driven storage choice — 200TB at warehouse storage cost ($0.02-0.04/GB/month) = ~$5K/month; same data on S3 = ~$4.6K/month and available to ML directly. The cost gap widens with growth.
- Lakehouse for ACID + time travel — clickstream replays and ML feature recomputation need transactional snapshots; a pure lake without Iceberg cannot give you that.
- Warehouse for curated SQL — finance analysts live in BI tools; warehouse SQL ergonomics + ACID across multi-table updates dominates the cost-per-query argument at 5TB scale.
- Iceberg as the open boundary — Snowflake reads Iceberg tables natively via external tables; no nightly copy job, no schema drift between systems.
-
Reverse-ETL closes the loop — finance summaries flow back to the lakehouse so ML features can
JOINagainst revenue without leaving the lake stack. - Operational cost trade-off — running both a lakehouse and a warehouse is more engineering than a single managed warehouse; the cost is justified at this volume mix but not at 5TB total.
Inline CTA: More SQL practice problems for warehouse-style queries and data modeling practice for star-schema and OBT patterns. Course: Data Modeling for Data Engineering Interviews.
SQL
Topic — dimensional modeling
Dimensional modeling problems
SQL
Language — data modeling
Data modeling problems
SQL
Language — SQL
All SQL practice problems
4. Interview Answer Shape — Grain, Idempotency, Lineage, Reconciliation
A five-step template for data lake design rounds
"Design our company's analytics data lake" is the canonical open-ended system-design prompt — and the cleanest answer is a five-step template that walks the interviewer through the load-bearing decisions in a fixed order. The mental model: clarify grain → separate landing from conformed → make loads idempotent → attach lineage keys → reconcile aggregates against source. Following this template demonstrates that you have shipped data pipelines before, and it gives the interviewer five concrete spots to drill deeper. Candidates who jump straight to vendor names or who skip the grain question lose the round, regardless of how many tools they can name.
Pro tip: State the template out loud at the start: "I'd answer this in five steps — first clarify grain, then separate landing from conformed, then make loads idempotent, then attach lineage keys, then explain how I'd reconcile aggregates against the source." This gives the interviewer a road map and makes it easy for them to interrupt at any step with "tell me more about X" — which is exactly the signal you want.
Step 1 — Clarify the grain and the metric definition
The grain invariant: the grain of a fact table is the business event one row represents — orders, order lines, shipments, page views, user-day, user-session — and ambiguous grain is the single most common bug in data engineering. Ask the interviewer "are we counting orders or order lines?" before drawing a box. The answer changes joins, group-bys, and reconciliation totals.
-
Order grain — one row per order;
COUNT(*)= number of orders. -
Order-line grain — one row per line item;
COUNT(DISTINCT order_id)= number of orders. -
User-day grain — one row per user per day;
SUM(events)= events per user per day. -
Session grain — one row per session; rolling
LAGover events to define session boundaries.
Worked example. "How many orders did we ship last week?" against fact_shipments.
| grain candidate | implied metric |
|---|---|
| order grain | COUNT(*) WHERE shipped_date BETWEEN ... |
| order-line grain | COUNT(DISTINCT order_id) WHERE shipped_date BETWEEN ... |
| shipment grain | COUNT(DISTINCT order_id) WHERE shipment_event = 'shipped' |
Step-by-step explanation.
- If the table is at order grain (one row per order),
COUNT(*) WHERE shipped_date BETWEEN ...directly counts orders shipped — clean and simple. - If the table is at order-line grain (one row per item per order),
COUNT(*)over-counts every multi-item order; the right answer becomesCOUNT(DISTINCT order_id). - If the table is at shipment grain (one row per shipment event per line, including partial shipments and cancellations), filter by
event_type = 'shipped'first and thenCOUNT(DISTINCT order_id). - Without naming the grain, the same SQL can produce three different "right" numbers — and the analyst, dashboard, and source-of-truth Slack thread will each pick a different one.
- Stating the grain in the first sentence of every interview answer prevents this entire class of bug — and the same rule applies in production: every fact table should have its grain documented in the catalog comment.
Worked-example solution. Always state grain explicitly:
"This fact_shipments table has shipment grain — one row per shipment_event per order_line.
For 'orders shipped last week' I'll do COUNT(DISTINCT order_id) where event_type = 'shipped'
and shipped_date BETWEEN start_of_week AND end_of_week."
Rule of thumb: the first sentence of every interview answer should name the grain. Even if the interviewer doesn't ask, declaring grain demonstrates senior intent.
Step 2 — Separate landing from conformed (bronze vs silver)
The separation invariant: landing is what the source sent; conformed is what the business agrees to call truth; never let analysts query landing directly because schemas change without notice. The bronze/silver split is the architectural manifestation of this rule.
-
Landing / bronze — append-only, source-fidelity, partitioned by
ingest_date. - Conformed / silver — deduplicated, typed, with conformed business keys.
- Curated / gold — subject-area marts and dimensional models for downstream consumption.
- Boundary — only the silver and gold layers carry consumer contracts; bronze is for re-processors only.
Worked example. A pipeline lands daily JSON snapshots; without a separation layer, analysts join directly against bronze.orders and break every time the source adds a column.
| layer | who reads | breakage tolerance |
|---|---|---|
| bronze.orders | re-processors only | high (re-process on demand) |
| silver.orders | analyst ad-hoc, ML | low (contract change ≥ 30 days notice) |
| gold.fact_orders | dashboards, BI | zero (versioned column contracts) |
Step-by-step explanation.
- Bronze is owned by the re-processors only — no SLA, no consumer contract; analysts who query it get whatever the source app emitted today, including freshly-renamed columns and broken types.
- Silver is the contract layer — analyst ad-hoc SQL, ML feature pipelines, and reverse-ETL all read it; breakage requires ≥30-day notice so consumers can adapt.
- Gold has zero breakage tolerance — dashboards and BI tools couple to specific column names + types; any change requires explicit version bumping (
gold.fact_orders_v2) so old dashboards keep working. - Without these boundaries, a source app's column rename cascades immediately into a broken executive dashboard, and the data team learns about it from a Slack screenshot.
- With these boundaries, the silver-layer owner absorbs the upstream change inside the dedup logic, gold contracts stay intact, and the dashboard never breaks — the architecture has done its job.
Worked-example solution. State the layer boundaries explicitly:
"I'd split the platform into three layers — bronze for raw landing, silver for conformed,
gold for analytics-ready. Bronze is for re-processors only; analysts and dashboards read
silver and gold. The boundary contract is documented and breakage requires 30-day notice
for silver and explicit version bumping for gold."
Rule of thumb: any answer that allows analysts to query the landing zone has a hidden bug-factory; the bronze/silver split is what prevents source-schema chaos from cascading into BI.
Step 3 — Idempotent loads — same input → same output, every time
The idempotency invariant: a daily load is idempotent if re-running it (after any failure, manual intervention, or backfill) produces byte-identical output; without idempotency, retries cause duplicates and counts drift silently. Idempotency is achieved through MERGE instead of INSERT, partition-overwrite semantics, or table-format ACID transactions.
-
MERGEon a business key —WHEN MATCHED UPDATE SET *+WHEN NOT MATCHED INSERT *. -
Partition overwrite —
INSERT OVERWRITE INTO silver.orders PARTITION (ingest_date='2026-04-13'). -
Iceberg / Delta
MERGE— ACID transaction; safe for concurrent writers. -
Functional idempotency — pure transformations whose output depends only on inputs, never on
NOW()or random.
Worked example. A retry on a half-completed daily load should produce the same final state as the original successful run.
| run | rows in silver before | rows after |
|---|---|---|
| original | 0 | 12,835 |
| retry (after partial failure) | 12,401 | 12,835 (no duplicates) |
| backfill 2026-04-12 a week later | 12,820 | 12,820 (overwritten cleanly) |
Step-by-step explanation.
- Original run: silver starts at 0 rows; the
MERGEwrites 12,835 unique rows after dedup; final count = 12,835. - Retry after a partial failure: silver already has 12,401 rows (the partial write that crashed); the
MERGEupdates the existing rows and inserts only the missing 434; final count = 12,835 — no duplicates. - Backfill 2026-04-12 a week later: partition-overwrite semantics drop the existing 12,820 rows for that date and replace them with the freshly recomputed 12,820; final count = 12,820 — clean.
- The key invariant: every rerun produces the same final state regardless of the starting state — that's what idempotency means.
- Without idempotency, the retry would have inserted 434 duplicate rows (12,835 - 12,401), and the backfill would have either errored on the unique constraint or silently created shadow data that broke the next dashboard refresh.
Worked-example solution.
MERGE INTO silver.orders t
USING (
SELECT * FROM bronze.orders
WHERE ingest_date = '2026-04-13'
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) = 1
) s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Rule of thumb: if your interviewer asks "what happens if this job runs twice", and your answer involves any kind of cleanup script, you don't have idempotency — restructure.
Step 4 — Attach row-level lineage keys — ingest_id, source_ts, pipeline_version
The lineage invariant: every silver and gold row carries the columns that let you reconstruct which source payload produced it and which pipeline version transformed it; without lineage, debugging "why does this row look wrong" is forensic archaeology.
-
ingest_id— unique identifier of the bronze batch (e.g., timestamp + UUID). -
source_ts— timestamp from the source system (CDC) for ordering. -
pipeline_version— git SHA or version tag of the transformation code. -
silver_loaded_at— when the row entered silver; useful for SLA metrics.
Worked example. Analysts notice that revenue for order_id=448 is wrong; with lineage, they can trace it back to the exact bronze file and pipeline version that produced it.
| field | value |
|---|---|
| order_id | 448 |
| revenue | $99.00 (wrong; should be $999.00) |
| ingest_id | 20260412T0200Z_a3f2 |
| source_ts | 2026-04-12 09:30:15 |
| pipeline_version |
v2.1.7 (commit b3a4d72) |
| silver_loaded_at | 2026-04-12 02:15:32 |
Step-by-step explanation.
- An analyst notices
order_id = 448shows revenue $99 instead of the expected $999 in the BI dashboard. - They look the row up in silver:
SELECT ingest_id, source_ts, pipeline_version FROM silver.orders WHERE order_id = 448. - The result tells them exactly which bronze batch produced this row (
ingest_id = '20260412T0200Z_a3f2'), the source moment (source_ts = 2026-04-12 09:30:15), and the pipeline version that ran (v2.1.7). - They open the bronze file at that
ingest_id. If the source payload already shows $99, it's a source bug — file a ticket with the upstream team and replay from a known-goodsource_ts. - If the bronze payload shows $999 but silver shows $99, the bug is in the pipeline. Run
git log v2.1.7to find the exact commit, fix the transformation, deployv2.1.8, and backfill the affectedingest_datepartition — total recovery time ~30 minutes instead of multi-day forensic SQL.
Worked-example solution. Carry lineage in every silver row:
SELECT order_id,
customer_id,
amount::DECIMAL(18,2) AS amount,
source_ts,
ingest_id,
'v2.1.7' AS pipeline_version,
CURRENT_TIMESTAMP AS silver_loaded_at
FROM (deduped bronze rows);
Rule of thumb: if the dashboard shows a wrong number and you can't answer "which source file produced this row?" in under five minutes, your lineage isn't strong enough.
Step 5 — Aggregate reconciliation against the source
The reconciliation invariant: a daily job compares aggregate metrics (row counts, sums, distinct counts) between the lake and the source system, alerts on drift above a tolerance, and blocks promotion to gold until the drift is investigated. Reconciliation is the difference between "we trust the lake" and "we hope the lake is right".
-
Row count —
COUNT(*)in lake vs source for the same time window. -
Sum reconciliation —
SUM(amount)in lake vs source. -
Distinct count —
COUNT(DISTINCT user_id)to catch dedup bugs. - Tolerance threshold — typically 0.1% for high-volume facts, 0.01% for finance.
Worked example. Daily reconciliation between silver and source-app replica.
| metric | silver | source | drift | passes? |
|---|---|---|---|---|
| row count | 12,835 | 12,835 | 0.0% | ✓ |
| sum(amount) | $4,128,931 | $4,128,931 | 0.0% | ✓ |
| count(distinct user_id) | 8,712 | 8,712 | 0.0% | ✓ |
Step-by-step explanation.
- The daily reconciliation job runs after the silver load completes for the prior day.
- It computes three metrics over
silver.orders:COUNT(*),SUM(amount), andCOUNT(DISTINCT user_id)for the same date. - It computes the same three metrics over
source_replica.orders(a read-only replica of the source-app database) for the same date. - For each metric, drift is calculated as
ABS(silver - source) / source; the gate passes only if every metric is below the tolerance (0.001 = 0.1% for facts; 0.0001 for finance). - If all three pass: silver promotes to gold and the dashboard refresh proceeds. If any fail: the gate blocks promotion, pages the on-call engineer, and emits the failing metric to a drift dashboard for investigation — the BI team never sees stale or wrong numbers because they see no refresh.
Worked-example solution. A reconciliation gate before gold promotion:
WITH lake AS (
SELECT COUNT(*) AS n, SUM(amount) AS s
FROM silver.orders
WHERE source_ts::DATE = '2026-04-13'
),
src AS (
SELECT COUNT(*) AS n, SUM(amount) AS s
FROM source_replica.orders
WHERE order_ts::DATE = '2026-04-13'
)
SELECT
ABS(lake.n - src.n) * 1.0 / src.n AS row_drift,
ABS(lake.s - src.s) * 1.0 / src.s AS sum_drift,
CASE
WHEN ABS(lake.n - src.n) * 1.0 / src.n < 0.001
AND ABS(lake.s - src.s) * 1.0 / src.s < 0.001
THEN 'PASS'
ELSE 'FAIL'
END AS gate
FROM lake, src;
Rule of thumb: never promote to gold without a reconciliation gate; the BI team will discover any drift the hard way otherwise, and trust takes years to rebuild.
Common beginner mistakes
- Skipping Step 1 (grain) and going straight to architecture — every downstream answer is wrong if grain is wrong.
- Letting analysts query the bronze zone directly — schema drift cascades into BI dashboards.
- "Idempotent" loads that depend on
NOW()— re-runs produce different rows; not actually idempotent. - Lineage limited to the pipeline level (not the row level) — debugging "this row is wrong" is a multi-day forensic effort.
- Reconciliation that only checks row counts but not sums —
COUNTmatches when the dedup deletes the wrong row but the count happens to be right.
Data Lake Interview Question on a Full System-Design Walkthrough
Walk through your end-to-end answer to "design our company's analytics data lake" using the five-step template.
Solution Using the Five-Step Template
1. CLARIFY GRAIN
"Before I draw any boxes — what's the canonical fact event? Orders, order lines, shipments?
What's the metric we ultimately care about? Revenue, user counts, latency?"
→ assume: order grain; canonical metric = daily revenue per region.
2. SEPARATE LANDING FROM CONFORMED
bronze.orders ← S3 append-only daily JSON, partitioned by ingest_date
silver.orders ← deduped + typed + conformed customer_key/region_key
gold.fact_orders ← star schema with dim_customer, dim_region, dim_date
3. IDEMPOTENT LOADS
- bronze: append-only writes by ingest_id (never overwrite)
- silver: MERGE on order_id with QUALIFY ROW_NUMBER() = 1 dedup
- gold: INSERT OVERWRITE PARTITION (date_key) for the affected day(s)
Re-runs produce byte-identical output.
4. ROW-LEVEL LINEAGE
Carry ingest_id, source_ts, pipeline_version, silver_loaded_at on every silver row.
Carry silver_loaded_at and pipeline_version on every gold row.
Forensic queries: "show me every silver.orders row where pipeline_version='v2.1.6'."
5. AGGREGATE RECONCILIATION
Daily SQL job: compare COUNT(*), SUM(amount), COUNT(DISTINCT user_id) between
silver.orders and the source-app replica for the prior day. Drift > 0.1% blocks
gold promotion and pages on-call. Drift dashboard surfaces history at a glance.
Why this works: the template gives the interviewer a clear road map (so they know where to drill) while demonstrating that the candidate has shipped this kind of pipeline before; each step addresses a specific failure mode (grain ambiguity, schema drift, retry duplicates, debugging dead-ends, silent data corruption); the order is non-arbitrary — Step N depends on Step N-1, and skipping any step weakens the foundation; every step has a concrete artifact (a layer, a SQL pattern, a column, a job) so the interviewer can ask "show me what that looks like" and get a specific answer.
Step-by-step trace through a sample interview round:
| time (min) | step | candidate output |
|---|---|---|
| 0-2 | grain | "Are we counting orders or order lines? Confirmed: orders." |
| 2-7 | landing vs conformed | drew bronze/silver/gold split with ownership boxes |
| 7-12 | idempotency | walked through silver MERGE; named QUALIFY ROW_NUMBER dedup |
| 12-15 | lineage | listed ingest_id, source_ts, pipeline_version columns |
| 15-20 | reconciliation | sketched daily-reconciliation SQL job + drift dashboard |
| 20-25 | open questions | streaming variant, schema evolution, multi-region replication |
Output: the recommended interview-round shape:
| step | minutes | failure mode addressed |
|---|---|---|
| 1 — grain | 0-2 | ambiguous metric → wrong joins |
| 2 — landing vs conformed | 2-7 | source-schema drift → BI breakage |
| 3 — idempotency | 7-12 | retries → duplicates → drift |
| 4 — lineage | 12-15 | "why is this row wrong" → forensic dead-end |
| 5 — reconciliation | 15-20 | silent corruption → trust loss |
Why this works — concept by concept:
- Step 1 anchors the conversation in business semantics — grain is the foundation; getting it right makes Steps 2-5 simpler, getting it wrong makes them all moot.
- Step 2 turns architecture into ownership — naming the layer boundary makes it easy to talk about who reads what, who's allowed to break what, and what notice consumers get.
-
Step 3 prevents the most common production incident — non-idempotent loads are the #1 source of "duplicate row" bug reports; demonstrating the
MERGE+QUALIFY ROW_NUMBERpattern signals senior fluency. - Step 4 turns debugging from hours to minutes — lineage columns are the difference between "I can fix this in 10 min" and "I'll get back to you tomorrow."
- Step 5 is the operational backstop — even with steps 1-4 done well, you need reconciliation to catch the failures you didn't anticipate; the gate-before-promotion pattern blocks drift before consumers see it.
- The template's value compounds — each step makes the next one easier, and skipping any step weakens the foundation that the later steps build on.
Inline CTA: More ETL practice problems for end-to-end pipeline design and the data modeling practice page for grain and dimensional patterns. Course: ETL System Design for Data Engineering Interviews.
ETL
Topic — ETL pipelines
ETL practice problems
SQL
Language — data modeling
Data modeling problems
SQL
Language — SQL
All SQL practice problems
Tips to crack data lake architecture interviews
Master the four primitives — zones, ingest flow, pattern selection, answer template
If you can draw the bronze/silver/gold zones with ownership labels, walk the ingest → catalog → compute flow without skipping the catalog, articulate when a lakehouse beats a warehouse and when it doesn't, and structure your answer using the five-step grain → idempotency → lineage → reconciliation template — you can clear most data-engineering system-design rounds. The remaining 20% is dialect-specific (Spark vs Snowflake idioms, Iceberg vs Delta semantics) and behavioral.
Always state grain in the first sentence
Before drawing any boxes, name the grain: "this is order-line grain" or "this is user-day grain". Every wrong answer in a system-design round can be traced back to a grain ambiguity that nobody named. Stating grain explicitly costs five seconds and saves the entire round.
Pick Iceberg unless you have a reason not to
Iceberg has the broadest engine support (Spark, Trino, Snowflake, BigQuery, Dremio, Athena) and is the most engine-agnostic of the three open table formats. Pick Delta if your stack is Databricks-centric and Spark-only. Pick Hudi if your workload is upsert-heavy CDC. State the choice and the reason out loud — "I'd pick Iceberg for engine portability" — interviewers grade the reasoning more than the choice.
Treat idempotency as table stakes, not advanced
MERGE instead of INSERT, partition-overwrite for backfill, and pure transformations whose output depends only on inputs (never NOW() or random) — these are baseline expectations, not advanced techniques. If you forget to mention idempotency in a system-design round, the interviewer will assume you have not shipped a production pipeline.
Use Spark for batch, Trino for interactive, DuckDB for ad-hoc
Spark dominates batch + streaming with the richest connector ecosystem; Trino dominates federated interactive SQL across many catalogs; DuckDB is rising fast for single-node ad-hoc analytics under 1TB. Naming the right tool for the workload (without over-explaining) signals breadth.
Reconciliation is what separates "we trust the lake" from "we hope the lake is right"
Always include a reconciliation step that compares aggregate metrics between the lake and the source system, alerts on drift above a tolerance, and blocks gold promotion until drift is investigated. The five seconds it takes to mention reconciliation is the difference between a senior signal and a mid-level signal.
Where to practice on PipeCode
Start with the ETL practice page for medallion-zone and end-to-end pipeline problems. Drill the related topic pages: streaming, dimensional modeling, SQL practice, Python practice, data modeling practice. The interview-first courses page bundles structured curricula — start with ETL System Design for Data Engineering Interviews, Data Modeling for Data Engineering Interviews, or PySpark Fundamentals. For broader coverage, browse by topic or read the SQL interview questions for data engineering and top data engineering interview questions 2026 blogs.
Frequently Asked Questions
What is data lake architecture?
Data lake architecture is the set of conventions — layered zones (bronze/silver/gold), an ingest → catalog → compute flow on object storage, an open table format for ACID semantics, and disciplined ownership and quality contracts — that turn raw object storage into a trustworthy analytics platform. Without these conventions, a "data lake" devolves into a data swamp where nobody can trust the numbers.
What is the difference between a data lake, a data warehouse, and a lakehouse?
A data lake is cheap, flexible object storage with file-based reads and no built-in ACID; a cloud warehouse (Snowflake, BigQuery, Redshift) is a managed system with proprietary storage, full ACID, and SQL-first ergonomics; a lakehouse is a lake plus an open table format (Iceberg, Delta Lake, Hudi) that adds ACID, time travel, schema evolution, and concurrent writers — bringing warehouse-like semantics to object storage. Most organizations run a hybrid: lake/lakehouse for high-volume + ML workloads, warehouse for curated SQL.
What are bronze, silver, and gold layers?
Bronze (or landing/raw) is an append-only mirror of source payloads with minimal transformation. Silver (or refined/conformed) applies dedup, type coercion, and conformed business keys; this is the source of truth for downstream applications. Gold (or curated/consumption) publishes subject-area marts and star-schema fact + dim tables for analyst SQL and BI dashboards. The names vary across vendors but the three-tier shape is universal.
Do I need Iceberg, Delta Lake, or Hudi for every project?
No. Small teams can start with well-partitioned Parquet and strict naming conventions. Reach for an open table format when you need ACID transactions, concurrent writers, partition evolution, time travel, or simpler upserts and deletes. Pick Iceberg for the broadest engine support, Delta for tightest Databricks/Spark integration, Hudi for upsert-heavy CDC workloads.
What is the small-file problem?
When a lake table accumulates millions of small files (e.g., 5KB each from frequent micro-batch writes), query planning spends more time listing files in the catalog and metastore than actually scanning data — a Spark or Trino query that should take 500ms can take 50 seconds. The fix is scheduled compaction jobs that rewrite many small files into fewer 128MB-1GB files, plus targeting larger micro-batch sizes upstream.
How do I handle schema evolution in a data lake?
Open table formats handle schema evolution gracefully — adding a column or widening a type is a single metadata commit. Without a table format, schema evolution requires rewriting partitions or carrying a column-version field on every row. Either way, the silver layer should be the schema-stability boundary: bronze accepts whatever the source sends, silver enforces a canonical schema, and changes to silver require explicit consumer notice (typically 30 days).
How does this connect to data engineering interviews on PipeCode?
System-design questions still reduce to SQL queries, Python data transforms, and dimensional modeling decisions. PipeCode focuses on those signals with 450+ problems — drill SQL aggregations and joins, Python pipeline patterns, and dimensional models, then layer on system-design depth via the courses. Use Practice once you can draw the medallion zones and the ingest → catalog → compute flow confidently.





Top comments (0)