databricks lakehouse is the architecture every modern data-engineering interview now anchors on: one copy of data on cheap object storage, a transactional delta lake layer on top, multi-engine compute (Photon SQL, Spark batch, Structured Streaming, ML notebooks) underneath one unity catalog governance plane — and the medallion architecture (Bronze raw → Silver cleansed → Gold business) is the canonical layering pattern that organises every table inside it. Together those two ideas — lakehouse architecture + bronze silver gold — are the single most-asked combination in 2026 Databricks loops, and the curriculum this guide walks through, end to end, in five numbered teaching sections.
This is the deep-dive companion to a quick "what is a lakehouse?" explainer: where a one-screen overview names the three medallion layers and the Delta table format, this guide widens the surface into five full teaching sections — lakehouse anatomy (storage + transactional layer + compute + governance), medallion architecture (Bronze ingest + Silver cleanse + Gold serve, with the exact transforms that bind each pair), delta lake mechanics (ACID via the _delta_log, time travel, schema evolution, OPTIMIZE + Z-ORDER, VACUUM), an end-to-end production lakehouse pipeline (sources → Auto Loader → Bronze → Silver via Spark or delta live tables → Gold → BI / ML / reverse ETL), and a cheat sheet that maps every interview question to one of the three layers. Each section ends as a real interview answer: a question, a SQL / PySpark / Delta snippet, a traced execution, a sample output, and a concept-by-concept why this works breakdown — the exact shape databricks medallion rounds reward.
When you want hands-on reps immediately after reading, browse the SQL practice library →, drill ETL pipeline problems →, sharpen aggregation reconciliation patterns →, rehearse joins drills →, warm up on data-validation problems →, or widen coverage on the full Python practice library →.
On this page
- Why the lakehouse + medallion model is the modern DE interview baseline
- Lakehouse anatomy — storage + transactional layer + multi-engine compute + Unity Catalog
- Medallion architecture — Bronze raw → Silver cleansed → Gold business marts
- Delta Lake mechanics — ACID + time travel + OPTIMIZE + Z-ORDER
- End-to-end production lakehouse pipeline (sources → Bronze → Silver → Gold → BI/ML)
- Choosing the right layer (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why the lakehouse + medallion model is the modern DE interview baseline
databricks lakehouse — why the warehouse-plus-lake duplex collapsed
The one-sentence invariant: the lakehouse is the architecture that replaced the warehouse-plus-lake duplex by putting a transactional layer (delta lake) on top of cheap object storage, so one copy of data can serve BI, ML, and streaming through many engines under one governance plane. Before 2020, every serious data team ran two systems — a data lake on S3/ADLS/GCS for ML and raw event capture, and a data warehouse (Snowflake / Redshift / BigQuery) for BI and SQL — and copied data between them with brittle ETL. The lakehouse removes the copy: same Parquet files on the same bucket, but a JSON transaction log makes them ACID, schema-enforced, and queryable by every engine.
What interviewers actually score on databricks lakehouse questions.
- Architecture fluency — can you name the four layers (object storage, transactional Delta, compute engines, Unity Catalog governance) and explain why each one is necessary?
-
Why Delta exists — can you explain what the
_delta_logdoes and why plain Parquet on S3 is not transactional? -
The medallion layering — can you map a raw OLTP
orderstable onto Bronze → Silver → Gold and name the transforms between each pair? - Streaming + batch unification — can you explain why Structured Streaming and batch jobs write to the same Delta table?
-
Cost + perf intuition — can you reason about
OPTIMIZE(small-file compaction),Z-ORDER(multi-dim clustering),VACUUM(tombstone cleanup), andPhoton(vectorised SQL engine)? - Governance — can you say one sentence about Unity Catalog — three-level namespace, fine-grained ACLs, lineage, audit log?
The five-stage map this guide walks through.
-
Stage 1 —
lakehouse anatomy— storage (S3/ADLS/GCS) + transactional Delta + compute engines (Photon, Spark, Streaming, ML) + Unity Catalog governance. -
Stage 2 —
medallion architecture— Bronze (raw, append-only audit trail), Silver (cleansed + conformed), Gold (business marts + BI surfaces). -
Stage 3 —
delta lake mechanics— ACID via_delta_log,MERGE INTO, time travel (VERSION AS OF), schema enforcement + evolution,OPTIMIZE+Z-ORDER,VACUUM. -
Stage 4 —
production pipeline— sources (Kafka, CDC, S3 drops) → Auto Loader → Bronze → Spark / DLT → Silver → aggregate + join → Gold → BI / ML / reverse-ETL consumers. -
Stage 5 —
cheat sheet— pick the right layer for every interview prompt; pick the right Delta feature for every failure mode.
Why this is the new interview baseline and not "just another tool" question.
-
lakehouse architectureis a fundamental shift — the warehouse-plus-lake duplex is not a hardware choice; it is a cost + governance + freshness tradeoff that the lakehouse genuinely resolves. -
The bugs are different — small-file explosions, schema drift in raw Bronze,
MERGEdeadlocks,VACUUMretention violations are all Delta-specific failure modes that don't exist in pure warehouses. -
delta live tableschanges the contract — declarative pipelines with expectations and autoscale replace the imperative Airflow-DAG-of-Spark-jobs you had in 2019. - Streaming and batch share one table — Structured Streaming writes to a Delta table that a batch SQL query reads, atomically, with no Lambda-architecture duplication.
-
Unity Catalog is the governance answer — one catalog across workspaces, with row + column ACLs, lineage, and audit; replaces the per-workspace
hive_metastoreof the 2018 era.
Worked example — map a single orders table onto the lakehouse + medallion model
Detailed explanation. Real interviews probe whether you can think across the lakehouse stack and the medallion layers on a single canonical table. Below is the walkthrough for a daily orders OLTP feed landing in a databricks lakehouse and surfacing as a gold_daily_revenue_mart BI table.
Question. A daily OLTP feed of orders is dropped as JSON to s3://bucket/raw/orders/dt=YYYY-MM-DD/. The BI team wants daily_revenue_by_region refreshed by 06:00 each day. Map the journey of one row onto Bronze, Silver, and Gold; name the transforms; name the Delta features that make each step safe.
Input. Raw orders JSON: {"order_id":1001,"customer_id":42,"region":"US","amount":"99.50","order_ts":"2026-05-28T22:31:09Z","currency":null}.
Code.
-- Bronze (raw append-only ingest, schema-on-read)
CREATE TABLE bronze.raw_orders AS
SELECT *,
_metadata.file_name AS source_file,
current_timestamp() AS ingest_ts
FROM read_files('s3://bucket/raw/orders/', format => 'json');
-- Silver (cleansed + typed + deduplicated)
CREATE OR REPLACE TABLE silver.orders_clean AS
SELECT order_id::BIGINT AS order_id,
customer_id::BIGINT AS customer_id,
upper(region) AS region,
amount::DECIMAL(18,4) AS amount,
order_ts::TIMESTAMP AS order_ts,
coalesce(currency, 'USD') AS currency
FROM bronze.raw_orders
WHERE order_id IS NOT NULL
QUALIFY row_number() OVER (PARTITION BY order_id ORDER BY ingest_ts DESC) = 1;
-- Gold (business mart, aggregated, BI-ready)
CREATE OR REPLACE TABLE gold.daily_revenue_by_region AS
SELECT date(order_ts) AS order_date,
region,
count(*) AS order_count,
sum(amount) AS gross_revenue
FROM silver.orders_clean
GROUP BY date(order_ts), region;
Step-by-step explanation.
-
Bronze (raw) —
read_files(Auto Loader under the hood) ingests every JSON drop as-is; we addsource_file+ingest_tsmetadata; we do not change types or drop rows. Bronze is the audit trail. -
Silver (cleansed) — we cast
amounttoDECIMAL(18,4)(no floating-point money), normaliseregionto upper case, defaultcurrencytoUSD, droporder_id IS NULL, and deduplicate viarow_number()so re-ingests are idempotent. -
Gold (business) — we aggregate to the grain the BI team consumes — one row per
(date, region)— and write to a small, fast, partition-pruned table that powers a Power BI dashboard or a SQL Warehouse endpoint. -
Delta safety net — every
CREATE OR REPLACE TABLEis atomic because of the_delta_log; readers see either yesterday's full table or today's full table, never a half-loaded mess.
Output (the Gold table's first 3 rows).
| order_date | region | order_count | gross_revenue |
|---|---|---|---|
| 2026-05-28 | US | 42137 | 1289450.7500 |
| 2026-05-28 | EU | 18204 | 612900.3300 |
| 2026-05-28 | APAC | 9810 | 287113.9000 |
Rule of thumb: one table threads three layers — Bronze keeps it forever, Silver makes it correct, Gold makes it useful. Senior engineers reason at all three layers on every prompt.
medallion architecture — the four senior signals interviewers chase
Signal 1 — Bronze is append-only, not overwrite. Junior engineers say "Bronze is the raw zone"; senior engineers say "Bronze is the immutable audit trail — every ingest is appended, schema-on-read, never overwritten, because the day you need to re-derive Silver and Gold from a bug fix, only an append-only Bronze can replay history."
Signal 2 — Silver is where contracts live. Junior engineers conflate Silver and Gold; senior engineers say "Silver is the conformed warehouse layer — types are real, deduplication is enforced, late-arriving data is merged, business keys are unique. Silver is the table I'd run a dbt test against."
Signal 3 — Gold is read-optimised, denormalised, and aggregated. Junior engineers leave Gold normalised; senior engineers say "Gold is whatever shape the consumer wants — usually a wide, denormalised, partition-pruned, often-pre-aggregated table built to answer one question fast; we accept duplication because read latency wins."
Signal 4 — every layer is a Delta table. Junior engineers think Bronze is "files" and Gold is "tables"; senior engineers say "all three layers are Delta tables — same _delta_log, same ACID guarantees, same time travel — the difference is contract and grain, not technology."
SQL
Topic — etl
ETL pipeline drills
Company
Databricks interview set
Databricks interview practice
Solution Using a 5-stage lakehouse coverage matrix
Code.
-- One canonical coverage matrix — every row maps a lakehouse stage to an artefact.
CREATE TABLE lakehouse_coverage_matrix AS
SELECT * FROM (VALUES
(1, 'anatomy', 'object_storage', 's3 / adls / gcs', 'always-on'),
(1, 'anatomy', 'delta_transactional', '_delta_log + parquet', 'always-on'),
(1, 'anatomy', 'compute_engines', 'photon + spark + streaming + ml', 'on-demand'),
(1, 'anatomy', 'unity_catalog', 'authn + authz + lineage + audit', 'always-on'),
(2, 'medallion', 'bronze_raw', 'append-only + schema-on-read', 'every load'),
(2, 'medallion', 'silver_cleansed', 'typed + deduped + conformed', 'every load'),
(2, 'medallion', 'gold_business', 'aggregated + denormalised + wide', 'every load'),
(3, 'delta', 'acid', 'merge into target using updates', 'every write'),
(3, 'delta', 'time_travel', 'select ... version as of N', 'on-demand'),
(3, 'delta', 'optimize_z_order', 'compact files + cluster columns', 'nightly'),
(4, 'pipeline', 'auto_loader_ingest', 'incremental file detection', 'continuous'),
(4, 'pipeline', 'dlt_declarative', 'expectations + autoscale', 'continuous'),
(5, 'governance', 'expectations', 'expect / drop / fail on bad rows', 'every load')
) AS t(stage_id, stage_name, artefact_name, primitive, cadence);
Step-by-step trace.
| stage_id | stage_name | artefact_name | primitive | cadence |
|---|---|---|---|---|
| 1 | anatomy | object_storage | s3 / adls / gcs | always-on |
| 1 | anatomy | delta_transactional | _delta_log + parquet | always-on |
| 1 | anatomy | compute_engines | photon + spark + streaming + ml | on-demand |
| 1 | anatomy | unity_catalog | authn + authz + lineage + audit | always-on |
| 2 | medallion | bronze_raw | append-only + schema-on-read | every load |
| 2 | medallion | silver_cleansed | typed + deduped + conformed | every load |
| 2 | medallion | gold_business | aggregated + denormalised + wide | every load |
| 3 | delta | acid | merge into target using updates | every write |
| 3 | delta | time_travel | select ... version as of N | on-demand |
| 3 | delta | optimize_z_order | compact files + cluster columns | nightly |
| 4 | pipeline | auto_loader_ingest | incremental file detection | continuous |
| 4 | pipeline | dlt_declarative | expectations + autoscale | continuous |
| 5 | governance | expectations | expect / drop / fail on bad rows | every load |
- Row 1 —
object_storageis the cheapest, infinitely scalable substrate; everything else stacks on top. - Row 2 —
delta_transactionalis what makes the lakehouse possible — without the_delta_log, you have a data lake, not a lakehouse. - Rows 3-4 —
compute_engines+unity_catalogcomplete the four-layer stack; one storage, many engines, one governance. - Rows 5-7 — the medallion layers map content to grain; Bronze keeps everything, Silver makes it correct, Gold makes it consumable.
- Rows 8-10 —
deltamechanics are the physics — ACID, time travel, OPTIMIZE — every senior question touches one of them. - Rows 11-12 — the production pipeline glue — Auto Loader for ingest, DLT for declarative orchestration.
- Row 13 — DLT
expectationsare the QA layer; every load asserts data quality before it advances.
Output.
| stage_id | stage_name | artefact_name | cadence |
|---|---|---|---|
| 1 | anatomy | object_storage | always-on |
| 2 | medallion | bronze_raw | every load |
| 2 | medallion | silver_cleansed | every load |
| 2 | medallion | gold_business | every load |
| 3 | delta | acid | every write |
| 3 | delta | optimize_z_order | nightly |
| 4 | pipeline | auto_loader_ingest | continuous |
| 4 | pipeline | dlt_declarative | continuous |
Why this works — concept by concept:
- Stage coverage matrix — turns the 5-stage map into an auditable artefact; every architectural decision is owned by exactly one stage, so you can talk to coverage gaps in one query.
-
Cadence binding — pairs each artefact with its run cadence (
always-on,every load,nightly,continuous); senior engineers explicitly assign cadence per artefact. -
Primitive column — codifies the implementation of the artefact (
merge into,_delta_log,expect / drop / fail); interviewers love a candidate who can name the primitive, not just the artefact. - Stage 3 is the differentiator — the four Delta mechanics (ACID, time travel, OPTIMIZE, VACUUM) are the answers that distinguish lakehouse fluency from generic Spark fluency.
-
Cost —
O(1)to read the coverage matrix; the actual artefacts areO(N)over the underlying tables but parallelisable across the five stages.
2. Lakehouse anatomy — storage + transactional layer + multi-engine compute + Unity Catalog
lakehouse architecture — four layers, one platform
lakehouse architecture is best understood as a four-layer stack stacked vertically and read top-down: at the bottom, cheap object storage (S3, ADLS, GCS) holds the actual bytes; in the middle, a transactional layer (Delta Lake, Apache Iceberg, or Apache Hudi) gives those bytes ACID semantics through a JSON-encoded transaction log; on top, multiple compute engines (Photon SQL, Spark batch, Structured Streaming, ML notebooks, BI tools) read and write the same tables; and threaded through all three, a governance plane (Unity Catalog on Databricks) handles permissions, lineage, and audit. The interview test is whether you can explain each layer in one sentence and say why removing any one of them collapses the model back to either a warehouse or a lake.
Layer 1 — object storage (the cheap, infinite substrate).
-
s3 / adls / gcs— the substrate; pay-per-GB, eleven nines of durability, infinite scale, schema-agnostic. -
open formats— Parquet (columnar), JSON (raw), CSV (legacy); the lakehouse never locks data inside a proprietary format. -
bucket organisation— typicallys3://bucket/<env>/<medallion_layer>/<table_name>/<partition_cols>/; one bucket per workspace is common. -
why this layer exists— warehouses store data on expensive coupled storage; the lakehouse decouples storage from compute and pays warehouse-grade only for the brief minutes a cluster runs.
Layer 2 — Delta Lake (the transactional layer).
-
_delta_log— a sub-directory next to the data files containing one JSON file per commit; this log is the source of truth, not the Parquet files. -
ACID— atomic / consistent / isolated / durable writes; concurrent writers serialise via the log, never via a database server. -
schema enforcement + evolution— bad rows are rejected at write time; intentional schema changes are explicit (ALTER TABLE). -
time travel—SELECT * FROM tbl VERSION AS OF 42orTIMESTAMP AS OF '2026-05-01'; the log retains every version up to a retention window. -
why this layer exists— plain Parquet on S3 has no commits, no rollback, no concurrency control; the lakehouse needs warehouse-grade reliability on lake-grade storage, and the log delivers it.
Layer 3 — compute engines (the polyglot layer).
-
Photon— Databricks' vectorised C++ SQL engine; up to 10x faster than open-source Spark on common BI workloads. -
Spark batch— the workhorse for medallion ETL;spark.read.format('delta')+df.write.format('delta'). -
Structured Streaming— the same DataFrame API for streams; reads Kafka / Kinesis / Auto Loader, writes to Delta with exactly-once. -
SQL Warehouses— serverless SQL endpoints for BI tools; auto-suspend, auto-scale, Photon-backed. -
ML runtimes— pre-baked images with PyTorch, TensorFlow, XGBoost, scikit-learn; notebooks query the same Gold tables BI consumes.
Layer 4 — Unity Catalog (the governance plane).
-
three-level namespace—catalog.schema.tablereplaces the flat 2-levelhive_metastore.database.table. -
fine-grained ACLs— GRANT / REVOKE on catalogs, schemas, tables, rows, and columns (via row-filter + column-mask functions). -
lineage— Unity Catalog tracks which table fed which downstream table, all the way to dashboards and ML models. -
audit log— every read, write, GRANT, REVOKE is captured to system tables; SOC2 / HIPAA / GDPR ready. -
cross-workspace— one catalog spans all workspaces in the account; no more per-workspacehive_metastoreduplication.
Worked example — write the four-layer stack as a Spark notebook
Detailed explanation. Real interviews ask you to show that you can invoke each lakehouse layer in code. Below is the canonical four-cell notebook that touches storage (layer 1), Delta transactions (layer 2), multi-engine compute (layer 3), and Unity Catalog (layer 4).
Question. Write a 4-cell PySpark notebook that (a) reads raw JSON from S3, (b) writes it to a Delta table with a schema, (c) queries the Delta table from SQL, (d) grants SELECT on the table to an analyst group via Unity Catalog.
Input. s3://acme-lakehouse/raw/orders/dt=2026-05-28/*.json and a Unity Catalog analyst_group already created at the account level.
Code.
# Cell 1 — Layer 1: read from object storage
raw_df = (
spark.read
.format("json")
.option("multiLine", "false")
.load("s3://acme-lakehouse/raw/orders/dt=2026-05-28/")
)
# Cell 2 — Layer 2: write to a Delta table (transactional)
(
raw_df.write
.format("delta")
.mode("append")
.option("mergeSchema", "true")
.saveAsTable("acme.bronze.raw_orders")
)
-- Cell 3 — Layer 3: query the same table from SQL (Photon-backed)
SELECT region, count(*) AS orders, sum(amount) AS revenue
FROM acme.bronze.raw_orders
WHERE date(order_ts) = '2026-05-28'
GROUP BY region;
-- Cell 4 — Layer 4: grant SELECT via Unity Catalog
GRANT SELECT ON TABLE acme.bronze.raw_orders TO `analyst_group`;
Step-by-step explanation.
-
Cell 1 —
spark.read.format("json")against ans3://path uses Layer 1 (object storage) directly; no warehouse compute needed. -
Cell 2 —
.format("delta").mode("append")writes Parquet files plus a new_delta_log/00000000000000000001.jsoncommit; this is Layer 2 in action. - Cell 3 — the same physical Delta table is queryable from SQL through Photon; the engine is different from the writer but the data is the same — that is Layer 3's multi-engine promise.
-
Cell 4 —
GRANT SELECTto a group goes through Layer 4 (Unity Catalog); every subsequent read by anyone inanalyst_groupis recorded in the audit log.
Output (Cell 3 result).
| region | orders | revenue |
|---|---|---|
| US | 42137 | 1289450.75 |
| EU | 18204 | 612900.33 |
| APAC | 9810 | 287113.90 |
Rule of thumb: every layer is invokable in one line of code. Junior engineers think the lakehouse is "Spark + S3"; senior engineers can write the four cells above without looking it up.
lakehouse vs data warehouse vs data lake — the three senior tradeoffs
-
cost— warehouses charge for coupled storage + compute (~$23/TB/month for Snowflake storage alone); lakehouses pay $0.023/TB/month for S3 plus per-second compute. -
schema enforcement— warehouses enforce schema on write (strict); lakes enforce schema on read (loose); lakehouses enforce schema on write via Delta but allow safe evolution. -
workload coverage— warehouses do BI great, ML poorly; lakes do ML great, BI poorly; lakehouses do both — same Delta table feeds Power BI and a PyTorch DataLoader. -
governance— warehouses ship strong governance out of the box; lakes ship none; lakehouses ship Unity Catalog which closed the gap in 2022-2024. -
vendor lock-in— warehouses lock data in proprietary formats; lakes and lakehouses keep open Parquet that any engine can read.
SQL
Topic — database
Database design drills
SQL
Topic — design
System design problems
Solution Using a one-table comparison matrix
Code.
-- A single comparison matrix; row = decision criterion, columns = the three architectures.
CREATE TABLE lakehouse_vs_warehouse_vs_lake AS
SELECT * FROM (VALUES
('storage_cost_per_tb', '$0.023 / mo (S3)', '$23 / mo (Snowflake)', '$0.023 / mo (S3)'),
('schema_enforcement', 'on read (loose)', 'on write (strict)', 'on write (Delta strict)'),
('acid', 'no', 'yes', 'yes (delta_log)'),
('time_travel', 'no', 'limited (fail-safe)', 'yes (version as of)'),
('bi_latency_ms', '> 10000 (cold)', '< 500', '< 500 (Photon)'),
('ml_workload', 'native', 'awkward', 'native'),
('streaming', 'awkward', 'awkward', 'native (Structured Streaming)'),
('vendor_lock_in', 'low', 'high', 'low (open Parquet)'),
('governance', 'none', 'strong', 'strong (Unity Catalog)')
) AS t(criterion, data_lake, data_warehouse, lakehouse);
Step-by-step trace.
| criterion | data_lake | data_warehouse | lakehouse |
|---|---|---|---|
| storage_cost_per_tb | $0.023 / mo (S3) | $23 / mo (Snowflake) | $0.023 / mo (S3) |
| schema_enforcement | on read (loose) | on write (strict) | on write (Delta strict) |
| acid | no | yes | yes (delta_log) |
| time_travel | no | limited (fail-safe) | yes (version as of) |
| bi_latency_ms | > 10000 (cold) | < 500 | < 500 (Photon) |
| ml_workload | native | awkward | native |
| streaming | awkward | awkward | native (Structured Streaming) |
| vendor_lock_in | low | high | low (open Parquet) |
| governance | none | strong | strong (Unity Catalog) |
- Storage cost — the lakehouse inherits the lake's 1000x cheaper storage; this is the single biggest economic reason teams migrate.
-
Schema + ACID — the lakehouse inherits the warehouse's reliability; the
_delta_logis the mechanism. - BI latency — Photon on Delta competes with Snowflake / Redshift on common dashboards; the gap that existed in 2021 has closed.
- ML + streaming — only the lakehouse handles both natively; warehouses bolt them on through external services.
- Vendor lock-in — Parquet is portable; if Databricks went away tomorrow, your Delta tables remain readable.
- Governance — Unity Catalog is the 2022-2024 development that finally let the lakehouse win on this dimension.
Output.
| criterion | data_lake | data_warehouse | lakehouse |
|---|---|---|---|
| storage_cost_per_tb | $0.023 / mo | $23 / mo | $0.023 / mo |
| acid | no | yes | yes |
| time_travel | no | limited | yes |
| ml_workload | native | awkward | native |
| streaming | awkward | awkward | native |
| governance | none | strong | strong |
Why this works — concept by concept:
- Single matrix — interviewers love a one-table answer that shows you can compare three architectures on the same axes; it is the structural signal of senior thinking.
- Cost row first — economics drive the migration; lead with the 1000x storage delta and the rest follows.
- ACID + time travel — the two rows that explain why the lakehouse isn't just a re-branded data lake.
- Streaming + ML — the two workloads where warehouses lose decisively; calling them out preempts the "but Snowflake also does ML now" follow-up.
- Governance — the 2022-2024 closing argument; Unity Catalog removed the last warehouse advantage on governance.
-
Cost —
O(1)to read the matrix; the actual architectural decisions cascade intoO(P)migrations whereP= pipeline count.
3. Medallion architecture — Bronze raw → Silver cleansed → Gold business marts
medallion architecture — three layers, two transforms, one contract
medallion architecture is the canonical layering pattern Databricks recommends for organising every table inside a lakehouse: Bronze holds raw data exactly as it arrived, Silver holds cleansed, conformed, deduplicated data with real types, and Gold holds business-ready aggregates and denormalised marts shaped for BI / ML consumption. The interview test is whether you can name what belongs in each layer, the two transforms that bind each pair (Bronze→Silver is cleanse + conform + dedupe; Silver→Gold is aggregate + join + denormalise), and one contract that each layer must honour to the next.
Bronze — the raw audit trail.
-
bronze.raw_orders— every row from every ingest run, appended forever; same schema as the source. -
schema-on-read— the table absorbs whatever the source emits; we cast types at read time, not write time. -
append-only— never overwrite; if today's load was buggy we re-run Silver and Gold from Bronze, never re-ingest. -
source-of-truth— Bronze is the artefact of record; everything downstream is derivable from Bronze + the transformation code. -
metadata columns—_metadata.file_name,ingest_ts,pipeline_run_id— added at ingest, never sourced upstream.
Silver — the cleansed warehouse layer.
-
silver.orders_clean— typed, deduplicated, conformed; one row per business key, types match the contract. -
cleansing transforms— cast toDECIMAL(18,4), normalise text case, fill nullable defaults, parse timestamps. -
deduplication—QUALIFY row_number() OVER (PARTITION BY business_key ORDER BY ingest_ts DESC) = 1; replays are idempotent. -
enrichment + joins— join Bronze sources together; bring in dimension lookups (e.g. customer dim, region dim). -
expectations— DLTexpect(col IS NOT NULL)/expect_or_drop/expect_or_fail; the layer where DQ lives.
Gold — the business mart.
-
gold.daily_revenue_by_region— aggregated to the grain BI asks for; partitioned by date for prune-friendly queries. -
denormalised— wide tables that fold dimensional joins into one row per fact; BI tools love them. -
aggregations—count,sum,avg, distinct counts; the SLA target is< 1 secquery latency from a SQL Warehouse. -
one Gold per consumer— different dashboards can have different Gold tables; we trade storage for read speed. -
reverse ETL feed— Gold tables often feed Hightouch / Census back into Salesforce, HubSpot, Iterable.
The two transforms — the verbs that move data between layers.
-
Bronze → Silver— cleanse + conform + dedupe + enrich; the verb is "make it correct". -
Silver → Gold— aggregate + join + denormalise; the verb is "make it useful". - The contract — Silver must be idempotent re-derivable from Bronze, Gold must be idempotent re-derivable from Silver; the medallion is then a replay-safe DAG.
Worked example — write the three medallion tables for a clickstream source
Detailed explanation. Real interviews want you to walk a non-orders example (so you cannot rely on muscle memory) and produce all three layers. Below is the canonical clickstream walkthrough.
Question. Raw web clickstream lands in s3://bucket/raw/clicks/ as JSON every 5 minutes. Build Bronze, Silver, and Gold so the marketing team can query daily_sessions_by_country on a SQL Warehouse with sub-second latency.
Input. Raw click JSON: {"event_id":"abc-123","user_id":42,"url":"/home","country":null,"ts":"2026-05-28T22:31:09Z","ua":"Mozilla/5.0..."}. Roughly 200M rows per day, ~10% duplicates from retries.
Code.
-- Bronze — schema-on-read, append-only audit trail
CREATE TABLE bronze.raw_clicks
USING DELTA
LOCATION 's3://acme-lakehouse/bronze/raw_clicks/'
AS SELECT *,
_metadata.file_name AS source_file,
current_timestamp() AS ingest_ts
FROM read_files('s3://bucket/raw/clicks/', format => 'json');
-- Silver — typed, deduplicated, country defaulted, sessions assigned
CREATE OR REPLACE TABLE silver.clicks_clean
USING DELTA
PARTITIONED BY (event_date)
AS SELECT event_id,
user_id::BIGINT AS user_id,
url,
coalesce(country, 'UNKNOWN') AS country,
ts::TIMESTAMP AS event_ts,
date(ts::TIMESTAMP) AS event_date,
session_id_from_ua_user(ua, user_id) AS session_id
FROM bronze.raw_clicks
WHERE event_id IS NOT NULL
QUALIFY row_number() OVER (PARTITION BY event_id ORDER BY ingest_ts DESC) = 1;
-- Gold — sessions aggregated by day + country, BI-ready
CREATE OR REPLACE TABLE gold.daily_sessions_by_country
USING DELTA
PARTITIONED BY (event_date)
AS SELECT event_date,
country,
count(DISTINCT session_id) AS sessions,
count(*) AS page_views,
count(DISTINCT user_id) AS unique_users
FROM silver.clicks_clean
GROUP BY event_date, country;
Step-by-step explanation.
-
Bronze —
read_filesis Auto Loader sugar; it incrementally tracks already-ingested files and only loads new ones. We addsource_file+ingest_tsso a Silver bug can be replayed against the right Bronze partition. -
Silver — we cast
user_idtoBIGINT, parsetstoTIMESTAMP, defaultcountrytoUNKNOWN(never propagate nulls into aGROUP BY), assign a deterministicsession_id, and dedupe byevent_id. Theevent_datepartition column lets us prune Gold queries cheaply. -
Gold — we aggregate to the grain marketing actually queries (
event_date, country) and compute three metrics. Because the table is small (one row per(date, country)) and partitioned, a Power BI dashboard query returns in well under a second. -
Replay safety — if the
session_idalgorithm has a bug, we re-derive Silver and Gold from the existing Bronze; we never re-ingest from the source.
Output (Gold table, first 3 rows).
| event_date | country | sessions | page_views | unique_users |
|---|---|---|---|---|
| 2026-05-28 | US | 1820411 | 18204110 | 743192 |
| 2026-05-28 | UNKNOWN | 412037 | 4120370 | 165823 |
| 2026-05-28 | DE | 198440 | 1984400 | 84112 |
Rule of thumb: every medallion stack has the same three verbs — ingest, cleanse, aggregate. Senior engineers can write all three SQL blocks above on a whiteboard in under five minutes.
bronze silver gold — the four senior gotchas
-
Don't
MERGEinto Bronze. Bronze is append-only. The day youMERGEinto Bronze you lose the audit trail and replay safety; do allMERGEs in Silver. -
Silver is where deduplication lives. Duplicate
event_ids from at-least-once delivery are normal in Bronze; Silver'srow_number() = 1filter is the only place dedup belongs. - Gold is denormalised by design. Resist the SQL purist instinct to keep Gold normalised; the storage cost is trivial and the query-time join cost is enormous.
-
Layer per consumer is fine. One BI team can own
gold.daily_revenue_by_region, another can owngold.weekly_revenue_by_product; both derive from the same Silver. Storage is cheap.
SQL
Topic — aggregation
Aggregation pattern drills
SQL
Topic — etl
ETL pipeline practice
Solution Using a single Bronze → Silver → Gold DAG with explicit contracts
Code.
-- One declarative DAG; the contract column says what the next layer expects.
CREATE TABLE medallion_contract_orders AS
SELECT * FROM (VALUES
(1, 'bronze.raw_orders', 'append_only', 'every column as-string + ingest metadata', 'read_files(json)'),
(2, 'silver.orders_clean', 'overwrite', 'order_id BIGINT NOT NULL UNIQUE; amount DECIMAL(18,4); region NOT NULL; deduped by order_id', 'CTAS from bronze + row_number=1'),
(3, 'gold.daily_revenue', 'overwrite', 'one row per (date,region); count(*) + sum(amount); partitioned by date', 'CTAS from silver + GROUP BY'),
(4, 'gold.user_segments', 'overwrite', 'one row per user_id; LTV bucket + activity tier; partitioned by snapshot_date', 'CTAS from silver + windowed scoring'),
(5, 'gold.exec_dashboard', 'overwrite', 'wide one-row-per-day denormalised mart; powers exec PBI dashboard', 'CTAS from multiple silver + gold tables')
) AS t(layer_order, table_name, write_mode, contract, transform);
Step-by-step trace.
| layer_order | table_name | write_mode | contract | transform |
|---|---|---|---|---|
| 1 | bronze.raw_orders | append_only | every column as-string + ingest metadata | read_files(json) |
| 2 | silver.orders_clean | overwrite | order_id BIGINT NOT NULL UNIQUE; amount DECIMAL(18,4); region NOT NULL; deduped | CTAS from bronze + row_number=1 |
| 3 | gold.daily_revenue | overwrite | one row per (date,region); count(*) + sum(amount); partitioned by date | CTAS from silver + GROUP BY |
| 4 | gold.user_segments | overwrite | one row per user_id; LTV bucket + activity tier | CTAS from silver + windowed scoring |
| 5 | gold.exec_dashboard | overwrite | wide one-row-per-day denormalised mart | CTAS from multiple silver + gold |
- Row 1 — Bronze write mode is
append_only— every load adds rows, never overwrites; this is the single most-violated medallion rule in junior code. - Row 2 — Silver write mode is
overwrite(orMERGEfor incremental) — the table is idempotent re-derivable from Bronze + transformation code. - Rows 3-5 — Gold has multiple tables — one per consumer / dashboard; storage cost is trivial, query latency wins.
- The contract column codifies what the next layer expects; junior engineers store this in Confluence, senior engineers store it in DDL constraints + DLT expectations.
- The transform column codifies the verb between layers; this is the column reviewers actually inspect.
Output.
| layer_order | table_name | write_mode | contract |
|---|---|---|---|
| 1 | bronze.raw_orders | append_only | every column as-string + ingest metadata |
| 2 | silver.orders_clean | overwrite | order_id BIGINT NOT NULL UNIQUE; deduped |
| 3 | gold.daily_revenue | overwrite | one row per (date,region) |
| 4 | gold.user_segments | overwrite | one row per user_id |
| 5 | gold.exec_dashboard | overwrite | wide denormalised mart |
Why this works — concept by concept:
- Append-only Bronze — the single rule that makes replay possible; once you overwrite Bronze, the history is gone forever.
- Contract column — codifies what the next layer assumes; this is the artefact reviewers can audit at PR time.
- Overwrite Silver — idempotency comes from "Silver = pure function of Bronze + code"; rebuilds are safe.
- Multi-Gold — different consumers get different shapes; the alternative (one mega-Gold) becomes a coordination nightmare.
- Layer order — interviewers love seeing the dependency order encoded explicitly; it signals you think in DAGs.
-
Cost —
O(1)to read the matrix; the actual DAG isO(N · M)forNrows acrossMlayers, but every step is parallelisable per partition.
4. Delta Lake mechanics — ACID + time travel + OPTIMIZE + Z-ORDER
delta lake mechanics — Parquet + transaction log + four headline features
delta lake is, at the file level, just a directory of Parquet data files plus a sibling _delta_log/ directory that contains one JSON file per commit. That tiny piece of metadata — one JSON file per commit — is the entire magic: it gives plain Parquet on S3 the four headline features warehouses charge for — ACID transactions, time travel, schema enforcement + evolution, and performance optimisations (OPTIMIZE + Z-ORDER). The interview test is whether you can name what the _delta_log does, write a MERGE INTO from memory, query a previous version with VERSION AS OF, and reason about small-file compaction and multi-dim clustering.
The _delta_log — one JSON per commit.
-
00000000000000000000.json— the initial commit; contains the metadata action ({"metaData":{"schemaString":...}}) and a list of added files ({"add":{"path":"part-0000.parquet",...}}). -
00000000000000000001.json— the next commit; contains added + removed file actions; older Parquet files are tombstoned but not deleted (untilVACUUM). -
_last_checkpoint— a pointer file; every 10 commits Delta writes a Parquet checkpoint that consolidates the log so readers don't replay 10,000 JSONs. -
Why JSON, not a DB?— JSON is human-readable, debuggable, and replicates trivially across regions; the price isO(commits)read cost without checkpoints.
Feature 1 — ACID via the log.
-
Atomic — a commit is the appearance of a new JSON file in
_delta_log/; either fully written or not at all. - Consistent — every reader picks the most recent committed version; partial writes are invisible.
-
Isolated —
optimistic concurrency control— writers detect concurrent commits and retry; serialisable isolation by default. - Durable — the JSON log lives on S3's 11-nines storage; once committed, the version exists forever (until intentional truncation).
-- ACID example: a MERGE that's safe under concurrent writes.
MERGE INTO silver.orders_clean AS t
USING bronze_changes AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'U' THEN UPDATE SET amount = s.amount, updated_ts = current_timestamp()
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN NOT MATCHED AND s.op = 'I' THEN INSERT (order_id, amount, region, order_ts)
VALUES (s.order_id, s.amount, s.region, s.order_ts);
Feature 2 — time travel.
-
VERSION AS OF—SELECT * FROM silver.orders_clean VERSION AS OF 42returns the table as of commit 42. -
TIMESTAMP AS OF—SELECT * FROM silver.orders_clean TIMESTAMP AS OF '2026-05-28 06:00:00'returns the table as of that wall-clock. -
DESCRIBE HISTORY—DESCRIBE HISTORY silver.orders_cleanlists every commit, user, operation, and metrics. -
RESTORE—RESTORE silver.orders_clean TO VERSION AS OF 42is the atomic rollback of a bad write. -
Retention — controlled by
delta.deletedFileRetentionDuration(default 7 days); after that,VACUUMcan purge.
Feature 3 — schema enforcement + evolution.
- Enforcement — by default, a write with a new column fails; data is rejected, not silently dropped.
-
Evolution —
mergeSchema=trueon a write allows adding (only adding) columns; existing rows getNULLfor the new column. -
ALTER TABLE—ALTER TABLE ... ADD COLUMNS / RENAME COLUMN / DROP COLUMNfor explicit governance. -
Type widening — Delta 3.0+ supports safe type widening (
INT → BIGINT); narrowing requires a rewrite.
Feature 4 — OPTIMIZE + Z-ORDER + VACUUM.
-
OPTIMIZE— coalesces many small Parquet files into fewer ~1 GB files; massive read-perf win. -
OPTIMIZE ... ZORDER BY (col1, col2)— multi-dimensional clustering; files are organised so prediates oncol1andcol2prune efficiently. -
VACUUM— deletes tombstoned Parquet files older than the retention window; reclaims S3 cost. -
Liquid Clustering— the 2024 replacement forZORDER; one-timeCLUSTER BY (col1, col2)DDL, auto-maintained.
Worked example — implement MERGE INTO + time-travel rollback + OPTIMIZE ZORDER
Detailed explanation. Real interviews ask you to write the full Delta mechanics flow on a single table: incremental MERGE, a time-travel rollback after a bad commit, and a maintenance OPTIMIZE ZORDER. Below is the canonical block.
Question. A nightly CDC stream bronze.cdc_orders lands as (order_id, op, amount, order_ts) with op IN ('I','U','D'). Write (a) the MERGE INTO silver.orders_clean, (b) the rollback after a bad release, and (c) the maintenance step that keeps silver.orders_clean fast.
Input. silver.orders_clean has 100M rows; bronze.cdc_orders adds ~500K daily changes; the table is queried frequently by (customer_id, order_ts) predicates.
Code.
-- (a) Idempotent MERGE — the canonical Silver upsert
MERGE INTO silver.orders_clean AS t
USING (
SELECT order_id, op, amount, region, order_ts
FROM bronze.cdc_orders
WHERE ingest_date = current_date()
) AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'U' THEN UPDATE SET amount = s.amount, region = s.region, updated_ts = current_timestamp()
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN NOT MATCHED AND s.op = 'I' THEN INSERT (order_id, amount, region, order_ts, ingest_ts)
VALUES (s.order_id, s.amount, s.region, s.order_ts, current_timestamp());
-- (b) Bad release rollback — restore to the last-known-good version
DESCRIBE HISTORY silver.orders_clean; -- inspect commits
RESTORE TABLE silver.orders_clean TO VERSION AS OF 1337; -- atomic rollback
-- (c) Maintenance — compact small files and cluster by hot predicates
OPTIMIZE silver.orders_clean
ZORDER BY (customer_id, order_ts);
Step-by-step explanation.
-
MERGE— one statement does insert / update / delete based onop; the_delta_logrecords the whole thing as a single commit, so readers either see the full batch or none of it. -
DESCRIBE HISTORY— lists every commit with version number, user, operation, and metrics; this is the artefact yougit blamefor tables. -
RESTORE TO VERSION AS OF 1337— atomic rollback; the next commit is a new version that contains the old version's contents. -
OPTIMIZE ... ZORDER BY (customer_id, order_ts)— rewrites the data files so rows that sharecustomer_idand similarorder_tsend up in the same file; predicates likeWHERE customer_id = 42 AND order_ts > '2026-05-01'can skip most files entirely.
Output (DESCRIBE HISTORY excerpt).
| version | timestamp | userName | operation | operationMetrics |
|---|---|---|---|---|
| 1336 | 2026-05-28 06:00:00 | etl_user | MERGE | {numOutputRows: 482103, numUpdatedRows: 18204} |
| 1337 | 2026-05-28 06:30:00 | etl_user | OPTIMIZE | {numFilesAdded: 142, numFilesRemoved: 9810} |
| 1338 | 2026-05-29 06:00:00 | etl_user | MERGE | {numOutputRows: 503112, BAD_RELEASE: true} |
| 1339 | 2026-05-29 06:45:00 | oncall | RESTORE | {restoredToVersion: 1337} |
Rule of thumb: MERGE is the verb for Silver; RESTORE is the verb for incidents; OPTIMIZE ... ZORDER BY is the verb for performance. Senior engineers can write all three on a whiteboard in under three minutes.
delta lake — the four senior gotchas
-
Don't
VACUUMaggressively. The default retention is 7 days for a reason — time travel depends on the tombstoned files being kept.VACUUM RETAIN 0 HOURSdeletes the very files you'dRESTOREfrom. -
MERGEisO(matched files). AMERGEthat touches one partition rewrites only that partition; partitioning the target Silver table on a hot predicate (event_date) keepsMERGEcheap. -
ZORDERis multi-dim, partitioning is single-dim. Use partitioning on low-cardinality time columns (event_date); useZORDERfor the 2-4 high-cardinality predicates BI runs against. -
Schema enforcement is on by default. A producer adding a column with no coordination will fail the write — this is the desired behaviour.
mergeSchema=trueis opt-in, never default.
SQL
Topic — data-validation
Data-validation practice
SQL
Topic — optimization
Optimization drills
Solution Using a single Delta mechanics cheat-table
Code.
-- One canonical cheat-table; row = mechanic, columns = primitive + when + caveats.
CREATE TABLE delta_mechanics_cheatsheet AS
SELECT * FROM (VALUES
('acid', 'MERGE INTO / INSERT / DELETE', 'every write', 'OCC retries on conflict'),
('time_travel', 'SELECT ... VERSION AS OF N', 'incident triage', 'bounded by retention window'),
('restore', 'RESTORE TABLE t TO VERSION AS OF N', 'bad-release rollback', 'atomic, creates new version'),
('schema_enforce', 'default on write', 'every write', 'mergeSchema=true to evolve'),
('schema_evolve', 'ALTER TABLE ... ADD COLUMNS', 'planned changes', 'add-only is safe; drop is rewrite'),
('optimize', 'OPTIMIZE t', 'nightly', 'compacts small files'),
('z_order', 'OPTIMIZE t ZORDER BY (a, b)', 'after OPTIMIZE', 'best on 2-4 high-cardinality cols'),
('liquid_clustering', 'ALTER TABLE t CLUSTER BY (a, b)', 'one-time DDL', '2024+ replacement for ZORDER'),
('vacuum', 'VACUUM t RETAIN 168 HOURS', 'weekly', 'do not lower retention below 7d')
) AS t(mechanic, primitive, cadence, caveat);
Step-by-step trace.
| mechanic | primitive | cadence | caveat |
|---|---|---|---|
| acid | MERGE INTO / INSERT / DELETE | every write | OCC retries on conflict |
| time_travel | SELECT ... VERSION AS OF N | incident triage | bounded by retention window |
| restore | RESTORE TABLE t TO VERSION AS OF N | bad-release rollback | atomic, creates new version |
| schema_enforce | default on write | every write | mergeSchema=true to evolve |
| schema_evolve | ALTER TABLE ... ADD COLUMNS | planned changes | add-only is safe; drop is rewrite |
| optimize | OPTIMIZE t | nightly | compacts small files |
| z_order | OPTIMIZE t ZORDER BY (a, b) | after OPTIMIZE | best on 2-4 high-cardinality cols |
| liquid_clustering | ALTER TABLE t CLUSTER BY (a, b) | one-time DDL | 2024+ replacement for ZORDER |
| vacuum | VACUUM t RETAIN 168 HOURS | weekly | do not lower retention below 7d |
-
acid— the foundation; every other mechanic assumes ACID semantics. -
time_travel + restore— two sides of the same coin; one for inspection, one for rollback. -
schema_enforce + evolve— enforcement is the default safety net; evolution is the opt-in escape hatch. -
optimize + z_order + liquid_clustering— performance trio; small-file compaction first, then clustering on hot predicates. -
vacuum— the only destructive operation; the cheatsheet pairs it with a don't go below 7 days caveat. - The cheatsheet collapses to: MERGE to write, VERSION AS OF to inspect, RESTORE to undo, OPTIMIZE to speed up, VACUUM rarely.
Output.
| mechanic | primitive | cadence |
|---|---|---|
| acid | MERGE INTO | every write |
| time_travel | VERSION AS OF | incident |
| restore | RESTORE | rollback |
| optimize | OPTIMIZE | nightly |
| z_order | ZORDER BY | after OPTIMIZE |
| vacuum | VACUUM 168 HOURS | weekly |
Why this works — concept by concept:
- Single cheat-table — interviewers love a one-table answer where you can name primitive + cadence + caveat; this collapses three follow-ups into one artefact.
- OCC mention — optimistic concurrency control is the specific mechanism Delta uses; calling it out is a senior signal.
-
Liquid Clustering — naming the 2024+ replacement for
ZORDERshows you're on the current Delta version. - VACUUM caveat — the most common production foot-gun; pairing it with the 7-day rule preempts the obvious follow-up.
-
MERGE as default —
MERGEis the verb for any Silver / Gold write where rows can be updated; calling it out as the default is a senior signal. -
Cost —
O(1)to read the cheatsheet; the actual mechanics areO(matched files)forMERGE,O(N)forOPTIMIZE,O(commits)for time travel.
5. End-to-end production lakehouse pipeline (sources → Bronze → Silver → Gold → BI/ML)
databricks medallion in production — sources, ingest, transform, serve
A production databricks lakehouse pipeline is a left-to-right pipeline with five concrete bands: sources (Kafka, RDBMS CDC, S3 file drops), ingest (Auto Loader, Kafka Structured Streaming, Debezium connectors), transform (Spark batch jobs or delta live tables declarative pipelines), serve (Gold tables behind a SQL Warehouse or Delta Sharing), and consumers (Power BI / Tableau, SQL endpoints, ML notebooks, reverse ETL). Threading through all five is Unity Catalog for permissions + lineage + audit. The interview test is whether you can draw all five bands on a whiteboard and name one concrete primitive in each.
Band 1 — sources.
-
Kafka / Kinesis / Event Hubs— high-throughput append streams; usually JSON or Avro encoded. -
S3 / ADLS / GCS file drops— vendor CSVs, partner Parquet, mobile-SDK JSON dumps. -
RDBMS CDC— Debezium / Fivetran / native Lakehouse Federation read change feeds from Postgres / MySQL / SQL Server. -
SaaS APIs— Salesforce / HubSpot / Stripe via Fivetran / Airbyte; landed as Parquet in the raw bucket.
Band 2 — ingest.
-
Auto Loader—spark.readStream.format("cloudFiles").option("cloudFiles.format", "json")...; incremental file detection withoutlistObjectsscans. -
Kafka Structured Streaming—spark.readStream.format("kafka").option("subscribe", "orders").load(); exactly-once into Delta. -
Debezium / Lakehouse Federation— read CDC feeds directly; land asbronze.cdc_orderswithopcolumn. -
Streaming + batch unified— the same DataFrame API for both; the writer to Delta is identical.
Band 3 — transform (Bronze → Silver → Gold).
-
Spark batch jobs— Airflow / Workflows orchestrate Python notebooks or JAR jobs; the legacy default. -
Delta Live Tables (DLT)— declarative pipelines:@dlt.table+@dlt.expect_or_drop; the framework handles orchestration, retries, autoscale. -
Workflows— Databricks' built-in scheduler; replaces a lot of Airflow for Databricks-only DAGs. -
Job clusters vs serverless— job clusters spin up per run; serverless compute starts in seconds and is the 2024+ default for shared workloads.
Band 4 — serve.
-
SQL Warehouses— serverless or pro endpoints; Photon-backed; auto-suspend; per-second billing. -
Delta Sharing— open protocol to share Delta tables with external consumers (other workspaces, other vendors). -
Materialized views— pre-computed Gold queries; refreshed declaratively. -
Streaming tables— continuously-updated Gold-grade tables for real-time dashboards.
Band 5 — consumers.
-
Power BI / Tableau / Looker— connect to a SQL Warehouse endpoint; queries hit Gold tables directly. -
ML notebooks—spark.read.format("delta").load(...)against Silver or Gold; the same tables BI consumes. -
Reverse ETL— Hightouch / Census push Gold rows back into Salesforce, HubSpot, Iterable. -
Apps / APIs— Databricks SQL Driver, JDBC, REST APIs; product features can read Gold directly.
Worked example — assemble a production pipeline as a Delta Live Tables file
Detailed explanation. Real interviews increasingly ask you to write a DLT file because it shows that you can think in declarative pipelines rather than imperative Airflow DAGs. Below is a complete (compact) DLT module that ingests Kafka orders, builds Silver, and aggregates Gold — with expectations gating each step.
Question. Write a Delta Live Tables Python module that (a) ingests orders from Kafka into a Bronze streaming table, (b) cleans and dedupes into a Silver streaming table with a not_null(order_id) expectation, (c) aggregates into a Gold materialized view of daily_revenue_by_region, and (d) runs continuously with autoscale.
Input. Kafka topic orders (JSON payload), Unity Catalog acme.bronze / acme.silver / acme.gold schemas already created.
Code.
import dlt
from pyspark.sql.functions import col, to_timestamp, upper, coalesce, lit, date, row_number, current_timestamp
from pyspark.sql.window import Window
# (a) Bronze — streaming ingest from Kafka, schema-on-read, append-only
@dlt.table(
name="bronze_raw_orders",
table_properties={"delta.appendOnly": "true"},
comment="Raw orders from Kafka — append-only audit trail",
)
def bronze_raw_orders():
return (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka.acme:9092")
.option("subscribe", "orders")
.load()
.selectExpr(
"CAST(value AS STRING) AS payload_json",
"topic", "partition", "offset", "timestamp AS kafka_ts"
)
.withColumn("ingest_ts", current_timestamp())
)
# (b) Silver — typed, deduped, expectations enforced
@dlt.table(name="silver_orders_clean", comment="Cleansed orders ready for analytics")
@dlt.expect_or_drop("valid_order_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount", "amount > 0")
@dlt.expect("region_known", "region IN ('US','EU','APAC','LATAM','UNKNOWN')")
def silver_orders_clean():
parsed = (
dlt.read_stream("bronze_raw_orders")
.selectExpr(
"get_json_object(payload_json, '$.order_id')::BIGINT AS order_id",
"get_json_object(payload_json, '$.customer_id')::BIGINT AS customer_id",
"upper(get_json_object(payload_json, '$.region')) AS region",
"get_json_object(payload_json, '$.amount')::DECIMAL(18,4) AS amount",
"to_timestamp(get_json_object(payload_json, '$.order_ts')) AS order_ts",
"coalesce(get_json_object(payload_json, '$.currency'), 'USD') AS currency",
"ingest_ts",
)
)
w = Window.partitionBy("order_id").orderBy(col("ingest_ts").desc())
return parsed.withColumn("rn", row_number().over(w)).filter(col("rn") == 1).drop("rn")
# (c) Gold — aggregated business mart, materialised
@dlt.table(name="gold_daily_revenue_by_region", comment="BI surface — daily revenue")
def gold_daily_revenue_by_region():
return (
dlt.read("silver_orders_clean")
.groupBy(date(col("order_ts")).alias("order_date"), col("region"))
.agg({"order_id": "count", "amount": "sum"})
.withColumnRenamed("count(order_id)", "order_count")
.withColumnRenamed("sum(amount)", "gross_revenue")
)
Step-by-step explanation.
-
Bronze —
readStream.format("kafka")streams the Kafka topic; we capture the payload as a string plus Kafka metadata;delta.appendOnly=trueenforces the audit-trail rule at the table level. -
Silver — we parse JSON columns with
get_json_object, cast to real types, upper-caseregion, defaultcurrency. The three@dlt.expect*decorators gate data quality:expect_or_dropquietly removes bad rows,expectrecords the violation count but allows the row through. -
Dedup — the
Window+row_number() = 1filter ensures eachorder_idkeeps only its latest version; replays are idempotent. -
Gold — a simple
groupBy().agg()materialises the daily-revenue mart; DLT decides whether to refresh it as a stream or batch based on configuration. - Autoscale + orchestration — DLT handles cluster sizing, retries, lineage, and event logs without us writing a single Airflow operator.
Output (Gold view, first 3 rows).
| order_date | region | order_count | gross_revenue |
|---|---|---|---|
| 2026-05-28 | US | 42137 | 1289450.7500 |
| 2026-05-28 | EU | 18204 | 612900.3300 |
| 2026-05-28 | APAC | 9810 | 287113.9000 |
Rule of thumb: DLT collapses 200 lines of Airflow + Spark plumbing into ~60 lines of declarative Python. Senior engineers reach for DLT for any new lakehouse pipeline; legacy Spark-batch-on-Airflow remains for migrations.
delta live tables + auto loader + unity catalog — the four senior production patterns
-
Auto Loader, not
listObjects.cloudFilesuses S3 notifications + a tracking store; it scales to billions of files. Never usespark.read.json(s3_path)in production — thelistObjectsscan blows up at scale. - DLT expectations, not post-hoc tests. Expectations are gates at write time. They publish to the DLT event log so SRE dashboards can chart violation counts per release.
- One DLT pipeline per medallion stack. Bronze + Silver + Gold for a single domain (orders, clicks, payments) belong in one DLT pipeline; the framework computes the DAG and runs it.
-
Unity Catalog GRANTs are per-table.
GRANT SELECT ON acme.gold.daily_revenue TO analystsdoesn't leak into Bronze or Silver; the three-level namespace is the security boundary.
SQL
Topic — streaming
Streaming pattern drills
Python
Language — python
Python practice library
Solution Using a declarative DLT pipeline + Unity Catalog governance
Code.
-- The end-to-end pipeline encoded as a single registry table.
CREATE TABLE production_lakehouse_pipeline AS
SELECT * FROM (VALUES
(1, 'source', 'kafka.orders', 'streaming', 'JSON value column'),
(2, 'ingest', 'auto_loader OR kafka_ss', 'continuous', 'incremental + exactly-once'),
(3, 'bronze', 'acme.bronze.raw_orders', 'append', 'schema-on-read + ingest_ts metadata'),
(4, 'silver', 'acme.silver.orders_clean','merge', 'typed + deduped + expectations'),
(5, 'gold', 'acme.gold.daily_revenue', 'overwrite', 'aggregated mart, partitioned by date'),
(6, 'serve', 'sql_warehouse', 'serverless', 'Photon + auto-suspend'),
(7, 'consume_bi', 'powerbi_dashboard', 'pull', 'queries Gold via JDBC'),
(8, 'consume_ml', 'notebook_train.ipynb', 'pull', 'reads Silver for features'),
(9, 'consume_rev','hightouch_to_salesforce', 'push', 'syncs Gold rows back into CRM'),
(10,'govern', 'unity_catalog', 'always-on', 'three-level namespace + ACL + lineage')
) AS t(band_order, band_name, artefact, mode, primitive);
Step-by-step trace.
| band_order | band_name | artefact | mode | primitive |
|---|---|---|---|---|
| 1 | source | kafka.orders | streaming | JSON value column |
| 2 | ingest | auto_loader OR kafka_ss | continuous | incremental + exactly-once |
| 3 | bronze | acme.bronze.raw_orders | append | schema-on-read + ingest_ts metadata |
| 4 | silver | acme.silver.orders_clean | merge | typed + deduped + expectations |
| 5 | gold | acme.gold.daily_revenue | overwrite | aggregated mart, partitioned by date |
| 6 | serve | sql_warehouse | serverless | Photon + auto-suspend |
| 7 | consume_bi | powerbi_dashboard | pull | queries Gold via JDBC |
| 8 | consume_ml | notebook_train.ipynb | pull | reads Silver for features |
| 9 | consume_rev | hightouch_to_salesforce | push | syncs Gold rows back into CRM |
| 10 | govern | unity_catalog | always-on | three-level namespace + ACL + lineage |
- Rows 1-2 — source + ingest are the streaming entry point; Auto Loader for files, Kafka SS for queues.
- Rows 3-5 — the medallion spine; Bronze append, Silver
MERGE, Gold overwrite is the canonical write-mode triple. - Rows 6-9 — serve + consume is where the lakehouse multi-engine promise pays off; BI, ML, and reverse ETL all read the same Delta tables.
- Row 10 — Unity Catalog is the always-on thread; it doesn't sit between two bands, it spans all of them.
- Note
consume_mlreads Silver, not Gold — ML wants the granular, per-row table; BI wants the aggregated Gold. - Note
consume_revpushes Gold into operational systems; this is the closing of the analytics → operations loop that lakehouses enable cheaply.
Output.
| band_order | band_name | artefact | mode |
|---|---|---|---|
| 1 | source | kafka.orders | streaming |
| 3 | bronze | acme.bronze.raw_orders | append |
| 4 | silver | acme.silver.orders_clean | merge |
| 5 | gold | acme.gold.daily_revenue | overwrite |
| 6 | serve | sql_warehouse | serverless |
| 10 | govern | unity_catalog | always-on |
Why this works — concept by concept:
- Five-band model — sources → ingest → transform → serve → consume is the whole pipeline; collapsing it into one table makes the architecture auditable.
-
Write-mode triple —
appendfor Bronze,mergefor Silver,overwritefor Gold is the senior shorthand for the medallion contract. - Multi-consumer — BI, ML, and reverse ETL all reading the same Delta tables is the lakehouse's headline benefit; calling out all three preempts "but where does ML fit?" follow-ups.
- Serverless SQL Warehouse — the 2024+ default for serving; auto-suspend keeps cost near zero between queries.
- Unity Catalog as thread — governance isn't a band, it's the warp the entire weave passes through; this is the senior framing.
-
Cost —
O(1)to read the registry; the actual pipeline isO(N · M)forNrows acrossMbands, with per-band horizontal scaling.
Choosing the right layer (cheat sheet)
A one-screen cheat sheet for databricks lakehouse and medallion architecture — pick the layer and the primitive that match the failure mode you're worried about.
| You want to … | Layer | Canonical primitive | When |
|---|---|---|---|
| Capture raw source bytes forever | Bronze |
read_files / Auto Loader → append Delta |
every ingest |
Add ingest_ts + source_file metadata |
Bronze |
_metadata.file_name + current_timestamp()
|
every ingest |
| Cast strings to real types | Silver |
::DECIMAL(18,4), ::BIGINT, ::TIMESTAMP
|
every load |
| Dedupe at-least-once duplicates | Silver | QUALIFY row_number() OVER (PARTITION BY k ORDER BY ingest_ts DESC) = 1 |
every load |
| Apply business rules + drop bad rows | Silver | DLT @dlt.expect_or_drop
|
every load |
| Update / delete rows in-place | Silver | MERGE INTO |
every CDC load |
| Aggregate to BI grain | Gold | GROUP BY ...; sum / count / avg |
every load |
| Denormalise for fast dashboard reads | Gold | wide CTAS with joined dims | every load |
| Partition for prune-friendly queries | Silver / Gold | PARTITIONED BY (event_date) |
DDL |
| Cluster by hot predicate columns | Silver / Gold |
OPTIMIZE ... ZORDER BY (a,b) or CLUSTER BY
|
nightly |
| Rollback a bad release | Delta | RESTORE TABLE t TO VERSION AS OF N |
incident |
| Inspect a table as of yesterday | Delta | SELECT ... FROM t TIMESTAMP AS OF '2026-05-27' |
incident triage |
| Compact small files | Delta | OPTIMIZE t |
nightly |
| Reclaim S3 from tombstones | Delta | VACUUM t RETAIN 168 HOURS |
weekly |
| Grant table access to a group | UC | GRANT SELECT ON ... TO group |
every onboarding |
| Track row-level lineage | UC | system.access.table_lineage |
always-on |
| Stream ingest with exactly-once | Ingest |
readStream.format("cloudFiles") / kafka + Delta sink |
continuous |
| Replace Airflow plumbing | Pipeline | Delta Live Tables @dlt.table
|
new pipelines |
| Share Delta tables externally | Serve | Delta Sharing | partner data sales |
Frequently asked questions
What is the databricks lakehouse in one sentence, and why does it matter for interviews?
A databricks lakehouse is cheap object storage + a transactional layer (Delta Lake) + many compute engines + one governance plane (Unity Catalog), designed so a single copy of data on S3 / ADLS / GCS can serve BI, ML, streaming, and SQL through the same Delta tables under one set of permissions and lineage. It matters for interviews because in 2026 it is the baseline architecture every data engineer is expected to reason about — the warehouse-plus-lake duplex has collapsed, and the questions panels now ask are "how would you build Bronze / Silver / Gold for this?" and "why MERGE here instead of overwrite?" rather than "warehouse or lake?". Memorise the four layers and the three medallion stages; almost every question maps to one of them.
How does medallion architecture differ from a classical Kimball star schema?
medallion architecture is a physical layering (Bronze raw → Silver cleansed → Gold business) that says what shape data should be in at each step of a pipeline. Kimball is a logical modelling discipline (facts + conformed dimensions) that says how to design the tables a BI tool consumes. The two are complementary: Silver typically holds normalised, dedup'd, dimensional-style tables you'd recognise from Kimball, and Gold then denormalises and aggregates those Silver tables into wide marts (which still respect Kimball conformed dims). A common pattern is Bronze = raw, Silver = Kimball-style normalised facts + dims, Gold = wide aggregated marts per consumer. The medallion is the pipeline contract; Kimball is the modelling philosophy.
What is the _delta_log, and how does it make Parquet files transactional?
The _delta_log is a sub-directory next to your Parquet data files that contains one JSON file per commit (and periodic Parquet checkpoints to keep read cost bounded). Each JSON file lists add (new file added), remove (file tombstoned), metaData (schema), and commitInfo (operation + metrics) actions. Because the appearance of a new JSON file is atomic on object storage, the entire commit is atomic; concurrent writers serialise via optimistic concurrency control — they detect that another commit landed first and retry. That single piece of metadata is what gives plain Parquet on S3 the ACID guarantees, time travel (you can read the table as of any past commit), and schema enforcement that warehouses charge for. Without the _delta_log, the same Parquet files are just a data lake.
bronze silver gold vs raw / staging / mart — are they the same thing?
They are very close and largely interchangeable in conversation, but with two nuances. Bronze is stricter than raw — it must be append-only and Delta-formatted, with metadata columns like ingest_ts; many raw zones in legacy stacks are overwriting CSV dumps that violate replay safety. Silver maps almost exactly to staging — typed, conformed, deduped — but the medallion explicitly expects expectations / DQ gates at the Silver write. Gold maps to mart — aggregated, denormalised, BI-ready — but the medallion encourages multiple Gold tables per domain (one per consumer or dashboard), whereas some mart layers try to enforce a single canonical mart per business unit. If you adopt medallion, you inherit the append-only Bronze and expectations on Silver discipline that vanilla raw / staging / mart doesn't enforce.
When should I use OPTIMIZE ZORDER BY versus partitioning versus Liquid Clustering?
Partition on low-cardinality columns that filter every query — event_date is the canonical example; partitions become folder prefixes that the scanner skips entirely. OPTIMIZE ... ZORDER BY (a, b) is for 2-4 high-cardinality columns that frequently appear in WHERE predicates (e.g. customer_id, order_ts); Z-ORDER co-locates rows with similar values into the same Parquet files, so file-skipping is cheap. Liquid Clustering (Delta 3.0+, generally available in 2024) is the modern replacement for both partition and ZORDER on a single dimension: one CLUSTER BY (a, b) DDL, auto-maintained, no daily OPTIMIZE job, and it adapts as data shape evolves. The interview-grade rule is: partition on date, ZORDER on hot predicates, migrate to Liquid Clustering when your runtime supports it. Never ZORDER on a column you don't filter on — it costs compute and gives no read benefit.
What is delta live tables and when should I use it over plain Spark + Airflow?
delta live tables (DLT) is Databricks' declarative pipeline framework: you write @dlt.table functions that return DataFrames, attach @dlt.expect_or_drop / @dlt.expect_or_fail data-quality decorators, and DLT computes the DAG, runs it, retries on failure, autoscales the cluster, and publishes lineage + an event log. Use DLT for any new lakehouse pipeline where the team owns the whole stack and wants to delete a lot of Airflow + Spark plumbing — typically saving 60-70% of the boilerplate. Keep plain Spark + Airflow when (a) the DAG spans non-Databricks systems (Snowflake, GCS, Salesforce), (b) you need exotic non-Delta sinks, or (c) you're mid-migration and the cost of rewriting outweighs the saving. The interview-grade answer is: DLT for greenfield lakehouse pipelines, Workflows for Databricks-only orchestration, Airflow for multi-system DAGs.
How does Unity Catalog change governance versus the old hive_metastore?
The legacy hive_metastore lives per workspace, uses a two-level namespace (database.table), and has coarse-grained ACLs (table-level GRANTs at best). Unity Catalog lives per account (so one catalog spans all workspaces), uses a three-level namespace (catalog.schema.table), and adds row filters, column masks, fine-grained ACLs, automated lineage, audit logs to system tables, and Delta Sharing for external consumers. The migration path is to create a Unity Catalog metastore at the account level, link workspaces to it, and either move tables (with UPGRADE) or leave the old hive_metastore for legacy reads while writing all new tables into UC. For interviews, the headline answer is: Unity Catalog is one catalog across the account, three-level namespace, fine-grained ACL, automatic lineage, audit — and it replaces the per-workspace hive_metastore.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including SQL + Python + Spark drills keyed to the same databricks lakehouse and medallion architecture skill set this guide teaches (Bronze append-only ingest, Silver MERGE + expectations, Gold aggregation, delta lake mechanics, DLT pipelines, Auto Loader patterns, Unity Catalog governance). Whether you're prepping for a Databricks loop, a senior data-engineer round at any FAANG / fintech, or grinding the migration from a warehouse-plus-lake duplex to a lakehouse over the next quarter, the practice library mirrors the same five-band production pipeline — plus the delta live tables + unity catalog + photon tooling you'll wire into your real production lakehouse.
Kick off via the SQL practice library →; fan out into the ETL pipeline lane →; rehearse aggregation reconciliation patterns →; drill the Databricks company set →; sharpen joins drills →; reinforce data-validation problems →; widen coverage on the full Python practice library →.





Top comments (0)