Gowtham Potureddi

Posted on May 30

Databricks Lakehouse + Medallion Architecture: Bronze, Silver, Gold with Delta

#python #sql #interview #dataengineering

databricks lakehouse is the architecture every modern data-engineering interview now anchors on: one copy of data on cheap object storage, a transactional delta lake layer on top, multi-engine compute (Photon SQL, Spark batch, Structured Streaming, ML notebooks) underneath one unity catalog governance plane — and the medallion architecture (Bronze raw → Silver cleansed → Gold business) is the canonical layering pattern that organises every table inside it. Together those two ideas — lakehouse architecture + bronze silver gold — are the single most-asked combination in 2026 Databricks loops, and the curriculum this guide walks through, end to end, in five numbered teaching sections.

This is the deep-dive companion to a quick "what is a lakehouse?" explainer: where a one-screen overview names the three medallion layers and the Delta table format, this guide widens the surface into five full teaching sections — lakehouse anatomy (storage + transactional layer + compute + governance), medallion architecture (Bronze ingest + Silver cleanse + Gold serve, with the exact transforms that bind each pair), delta lake mechanics (ACID via the _delta_log, time travel, schema evolution, OPTIMIZE + Z-ORDER, VACUUM), an end-to-end production lakehouse pipeline (sources → Auto Loader → Bronze → Silver via Spark or delta live tables → Gold → BI / ML / reverse ETL), and a cheat sheet that maps every interview question to one of the three layers. Each section ends as a real interview answer: a question, a SQL / PySpark / Delta snippet, a traced execution, a sample output, and a concept-by-concept why this works breakdown — the exact shape databricks medallion rounds reward.

When you want hands-on reps immediately after reading, browse the SQL practice library →, drill ETL pipeline problems →, sharpen aggregation reconciliation patterns →, rehearse joins drills →, warm up on data-validation problems →, or widen coverage on the full Python practice library →.

On this page

Why the lakehouse + medallion model is the modern DE interview baseline
Lakehouse anatomy — storage + transactional layer + multi-engine compute + Unity Catalog
Medallion architecture — Bronze raw → Silver cleansed → Gold business marts
Delta Lake mechanics — ACID + time travel + OPTIMIZE + Z-ORDER
End-to-end production lakehouse pipeline (sources → Bronze → Silver → Gold → BI/ML)
Choosing the right layer (cheat sheet)
Frequently asked questions
Practice on PipeCode

1. Why the lakehouse + medallion model is the modern DE interview baseline

`databricks lakehouse` — why the warehouse-plus-lake duplex collapsed

The one-sentence invariant: the lakehouse is the architecture that replaced the warehouse-plus-lake duplex by putting a transactional layer (delta lake) on top of cheap object storage, so one copy of data can serve BI, ML, and streaming through many engines under one governance plane. Before 2020, every serious data team ran two systems — a data lake on S3/ADLS/GCS for ML and raw event capture, and a data warehouse (Snowflake / Redshift / BigQuery) for BI and SQL — and copied data between them with brittle ETL. The lakehouse removes the copy: same Parquet files on the same bucket, but a JSON transaction log makes them ACID, schema-enforced, and queryable by every engine.

What interviewers actually score on databricks lakehouse questions.

Architecture fluency — can you name the four layers (object storage, transactional Delta, compute engines, Unity Catalog governance) and explain why each one is necessary?
Why Delta exists — can you explain what the _delta_log does and why plain Parquet on S3 is not transactional?
The medallion layering — can you map a raw OLTP orders table onto Bronze → Silver → Gold and name the transforms between each pair?
Streaming + batch unification — can you explain why Structured Streaming and batch jobs write to the same Delta table?
Cost + perf intuition — can you reason about OPTIMIZE (small-file compaction), Z-ORDER (multi-dim clustering), VACUUM (tombstone cleanup), and Photon (vectorised SQL engine)?
Governance — can you say one sentence about Unity Catalog — three-level namespace, fine-grained ACLs, lineage, audit log?

The five-stage map this guide walks through.

Stage 1 — lakehouse anatomy — storage (S3/ADLS/GCS) + transactional Delta + compute engines (Photon, Spark, Streaming, ML) + Unity Catalog governance.
Stage 2 — medallion architecture — Bronze (raw, append-only audit trail), Silver (cleansed + conformed), Gold (business marts + BI surfaces).
Stage 3 — delta lake mechanics — ACID via _delta_log, MERGE INTO, time travel (VERSION AS OF), schema enforcement + evolution, OPTIMIZE + Z-ORDER, VACUUM.
Stage 4 — production pipeline — sources (Kafka, CDC, S3 drops) → Auto Loader → Bronze → Spark / DLT → Silver → aggregate + join → Gold → BI / ML / reverse-ETL consumers.
Stage 5 — cheat sheet — pick the right layer for every interview prompt; pick the right Delta feature for every failure mode.

Why this is the new interview baseline and not "just another tool" question.

lakehouse architecture is a fundamental shift — the warehouse-plus-lake duplex is not a hardware choice; it is a cost + governance + freshness tradeoff that the lakehouse genuinely resolves.
The bugs are different — small-file explosions, schema drift in raw Bronze, MERGE deadlocks, VACUUM retention violations are all Delta-specific failure modes that don't exist in pure warehouses.
delta live tables changes the contract — declarative pipelines with expectations and autoscale replace the imperative Airflow-DAG-of-Spark-jobs you had in 2019.
Streaming and batch share one table — Structured Streaming writes to a Delta table that a batch SQL query reads, atomically, with no Lambda-architecture duplication.
Unity Catalog is the governance answer — one catalog across workspaces, with row + column ACLs, lineage, and audit; replaces the per-workspace hive_metastore of the 2018 era.

Worked example — map a single `orders` table onto the lakehouse + medallion model

Detailed explanation. Real interviews probe whether you can think across the lakehouse stack and the medallion layers on a single canonical table. Below is the walkthrough for a daily orders OLTP feed landing in a databricks lakehouse and surfacing as a gold_daily_revenue_mart BI table.

Question. A daily OLTP feed of orders is dropped as JSON to s3://bucket/raw/orders/dt=YYYY-MM-DD/. The BI team wants daily_revenue_by_region refreshed by 06:00 each day. Map the journey of one row onto Bronze, Silver, and Gold; name the transforms; name the Delta features that make each step safe.

Input. Raw orders JSON: {"order_id":1001,"customer_id":42,"region":"US","amount":"99.50","order_ts":"2026-05-28T22:31:09Z","currency":null}.

Code.

-- Bronze (raw append-only ingest, schema-on-read)
CREATE TABLE bronze.raw_orders AS
SELECT *,
       _metadata.file_name        AS source_file,
       current_timestamp()        AS ingest_ts
FROM   read_files('s3://bucket/raw/orders/', format => 'json');

-- Silver (cleansed + typed + deduplicated)
CREATE OR REPLACE TABLE silver.orders_clean AS
SELECT order_id::BIGINT          AS order_id,
       customer_id::BIGINT       AS customer_id,
       upper(region)             AS region,
       amount::DECIMAL(18,4)     AS amount,
       order_ts::TIMESTAMP       AS order_ts,
       coalesce(currency, 'USD') AS currency
FROM   bronze.raw_orders
WHERE  order_id IS NOT NULL
QUALIFY row_number() OVER (PARTITION BY order_id ORDER BY ingest_ts DESC) = 1;

-- Gold (business mart, aggregated, BI-ready)
CREATE OR REPLACE TABLE gold.daily_revenue_by_region AS
SELECT date(order_ts)         AS order_date,
       region,
       count(*)               AS order_count,
       sum(amount)            AS gross_revenue
FROM   silver.orders_clean
GROUP  BY date(order_ts), region;

Step-by-step explanation.

Bronze (raw) — read_files (Auto Loader under the hood) ingests every JSON drop as-is; we add source_file + ingest_ts metadata; we do not change types or drop rows. Bronze is the audit trail.
Silver (cleansed) — we cast amount to DECIMAL(18,4) (no floating-point money), normalise region to upper case, default currency to USD, drop order_id IS NULL, and deduplicate via row_number() so re-ingests are idempotent.
Gold (business) — we aggregate to the grain the BI team consumes — one row per (date, region) — and write to a small, fast, partition-pruned table that powers a Power BI dashboard or a SQL Warehouse endpoint.
Delta safety net — every CREATE OR REPLACE TABLE is atomic because of the _delta_log; readers see either yesterday's full table or today's full table, never a half-loaded mess.

Output (the Gold table's first 3 rows).

order_date	region	order_count	gross_revenue
2026-05-28	US	42137	1289450.7500
2026-05-28	EU	18204	612900.3300
2026-05-28	APAC	9810	287113.9000

Rule of thumb: one table threads three layers — Bronze keeps it forever, Silver makes it correct, Gold makes it useful. Senior engineers reason at all three layers on every prompt.

`medallion architecture` — the four senior signals interviewers chase

Signal 1 — Bronze is append-only, not overwrite. Junior engineers say "Bronze is the raw zone"; senior engineers say "Bronze is the immutable audit trail — every ingest is appended, schema-on-read, never overwritten, because the day you need to re-derive Silver and Gold from a bug fix, only an append-only Bronze can replay history."

Signal 2 — Silver is where contracts live. Junior engineers conflate Silver and Gold; senior engineers say "Silver is the conformed warehouse layer — types are real, deduplication is enforced, late-arriving data is merged, business keys are unique. Silver is the table I'd run a dbt test against."

Signal 3 — Gold is read-optimised, denormalised, and aggregated. Junior engineers leave Gold normalised; senior engineers say "Gold is whatever shape the consumer wants — usually a wide, denormalised, partition-pruned, often-pre-aggregated table built to answer one question fast; we accept duplication because read latency wins."

Signal 4 — every layer is a Delta table. Junior engineers think Bronze is "files" and Gold is "tables"; senior engineers say "all three layers are Delta tables — same _delta_log, same ACID guarantees, same time travel — the difference is contract and grain, not technology."

SQL
Topic — etl
ETL pipeline drills

Practice →

Company
Databricks interview set
Databricks interview practice

Practice →

Solution Using a 5-stage lakehouse coverage matrix

Code.

-- One canonical coverage matrix — every row maps a lakehouse stage to an artefact.
CREATE TABLE lakehouse_coverage_matrix AS
SELECT * FROM (VALUES
    (1, 'anatomy',       'object_storage',         's3 / adls / gcs',                  'always-on'),
    (1, 'anatomy',       'delta_transactional',    '_delta_log + parquet',             'always-on'),
    (1, 'anatomy',       'compute_engines',        'photon + spark + streaming + ml',  'on-demand'),
    (1, 'anatomy',       'unity_catalog',          'authn + authz + lineage + audit',  'always-on'),
    (2, 'medallion',     'bronze_raw',             'append-only + schema-on-read',     'every load'),
    (2, 'medallion',     'silver_cleansed',        'typed + deduped + conformed',      'every load'),
    (2, 'medallion',     'gold_business',          'aggregated + denormalised + wide', 'every load'),
    (3, 'delta',         'acid',                   'merge into target using updates',  'every write'),
    (3, 'delta',         'time_travel',            'select ... version as of N',       'on-demand'),
    (3, 'delta',         'optimize_z_order',       'compact files + cluster columns',  'nightly'),
    (4, 'pipeline',      'auto_loader_ingest',     'incremental file detection',       'continuous'),
    (4, 'pipeline',      'dlt_declarative',        'expectations + autoscale',         'continuous'),
    (5, 'governance',    'expectations',           'expect / drop / fail on bad rows', 'every load')
) AS t(stage_id, stage_name, artefact_name, primitive, cadence);

Step-by-step trace.

stage_id	stage_name	artefact_name	primitive	cadence
1	anatomy	object_storage	s3 / adls / gcs	always-on
1	anatomy	delta_transactional	_delta_log + parquet	always-on
1	anatomy	compute_engines	photon + spark + streaming + ml	on-demand
1	anatomy	unity_catalog	authn + authz + lineage + audit	always-on
2	medallion	bronze_raw	append-only + schema-on-read	every load
2	medallion	silver_cleansed	typed + deduped + conformed	every load
2	medallion	gold_business	aggregated + denormalised + wide	every load
3	delta	acid	merge into target using updates	every write
3	delta	time_travel	select ... version as of N	on-demand
3	delta	optimize_z_order	compact files + cluster columns	nightly
4	pipeline	auto_loader_ingest	incremental file detection	continuous
4	pipeline	dlt_declarative	expectations + autoscale	continuous
5	governance	expectations	expect / drop / fail on bad rows	every load

Row 1 — object_storage is the cheapest, infinitely scalable substrate; everything else stacks on top.
Row 2 — delta_transactional is what makes the lakehouse possible — without the _delta_log, you have a data lake, not a lakehouse.
Rows 3-4 — compute_engines + unity_catalog complete the four-layer stack; one storage, many engines, one governance.
Rows 5-7 — the medallion layers map content to grain; Bronze keeps everything, Silver makes it correct, Gold makes it consumable.
Rows 8-10 — delta mechanics are the physics — ACID, time travel, OPTIMIZE — every senior question touches one of them.
Rows 11-12 — the production pipeline glue — Auto Loader for ingest, DLT for declarative orchestration.
Row 13 — DLT expectations are the QA layer; every load asserts data quality before it advances.

Output.

stage_id	stage_name	artefact_name	cadence
1	anatomy	object_storage	always-on
2	medallion	bronze_raw	every load
2	medallion	silver_cleansed	every load
2	medallion	gold_business	every load
3	delta	acid	every write
3	delta	optimize_z_order	nightly
4	pipeline	auto_loader_ingest	continuous
4	pipeline	dlt_declarative	continuous

Why this works — concept by concept:

Stage coverage matrix — turns the 5-stage map into an auditable artefact; every architectural decision is owned by exactly one stage, so you can talk to coverage gaps in one query.
Cadence binding — pairs each artefact with its run cadence (always-on, every load, nightly, continuous); senior engineers explicitly assign cadence per artefact.
Primitive column — codifies the implementation of the artefact (merge into, _delta_log, expect / drop / fail); interviewers love a candidate who can name the primitive, not just the artefact.
Stage 3 is the differentiator — the four Delta mechanics (ACID, time travel, OPTIMIZE, VACUUM) are the answers that distinguish lakehouse fluency from generic Spark fluency.
Cost — O(1) to read the coverage matrix; the actual artefacts are O(N) over the underlying tables but parallelisable across the five stages.

2. Lakehouse anatomy — storage + transactional layer + multi-engine compute + Unity Catalog

`lakehouse architecture` — four layers, one platform

lakehouse architecture is best understood as a four-layer stack stacked vertically and read top-down: at the bottom, cheap object storage (S3, ADLS, GCS) holds the actual bytes; in the middle, a transactional layer (Delta Lake, Apache Iceberg, or Apache Hudi) gives those bytes ACID semantics through a JSON-encoded transaction log; on top, multiple compute engines (Photon SQL, Spark batch, Structured Streaming, ML notebooks, BI tools) read and write the same tables; and threaded through all three, a governance plane (Unity Catalog on Databricks) handles permissions, lineage, and audit. The interview test is whether you can explain each layer in one sentence and say why removing any one of them collapses the model back to either a warehouse or a lake.

Layer 1 — object storage (the cheap, infinite substrate).

s3 / adls / gcs — the substrate; pay-per-GB, eleven nines of durability, infinite scale, schema-agnostic.
open formats — Parquet (columnar), JSON (raw), CSV (legacy); the lakehouse never locks data inside a proprietary format.
bucket organisation — typically s3://bucket/<env>/<medallion_layer>/<table_name>/<partition_cols>/; one bucket per workspace is common.
why this layer exists — warehouses store data on expensive coupled storage; the lakehouse decouples storage from compute and pays warehouse-grade only for the brief minutes a cluster runs.

Layer 2 — Delta Lake (the transactional layer).

_delta_log — a sub-directory next to the data files containing one JSON file per commit; this log is the source of truth, not the Parquet files.
ACID — atomic / consistent / isolated / durable writes; concurrent writers serialise via the log, never via a database server.
schema enforcement + evolution — bad rows are rejected at write time; intentional schema changes are explicit (ALTER TABLE).
time travel — SELECT * FROM tbl VERSION AS OF 42 or TIMESTAMP AS OF '2026-05-01'; the log retains every version up to a retention window.
why this layer exists — plain Parquet on S3 has no commits, no rollback, no concurrency control; the lakehouse needs warehouse-grade reliability on lake-grade storage, and the log delivers it.

Layer 3 — compute engines (the polyglot layer).

Photon — Databricks' vectorised C++ SQL engine; up to 10x faster than open-source Spark on common BI workloads.
Spark batch — the workhorse for medallion ETL; spark.read.format('delta') + df.write.format('delta').
Structured Streaming — the same DataFrame API for streams; reads Kafka / Kinesis / Auto Loader, writes to Delta with exactly-once.
SQL Warehouses — serverless SQL endpoints for BI tools; auto-suspend, auto-scale, Photon-backed.
ML runtimes — pre-baked images with PyTorch, TensorFlow, XGBoost, scikit-learn; notebooks query the same Gold tables BI consumes.

Layer 4 — Unity Catalog (the governance plane).

three-level namespace — catalog.schema.table replaces the flat 2-level hive_metastore.database.table.
fine-grained ACLs — GRANT / REVOKE on catalogs, schemas, tables, rows, and columns (via row-filter + column-mask functions).
lineage — Unity Catalog tracks which table fed which downstream table, all the way to dashboards and ML models.
audit log — every read, write, GRANT, REVOKE is captured to system tables; SOC2 / HIPAA / GDPR ready.
cross-workspace — one catalog spans all workspaces in the account; no more per-workspace hive_metastore duplication.

Worked example — write the four-layer stack as a Spark notebook

Detailed explanation. Real interviews ask you to show that you can invoke each lakehouse layer in code. Below is the canonical four-cell notebook that touches storage (layer 1), Delta transactions (layer 2), multi-engine compute (layer 3), and Unity Catalog (layer 4).

Question. Write a 4-cell PySpark notebook that (a) reads raw JSON from S3, (b) writes it to a Delta table with a schema, (c) queries the Delta table from SQL, (d) grants SELECT on the table to an analyst group via Unity Catalog.

Input. s3://acme-lakehouse/raw/orders/dt=2026-05-28/*.json and a Unity Catalog analyst_group already created at the account level.

Code.

# Cell 1 — Layer 1: read from object storage
raw_df = (
    spark.read
         .format("json")
         .option("multiLine", "false")
         .load("s3://acme-lakehouse/raw/orders/dt=2026-05-28/")
)

# Cell 2 — Layer 2: write to a Delta table (transactional)
(
    raw_df.write
          .format("delta")
          .mode("append")
          .option("mergeSchema", "true")
          .saveAsTable("acme.bronze.raw_orders")
)

-- Cell 3 — Layer 3: query the same table from SQL (Photon-backed)
SELECT region, count(*) AS orders, sum(amount) AS revenue
FROM   acme.bronze.raw_orders
WHERE  date(order_ts) = '2026-05-28'
GROUP  BY region;

-- Cell 4 — Layer 4: grant SELECT via Unity Catalog
GRANT SELECT ON TABLE acme.bronze.raw_orders TO `analyst_group`;

Step-by-step explanation.

Cell 1 — spark.read.format("json") against an s3:// path uses Layer 1 (object storage) directly; no warehouse compute needed.
Cell 2 — .format("delta").mode("append") writes Parquet files plus a new _delta_log/00000000000000000001.json commit; this is Layer 2 in action.
Cell 3 — the same physical Delta table is queryable from SQL through Photon; the engine is different from the writer but the data is the same — that is Layer 3's multi-engine promise.
Cell 4 — GRANT SELECT to a group goes through Layer 4 (Unity Catalog); every subsequent read by anyone in analyst_group is recorded in the audit log.

Output (Cell 3 result).

region	orders	revenue
US	42137	1289450.75
EU	18204	612900.33
APAC	9810	287113.90

Rule of thumb: every layer is invokable in one line of code. Junior engineers think the lakehouse is "Spark + S3"; senior engineers can write the four cells above without looking it up.

`lakehouse vs data warehouse vs data lake` — the three senior tradeoffs

cost — warehouses charge for coupled storage + compute (~$23/TB/month for Snowflake storage alone); lakehouses pay $0.023/TB/month for S3 plus per-second compute.
schema enforcement — warehouses enforce schema on write (strict); lakes enforce schema on read (loose); lakehouses enforce schema on write via Delta but allow safe evolution.
workload coverage — warehouses do BI great, ML poorly; lakes do ML great, BI poorly; lakehouses do both — same Delta table feeds Power BI and a PyTorch DataLoader.
governance — warehouses ship strong governance out of the box; lakes ship none; lakehouses ship Unity Catalog which closed the gap in 2022-2024.
vendor lock-in — warehouses lock data in proprietary formats; lakes and lakehouses keep open Parquet that any engine can read.

SQL
Topic — database
Database design drills

Practice →

SQL
Topic — design
System design problems

Practice →

Solution Using a one-table comparison matrix

Code.

-- A single comparison matrix; row = decision criterion, columns = the three architectures.
CREATE TABLE lakehouse_vs_warehouse_vs_lake AS
SELECT * FROM (VALUES
    ('storage_cost_per_tb',  '$0.023 / mo (S3)',     '$23 / mo (Snowflake)',  '$0.023 / mo (S3)'),
    ('schema_enforcement',   'on read (loose)',      'on write (strict)',     'on write (Delta strict)'),
    ('acid',                 'no',                   'yes',                   'yes (delta_log)'),
    ('time_travel',          'no',                   'limited (fail-safe)',   'yes (version as of)'),
    ('bi_latency_ms',        '> 10000 (cold)',       '< 500',                 '< 500 (Photon)'),
    ('ml_workload',          'native',               'awkward',               'native'),
    ('streaming',            'awkward',              'awkward',               'native (Structured Streaming)'),
    ('vendor_lock_in',       'low',                  'high',                  'low (open Parquet)'),
    ('governance',           'none',                 'strong',                'strong (Unity Catalog)')
) AS t(criterion, data_lake, data_warehouse, lakehouse);

Step-by-step trace.

criterion	data_lake	data_warehouse	lakehouse
storage_cost_per_tb	$0.023 / mo (S3)	$23 / mo (Snowflake)	$0.023 / mo (S3)
schema_enforcement	on read (loose)	on write (strict)	on write (Delta strict)
acid	no	yes	yes (delta_log)
time_travel	no	limited (fail-safe)	yes (version as of)
bi_latency_ms	> 10000 (cold)	< 500	< 500 (Photon)
ml_workload	native	awkward	native
streaming	awkward	awkward	native (Structured Streaming)
vendor_lock_in	low	high	low (open Parquet)
governance	none	strong	strong (Unity Catalog)

Storage cost — the lakehouse inherits the lake's 1000x cheaper storage; this is the single biggest economic reason teams migrate.
Schema + ACID — the lakehouse inherits the warehouse's reliability; the _delta_log is the mechanism.
BI latency — Photon on Delta competes with Snowflake / Redshift on common dashboards; the gap that existed in 2021 has closed.
ML + streaming — only the lakehouse handles both natively; warehouses bolt them on through external services.
Vendor lock-in — Parquet is portable; if Databricks went away tomorrow, your Delta tables remain readable.
Governance — Unity Catalog is the 2022-2024 development that finally let the lakehouse win on this dimension.

Output.

criterion	data_lake	data_warehouse	lakehouse
storage_cost_per_tb	$0.023 / mo	$23 / mo	$0.023 / mo
acid	no	yes	yes
time_travel	no	limited	yes
ml_workload	native	awkward	native
streaming	awkward	awkward	native
governance	none	strong	strong

Why this works — concept by concept:

Single matrix — interviewers love a one-table answer that shows you can compare three architectures on the same axes; it is the structural signal of senior thinking.
Cost row first — economics drive the migration; lead with the 1000x storage delta and the rest follows.
ACID + time travel — the two rows that explain why the lakehouse isn't just a re-branded data lake.
Streaming + ML — the two workloads where warehouses lose decisively; calling them out preempts the "but Snowflake also does ML now" follow-up.
Governance — the 2022-2024 closing argument; Unity Catalog removed the last warehouse advantage on governance.
Cost — O(1) to read the matrix; the actual architectural decisions cascade into O(P) migrations where P = pipeline count.

3. Medallion architecture — Bronze raw → Silver cleansed → Gold business marts

`medallion architecture` — three layers, two transforms, one contract

medallion architecture is the canonical layering pattern Databricks recommends for organising every table inside a lakehouse: Bronze holds raw data exactly as it arrived, Silver holds cleansed, conformed, deduplicated data with real types, and Gold holds business-ready aggregates and denormalised marts shaped for BI / ML consumption. The interview test is whether you can name what belongs in each layer, the two transforms that bind each pair (Bronze→Silver is cleanse + conform + dedupe; Silver→Gold is aggregate + join + denormalise), and one contract that each layer must honour to the next.

Bronze — the raw audit trail.

bronze.raw_orders — every row from every ingest run, appended forever; same schema as the source.
schema-on-read — the table absorbs whatever the source emits; we cast types at read time, not write time.
append-only — never overwrite; if today's load was buggy we re-run Silver and Gold from Bronze, never re-ingest.
source-of-truth — Bronze is the artefact of record; everything downstream is derivable from Bronze + the transformation code.
metadata columns — _metadata.file_name, ingest_ts, pipeline_run_id — added at ingest, never sourced upstream.

Silver — the cleansed warehouse layer.

silver.orders_clean — typed, deduplicated, conformed; one row per business key, types match the contract.
cleansing transforms — cast to DECIMAL(18,4), normalise text case, fill nullable defaults, parse timestamps.
deduplication — QUALIFY row_number() OVER (PARTITION BY business_key ORDER BY ingest_ts DESC) = 1; replays are idempotent.
enrichment + joins — join Bronze sources together; bring in dimension lookups (e.g. customer dim, region dim).
expectations — DLT expect(col IS NOT NULL) / expect_or_drop / expect_or_fail; the layer where DQ lives.

Gold — the business mart.

gold.daily_revenue_by_region — aggregated to the grain BI asks for; partitioned by date for prune-friendly queries.
denormalised — wide tables that fold dimensional joins into one row per fact; BI tools love them.
aggregations — count, sum, avg, distinct counts; the SLA target is < 1 sec query latency from a SQL Warehouse.
one Gold per consumer — different dashboards can have different Gold tables; we trade storage for read speed.
reverse ETL feed — Gold tables often feed Hightouch / Census back into Salesforce, HubSpot, Iterable.

The two transforms — the verbs that move data between layers.

Bronze → Silver — cleanse + conform + dedupe + enrich; the verb is "make it correct".
Silver → Gold — aggregate + join + denormalise; the verb is "make it useful".
The contract — Silver must be idempotent re-derivable from Bronze, Gold must be idempotent re-derivable from Silver; the medallion is then a replay-safe DAG.

Worked example — write the three medallion tables for a `clickstream` source

Detailed explanation. Real interviews want you to walk a non-orders example (so you cannot rely on muscle memory) and produce all three layers. Below is the canonical clickstream walkthrough.

Question. Raw web clickstream lands in s3://bucket/raw/clicks/ as JSON every 5 minutes. Build Bronze, Silver, and Gold so the marketing team can query daily_sessions_by_country on a SQL Warehouse with sub-second latency.

Input. Raw click JSON: {"event_id":"abc-123","user_id":42,"url":"/home","country":null,"ts":"2026-05-28T22:31:09Z","ua":"Mozilla/5.0..."}. Roughly 200M rows per day, ~10% duplicates from retries.

Code.

-- Bronze — schema-on-read, append-only audit trail
CREATE TABLE bronze.raw_clicks
USING DELTA
LOCATION 's3://acme-lakehouse/bronze/raw_clicks/'
AS SELECT *,
          _metadata.file_name AS source_file,
          current_timestamp() AS ingest_ts
   FROM   read_files('s3://bucket/raw/clicks/', format => 'json');

-- Silver — typed, deduplicated, country defaulted, sessions assigned
CREATE OR REPLACE TABLE silver.clicks_clean
USING DELTA
PARTITIONED BY (event_date)
AS SELECT event_id,
          user_id::BIGINT                   AS user_id,
          url,
          coalesce(country, 'UNKNOWN')      AS country,
          ts::TIMESTAMP                     AS event_ts,
          date(ts::TIMESTAMP)               AS event_date,
          session_id_from_ua_user(ua, user_id) AS session_id
   FROM   bronze.raw_clicks
   WHERE  event_id IS NOT NULL
   QUALIFY row_number() OVER (PARTITION BY event_id ORDER BY ingest_ts DESC) = 1;

-- Gold — sessions aggregated by day + country, BI-ready
CREATE OR REPLACE TABLE gold.daily_sessions_by_country
USING DELTA
PARTITIONED BY (event_date)
AS SELECT event_date,
          country,
          count(DISTINCT session_id) AS sessions,
          count(*)                   AS page_views,
          count(DISTINCT user_id)    AS unique_users
   FROM   silver.clicks_clean
   GROUP  BY event_date, country;

Step-by-step explanation.

Bronze — read_files is Auto Loader sugar; it incrementally tracks already-ingested files and only loads new ones. We add source_file + ingest_ts so a Silver bug can be replayed against the right Bronze partition.
Silver — we cast user_id to BIGINT, parse ts to TIMESTAMP, default country to UNKNOWN (never propagate nulls into a GROUP BY), assign a deterministic session_id, and dedupe by event_id. The event_date partition column lets us prune Gold queries cheaply.
Gold — we aggregate to the grain marketing actually queries (event_date, country) and compute three metrics. Because the table is small (one row per (date, country)) and partitioned, a Power BI dashboard query returns in well under a second.
Replay safety — if the session_id algorithm has a bug, we re-derive Silver and Gold from the existing Bronze; we never re-ingest from the source.

Output (Gold table, first 3 rows).

event_date	country	sessions	page_views	unique_users
2026-05-28	US	1820411	18204110	743192
2026-05-28	UNKNOWN	412037	4120370	165823
2026-05-28	DE	198440	1984400	84112

Rule of thumb: every medallion stack has the same three verbs — ingest, cleanse, aggregate. Senior engineers can write all three SQL blocks above on a whiteboard in under five minutes.

`bronze silver gold` — the four senior gotchas

Don't MERGE into Bronze. Bronze is append-only. The day you MERGE into Bronze you lose the audit trail and replay safety; do all MERGEs in Silver.
Silver is where deduplication lives. Duplicate event_ids from at-least-once delivery are normal in Bronze; Silver's row_number() = 1 filter is the only place dedup belongs.
Gold is denormalised by design. Resist the SQL purist instinct to keep Gold normalised; the storage cost is trivial and the query-time join cost is enormous.
Layer per consumer is fine. One BI team can own gold.daily_revenue_by_region, another can own gold.weekly_revenue_by_product; both derive from the same Silver. Storage is cheap.

SQL
Topic — aggregation
Aggregation pattern drills

Practice →

SQL
Topic — etl
ETL pipeline practice

Practice →

Solution Using a single Bronze → Silver → Gold DAG with explicit contracts

Code.

-- One declarative DAG; the contract column says what the next layer expects.
CREATE TABLE medallion_contract_orders AS
SELECT * FROM (VALUES
    (1, 'bronze.raw_orders',     'append_only',  'every column as-string + ingest metadata', 'read_files(json)'),
    (2, 'silver.orders_clean',   'overwrite',    'order_id BIGINT NOT NULL UNIQUE; amount DECIMAL(18,4); region NOT NULL; deduped by order_id', 'CTAS from bronze + row_number=1'),
    (3, 'gold.daily_revenue',    'overwrite',    'one row per (date,region); count(*) + sum(amount); partitioned by date', 'CTAS from silver + GROUP BY'),
    (4, 'gold.user_segments',    'overwrite',    'one row per user_id; LTV bucket + activity tier; partitioned by snapshot_date', 'CTAS from silver + windowed scoring'),
    (5, 'gold.exec_dashboard',   'overwrite',    'wide one-row-per-day denormalised mart; powers exec PBI dashboard', 'CTAS from multiple silver + gold tables')
) AS t(layer_order, table_name, write_mode, contract, transform);

Step-by-step trace.

layer_order	table_name	write_mode	contract	transform
1	bronze.raw_orders	append_only	every column as-string + ingest metadata	read_files(json)
2	silver.orders_clean	overwrite	order_id BIGINT NOT NULL UNIQUE; amount DECIMAL(18,4); region NOT NULL; deduped	CTAS from bronze + row_number=1
3	gold.daily_revenue	overwrite	one row per (date,region); count(*) + sum(amount); partitioned by date	CTAS from silver + GROUP BY
4	gold.user_segments	overwrite	one row per user_id; LTV bucket + activity tier	CTAS from silver + windowed scoring
5	gold.exec_dashboard	overwrite	wide one-row-per-day denormalised mart	CTAS from multiple silver + gold

Row 1 — Bronze write mode is append_only — every load adds rows, never overwrites; this is the single most-violated medallion rule in junior code.
Row 2 — Silver write mode is overwrite (or MERGE for incremental) — the table is idempotent re-derivable from Bronze + transformation code.
Rows 3-5 — Gold has multiple tables — one per consumer / dashboard; storage cost is trivial, query latency wins.
The contract column codifies what the next layer expects; junior engineers store this in Confluence, senior engineers store it in DDL constraints + DLT expectations.
The transform column codifies the verb between layers; this is the column reviewers actually inspect.

Output.

layer_order	table_name	write_mode	contract
1	bronze.raw_orders	append_only	every column as-string + ingest metadata
2	silver.orders_clean	overwrite	order_id BIGINT NOT NULL UNIQUE; deduped
3	gold.daily_revenue	overwrite	one row per (date,region)
4	gold.user_segments	overwrite	one row per user_id
5	gold.exec_dashboard	overwrite	wide denormalised mart

Why this works — concept by concept:

Append-only Bronze — the single rule that makes replay possible; once you overwrite Bronze, the history is gone forever.
Contract column — codifies what the next layer assumes; this is the artefact reviewers can audit at PR time.
Overwrite Silver — idempotency comes from "Silver = pure function of Bronze + code"; rebuilds are safe.
Multi-Gold — different consumers get different shapes; the alternative (one mega-Gold) becomes a coordination nightmare.
Layer order — interviewers love seeing the dependency order encoded explicitly; it signals you think in DAGs.
Cost — O(1) to read the matrix; the actual DAG is O(N · M) for N rows across M layers, but every step is parallelisable per partition.

4. Delta Lake mechanics — ACID + time travel + OPTIMIZE + Z-ORDER

`delta lake` mechanics — Parquet + transaction log + four headline features

delta lake is, at the file level, just a directory of Parquet data files plus a sibling _delta_log/ directory that contains one JSON file per commit. That tiny piece of metadata — one JSON file per commit — is the entire magic: it gives plain Parquet on S3 the four headline features warehouses charge for — ACID transactions, time travel, schema enforcement + evolution, and performance optimisations (OPTIMIZE + Z-ORDER). The interview test is whether you can name what the _delta_log does, write a MERGE INTO from memory, query a previous version with VERSION AS OF, and reason about small-file compaction and multi-dim clustering.

The _delta_log — one JSON per commit.

00000000000000000000.json — the initial commit; contains the metadata action ({"metaData":{"schemaString":...}}) and a list of added files ({"add":{"path":"part-0000.parquet",...}}).
00000000000000000001.json — the next commit; contains added + removed file actions; older Parquet files are tombstoned but not deleted (until VACUUM).
_last_checkpoint — a pointer file; every 10 commits Delta writes a Parquet checkpoint that consolidates the log so readers don't replay 10,000 JSONs.
Why JSON, not a DB? — JSON is human-readable, debuggable, and replicates trivially across regions; the price is O(commits) read cost without checkpoints.

Feature 1 — ACID via the log.

Atomic — a commit is the appearance of a new JSON file in _delta_log/; either fully written or not at all.
Consistent — every reader picks the most recent committed version; partial writes are invisible.
Isolated — optimistic concurrency control — writers detect concurrent commits and retry; serialisable isolation by default.
Durable — the JSON log lives on S3's 11-nines storage; once committed, the version exists forever (until intentional truncation).

-- ACID example: a MERGE that's safe under concurrent writes.
MERGE INTO silver.orders_clean AS t
USING bronze_changes AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'U' THEN UPDATE SET amount = s.amount, updated_ts = current_timestamp()
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN NOT MATCHED AND s.op = 'I' THEN INSERT (order_id, amount, region, order_ts)
                                  VALUES (s.order_id, s.amount, s.region, s.order_ts);

Feature 2 — time travel.

VERSION AS OF — SELECT * FROM silver.orders_clean VERSION AS OF 42 returns the table as of commit 42.
TIMESTAMP AS OF — SELECT * FROM silver.orders_clean TIMESTAMP AS OF '2026-05-28 06:00:00' returns the table as of that wall-clock.
DESCRIBE HISTORY — DESCRIBE HISTORY silver.orders_clean lists every commit, user, operation, and metrics.
RESTORE — RESTORE silver.orders_clean TO VERSION AS OF 42 is the atomic rollback of a bad write.
Retention — controlled by delta.deletedFileRetentionDuration (default 7 days); after that, VACUUM can purge.

Feature 3 — schema enforcement + evolution.

Enforcement — by default, a write with a new column fails; data is rejected, not silently dropped.
Evolution — mergeSchema=true on a write allows adding (only adding) columns; existing rows get NULL for the new column.
ALTER TABLE — ALTER TABLE ... ADD COLUMNS / RENAME COLUMN / DROP COLUMN for explicit governance.
Type widening — Delta 3.0+ supports safe type widening (INT → BIGINT); narrowing requires a rewrite.

Feature 4 — OPTIMIZE + Z-ORDER + VACUUM.

OPTIMIZE — coalesces many small Parquet files into fewer ~1 GB files; massive read-perf win.
OPTIMIZE ... ZORDER BY (col1, col2) — multi-dimensional clustering; files are organised so prediates on col1 and col2 prune efficiently.
VACUUM — deletes tombstoned Parquet files older than the retention window; reclaims S3 cost.
Liquid Clustering — the 2024 replacement for ZORDER; one-time CLUSTER BY (col1, col2) DDL, auto-maintained.

Worked example — implement `MERGE INTO` + time-travel rollback + `OPTIMIZE ZORDER`

Detailed explanation. Real interviews ask you to write the full Delta mechanics flow on a single table: incremental MERGE, a time-travel rollback after a bad commit, and a maintenance OPTIMIZE ZORDER. Below is the canonical block.

Question. A nightly CDC stream bronze.cdc_orders lands as (order_id, op, amount, order_ts) with op IN ('I','U','D'). Write (a) the MERGE INTO silver.orders_clean, (b) the rollback after a bad release, and (c) the maintenance step that keeps silver.orders_clean fast.

Input. silver.orders_clean has 100M rows; bronze.cdc_orders adds ~500K daily changes; the table is queried frequently by (customer_id, order_ts) predicates.

Code.

-- (a) Idempotent MERGE — the canonical Silver upsert
MERGE INTO silver.orders_clean AS t
USING (
    SELECT order_id, op, amount, region, order_ts
    FROM   bronze.cdc_orders
    WHERE  ingest_date = current_date()
) AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.op = 'U' THEN UPDATE SET amount = s.amount, region = s.region, updated_ts = current_timestamp()
WHEN MATCHED AND s.op = 'D' THEN DELETE
WHEN NOT MATCHED AND s.op = 'I' THEN INSERT (order_id, amount, region, order_ts, ingest_ts)
                                  VALUES (s.order_id, s.amount, s.region, s.order_ts, current_timestamp());

-- (b) Bad release rollback — restore to the last-known-good version
DESCRIBE HISTORY silver.orders_clean;          -- inspect commits
RESTORE TABLE silver.orders_clean TO VERSION AS OF 1337;  -- atomic rollback

-- (c) Maintenance — compact small files and cluster by hot predicates
OPTIMIZE silver.orders_clean
ZORDER BY (customer_id, order_ts);

Step-by-step explanation.

MERGE — one statement does insert / update / delete based on op; the _delta_log records the whole thing as a single commit, so readers either see the full batch or none of it.
DESCRIBE HISTORY — lists every commit with version number, user, operation, and metrics; this is the artefact you git blame for tables.
RESTORE TO VERSION AS OF 1337 — atomic rollback; the next commit is a new version that contains the old version's contents.
OPTIMIZE ... ZORDER BY (customer_id, order_ts) — rewrites the data files so rows that share customer_id and similar order_ts end up in the same file; predicates like WHERE customer_id = 42 AND order_ts > '2026-05-01' can skip most files entirely.

Output (DESCRIBE HISTORY excerpt).

version	timestamp	userName	operation	operationMetrics
1336	2026-05-28 06:00:00	etl_user	MERGE	{numOutputRows: 482103, numUpdatedRows: 18204}
1337	2026-05-28 06:30:00	etl_user	OPTIMIZE	{numFilesAdded: 142, numFilesRemoved: 9810}
1338	2026-05-29 06:00:00	etl_user	MERGE	{numOutputRows: 503112, BAD_RELEASE: true}
1339	2026-05-29 06:45:00	oncall	RESTORE	{restoredToVersion: 1337}

Rule of thumb: MERGE is the verb for Silver; RESTORE is the verb for incidents; OPTIMIZE ... ZORDER BY is the verb for performance. Senior engineers can write all three on a whiteboard in under three minutes.

`delta lake` — the four senior gotchas

Don't VACUUM aggressively. The default retention is 7 days for a reason — time travel depends on the tombstoned files being kept. VACUUM RETAIN 0 HOURS deletes the very files you'd RESTORE from.
MERGE is O(matched files). A MERGE that touches one partition rewrites only that partition; partitioning the target Silver table on a hot predicate (event_date) keeps MERGE cheap.
ZORDER is multi-dim, partitioning is single-dim. Use partitioning on low-cardinality time columns (event_date); use ZORDER for the 2-4 high-cardinality predicates BI runs against.
Schema enforcement is on by default. A producer adding a column with no coordination will fail the write — this is the desired behaviour. mergeSchema=true is opt-in, never default.

SQL
Topic — data-validation
Data-validation practice

Practice →

SQL
Topic — optimization
Optimization drills

Practice →

Solution Using a single Delta mechanics cheat-table

Code.

-- One canonical cheat-table; row = mechanic, columns = primitive + when + caveats.
CREATE TABLE delta_mechanics_cheatsheet AS
SELECT * FROM (VALUES
    ('acid',              'MERGE INTO / INSERT / DELETE',           'every write',          'OCC retries on conflict'),
    ('time_travel',       'SELECT ... VERSION AS OF N',             'incident triage',      'bounded by retention window'),
    ('restore',           'RESTORE TABLE t TO VERSION AS OF N',     'bad-release rollback', 'atomic, creates new version'),
    ('schema_enforce',    'default on write',                       'every write',          'mergeSchema=true to evolve'),
    ('schema_evolve',     'ALTER TABLE ... ADD COLUMNS',            'planned changes',      'add-only is safe; drop is rewrite'),
    ('optimize',          'OPTIMIZE t',                             'nightly',              'compacts small files'),
    ('z_order',           'OPTIMIZE t ZORDER BY (a, b)',            'after OPTIMIZE',       'best on 2-4 high-cardinality cols'),
    ('liquid_clustering', 'ALTER TABLE t CLUSTER BY (a, b)',        'one-time DDL',         '2024+ replacement for ZORDER'),
    ('vacuum',            'VACUUM t RETAIN 168 HOURS',              'weekly',               'do not lower retention below 7d')
) AS t(mechanic, primitive, cadence, caveat);

Step-by-step trace.

mechanic	primitive	cadence	caveat
acid	MERGE INTO / INSERT / DELETE	every write	OCC retries on conflict
time_travel	SELECT ... VERSION AS OF N	incident triage	bounded by retention window
restore	RESTORE TABLE t TO VERSION AS OF N	bad-release rollback	atomic, creates new version
schema_enforce	default on write	every write	mergeSchema=true to evolve
schema_evolve	ALTER TABLE ... ADD COLUMNS	planned changes	add-only is safe; drop is rewrite
optimize	OPTIMIZE t	nightly	compacts small files
z_order	OPTIMIZE t ZORDER BY (a, b)	after OPTIMIZE	best on 2-4 high-cardinality cols
liquid_clustering	ALTER TABLE t CLUSTER BY (a, b)	one-time DDL	2024+ replacement for ZORDER
vacuum	VACUUM t RETAIN 168 HOURS	weekly	do not lower retention below 7d

acid — the foundation; every other mechanic assumes ACID semantics.
time_travel + restore — two sides of the same coin; one for inspection, one for rollback.
schema_enforce + evolve — enforcement is the default safety net; evolution is the opt-in escape hatch.
optimize + z_order + liquid_clustering — performance trio; small-file compaction first, then clustering on hot predicates.
vacuum — the only destructive operation; the cheatsheet pairs it with a don't go below 7 days caveat.
The cheatsheet collapses to: MERGE to write, VERSION AS OF to inspect, RESTORE to undo, OPTIMIZE to speed up, VACUUM rarely.

Output.

mechanic	primitive	cadence
acid	MERGE INTO	every write
time_travel	VERSION AS OF	incident
restore	RESTORE	rollback
optimize	OPTIMIZE	nightly
z_order	ZORDER BY	after OPTIMIZE
vacuum	VACUUM 168 HOURS	weekly

Why this works — concept by concept:

Single cheat-table — interviewers love a one-table answer where you can name primitive + cadence + caveat; this collapses three follow-ups into one artefact.
OCC mention — optimistic concurrency control is the specific mechanism Delta uses; calling it out is a senior signal.
Liquid Clustering — naming the 2024+ replacement for ZORDER shows you're on the current Delta version.
VACUUM caveat — the most common production foot-gun; pairing it with the 7-day rule preempts the obvious follow-up.
MERGE as default — MERGE is the verb for any Silver / Gold write where rows can be updated; calling it out as the default is a senior signal.
Cost — O(1) to read the cheatsheet; the actual mechanics are O(matched files) for MERGE, O(N) for OPTIMIZE, O(commits) for time travel.

5. End-to-end production lakehouse pipeline (sources → Bronze → Silver → Gold → BI/ML)

`databricks medallion` in production — sources, ingest, transform, serve

A production databricks lakehouse pipeline is a left-to-right pipeline with five concrete bands: sources (Kafka, RDBMS CDC, S3 file drops), ingest (Auto Loader, Kafka Structured Streaming, Debezium connectors), transform (Spark batch jobs or delta live tables declarative pipelines), serve (Gold tables behind a SQL Warehouse or Delta Sharing), and consumers (Power BI / Tableau, SQL endpoints, ML notebooks, reverse ETL). Threading through all five is Unity Catalog for permissions + lineage + audit. The interview test is whether you can draw all five bands on a whiteboard and name one concrete primitive in each.

Band 1 — sources.

Kafka / Kinesis / Event Hubs — high-throughput append streams; usually JSON or Avro encoded.
S3 / ADLS / GCS file drops — vendor CSVs, partner Parquet, mobile-SDK JSON dumps.
RDBMS CDC — Debezium / Fivetran / native Lakehouse Federation read change feeds from Postgres / MySQL / SQL Server.
SaaS APIs — Salesforce / HubSpot / Stripe via Fivetran / Airbyte; landed as Parquet in the raw bucket.

Band 2 — ingest.

Auto Loader — spark.readStream.format("cloudFiles").option("cloudFiles.format", "json")...; incremental file detection without listObjects scans.
Kafka Structured Streaming — spark.readStream.format("kafka").option("subscribe", "orders").load(); exactly-once into Delta.
Debezium / Lakehouse Federation — read CDC feeds directly; land as bronze.cdc_orders with op column.
Streaming + batch unified — the same DataFrame API for both; the writer to Delta is identical.

Band 3 — transform (Bronze → Silver → Gold).

Spark batch jobs — Airflow / Workflows orchestrate Python notebooks or JAR jobs; the legacy default.
Delta Live Tables (DLT) — declarative pipelines: @dlt.table + @dlt.expect_or_drop; the framework handles orchestration, retries, autoscale.
Workflows — Databricks' built-in scheduler; replaces a lot of Airflow for Databricks-only DAGs.
Job clusters vs serverless — job clusters spin up per run; serverless compute starts in seconds and is the 2024+ default for shared workloads.

Band 4 — serve.

SQL Warehouses — serverless or pro endpoints; Photon-backed; auto-suspend; per-second billing.
Delta Sharing — open protocol to share Delta tables with external consumers (other workspaces, other vendors).
Materialized views — pre-computed Gold queries; refreshed declaratively.
Streaming tables — continuously-updated Gold-grade tables for real-time dashboards.

Band 5 — consumers.

Power BI / Tableau / Looker — connect to a SQL Warehouse endpoint; queries hit Gold tables directly.
ML notebooks — spark.read.format("delta").load(...) against Silver or Gold; the same tables BI consumes.
Reverse ETL — Hightouch / Census push Gold rows back into Salesforce, HubSpot, Iterable.
Apps / APIs — Databricks SQL Driver, JDBC, REST APIs; product features can read Gold directly.

Worked example — assemble a production pipeline as a Delta Live Tables file

Detailed explanation. Real interviews increasingly ask you to write a DLT file because it shows that you can think in declarative pipelines rather than imperative Airflow DAGs. Below is a complete (compact) DLT module that ingests Kafka orders, builds Silver, and aggregates Gold — with expectations gating each step.

Question. Write a Delta Live Tables Python module that (a) ingests orders from Kafka into a Bronze streaming table, (b) cleans and dedupes into a Silver streaming table with a not_null(order_id) expectation, (c) aggregates into a Gold materialized view of daily_revenue_by_region, and (d) runs continuously with autoscale.

Input. Kafka topic orders (JSON payload), Unity Catalog acme.bronze / acme.silver / acme.gold schemas already created.

Code.

import dlt
from pyspark.sql.functions import col, to_timestamp, upper, coalesce, lit, date, row_number, current_timestamp
from pyspark.sql.window import Window

# (a) Bronze — streaming ingest from Kafka, schema-on-read, append-only
@dlt.table(
    name="bronze_raw_orders",
    table_properties={"delta.appendOnly": "true"},
    comment="Raw orders from Kafka — append-only audit trail",
)
def bronze_raw_orders():
    return (
        spark.readStream
             .format("kafka")
             .option("kafka.bootstrap.servers", "kafka.acme:9092")
             .option("subscribe", "orders")
             .load()
             .selectExpr(
                 "CAST(value AS STRING) AS payload_json",
                 "topic", "partition", "offset", "timestamp AS kafka_ts"
             )
             .withColumn("ingest_ts", current_timestamp())
    )

# (b) Silver — typed, deduped, expectations enforced
@dlt.table(name="silver_orders_clean", comment="Cleansed orders ready for analytics")
@dlt.expect_or_drop("valid_order_id",   "order_id IS NOT NULL")
@dlt.expect_or_drop("positive_amount",  "amount > 0")
@dlt.expect("region_known",             "region IN ('US','EU','APAC','LATAM','UNKNOWN')")
def silver_orders_clean():
    parsed = (
        dlt.read_stream("bronze_raw_orders")
           .selectExpr(
               "get_json_object(payload_json, '$.order_id')::BIGINT     AS order_id",
               "get_json_object(payload_json, '$.customer_id')::BIGINT  AS customer_id",
               "upper(get_json_object(payload_json, '$.region'))        AS region",
               "get_json_object(payload_json, '$.amount')::DECIMAL(18,4) AS amount",
               "to_timestamp(get_json_object(payload_json, '$.order_ts')) AS order_ts",
               "coalesce(get_json_object(payload_json, '$.currency'), 'USD') AS currency",
               "ingest_ts",
           )
    )
    w = Window.partitionBy("order_id").orderBy(col("ingest_ts").desc())
    return parsed.withColumn("rn", row_number().over(w)).filter(col("rn") == 1).drop("rn")

# (c) Gold — aggregated business mart, materialised
@dlt.table(name="gold_daily_revenue_by_region", comment="BI surface — daily revenue")
def gold_daily_revenue_by_region():
    return (
        dlt.read("silver_orders_clean")
           .groupBy(date(col("order_ts")).alias("order_date"), col("region"))
           .agg({"order_id": "count", "amount": "sum"})
           .withColumnRenamed("count(order_id)", "order_count")
           .withColumnRenamed("sum(amount)",     "gross_revenue")
    )

Step-by-step explanation.

Bronze — readStream.format("kafka") streams the Kafka topic; we capture the payload as a string plus Kafka metadata; delta.appendOnly=true enforces the audit-trail rule at the table level.
Silver — we parse JSON columns with get_json_object, cast to real types, upper-case region, default currency. The three @dlt.expect* decorators gate data quality: expect_or_drop quietly removes bad rows, expect records the violation count but allows the row through.
Dedup — the Window + row_number() = 1 filter ensures each order_id keeps only its latest version; replays are idempotent.
Gold — a simple groupBy().agg() materialises the daily-revenue mart; DLT decides whether to refresh it as a stream or batch based on configuration.
Autoscale + orchestration — DLT handles cluster sizing, retries, lineage, and event logs without us writing a single Airflow operator.

Output (Gold view, first 3 rows).

order_date	region	order_count	gross_revenue
2026-05-28	US	42137	1289450.7500
2026-05-28	EU	18204	612900.3300
2026-05-28	APAC	9810	287113.9000

Rule of thumb: DLT collapses 200 lines of Airflow + Spark plumbing into ~60 lines of declarative Python. Senior engineers reach for DLT for any new lakehouse pipeline; legacy Spark-batch-on-Airflow remains for migrations.

`delta live tables` + `auto loader` + `unity catalog` — the four senior production patterns

Auto Loader, not listObjects. cloudFiles uses S3 notifications + a tracking store; it scales to billions of files. Never use spark.read.json(s3_path) in production — the listObjects scan blows up at scale.
DLT expectations, not post-hoc tests. Expectations are gates at write time. They publish to the DLT event log so SRE dashboards can chart violation counts per release.
One DLT pipeline per medallion stack. Bronze + Silver + Gold for a single domain (orders, clicks, payments) belong in one DLT pipeline; the framework computes the DAG and runs it.
Unity Catalog GRANTs are per-table. GRANT SELECT ON acme.gold.daily_revenue TO analysts doesn't leak into Bronze or Silver; the three-level namespace is the security boundary.

SQL
Topic — streaming
Streaming pattern drills

Practice →

Python
Language — python
Python practice library

Practice →

Solution Using a declarative DLT pipeline + Unity Catalog governance

Code.

-- The end-to-end pipeline encoded as a single registry table.
CREATE TABLE production_lakehouse_pipeline AS
SELECT * FROM (VALUES
    (1, 'source',     'kafka.orders',            'streaming',  'JSON value column'),
    (2, 'ingest',     'auto_loader OR kafka_ss', 'continuous', 'incremental + exactly-once'),
    (3, 'bronze',     'acme.bronze.raw_orders',  'append',     'schema-on-read + ingest_ts metadata'),
    (4, 'silver',     'acme.silver.orders_clean','merge',      'typed + deduped + expectations'),
    (5, 'gold',       'acme.gold.daily_revenue', 'overwrite',  'aggregated mart, partitioned by date'),
    (6, 'serve',      'sql_warehouse',           'serverless', 'Photon + auto-suspend'),
    (7, 'consume_bi', 'powerbi_dashboard',       'pull',       'queries Gold via JDBC'),
    (8, 'consume_ml', 'notebook_train.ipynb',    'pull',       'reads Silver for features'),
    (9, 'consume_rev','hightouch_to_salesforce', 'push',       'syncs Gold rows back into CRM'),
    (10,'govern',     'unity_catalog',           'always-on',  'three-level namespace + ACL + lineage')
) AS t(band_order, band_name, artefact, mode, primitive);

Step-by-step trace.

band_order	band_name	artefact	mode	primitive
1	source	kafka.orders	streaming	JSON value column
2	ingest	auto_loader OR kafka_ss	continuous	incremental + exactly-once
3	bronze	acme.bronze.raw_orders	append	schema-on-read + ingest_ts metadata
4	silver	acme.silver.orders_clean	merge	typed + deduped + expectations
5	gold	acme.gold.daily_revenue	overwrite	aggregated mart, partitioned by date
6	serve	sql_warehouse	serverless	Photon + auto-suspend
7	consume_bi	powerbi_dashboard	pull	queries Gold via JDBC
8	consume_ml	notebook_train.ipynb	pull	reads Silver for features
9	consume_rev	hightouch_to_salesforce	push	syncs Gold rows back into CRM
10	govern	unity_catalog	always-on	three-level namespace + ACL + lineage

Rows 1-2 — source + ingest are the streaming entry point; Auto Loader for files, Kafka SS for queues.
Rows 3-5 — the medallion spine; Bronze append, Silver MERGE, Gold overwrite is the canonical write-mode triple.
Rows 6-9 — serve + consume is where the lakehouse multi-engine promise pays off; BI, ML, and reverse ETL all read the same Delta tables.
Row 10 — Unity Catalog is the always-on thread; it doesn't sit between two bands, it spans all of them.
Note consume_ml reads Silver, not Gold — ML wants the granular, per-row table; BI wants the aggregated Gold.
Note consume_rev pushes Gold into operational systems; this is the closing of the analytics → operations loop that lakehouses enable cheaply.

Output.

band_order	band_name	artefact	mode
1	source	kafka.orders	streaming
3	bronze	acme.bronze.raw_orders	append
4	silver	acme.silver.orders_clean	merge
5	gold	acme.gold.daily_revenue	overwrite
6	serve	sql_warehouse	serverless
10	govern	unity_catalog	always-on

Why this works — concept by concept:

Five-band model — sources → ingest → transform → serve → consume is the whole pipeline; collapsing it into one table makes the architecture auditable.
Write-mode triple — append for Bronze, merge for Silver, overwrite for Gold is the senior shorthand for the medallion contract.
Multi-consumer — BI, ML, and reverse ETL all reading the same Delta tables is the lakehouse's headline benefit; calling out all three preempts "but where does ML fit?" follow-ups.
Serverless SQL Warehouse — the 2024+ default for serving; auto-suspend keeps cost near zero between queries.
Unity Catalog as thread — governance isn't a band, it's the warp the entire weave passes through; this is the senior framing.
Cost — O(1) to read the registry; the actual pipeline is O(N · M) for N rows across M bands, with per-band horizontal scaling.

Choosing the right layer (cheat sheet)

A one-screen cheat sheet for databricks lakehouse and medallion architecture — pick the layer and the primitive that match the failure mode you're worried about.

You want to …	Layer	Canonical primitive	When
Capture raw source bytes forever	Bronze	`read_files` / Auto Loader → append Delta	every ingest
Add `ingest_ts` + `source_file` metadata	Bronze	`_metadata.file_name` + `current_timestamp()`	every ingest
Cast strings to real types	Silver	`::DECIMAL(18,4)`, `::BIGINT`, `::TIMESTAMP`	every load
Dedupe at-least-once duplicates	Silver	`QUALIFY row_number() OVER (PARTITION BY k ORDER BY ingest_ts DESC) = 1`	every load
Apply business rules + drop bad rows	Silver	DLT `@dlt.expect_or_drop`	every load
Update / delete rows in-place	Silver	`MERGE INTO`	every CDC load
Aggregate to BI grain	Gold	`GROUP BY ...; sum / count / avg`	every load
Denormalise for fast dashboard reads	Gold	wide CTAS with joined dims	every load
Partition for prune-friendly queries	Silver / Gold	`PARTITIONED BY (event_date)`	DDL
Cluster by hot predicate columns	Silver / Gold	`OPTIMIZE ... ZORDER BY (a,b)` or `CLUSTER BY`	nightly
Rollback a bad release	Delta	`RESTORE TABLE t TO VERSION AS OF N`	incident
Inspect a table as of yesterday	Delta	`SELECT ... FROM t TIMESTAMP AS OF '2026-05-27'`	incident triage
Compact small files	Delta	`OPTIMIZE t`	nightly
Reclaim S3 from tombstones	Delta	`VACUUM t RETAIN 168 HOURS`	weekly
Grant table access to a group	UC	`GRANT SELECT ON ... TO group`	every onboarding
Track row-level lineage	UC	`system.access.table_lineage`	always-on
Stream ingest with exactly-once	Ingest	`readStream.format("cloudFiles")` / `kafka` + Delta sink	continuous
Replace Airflow plumbing	Pipeline	Delta Live Tables `@dlt.table`	new pipelines
Share Delta tables externally	Serve	Delta Sharing	partner data sales

Frequently asked questions

What is the `databricks lakehouse` in one sentence, and why does it matter for interviews?

A databricks lakehouse is cheap object storage + a transactional layer (Delta Lake) + many compute engines + one governance plane (Unity Catalog), designed so a single copy of data on S3 / ADLS / GCS can serve BI, ML, streaming, and SQL through the same Delta tables under one set of permissions and lineage. It matters for interviews because in 2026 it is the baseline architecture every data engineer is expected to reason about — the warehouse-plus-lake duplex has collapsed, and the questions panels now ask are "how would you build Bronze / Silver / Gold for this?" and "why MERGE here instead of overwrite?" rather than "warehouse or lake?". Memorise the four layers and the three medallion stages; almost every question maps to one of them.

How does `medallion architecture` differ from a classical Kimball star schema?

medallion architecture is a physical layering (Bronze raw → Silver cleansed → Gold business) that says what shape data should be in at each step of a pipeline. Kimball is a logical modelling discipline (facts + conformed dimensions) that says how to design the tables a BI tool consumes. The two are complementary: Silver typically holds normalised, dedup'd, dimensional-style tables you'd recognise from Kimball, and Gold then denormalises and aggregates those Silver tables into wide marts (which still respect Kimball conformed dims). A common pattern is Bronze = raw, Silver = Kimball-style normalised facts + dims, Gold = wide aggregated marts per consumer. The medallion is the pipeline contract; Kimball is the modelling philosophy.

What is the `_delta_log`, and how does it make Parquet files transactional?

The _delta_log is a sub-directory next to your Parquet data files that contains one JSON file per commit (and periodic Parquet checkpoints to keep read cost bounded). Each JSON file lists add (new file added), remove (file tombstoned), metaData (schema), and commitInfo (operation + metrics) actions. Because the appearance of a new JSON file is atomic on object storage, the entire commit is atomic; concurrent writers serialise via optimistic concurrency control — they detect that another commit landed first and retry. That single piece of metadata is what gives plain Parquet on S3 the ACID guarantees, time travel (you can read the table as of any past commit), and schema enforcement that warehouses charge for. Without the _delta_log, the same Parquet files are just a data lake.

`bronze silver gold` vs `raw / staging / mart` — are they the same thing?

They are very close and largely interchangeable in conversation, but with two nuances. Bronze is stricter than raw — it must be append-only and Delta-formatted, with metadata columns like ingest_ts; many raw zones in legacy stacks are overwriting CSV dumps that violate replay safety. Silver maps almost exactly to staging — typed, conformed, deduped — but the medallion explicitly expects expectations / DQ gates at the Silver write. Gold maps to mart — aggregated, denormalised, BI-ready — but the medallion encourages multiple Gold tables per domain (one per consumer or dashboard), whereas some mart layers try to enforce a single canonical mart per business unit. If you adopt medallion, you inherit the append-only Bronze and expectations on Silver discipline that vanilla raw / staging / mart doesn't enforce.

When should I use `OPTIMIZE ZORDER BY` versus partitioning versus Liquid Clustering?

Partition on low-cardinality columns that filter every query — event_date is the canonical example; partitions become folder prefixes that the scanner skips entirely. OPTIMIZE ... ZORDER BY (a, b) is for 2-4 high-cardinality columns that frequently appear in WHERE predicates (e.g. customer_id, order_ts); Z-ORDER co-locates rows with similar values into the same Parquet files, so file-skipping is cheap. Liquid Clustering (Delta 3.0+, generally available in 2024) is the modern replacement for both partition and ZORDER on a single dimension: one CLUSTER BY (a, b) DDL, auto-maintained, no daily OPTIMIZE job, and it adapts as data shape evolves. The interview-grade rule is: partition on date, ZORDER on hot predicates, migrate to Liquid Clustering when your runtime supports it. Never ZORDER on a column you don't filter on — it costs compute and gives no read benefit.

What is `delta live tables` and when should I use it over plain Spark + Airflow?

delta live tables (DLT) is Databricks' declarative pipeline framework: you write @dlt.table functions that return DataFrames, attach @dlt.expect_or_drop / @dlt.expect_or_fail data-quality decorators, and DLT computes the DAG, runs it, retries on failure, autoscales the cluster, and publishes lineage + an event log. Use DLT for any new lakehouse pipeline where the team owns the whole stack and wants to delete a lot of Airflow + Spark plumbing — typically saving 60-70% of the boilerplate. Keep plain Spark + Airflow when (a) the DAG spans non-Databricks systems (Snowflake, GCS, Salesforce), (b) you need exotic non-Delta sinks, or (c) you're mid-migration and the cost of rewriting outweighs the saving. The interview-grade answer is: DLT for greenfield lakehouse pipelines, Workflows for Databricks-only orchestration, Airflow for multi-system DAGs.

How does Unity Catalog change governance versus the old `hive_metastore`?

The legacy hive_metastore lives per workspace, uses a two-level namespace (database.table), and has coarse-grained ACLs (table-level GRANTs at best). Unity Catalog lives per account (so one catalog spans all workspaces), uses a three-level namespace (catalog.schema.table), and adds row filters, column masks, fine-grained ACLs, automated lineage, audit logs to system tables, and Delta Sharing for external consumers. The migration path is to create a Unity Catalog metastore at the account level, link workspaces to it, and either move tables (with UPGRADE) or leave the old hive_metastore for legacy reads while writing all new tables into UC. For interviews, the headline answer is: Unity Catalog is one catalog across the account, three-level namespace, fine-grained ACL, automatic lineage, audit — and it replaces the per-workspace hive_metastore.

Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems — including SQL + Python + Spark drills keyed to the same databricks lakehouse and medallion architecture skill set this guide teaches (Bronze append-only ingest, Silver MERGE + expectations, Gold aggregation, delta lake mechanics, DLT pipelines, Auto Loader patterns, Unity Catalog governance). Whether you're prepping for a Databricks loop, a senior data-engineer round at any FAANG / fintech, or grinding the migration from a warehouse-plus-lake duplex to a lakehouse over the next quarter, the practice library mirrors the same five-band production pipeline — plus the delta live tables + unity catalog + photon tooling you'll wire into your real production lakehouse.

Kick off via the SQL practice library →; fan out into the ETL pipeline lane →; rehearse aggregation reconciliation patterns →; drill the Databricks company set →; sharpen joins drills →; reinforce data-validation problems →; widen coverage on the full Python practice library →.

DEV Community

Databricks Lakehouse + Medallion Architecture: Bronze, Silver, Gold with Delta

1. Why the lakehouse + medallion model is the modern DE interview baseline

`databricks lakehouse` — why the warehouse-plus-lake duplex collapsed

Worked example — map a single `orders` table onto the lakehouse + medallion model

`medallion architecture` — the four senior signals interviewers chase

Solution Using a 5-stage lakehouse coverage matrix

2. Lakehouse anatomy — storage + transactional layer + multi-engine compute + Unity Catalog

`lakehouse architecture` — four layers, one platform

Worked example — write the four-layer stack as a Spark notebook

`lakehouse vs data warehouse vs data lake` — the three senior tradeoffs

Solution Using a one-table comparison matrix

3. Medallion architecture — Bronze raw → Silver cleansed → Gold business marts

`medallion architecture` — three layers, two transforms, one contract

Worked example — write the three medallion tables for a `clickstream` source

`bronze silver gold` — the four senior gotchas

Solution Using a single Bronze → Silver → Gold DAG with explicit contracts

4. Delta Lake mechanics — ACID + time travel + OPTIMIZE + Z-ORDER

`delta lake` mechanics — Parquet + transaction log + four headline features

Worked example — implement `MERGE INTO` + time-travel rollback + `OPTIMIZE ZORDER`

`delta lake` — the four senior gotchas

Solution Using a single Delta mechanics cheat-table

5. End-to-end production lakehouse pipeline (sources → Bronze → Silver → Gold → BI/ML)

`databricks medallion` in production — sources, ingest, transform, serve

Worked example — assemble a production pipeline as a Delta Live Tables file

`delta live tables` + `auto loader` + `unity catalog` — the four senior production patterns

Solution Using a declarative DLT pipeline + Unity Catalog governance

Choosing the right layer (cheat sheet)

Frequently asked questions

What is the `databricks lakehouse` in one sentence, and why does it matter for interviews?

How does `medallion architecture` differ from a classical Kimball star schema?

What is the `_delta_log`, and how does it make Parquet files transactional?

`bronze silver gold` vs `raw / staging / mart` — are they the same thing?

When should I use `OPTIMIZE ZORDER BY` versus partitioning versus Liquid Clustering?

What is `delta live tables` and when should I use it over plain Spark + Airflow?

How does Unity Catalog change governance versus the old `hive_metastore`?

Practice on PipeCode

Top comments (0)

1. Why the lakehouse + medallion model is the modern DE interview baseline

databricks lakehouse — why the warehouse-plus-lake duplex collapsed

Worked example — map a single orders table onto the lakehouse + medallion model

medallion architecture — the four senior signals interviewers chase

Solution Using a 5-stage lakehouse coverage matrix

2. Lakehouse anatomy — storage + transactional layer + multi-engine compute + Unity Catalog

lakehouse architecture — four layers, one platform

Worked example — write the four-layer stack as a Spark notebook

lakehouse vs data warehouse vs data lake — the three senior tradeoffs

Solution Using a one-table comparison matrix

3. Medallion architecture — Bronze raw → Silver cleansed → Gold business marts

medallion architecture — three layers, two transforms, one contract

Worked example — write the three medallion tables for a clickstream source

bronze silver gold — the four senior gotchas

Solution Using a single Bronze → Silver → Gold DAG with explicit contracts

4. Delta Lake mechanics — ACID + time travel + OPTIMIZE + Z-ORDER

delta lake mechanics — Parquet + transaction log + four headline features

Worked example — implement MERGE INTO + time-travel rollback + OPTIMIZE ZORDER

delta lake — the four senior gotchas

Solution Using a single Delta mechanics cheat-table

5. End-to-end production lakehouse pipeline (sources → Bronze → Silver → Gold → BI/ML)

databricks medallion in production — sources, ingest, transform, serve

Worked example — assemble a production pipeline as a Delta Live Tables file

delta live tables + auto loader + unity catalog — the four senior production patterns

Solution Using a declarative DLT pipeline + Unity Catalog governance

Choosing the right layer (cheat sheet)

Frequently asked questions

What is the databricks lakehouse in one sentence, and why does it matter for interviews?

How does medallion architecture differ from a classical Kimball star schema?

What is the _delta_log, and how does it make Parquet files transactional?

bronze silver gold vs raw / staging / mart — are they the same thing?

When should I use OPTIMIZE ZORDER BY versus partitioning versus Liquid Clustering?

What is delta live tables and when should I use it over plain Spark + Airflow?

How does Unity Catalog change governance versus the old hive_metastore?

Practice on PipeCode

`databricks lakehouse` — why the warehouse-plus-lake duplex collapsed

Worked example — map a single `orders` table onto the lakehouse + medallion model

`medallion architecture` — the four senior signals interviewers chase

`lakehouse architecture` — four layers, one platform

`lakehouse vs data warehouse vs data lake` — the three senior tradeoffs

`medallion architecture` — three layers, two transforms, one contract

Worked example — write the three medallion tables for a `clickstream` source

`bronze silver gold` — the four senior gotchas

`delta lake` mechanics — Parquet + transaction log + four headline features

Worked example — implement `MERGE INTO` + time-travel rollback + `OPTIMIZE ZORDER`

`delta lake` — the four senior gotchas

`databricks medallion` in production — sources, ingest, transform, serve

`delta live tables` + `auto loader` + `unity catalog` — the four senior production patterns

What is the `databricks lakehouse` in one sentence, and why does it matter for interviews?

How does `medallion architecture` differ from a classical Kimball star schema?

What is the `_delta_log`, and how does it make Parquet files transactional?

`bronze silver gold` vs `raw / staging / mart` — are they the same thing?

When should I use `OPTIMIZE ZORDER BY` versus partitioning versus Liquid Clustering?

What is `delta live tables` and when should I use it over plain Spark + Airflow?

How does Unity Catalog change governance versus the old `hive_metastore`?