apache iceberg vs delta lake is the table-format question every modern data engineering team has to answer, and the third contender — apache hudi — quietly powers more streaming-upsert pipelines than the headlines suggest. All three are open table formats that turn raw Parquet on object storage into a real, ACID, time-traveling, schema-evolving warehouse — but they get there with three different metadata layouts, three different catalog stories, and three different opinions about how writers and readers should split the work. This deep-dive walks the same territory delta lake vs iceberg comparisons usually skim — iceberg snapshot trees, the delta transaction log, hudi copy on write and hudi merge on read — at the depth a senior interview round and a real architecture-review meeting actually demand.
This guide is the architectural companion to the spec-by-spec table that most blogs ship: where a short comparison post drops a five-column feature grid and calls it done, this one walks the five-layer anatomy of each format — Iceberg's catalog → snapshots → manifest list → manifests → data files, Delta's Parquet + _delta_log/ JSON + checkpoints, and Hudi's CoW vs MoR + compaction + timeline, then collapses the three stacks into a five-dimension decision matrix (engine reach, schema / partition evolution, streaming upserts, catalog story, best-fit use case) you can hand to an architecture review. Each section ends with a hands-on open table formats worked example — a question, a SQL or Python / PySpark snippet, a traced execution, a sample output, and a concept-by-concept why this works breakdown — the exact shape interview rounds, RFC docs, and senior lakehouse decisions reward.
When you want hands-on reps immediately after reading, browse SQL practice library →, drill ETL pipeline problems →, sharpen aggregation reconciliation patterns →, rehearse streaming drills →, reinforce database problems →, or widen coverage on the full Python practice library →.
On this page
- Why open table formats are the modern lakehouse foundation
- Apache Iceberg anatomy — catalog, snapshots, manifest list, manifests, data files
- Delta Lake anatomy — Parquet, transaction log, checkpoints, time travel
- Apache Hudi anatomy — Copy-on-Write vs Merge-on-Read, compaction, streaming upserts
- Decision matrix — Iceberg vs Delta vs Hudi by engine reach, catalog story, streaming needs
- Choosing the right table format (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why open table formats are the modern lakehouse foundation
open table formats — the missing layer between Parquet and a real warehouse
The one-sentence invariant: a table format is the metadata layer that turns a bag of immutable Parquet files in object storage into a real, ACID, time-traveling, schema-evolving table — without giving up the engine pluggability and storage economics that made the data lake attractive in the first place. Before Iceberg, Delta, and Hudi, the data-lake model was just a folder of Parquet files, and every operation that a warehouse takes for granted — atomic appends, in-place updates, deletes, schema changes, partition evolution, snapshot reads, concurrent writers — either failed silently or required a heroic re-write at the application layer.
What a table format actually adds on top of Parquet.
- ACID transactions — writers commit atomically; readers never see half-written files; concurrent writers are serialised via optimistic concurrency.
- Snapshot isolation + time travel — every commit produces a new immutable snapshot; readers can pin a query to any historical snapshot for audit, debugging, or backfill.
- Schema evolution — add, drop, rename, or reorder columns without rewriting data files; the metadata layer maps logical columns to physical Parquet columns by ID.
- Partition evolution (Iceberg-specific) — change the partition scheme over time without re-partitioning historical data; old data keeps its old layout, new data uses the new one.
-
Hidden / declarative partitioning — engines compute partition values automatically from columns; users never write
WHERE partition_date = '2026-05-29'by hand. -
Upserts and deletes (
MERGE INTO) — row-level mutations on append-only object storage, implemented via copy-on-write rewrites or merge-on-read delta logs. - Statistics + data skipping — file-level min / max / null-count / row-count statistics let engines prune entire Parquet files before they're opened.
Why the three formats arrived at the same time (~2017–2019).
-
apache iceberg vs delta lakevsapache hudiare the three production answers to the same problem — the warehouse-on-object-storage problem — invented inside the three biggest companies running that workload. -
Iceberg was incubated inside Netflix (2017) because Hive's
_SUCCESS+ folder-listing model broke at petabyte scale; the design goal was a spec, not a single implementation. - Delta Lake was open-sourced by Databricks (2019) to commercialise a format they had been using internally since 2017; the design goal was Spark-native ACID on S3.
- Hudi (Hadoop Upserts Deletes and Incrementals) was built at Uber (2017) for streaming upserts into a warehouse that needed minute-level freshness on a billion-row trip-ledger; the design goal was incremental writes, not just batch reads.
The three-way landscape, one paragraph each.
- Iceberg — the most engine-neutral; the catalog story (REST · Glue · Nessie · Polaris) is the strongest of the three; partition evolution is a unique super-power; Snowflake, BigQuery, Athena, Trino, Spark, and Flink all read and write it natively.
-
Delta Lake — the simplest to reason about (one folder, one log, one truth); Spark-first by birth but increasingly engine-neutral via Delta UniForm + Delta Kernel; the default at Databricks and Synapse;
MERGE INTO,OPTIMIZE,Z-ORDER, andVACUUMare first-class commands. -
Hudi — the original streaming-first format; Copy-on-Write and Merge-on-Read tables let you choose the write-cost / read-freshness trade-off per pipeline; native upserts (
UPSERT,INSERT,BULK_INSERT) and a built-in compaction service make it the strongest fit for CDC sinks and minute-level streaming ingest.
The market signal — convergence, not consolidation.
-
All three formats now ship
MERGE INTO, schema evolution, time travel, and ACID writes — the headline features are at parity. - Iceberg has the broadest engine support — Snowflake, BigQuery, Databricks (read), Athena, Trino, Spark, Flink, ClickHouse, StarRocks all read it; this is the fastest-growing dimension.
- Delta is dominant inside the Databricks ecosystem — and Delta UniForm + Delta Kernel are closing the engine-reach gap year over year.
- Hudi is dominant for streaming-upsert workloads — Onehouse (the Hudi-backing company) is pushing a "universal" runtime that writes Hudi natively and exports to Iceberg / Delta metadata.
- The honest 2026 answer — pick by engine alignment and catalog story, not by the spec; all three will land your bytes safely on S3 / GCS / ADLS.
Worked example — map a single warehouse workload onto all three formats
Detailed explanation. Real architecture reviews start with a workload, not a format. Below is a canonical workload — daily-batch + hourly-incremental + CDC-streaming into a single fact_orders table — and how each of the three formats would land that workload, end to end.
Question. A retailer wants fact_orders to land 200M new rows/day (batch), 1M late-arriving updates/hour (incremental), and a 50k-event/minute CDC stream from the OLTP source. Which of the three table formats fits this workload, and how does the metadata model differ?
Input. Three writers (a Spark batch job, an hourly Spark job, a Flink CDC streaming job), one warehouse table fact_orders, and three readers (Trino BI dashboards, a Snowflake feature store, an Athena ad-hoc lane).
Code.
-- Iceberg — one CREATE TABLE, partition by hour, all three writers append + MERGE.
CREATE TABLE warehouse.fact_orders (
order_id BIGINT,
customer_id BIGINT,
region STRING,
amount DECIMAL(18, 4),
order_ts TIMESTAMP
) USING ICEBERG
PARTITIONED BY (hours(order_ts));
-- Delta — same shape, _delta_log/ tracks every commit.
CREATE TABLE warehouse.fact_orders (
order_id BIGINT,
customer_id BIGINT,
region STRING,
amount DECIMAL(18, 4),
order_ts TIMESTAMP
) USING DELTA
PARTITIONED BY (DATE(order_ts));
-- Hudi — MoR for the streaming writer, CoW would slow the CDC sink.
CREATE TABLE warehouse.fact_orders (
order_id BIGINT,
customer_id BIGINT,
region STRING,
amount DECIMAL(18, 4),
order_ts TIMESTAMP
) USING HUDI
TBLPROPERTIES (
type = 'mor',
primaryKey = 'order_id',
preCombineField = 'order_ts'
)
PARTITIONED BY (DATE(order_ts));
Step-by-step explanation.
-
Iceberg — one table, three writers commit via optimistic concurrency; partition by
hours(order_ts)is a hidden partition transform; engines auto-prune without users writingWHERE. -
Delta — same shape; partition by
DATE(order_ts)is a physical directory layout; the_delta_log/JSON log tracks every commit and the streaming writer usesMERGE INTOfor upserts. - Hudi MoR — the streaming writer appends delta logs next to the base Parquet (no Parquet rewrite per event); async compaction merges the logs every N minutes; readers see Parquet + log on the fly until compaction catches up.
- All three satisfy the workload — the differentiator is where the cost lands: Iceberg / Delta pay it at write (rewrite Parquet on MERGE); Hudi MoR pays it at read (or asynchronously, during compaction).
- The choice reduces to engine alignment — Trino-heavy → Iceberg; Databricks-heavy → Delta; streaming-CDC-heavy → Hudi MoR.
Output (a one-row workload-fit matrix).
| writer | iceberg | delta | hudi (mor) |
|---|---|---|---|
| batch (200M/day) | append (atomic snapshot) | append (atomic commit) | bulk_insert |
| incremental (1M/hr) | MERGE INTO | MERGE INTO | UPSERT |
| cdc stream (50k/min) | append + MERGE (v2) | structured-streaming MERGE | UPSERT (native, MoR-optimised) |
Rule of thumb: the workload picks the format. Streaming-heavy → Hudi MoR. Databricks-native → Delta. Multi-engine open lakehouse → Iceberg.
delta lake vs iceberg vs Hudi — the four senior architecture signals
Signal 1 — opinionated engine alignment. Senior data engineers do not say "any of the three works"; they say "we read 80% of this table from Snowflake and Athena, so Iceberg is the cheapest choice — Snowflake reads it natively, Athena has zero setup, and the REST catalog gives us one source of truth".
Signal 2 — catalog before format. Junior architects pick the file format; senior architects pick the catalog first (Glue, REST, Polaris, Unity, Nessie) and let that constrain the format. The catalog owns identity, versioning, and access control; the format is downstream.
Signal 3 — write-pattern awareness. Senior architects ask "how often will this table be updated, and at what row volume?" before they pick. Append-only batch → any of the three. Hourly upserts of < 5% of rows → Iceberg or Delta MERGE. Per-second upserts on > 10% of rows → Hudi MoR.
Signal 4 — incident reasoning, not spec recitation. When a snapshot expires, a manifest file is corrupted, or a checkpoint lags, junior engineers report "the table is broken". Senior engineers report "the table is on snapshot 12345 from 02:14 UTC, the corrupt manifest is m_0003.avro, the rollback to snapshot 12344 is a one-line CALL system.rollback_to_snapshot('db.t', 12344), and here's the new alert that pages on manifest-write failures".
SQL
Topic — etl
Lakehouse ETL drills
SQL
Topic — database
Database / warehouse practice
Solution Using a workload-to-format mapping table
Code.
-- One canonical workload-to-format matrix — every row maps a workload pattern to its best-fit format.
CREATE TABLE lakehouse_format_choice AS
SELECT * FROM (VALUES
(1, 'append-only batch (S3/GCS)', 'any', 'pick by engine reach', 'low'),
(2, 'multi-engine reads (Snowflake + Trino + Athena)', 'iceberg', 'broadest open ecosystem', 'low'),
(3, 'Databricks-native + Spark-first', 'delta', 'first-class MERGE/OPTIMIZE/Z-ORDER', 'low'),
(4, 'hourly upserts < 5% of rows', 'iceberg or delta', 'MERGE INTO cost is fine', 'medium'),
(5, 'streaming CDC > 50k events/min', 'hudi mor','append delta logs, async compaction', 'medium'),
(6, 'partition scheme must evolve over time', 'iceberg','partition evolution is unique', 'medium'),
(7, 'time-travel for audit + GDPR backfill', 'any', 'all three support time travel', 'low'),
(8, 'feature store + ML reads with low latency','iceberg or delta','data skipping + Z-ORDER', 'medium')
) AS t(workload_id, workload_pattern, best_fit, why, setup_cost);
Step-by-step trace.
| workload_id | workload_pattern | best_fit | why | setup_cost |
|---|---|---|---|---|
| 1 | append-only batch (S3/GCS) | any | pick by engine reach | low |
| 2 | multi-engine reads (Snowflake + Trino + Athena) | iceberg | broadest open ecosystem | low |
| 3 | Databricks-native + Spark-first | delta | first-class MERGE/OPTIMIZE/Z-ORDER | low |
| 4 | hourly upserts < 5% of rows | iceberg or delta | MERGE INTO cost is fine | medium |
| 5 | streaming CDC > 50k events/min | hudi mor | append delta logs, async compaction | medium |
| 6 | partition scheme must evolve over time | iceberg | partition evolution is unique | medium |
| 7 | time-travel for audit + GDPR backfill | any | all three support time travel | low |
| 8 | feature store + ML reads with low latency | iceberg or delta | data skipping + Z-ORDER | medium |
- Row 1 — append-only batch is the easiest case; pick the format that matches your reader engines, not the spec.
- Row 2 — multi-engine reads is the Iceberg killer feature in 2026; no other format has Snowflake + BigQuery + Athena + Trino + Spark + Flink native support.
- Row 3 — Databricks-native shops are Delta-native shops; the toolchain (
OPTIMIZE,Z-ORDER, Photon, Unity Catalog) is the moat. - Row 4 —
MERGE INTOis fine on Iceberg or Delta when the update fraction is low; both rewrite the affected files at commit time. - Row 5 — high-throughput streaming upserts is the Hudi MoR killer feature; appending delta logs is orders of magnitude cheaper than rewriting Parquet.
- Row 6 — partition evolution is unique to Iceberg; Delta and Hudi require a backfill if the partition scheme changes.
- Row 7 — all three support time travel; differences are at the syntax level (
VERSION AS OFfor Delta,AS OFfor Iceberg, instant time for Hudi). - Row 8 — feature stores benefit from data-skipping stats + clustering; Iceberg (sort + Z-ORDER-style ordering) and Delta (
Z-ORDER) both deliver.
Output.
| workload_id | workload_pattern | best_fit |
|---|---|---|
| 1 | append-only batch | any |
| 2 | multi-engine reads | iceberg |
| 3 | Databricks-native | delta |
| 4 | hourly upserts | iceberg or delta |
| 5 | streaming CDC | hudi mor |
| 6 | partition evolution | iceberg |
| 7 | time-travel audit | any |
| 8 | feature store / ML | iceberg or delta |
Why this works — concept by concept:
- Workload-to-format mapping — turns a vague "which format?" into a one-row lookup; senior architects pick by workload pattern, not by spec.
- Engine reach is the dominant axis — most teams are reader-heavy; the format that all your readers support natively wins, regardless of write-side features.
-
Catalog before format — the
setup_costcolumn folds catalog-onboarding into the decision; Iceberg's REST catalog is cheap, Hudi's Hive Metastore is medium, Delta's Unity Catalog is negligible inside Databricks but medium outside. - No-loser framing — the table never says "X is best"; it says "X is best **for this workload"; senior architects refuse one-size-fits-all answers.
-
Cost —
O(1)to read the matrix; the actual format adoption isO(table count)of migration work but happens once.
2. Apache Iceberg anatomy — catalog, snapshots, manifest list, manifests, data files
apache iceberg metadata — five layers, one open spec
apache iceberg is the most engine-neutral of the three open table formats, and its metadata model is a five-layer indirection that is purpose-built for that neutrality. Every read traces the path catalog → metadata.json → snapshot → manifest list → manifest → data files, and every layer is an open file format (JSON / Avro / Parquet) that any engine can parse without an Iceberg client at all. The result is a format where Snowflake, BigQuery, Athena, Trino, Spark, Flink, ClickHouse, and StarRocks all read the same physical bytes — and that engine reach is the single biggest reason apache iceberg vs delta lake debates often end in Iceberg's favour for multi-engine lakehouses.
The five layers of the iceberg snapshot tree.
-
Layer 1 —
catalog— owns the current pointer (e.g. "current metadata.json for db.t is at s3://.../metadata-v123.json"); Glue, Nessie, Polaris, REST, Hive Metastore, JDBC all implement this contract. -
Layer 2 —
metadata.json— the table-level manifest: schema, partition spec, sort order, snapshot history, current snapshot id, properties. -
Layer 3 —
snapshot— one immutable snapshot per commit; references a single manifest list file; carries summary stats (added-rows, deleted-rows, parent-snapshot-id, timestamp). -
Layer 4 —
manifest list— an Avro file listing every manifest file in the snapshot, with per-manifest summary stats (partition bounds, added-files, deleted-files); the engine prunes manifests at this layer before opening any of them. -
Layer 5 —
manifest files— Avro files; each lists a batch of data files (Parquet / ORC / Avro) with per-file stats (row count, file size, lower / upper bounds per column, null counts, NaN counts); the engine prunes data files at this layer before opening any of them.
Why five layers instead of two (Delta's flat log).
- Compactable metadata — manifests can be rewritten without rewriting the data files they reference; metadata size stays bounded as table size grows.
-
File-level statistics co-located with file paths — engines do
min/maxpruning at the manifest layer, then file-level pruning at the file layer; two pruning passes, both cheap. - Snapshot isolation as a first-class citizen — readers pin to a snapshot; writers append new snapshots; no shared mutable state.
- Catalog-pluggable identity — the catalog owns the current pointer; the rest of the metadata is in object storage; this is why Iceberg is the format with the most catalog options.
The Iceberg snapshot lifecycle, in one paragraph.
A writer appends new data files to object storage, then writes a new manifest, then writes a new manifest list that includes that manifest plus all the carry-forward manifests from the previous snapshot, then writes a new metadata.json that references the new snapshot, then atomically updates the catalog pointer to the new metadata.json. The atomic step is a single catalog operation — Glue UpdateTable, Nessie commit, REST PUT — which is why Iceberg works even on storages without a compare-and-swap primitive. Readers always start from the catalog, follow the pointer to the current metadata.json, pick the snapshot they want (current, time-travel, or specific snapshot id), and walk down the tree.
iceberg snapshot operations every senior engineer knows.
-- Read the table at a specific snapshot id
SELECT * FROM warehouse.fact_orders FOR SYSTEM_VERSION AS OF 6543210987;
-- Read the table at a specific timestamp
SELECT * FROM warehouse.fact_orders FOR SYSTEM_TIME AS OF TIMESTAMP '2026-05-28 02:00:00';
-- Roll back to a prior snapshot (Spark procedure)
CALL system.rollback_to_snapshot('warehouse.fact_orders', 6543210987);
-- Expire old snapshots to reclaim metadata (and eventually data, after orphan-file cleanup)
CALL system.expire_snapshots('warehouse.fact_orders', TIMESTAMP '2026-04-29 00:00:00');
-- Rewrite small files into bigger ones; rewrites *data files*, leaves metadata layout intact
CALL system.rewrite_data_files('warehouse.fact_orders');
-- Rewrite manifests for better pruning; rewrites *manifests*, leaves data files alone
CALL system.rewrite_manifests('warehouse.fact_orders');
- Snapshot id + timestamp — both forms of time travel; the snapshot id is cheaper for repeated reads, the timestamp is friendlier for ad-hoc audit.
-
rollback_to_snapshot— instant; just flips the catalog pointer back; no data is rewritten. -
expire_snapshots— bounded by retention policy; this is the maintenance cron job every Iceberg deployment runs. -
rewrite_data_files+rewrite_manifests— the two compaction primitives; the equivalent of Delta'sOPTIMIZE.
Schema evolution and partition evolution — the two Iceberg super-powers.
- Schema evolution — add / drop / rename / reorder columns by column id, not by name or position; old Parquet files keep their physical schema; reads map physical → logical via the id.
-
Partition evolution — change the partition spec (
days(ts) → hours(ts), or add a new partition column) without rewriting historical data; the metadata layer tracks which partition spec was in force when each file was written. -
Hidden partitioning — users write
WHERE order_ts > '2026-05-29'and Iceberg computes the partition predicate automatically; noWHERE partition_date = ...boilerplate. - No partition columns in the data files — the partition value is in metadata, not in Parquet; dropping a partition column is a metadata-only operation.
Worked example — walk the Iceberg metadata tree from catalog to data file
Detailed explanation. Real interviews ask you to draw the Iceberg tree from a catalog pointer down to a Parquet file. Below is the canonical walk, with a single new commit landing.
Question. A writer commits 1,200 new rows to warehouse.fact_orders (snapshot s2). Walk the path a reader takes from the catalog to one of the new data files, naming every artefact it touches.
Input. Catalog entry warehouse.fact_orders → metadata-v124.json, prior snapshot s1 (manifest list mlist_s1.avro, two manifests, ten data files), new snapshot s2 (manifest list mlist_s2.avro, three manifests, eleven data files).
Code.
# Pseudo-code for the reader path; real engines (Trino/Spark/Snowflake) implement this in their connectors.
def read_iceberg_table(catalog, namespace, table_name):
# 1. Catalog: resolve the current metadata.json
table_loc = catalog.load_table(namespace, table_name)
metadata_json_path = table_loc.current_metadata_location
# 2. metadata.json: pick the current snapshot
md = read_json(metadata_json_path)
snapshot = md["snapshots"][-1] # s2
# 3. snapshot: follow the manifest list pointer
manifest_list_path = snapshot["manifest-list"]
# 4. manifest list: list manifest files (with per-manifest stats for pruning)
manifests = read_avro(manifest_list_path)
pruned_manifests = [m for m in manifests if matches_query_predicate(m)]
# 5. manifests: list data files (with per-file stats for pruning)
data_files = []
for m in pruned_manifests:
for df in read_avro(m["manifest_path"]):
if matches_query_predicate(df):
data_files.append(df["file_path"])
# 6. data files: actually open the Parquet
return [read_parquet(p) for p in data_files]
Step-by-step explanation.
- The catalog resolves
warehouse.fact_orderstometadata-v124.json— a singleGETagainst Glue / REST / Nessie. -
metadata-v124.jsonlists every snapshot; the reader pickss2(the current one). -
s2referencesmlist_s2.avro; the reader reads that file once. -
mlist_s2.avrolists three manifest files with per-manifest partition bounds; pruning drops any whose bounds don't overlap the query. - Surviving manifests are read; each lists data files with per-file column min / max; pruning drops files whose column bounds don't overlap the predicate.
- Only the surviving Parquet data files are actually opened — typically a tiny fraction of the table.
Output (artefacts opened to satisfy the query).
| step | artefact | bytes | engine cost |
|---|---|---|---|
| 1 | catalog entry | < 1 KB | 1 catalog RPC |
| 2 | metadata-v124.json | 50 KB | 1 object read |
| 3 | mlist_s2.avro | 2 KB | 1 object read |
| 4 | manifest_m_03.avro | 5 KB | 1 object read (after pruning 2/3) |
| 5 | part-00007.parquet (only) | 12 MB | 1 file scan |
| 6 | query result | — | rows returned |
Rule of thumb: Iceberg's two-stage pruning (manifest then data file) is what makes it the format of choice for huge tables with selective queries; the manifest layer kills the table-scan cost.
apache iceberg catalogs — REST, Glue, Nessie, Polaris
- REST catalog — the spec; vendor-neutral; the path most platforms (Tabular, Databricks, Snowflake Open Catalog, Polaris) implement.
- AWS Glue — the default if you're on AWS; integrates with Athena, EMR, Redshift Spectrum out of the box.
- Nessie — git-style branching for data; experiment on a branch, merge to main; the strongest catalog for ML / experimentation workloads.
- Polaris — Snowflake's open-source REST catalog; designed for multi-engine sharing; emerging as the cross-vendor default.
- Hive Metastore — legacy; works but lacks the multi-table atomic commits that REST and Nessie support.
SQL
Topic — database
Database / catalog drills
SQL
Topic — etl
Iceberg-style ETL practice
Solution Using iceberg_snapshots + a time-travel audit query
Code.
-- Iceberg ships system tables that surface every snapshot; the canonical audit-trail query.
SELECT
snapshot_id,
committed_at,
operation,
summary['added-records'] AS added,
summary['deleted-records'] AS deleted,
summary['total-records'] AS total_after,
parent_id
FROM warehouse.fact_orders.snapshots
ORDER BY committed_at DESC
LIMIT 10;
Step-by-step trace.
| snapshot_id | committed_at | operation | added | deleted | total_after |
|---|---|---|---|---|---|
| 6543210989 | 2026-05-29 02:14:11 | append | 1200 | 0 | 1234567 |
| 6543210988 | 2026-05-29 01:14:08 | append | 1180 | 0 | 1233367 |
| 6543210987 | 2026-05-29 00:14:05 | overwrite | 0 | 200 | 1232187 |
| 6543210986 | 2026-05-28 23:14:02 | append | 1240 | 0 | 1232387 |
- The
snapshotssystem table is always available on every Iceberg table — no extra setup, no maintenance. - Each row is one commit;
operationisappend,overwrite,delete, orreplace. -
summary['added-records']andsummary['deleted-records']give you the row delta without scanning the data. -
parent_idis the prior snapshot; you can reconstruct the full snapshot graph from this column. - The audit-trail query is what you paste into the incident channel when reconciliation fails: "snapshot 6543210989 added 1200 rows at 02:14 UTC; the missing 165 rows are in 6543210987 (overwrite)".
Output.
| snapshot_id | committed_at | operation | added | deleted |
|---|---|---|---|---|
| 6543210989 | 2026-05-29 02:14:11 | append | 1200 | 0 |
| 6543210988 | 2026-05-29 01:14:08 | append | 1180 | 0 |
| 6543210987 | 2026-05-29 00:14:05 | overwrite | 0 | 200 |
| 6543210986 | 2026-05-28 23:14:02 | append | 1240 | 0 |
Why this works — concept by concept:
-
System tables — Iceberg exposes
.snapshots,.history,.files,.manifests,.partitionsas queryable tables; you debug a table with SQL, not by spelunking S3. -
Per-commit row deltas —
added-records/deleted-recordsare in the snapshot summary; no need to diff two snapshots to compute them. - Parent-id graph — the snapshot history is a DAG; you can roll back to any node without rewriting any data file.
-
Operation column —
appendvsoverwritevsdeletemakes incident triage trivial; you see the intent of the commit, not just the row count. -
Cost —
O(snapshot count)to read the audit trail; typically < 100 snapshots after the dailyexpire_snapshotsjob runs.
3. Delta Lake anatomy — Parquet, transaction log, checkpoints, time travel
Delta Lake — Parquet + _delta_log/ = ACID on object storage
Delta Lake is the simplest of the three open table formats to reason about: a Delta table is one folder containing a stack of Parquet data files plus a _delta_log/ sub-folder containing a numbered sequence of JSON files (one per commit) and an occasional Parquet checkpoint file. There is no catalog-pointer indirection, no manifest list, no manifest file — the delta transaction log is the single, append-only source of truth, and a reader reconstructs the current table state by replaying the log from the latest checkpoint forward.
The Delta Lake folder layout, in one mental image.
my_table/
├── part-00000-...-c000.snappy.parquet ← data files (immutable)
├── part-00001-...-c000.snappy.parquet
├── part-00002-...-c000.snappy.parquet
└── _delta_log/
├── 00000000000000000000.json ← commit v0 (CREATE TABLE, add 3 files)
├── 00000000000000000001.json ← commit v1 (add 2 files, remove 1)
├── 00000000000000000002.json ← commit v2 (MERGE INTO; add 1 file, remove 2)
├── ...
└── 00000000000000000010.checkpoint.parquet ← cumulative state at v10
- Data files — standard Snappy Parquet; immutable; never modified in place.
-
_delta_log/NNNNN.json— one JSON file per commit; contains actions (add,remove,metaData,protocol,commitInfo,txn). -
_delta_log/NNNNN.checkpoint.parquet— cumulative snapshot of the table state at versionN, written every 10 commits by default; readers replay the log only from the latest checkpoint forward, never from version 0. - Atomic commit — the JSON file is written with a file-name-as-version-number convention; only one writer can successfully create the next-numbered JSON (object-storage atomic-create semantics), giving Delta its single-writer-at-a-time concurrency control.
What each commit JSON contains, action by action.
{"commitInfo": {
"timestamp": 1716950000000,
"operation": "MERGE",
"operationParameters": {"predicate": "[\"(target.order_id = source.order_id)\"]"},
"readVersion": 1,
"isolationLevel": "WriteSerializable",
"isBlindAppend": false,
"operationMetrics": {"numTargetRowsInserted": "100", "numTargetRowsUpdated": "50"}
}}
{"protocol": {"minReaderVersion": 1, "minWriterVersion": 2}}
{"metaData": {"id": "...", "schemaString": "...", "partitionColumns": ["order_date"]}}
{"add": {"path": "part-00003-...-c000.snappy.parquet", "size": 12345678, "stats": "{\"numRecords\":150,\"minValues\":{...},\"maxValues\":{...},\"nullCount\":{...}}"}}
{"remove": {"path": "part-00001-...-c000.snappy.parquet", "deletionTimestamp": 1716950000000, "dataChange": true}}
-
commitInfo— audit trail; who, when, why, with what operation metrics. -
protocol— minimum reader / writer versions; clients refuse to read tables that require a newer protocol. -
metaData— schema, partition columns, properties; written onCREATE TABLEandALTER TABLE. -
add— file path + size + per-column statistics (min, max, null count); statistics enable file-skipping at query time. -
remove— file path + tombstone timestamp; the file is not deleted from object storage untilVACUUMruns past the retention horizon.
Checkpoints — why Delta tables stay fast at version 100,000.
-
A reader replays the log to reconstruct table state — without checkpoints, the cost is
O(N)in the version count. -
Every 10 commits (default), Delta writes a
NNNNN.checkpoint.parquet— a snapshot of the cumulativeadd/removeset at that version. - Readers replay only from the latest checkpoint forward — usually 1–10 JSON files, never thousands.
-
_last_checkpoint— a tiny pointer file that tells readers which checkpoint is the latest; avoids listing_delta_log/.
delta transaction log operations every senior engineer knows.
-- Time travel by version
SELECT * FROM warehouse.fact_orders VERSION AS OF 42;
-- Time travel by timestamp
SELECT * FROM warehouse.fact_orders TIMESTAMP AS OF '2026-05-28 02:00:00';
-- MERGE INTO — upsert, the Delta workhorse
MERGE INTO warehouse.fact_orders AS t
USING warehouse.fact_orders_incoming AS s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
-- OPTIMIZE — compact small files; rewrite into ~1 GB bins
OPTIMIZE warehouse.fact_orders;
-- Z-ORDER — co-locate data on high-cardinality columns for skipping
OPTIMIZE warehouse.fact_orders ZORDER BY (customer_id);
-- VACUUM — physically delete tombstoned files past the retention horizon
VACUUM warehouse.fact_orders RETAIN 168 HOURS;
-- DESCRIBE HISTORY — the Delta audit trail
DESCRIBE HISTORY warehouse.fact_orders;
-
VERSION AS OF+TIMESTAMP AS OF— time-travel reads; the timestamp form is friendlier for ad-hoc audit. -
MERGE INTO— the first-class Delta upsert; rewrites the affected Parquet files (copy-on-write). -
OPTIMIZE— compaction; small-file consolidation; the maintenance command every Delta deployment runs. -
Z-ORDER— multi-column locality sort; readers prune more aggressively on the Z-ordered columns. -
VACUUM— deletes tombstoned files; default 7-day retention preserves time travel.
Schema evolution — what Delta supports today.
-
Add column —
ALTER TABLE … ADD COLUMNS (new_col STRING); metadata-only. -
Rename / drop column — supported via
delta.columnMapping.mode = 'name'(newer Delta protocols); older tables require a rewrite. -
Type widening —
INT → BIGINT,FLOAT → DOUBLE; supported via metadata after protocol upgrade. -
No partition evolution — changing the partition column requires a
CREATE TABLE AS SELECTrewrite; this is the gap vs Iceberg.
Worked example — read commit JSON, reconstruct table state, time-travel
Detailed explanation. Real interviews ask you to read a single commit JSON and reconstruct what the table looked like at that version. Below is the canonical walk.
Question. Given a Delta table at version 2 with the three commit JSONs above (v0: create + add 3 files, v1: add 2 + remove 1, v2: MERGE: add 1 + remove 2), reconstruct the current file set and write the time-travel query that returns the v1 state.
Input. Three _delta_log/NNNNN.json files; no checkpoints yet (table is too young).
Code.
import json, glob
# Replay the transaction log from version 0 forward.
active_files: set[str] = set()
for path in sorted(glob.glob("my_table/_delta_log/*.json")):
for line in open(path):
action = json.loads(line)
if "add" in action:
active_files.add(action["add"]["path"])
elif "remove" in action:
active_files.discard(action["remove"]["path"])
print("v2 active files:", sorted(active_files))
-- Time-travel to v1 reads only the first two commits, ignoring v2's add/remove.
SELECT * FROM warehouse.fact_orders VERSION AS OF 1;
Step-by-step explanation.
-
v0adds three files:part-00000,part-00001,part-00002. Active set after v0 = {0, 1, 2}. -
v1adds two files (part-00003,part-00004) and removespart-00001. Active set after v1 = {0, 2, 3, 4}. -
v2is aMERGEthat addspart-00005and removespart-00000andpart-00002. Active set after v2 = {3, 4, 5}. - The time-travel query
VERSION AS OF 1stops replay afterv1— the reader sees the active set as of v1: {0, 2, 3, 4}. - Time travel is free (no rewrite); it's a pure replay of the log up to the requested version.
Output (v2 active file set + v1 time-travel target).
| version | active files | row count (illustrative) |
|---|---|---|
| 0 | part-00000, part-00001, part-00002 | 300 |
| 1 | part-00000, part-00002, part-00003, part-00004 | 400 |
| 2 | part-00003, part-00004, part-00005 | 350 |
time-travel VERSION AS OF 1
|
part-00000, part-00002, part-00003, part-00004 | 400 |
Rule of thumb: the Delta _delta_log/ is a replay-to-reconstruct model; checkpoints exist solely to bound replay cost; time travel is a substring of the replay.
delta lake vs iceberg — three architectural deltas
- Catalog story. Delta uses a folder + log model; the "catalog" is just the filesystem path. Unity Catalog (Databricks) adds identity, access control, and lineage on top. Iceberg uses a catalog-first model with pluggable backends (REST, Glue, Nessie, Polaris).
- Engine reach. Delta is Spark-native; Databricks SQL + Trino + Synapse + Athena (via Delta Lake UniForm) read it. Iceberg is read natively by Snowflake, BigQuery, Athena, Trino, Spark, Flink, ClickHouse, StarRocks.
- Concurrency model. Delta uses single-writer-at-a-time with object-storage atomic-create; high-throughput writers serialise. Iceberg uses optimistic concurrency — multiple writers can succeed simultaneously if their file sets don't overlap.
Delta UniForm + Delta Kernel — closing the engine-reach gap
- Delta Lake UniForm — writes Iceberg metadata alongside Delta metadata; Delta-only writers + Iceberg-only readers can share one table.
- Delta Kernel — a Java/Rust library that lets any engine read Delta without Spark; the foundation of Trino / Presto / Synapse Delta support.
- The signal — Databricks is hedging; UniForm + Kernel + open-sourcing more of Delta is the response to Iceberg's engine-reach lead.
SQL
Topic — etl
Delta-style ETL drills
SQL
Topic — aggregation
Aggregation / OPTIMIZE practice
Solution Using DESCRIBE HISTORY + a single MERGE round-trip
Code.
-- The canonical Delta upsert + audit pattern.
MERGE INTO warehouse.fact_orders AS t
USING warehouse.fact_orders_incoming AS s
ON t.order_id = s.order_id
WHEN MATCHED AND s.order_ts > t.order_ts THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
-- Confirm what just happened
DESCRIBE HISTORY warehouse.fact_orders LIMIT 5;
Step-by-step trace.
| version | timestamp | operation | numTargetRowsInserted | numTargetRowsUpdated | numTargetRowsDeleted |
|---|---|---|---|---|---|
| 43 | 2026-05-29 02:14:11 | MERGE | 100 | 50 | 0 |
| 42 | 2026-05-29 01:14:08 | MERGE | 95 | 48 | 0 |
| 41 | 2026-05-29 00:14:05 | DELETE | 0 | 0 | 12 |
| 40 | 2026-05-28 23:14:02 | OPTIMIZE | 0 | 0 | 0 |
| 39 | 2026-05-28 22:14:01 | WRITE | 1200 | 0 | 0 |
- The
MERGEmatches onorder_id; rows that exist are updated only if the incomingorder_tsis newer (late-arriving-data guard). - Rows that don't exist are inserted.
- Delta computes the affected file set, rewrites those Parquet files with the merged rows, and appends a new commit JSON.
-
DESCRIBE HISTORYreturns one row per commit with operation metrics; this is your audit trail without any extra setup. - The
OPTIMIZEat version 40 is the routine compaction; it rewrites Parquet bins but adds zero logical rows.
Output.
| version | operation | inserted | updated |
|---|---|---|---|
| 43 | MERGE | 100 | 50 |
| 42 | MERGE | 95 | 48 |
| 41 | DELETE | 0 | 0 |
| 40 | OPTIMIZE | 0 | 0 |
| 39 | WRITE | 1200 | 0 |
Why this works — concept by concept:
- MERGE INTO is the workhorse — one statement covers insert + update + late-arrival guard; this is the strongest reason teams stay on Delta once they're on Databricks.
-
Late-arrival guard via
s.order_ts > t.order_ts— protects against out-of-order CDC events; without it, an old event would overwrite a newer one. -
Copy-on-write commit — Delta rewrites the affected Parquet files entirely; the
_delta_log/JSON records the add / remove deltas atomically. - DESCRIBE HISTORY is free audit — no separate logging service; the operation metrics live next to the data; one query for incident triage.
-
Cost —
O(N_affected_files)to rewrite, whereN_affected_filesis the number of Parquet bins the merge touches; this is whyOPTIMIZE(bigger bins) helps merge cost.
4. Apache Hudi anatomy — Copy-on-Write vs Merge-on-Read, compaction, streaming upserts
apache hudi — two table types, one streaming-first opinion
apache hudi is the streaming-first of the three open table formats: it was built at Uber to handle minute-level upserts into a billion-row trip-ledger, and its dual table-type model (hudi copy on write and hudi merge on read) is the single biggest architectural difference between Hudi and the other two. Where Iceberg and Delta both default to rewrite the affected Parquet files on every update (a copy-on-write model), Hudi MoR gives writers an alternative — append a tiny Avro delta log next to the base Parquet, and let an async compaction service merge them later — that makes high-throughput streaming upserts an order of magnitude cheaper.
Hudi's two table types, in one mental image.
-
hudi copy on write(CoW) — every update rewrites the affected Parquet file in full; readers see only Parquet (fast); writers pay the rewrite cost (slow on high-frequency updates). -
hudi merge on read(MoR) — every update appends a tiny Avro delta log next to the base Parquet; readers merge Parquet + log on the fly (slower); writers append cheaply (fast); a background compaction service merges logs back into Parquet on a schedule. - The choice is per table, not per cluster — a Hudi deployment can have CoW tables (read-heavy dashboards) and MoR tables (CDC sinks) side by side.
- The compaction dial — async compaction frequency is the operator's knob for trading read latency vs storage / write latency.
Hudi's metadata layout — the .hoodie/ folder.
my_table/
├── order_date=2026-05-29/
│ ├── 8c4a9b00-...-0_1-20-30_20260529021411.parquet ← base file (CoW + MoR)
│ ├── 8c4a9b00-...-0_1-20-30_20260529021411.log.1_0-21-31 ← MoR delta log
│ └── 8c4a9b00-...-0_1-20-30_20260529021411.log.2_0-22-32 ← MoR delta log
└── .hoodie/
├── 20260529021411.commit ← CoW commit (txn metadata)
├── 20260529021411.deltacommit ← MoR delta commit
├── 20260529021411.compaction.requested ← async compaction request
├── 20260529021411.compaction.inflight ← in-progress
├── 20260529021411.compaction.commit ← completed compaction
├── 20260529021411.clean.requested ← cleaner request
└── hoodie.properties ← table-level config
-
.commit— emitted on CoW writes; full Parquet snapshot of the file. -
.deltacommit— emitted on MoR writes; the new delta log file(s). -
.compaction.{requested,inflight,commit}— three-phase async compaction state machine. -
.clean.requested— cleaner removes old versions of files past the retention window. -
The Hudi timeline — every action lands as a file under
.hoodie/; the timeline is the source of truth.
The four canonical Hudi write operations.
-
UPSERT— the default; index-aware upsert; looks up the record key against the Hudi index (Bloom / HBase / Bucket); inserts new records, updates existing ones; the most expensive write but the most common in streaming pipelines. -
INSERT— append-only; skips the index lookup; cheaper thanUPSERTbut allows duplicate keys. -
BULK_INSERT— bypasses the index entirely; used for initial loads; the cheapest write but never use for ongoing pipelines. -
DELETE— soft or hard delete by record key; the hard delete removes the row, the soft delete writes a tombstone.
Hudi indexes — why upserts are fast.
-
Bloom-filter index (default) — Hudi maintains a Bloom filter per Parquet file; on
UPSERT, the writer probes the filter to identify the affected files; only those files are rewritten (CoW) or get a new delta log (MoR). - HBase index — external HBase keeps the record-key → file mapping; faster lookups at the cost of an HBase cluster.
- Bucket index — hash-based; record keys deterministically map to buckets; no lookup needed; the fastest at the cost of bucket count being fixed.
- The interview signal — when asked "why is Hudi fast for upserts?", say "the index avoids a full table scan; the writer rewrites or appends only to the affected files".
hudi copy on write vs hudi merge on read — when to pick which.
| dimension | CoW | MoR |
|---|---|---|
| write cost (per update) | high (rewrites base file) | low (appends delta log) |
| read cost (per query) | low (Parquet only) | medium (Parquet + log merge) |
| freshness | high (committed on write) | high (committed on write) |
| storage overhead | low | medium (logs + base) |
| compaction | not needed | required (async) |
| best for | read-heavy + low-frequency writes | write-heavy streaming + CDC |
| analytics latency | sub-second | seconds (real-time view) |
hudi merge on read query views.
-
snapshotview (default) — reader merges Parquet + uncompacted delta logs on the fly; sees the latest committed state. -
read_optimizedview — reader sees only Parquet (skips delta logs); fastest but data is stale by up to the compaction interval. -
incrementalview — reader pulls only rows that changed since a given instant (begin_instant,end_instant); the Hudi-native CDC export pattern. - Pick
snapshotfor freshness,read_optimizedfor speed,incrementalfor downstream CDC.
The compaction service — the dial that rebalances MoR.
-
inlinecompaction — runs synchronously after every Nth deltacommit; predictable but blocks the writer. -
asynccompaction — runs in a separate Spark / Flink job; doesn't block the writer; the production default. - Compaction frequency — the operator's knob; every 10 deltacommits is a common starting point.
-
Compaction cost —
O(N_delta_logs)per file group; cheaper than rewriting every base file on every update, more expensive than no compaction at all.
Worked example — write a MoR table with PySpark + a streaming UPSERT loop
Detailed explanation. Real Hudi pipelines are wired with the hudi-spark-bundle and a streaming writeStream block. Below is the canonical PySpark loop that lands CDC events into a MoR table.
Question. Write a PySpark structured-streaming job that ingests a Kafka topic of order events into a Hudi MoR fact_orders table with record key order_id, pre-combine field order_ts, async compaction every 10 deltacommits, and asserts that the snapshot view reflects the latest committed instant.
Input. Kafka topic orders.cdc (JSON events {order_id, customer_id, amount, order_ts, op}); a Hudi MoR table at s3://warehouse/fact_orders.
Code.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import StructType, StringType, LongType, TimestampType, DecimalType
spark = SparkSession.builder.appName("hudi-cdc-sink").getOrCreate()
schema = (
StructType()
.add("order_id", LongType())
.add("customer_id", LongType())
.add("amount", DecimalType(18, 4))
.add("order_ts", TimestampType())
.add("op", StringType())
)
events = (
spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "orders.cdc")
.load()
.select(from_json(col("value").cast("string"), schema).alias("e"))
.select("e.*")
)
hudi_opts = {
"hoodie.table.name": "fact_orders",
"hoodie.datasource.write.table.type": "MERGE_ON_READ",
"hoodie.datasource.write.recordkey.field": "order_id",
"hoodie.datasource.write.precombine.field": "order_ts",
"hoodie.datasource.write.operation": "upsert",
"hoodie.compact.inline": "false",
"hoodie.compact.inline.max.delta.commits": "10",
"hoodie.cleaner.commits.retained": "20",
}
query = (
events.writeStream
.format("hudi")
.options(**hudi_opts)
.option("checkpointLocation", "s3://warehouse/_chk/fact_orders")
.outputMode("append")
.start("s3://warehouse/fact_orders")
)
query.awaitTermination()
Step-by-step explanation.
- The stream reads JSON events off Kafka and parses them against the order schema.
-
MERGE_ON_READselects the MoR table type; writes will append delta logs, not rewrite base Parquet. -
recordkey.field=order_idtells Hudi which column identifies a row for upsert purposes; the Bloom index uses this. -
precombine.field=order_tsresolves duplicate keys within the same batch — the row with the largestorder_tswins (the late-arrival guard). -
hoodie.compact.inline=false+hoodie.compact.inline.max.delta.commits=10runs async compaction every 10 deltacommits — the production default. -
checkpointLocationis the Spark streaming checkpoint; on restart, the job resumes from the last committed Kafka offset.
Output (a single deltacommit produced by one Spark micro-batch).
| artefact | type | bytes |
|---|---|---|
.hoodie/20260529021411.deltacommit |
metadata | 4 KB |
order_date=2026-05-29/8c4a9b00-...0.log.1_0-21-31 |
delta log | 1.2 MB |
.hoodie/20260529021411.deltacommit.inflight |
(deleted on commit) | — |
s3://warehouse/_chk/fact_orders/offsets/... |
spark checkpoint | < 1 KB |
Rule of thumb: if you're writing > 10k upserts/sec to a Hudi table, pick MoR. If you're writing < 1k upserts/sec, CoW is simpler and reads are faster.
apache hudi — incremental queries are the secret super-power
-
Incremental query —
SELECT * FROM fact_orders WHERE _hoodie_commit_time > '20260529020000'returns only rows changed since that instant; this is the Hudi-native CDC export. - Used by downstream consumers — feature stores, ML training pipelines, search-index sinks all consume incrementals instead of full snapshots.
-
Cheaper than
MERGE INTOon the consumer side — the consumer reads only the deltas, never the full table. - The senior-interview answer to "how does Hudi differ from Delta?" — incremental queries are first-class; Delta and Iceberg both ship CDC features but Hudi was designed around them from day one.
SQL
Topic — streaming
Streaming + CDC drills
Python
Topic — etl
Hudi-style upsert pipeline practice
Solution Using an MoR table + async compaction + an incremental query
Code.
# Two follow-up actions every Hudi MoR pipeline needs:
# (1) periodic async compaction job
# (2) a downstream incremental-query reader.
# (1) async compaction job (runs in its own Spark job, separate from the streaming writer).
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("hudi-async-compact").getOrCreate()
spark.read.format("hudi").load("s3://warehouse/fact_orders") # bootstrap metadata
spark.sql("""
CALL run_compaction(
op => 'run',
table => 'fact_orders'
)
""")
# (2) downstream incremental query (feature-store / ML / sink consumer).
last_instant = "20260529020000"
incr_df = (
spark.read.format("hudi")
.option("hoodie.datasource.query.type", "incremental")
.option("hoodie.datasource.read.begin.instanttime", last_instant)
.load("s3://warehouse/fact_orders")
)
print(f"rows since {last_instant}: {incr_df.count()}")
Step-by-step trace.
| step | action | observed |
|---|---|---|
| 1 | streaming writer commits 10 deltacommits | 10 .deltacommit files, 10 log files per file group |
| 2 | async compaction runs | new base Parquet emitted, .compaction.commit written |
| 3 | downstream reader runs incremental query at begin_instant=20260529020000
|
1,200 changed rows returned |
| 4 | next incremental tick at begin_instant=20260529021411
|
new 1,180 changed rows returned |
- The streaming writer accumulates 10 deltacommits in
~/.hoodie/; each is a few KB of metadata plus an Avro log file in the partition directory. - The async compaction job detects the threshold and rewrites the base Parquet for the affected file group, then writes a
.compaction.commit. - The downstream reader uses
query.type=incrementalwithbegin.instanttime=20260529020000to pull only the new rows since that point. - The reader bumps
begin_instanton every tick; this is the Hudi-native CDC export pattern. - No extra Kafka, no extra Debezium — the table itself is the CDC source.
Output.
| consumer tick | begin_instant | rows returned |
|---|---|---|
| 1 | 20260529020000 | 1200 |
| 2 | 20260529021411 | 1180 |
| 3 | 20260529022500 | 1340 |
| 4 | 20260529023700 | 1100 |
Why this works — concept by concept:
- MoR + async compaction — the writer never pays Parquet-rewrite cost on each event; the compaction service amortises the cost over many commits.
- Incremental query — turns the table itself into a CDC source; downstream consumers don't need a separate event stream.
- begin_instanttime — the canonical CDC checkpoint; consumers persist this instant and resume from it on restart.
-
Operator-tunable compaction —
inline.max.delta.commitsis the dial; tighter = lower read latency + higher write cost; looser = higher read latency + lower write cost. -
Cost — write is
O(batch_size)(append-only), compaction isO(affected_file_groups)(rewrite), incremental query isO(rows_changed)(not table size).
5. Decision matrix — Iceberg vs Delta vs Hudi by engine reach, catalog story, streaming needs
apache iceberg vs delta lake vs Hudi — the five-dimension verdict
Every senior open table formats decision collapses to five dimensions — engine reach, schema / partition evolution, streaming upserts, catalog story, and best-fit use case. All three formats are converging on parity at the spec level; the deciding factor in 2026 is which dimensions matter most for your stack. This section walks each dimension at depth and ends with a Python decision script you can paste into an RFC.
Dimension 1 — engine reach.
- Iceberg — broadest. Snowflake (read + write via Polaris), BigQuery (read + write via BigLake), Databricks (read via UniForm), Athena (native), Trino / Presto (native), Spark (native), Flink (native), ClickHouse, StarRocks, DuckDB (experimental).
- Delta — Spark-first, expanding. Databricks SQL (native), Spark (native), Trino / Presto (via Delta Kernel), Synapse (limited), Athena (via Delta UniForm), BigQuery (via BigLake).
- Hudi — Spark + Flink + Presto. Spark (native write + read), Flink (native write + read), Presto (read), Trino (read), Hive (read); Snowflake / BigQuery / Athena support is limited.
- 2026 verdict — if you read from > 2 engines, Iceberg wins by a wide margin; if you live inside Databricks, Delta wins; if your writers are Flink CDC streamers, Hudi wins.
Dimension 2 — schema / partition evolution.
- Iceberg — best-in-class. Schema evolution (add / drop / rename / reorder) is metadata-only via column id; partition evolution (change the partition spec without rewriting data) is unique to Iceberg.
-
Delta — strong schema, no partition evolution. Schema add / drop / rename via
delta.columnMapping.mode='name'; partition changes requireCREATE TABLE AS SELECTrewrite. - Hudi — schema evolve, partition evolution limited. Schema evolution is supported (add / rename); partition evolution is limited and typically requires a rewrite.
- 2026 verdict — if your table partition scheme is uncertain or expected to change, pick Iceberg; the partition-evolution feature alone is worth the migration.
Dimension 3 — streaming upserts.
-
Iceberg — improving (v2 spec). Position deletes + equality deletes (the v2 spec); Flink + Spark streaming writers;
MERGE INTOworks but pays full copy-on-write cost. -
Delta — first-class streaming + MERGE. Structured streaming source + sink;
MERGE INTOis the workhorse; change data feed (CDF) exposes row-level deltas downstream. - Hudi — native upserts, MoR-optimised. Built for streaming upserts from day one; MoR avoids rewrite-on-update; incremental queries are first-class CDC sources.
- 2026 verdict — if you ingest > 10k upserts/sec or run CDC sinks, pick Hudi MoR. For < 10k upserts/sec, Delta + structured streaming is a tighter fit if you're Spark-native; Iceberg is fine if you're not.
Dimension 4 — catalog story.
- Iceberg — most catalog options. REST (vendor-neutral spec), AWS Glue (the AWS default), Nessie (git-style branching), Polaris (Snowflake's open REST cat), Hive Metastore, JDBC; all interoperable.
- Delta — Unity Catalog (Databricks) or Hive Metastore. Unity is the strongest catalog inside Databricks (lineage, ACL, governance); outside Databricks, Hive Metastore is the fallback.
- Hudi — Hive Metastore + DataHub. Native Hive Metastore integration; DataHub for lineage / discovery; less catalog optionality than Iceberg.
- 2026 verdict — multi-engine or open-spec? Iceberg + REST/Polaris. Databricks-native? Delta + Unity. Streaming-CDC + DataHub? Hudi + HMS + DataHub.
Dimension 5 — best-fit use case.
- Iceberg → multi-engine open lakehouse. The default when no single engine dominates; the spec is the moat.
- Delta → Databricks-first lakehouse. The default when Databricks is the platform; UniForm + Kernel narrow the gap for other engines.
- Hudi → streaming upserts + CDC sinks. The default when minute-level freshness on high-throughput upserts is the workload.
The 2026 honest read.
-
All three formats now have ACID, time travel, schema evolution,
MERGE INTO— picking by feature checklist is a 2022 mistake. - The real decision axis is engine alignment + catalog story — both of which are external to the format.
- Migration between formats is increasingly cheap — Apache XTable + Delta UniForm + OneTable can present a single physical table as Iceberg / Delta / Hudi metadata simultaneously.
- The senior interview answer — "we picked X because [engine] reads it natively and [catalog] is our identity store; the other two would have worked but cost us [Y] in operations".
Worked example — write the decision script you'd paste into an RFC
Detailed explanation. Real architecture-review meetings end with a script you can paste into a doc, not a vibe. Below is the canonical Python decision function.
Question. Write a Python function that takes a stack profile (engines, write_pattern, catalog, partition_stability) and returns the recommended table format with a one-sentence justification.
Input. A profile dict like {"engines": ["snowflake", "trino", "athena"], "write_pattern": "batch", "catalog": "polaris", "partition_stability": "stable"}.
Code.
def pick_table_format(profile: dict) -> tuple[str, str]:
engines = set(e.lower() for e in profile["engines"])
write_pattern = profile["write_pattern"].lower() # batch | incremental | streaming
catalog = profile["catalog"].lower() # unity | polaris | glue | nessie | hms | rest
partition_change = profile.get("partition_stability", "stable").lower() # stable | evolving
# Rule 1 — high-throughput streaming upserts always favour Hudi MoR.
if write_pattern == "streaming":
return ("hudi (mor)", "streaming upserts at high TPS; MoR avoids Parquet rewrite per event")
# Rule 2 — Databricks-only / Unity catalog favours Delta.
if engines == {"databricks"} or catalog == "unity":
return ("delta", "Databricks-native + Unity Catalog gives Delta first-class tooling")
# Rule 3 — multi-engine reads favour Iceberg.
if len(engines & {"snowflake", "trino", "athena", "bigquery", "flink", "spark"}) >= 2:
return ("iceberg", "broadest open engine reach; multiple engines read it natively")
# Rule 4 — partition scheme expected to change favours Iceberg.
if partition_change == "evolving":
return ("iceberg", "partition evolution is unique to Iceberg; avoids future rewrites")
# Default — Iceberg as the safe modern default.
return ("iceberg", "safe modern default: open spec, broad engine reach, REST/Polaris catalog")
# Three sample profiles
profiles = [
{"engines": ["snowflake", "trino", "athena"], "write_pattern": "batch", "catalog": "polaris"},
{"engines": ["databricks"], "write_pattern": "incremental", "catalog": "unity"},
{"engines": ["spark", "flink"], "write_pattern": "streaming", "catalog": "hms"},
]
for p in profiles:
fmt, why = pick_table_format(p)
print(f"{fmt:<12} ← {why} ({p['engines']})")
Step-by-step explanation.
- The function evaluates four ordered rules; the first match wins.
-
Rule 1 —
write_pattern == "streaming"is the strongest signal; Hudi MoR is the right answer regardless of catalog. -
Rule 2 —
engines == {"databricks"}orcatalog == "unity"short-circuits to Delta; the tooling story dominates everything else. - Rule 3 — multi-engine reads (≥ 2 of Snowflake / Trino / Athena / BigQuery / Flink / Spark) favours Iceberg; this is the most common modern case.
- Rule 4 — if the partition scheme is expected to change, Iceberg's partition-evolution feature is the deciding factor.
- Default — Iceberg as the modern safe pick; the spec is open, the engine reach is broadest, the catalog options are widest.
Output (running the three sample profiles).
iceberg ← broadest open engine reach; multiple engines read it natively (['snowflake', 'trino', 'athena'])
delta ← Databricks-native + Unity Catalog gives Delta first-class tooling (['databricks'])
hudi (mor) ← streaming upserts at high TPS; MoR avoids Parquet rewrite per event (['spark', 'flink'])
Rule of thumb: a 30-line decision function captures most production architecture-review verdicts. The order of the rules matters — write pattern first, then ecosystem, then catalog, then partition evolution.
delta lake vs iceberg vs Hudi — the three failure modes to avoid
- Failure mode 1 — picking the format before the catalog. The catalog owns identity, ACL, and lineage. If you can't deploy Polaris / Unity / Nessie, your format choice is constrained.
- Failure mode 2 — picking the format before the engines. If Snowflake is your BI engine and you pick Hudi, you'll spend the next 12 months building bridge tables; pick Iceberg or Delta UniForm instead.
-
Failure mode 3 — picking the format without sizing the write pattern. Hourly batch into Hudi is wasted overhead; per-minute upserts into Iceberg are needless
MERGErewrites.
Migration paths — XTable, UniForm, OneTable
- Apache XTable (formerly OneTable) — writes one physical Parquet set with three sets of metadata (Iceberg + Delta + Hudi); readers in any format see the same table.
- Delta Lake UniForm — Delta-writer + Iceberg-reader interop; the Databricks-led answer to multi-engine reads.
- Migration in practice — most teams pick one format and live with it; XTable / UniForm exist for the few teams that genuinely need multi-format access.
- Interview signal — naming XTable + UniForm in a comparison answer is a senior signal; most candidates don't know they exist.
Python
Topic — etl
Lakehouse decision drills
SQL
Topic — database
Warehouse / catalog practice
Solution Using a five-dimension verdict table + a one-paragraph defense
Code.
-- A canonical 5-dimension verdict matrix you can paste into any RFC.
CREATE TABLE table_format_verdict AS
SELECT * FROM (VALUES
('engine reach', 'iceberg', 'best', 'Snowflake, BigQuery, Athena, Trino, Spark, Flink read natively'),
('engine reach', 'delta', 'good', 'Spark, Databricks SQL native; Trino via Delta Kernel; UniForm closes the gap'),
('engine reach', 'hudi', 'ok', 'Spark, Flink, Presto, Trino read; Snowflake / BQ support limited'),
('schema / partition evolve', 'iceberg', 'best', 'schema by column id; partition evolution unique'),
('schema / partition evolve', 'delta', 'good', 'schema add/drop/rename; no partition evolution'),
('schema / partition evolve', 'hudi', 'ok', 'schema evolution; partition evolution limited'),
('streaming upserts', 'iceberg', 'ok', 'v2 deletes; Flink + Spark; MERGE pays CoW cost'),
('streaming upserts', 'delta', 'best', 'MERGE INTO + structured streaming + CDF'),
('streaming upserts', 'hudi', 'best', 'native UPSERT; MoR avoids rewrite per event'),
('catalog story', 'iceberg', 'best', 'REST, Glue, Nessie, Polaris, HMS, JDBC interoperable'),
('catalog story', 'delta', 'good', 'Unity Catalog inside Databricks; HMS outside'),
('catalog story', 'hudi', 'ok', 'Hive Metastore + DataHub'),
('best-fit use case', 'iceberg', 'multi-engine open lakehouse', '—'),
('best-fit use case', 'delta', 'Databricks-first lakehouse', '—'),
('best-fit use case', 'hudi', 'streaming upserts + CDC sinks','—')
) AS t(dimension, format, verdict, notes);
Step-by-step trace.
| dimension | iceberg | delta | hudi |
|---|---|---|---|
| engine reach | best | good | ok |
| schema / partition evolve | best | good | ok |
| streaming upserts | ok | best | best |
| catalog story | best | good | ok |
| best-fit use case | multi-engine open lakehouse | Databricks-first lakehouse | streaming + CDC |
- Row 1 — engine reach is the dominant axis for most 2026 lakehouses; Iceberg wins because it reads natively from Snowflake / BigQuery / Athena.
- Row 2 — schema + partition evolution is a power-feature row; only Iceberg ships partition evolution.
- Row 3 — streaming upserts split between Delta (Spark-native) and Hudi (MoR-optimised); both beat Iceberg here.
- Row 4 — catalog story is the second strongest axis; Iceberg's catalog optionality is the moat.
- Row 5 — the best-fit use case row is the summary; one line per format.
Output.
| dimension | iceberg | delta | hudi |
|---|---|---|---|
| engine reach | best | good | ok |
| schema / partition evolve | best | good | ok |
| streaming upserts | ok | best | best |
| catalog story | best | good | ok |
| best-fit use case | multi-engine | Databricks | streaming |
Why this works — concept by concept:
- Dimension-by-dimension verdict — replaces vague "X is better" with "X is better on dimension D"; senior architects always score per dimension.
- No one-winner framing — every format wins at something; the matrix forces you to acknowledge tradeoffs.
- Best-fit use case row — the summary; one sentence per format that you can quote in a one-pager.
- Notes column — embeds the why next to the verdict; reviewers can audit each cell without follow-up questions.
-
Cost —
O(1)to read; the underlying migration cost (if you change format) isO(table count × data size)but happens once.
Choosing the right table format (cheat sheet)
A one-screen cheat sheet for apache iceberg vs delta lake vs apache hudi — pick the format that matches the workload, engine mix, and catalog story you actually have.
| You want to … | Pick | Why | Catalog default |
|---|---|---|---|
| Read from > 2 engines (Snowflake + Trino + Athena) | Iceberg | broadest open engine reach | Polaris / Glue / REST |
| Live inside Databricks + Spark | Delta | first-class MERGE / OPTIMIZE / Z-ORDER + Unity | Unity Catalog |
| Run CDC sink at > 10k upserts/sec | Hudi (MoR) | append delta logs; async compaction; native UPSERT | Hive Metastore + DataHub |
| Evolve partition scheme without rewriting history | Iceberg | partition evolution is unique | any |
| Time-travel + audit GDPR backfill | any | all three support time travel | — |
| Open spec, vendor-neutral, multi-cloud | Iceberg | the format with no single vendor owner | Polaris / Nessie |
| Build a feature store on a Spark stack | Delta | Z-ORDER + Photon + structured streaming | Unity |
| Incremental query as a CDC source for downstream | Hudi | incremental queries are first-class | HMS |
| Start fresh on AWS with Athena + Glue | Iceberg | Glue + Athena native; zero new infra | Glue |
| Migrate one table to read from all three at once | XTable / UniForm | dual-metadata interop layer | any |
| Bound metadata cost on a billion-row table | Iceberg | two-stage manifest pruning | any |
| Rewrite small files / compact | OPTIMIZE (Delta) · rewrite_data_files (Iceberg) · compaction (Hudi) | per-format compaction commands | — |
| Roll back a bad write in one command | Iceberg rollback_to_snapshot
|
flip the catalog pointer; no data rewrite | any |
| Run a free, open-source DQ layer | dbt tests + Great Expectations | works against any of the three | — |
| Share one table across vendor silos | XTable / UniForm | one physical Parquet, three metadata views | any |
Frequently asked questions
How is apache iceberg vs delta lake different from a generic Iceberg-only or Delta-only deep dive?
A single-format deep dive answers "how does X work?" — this guide answers "which of X, Y, Z fits my workload, and why?" The five sections walk the anatomy of each format (Iceberg's catalog → snapshots → manifest list → manifests → data files; Delta's Parquet + _delta_log/ + checkpoints; Hudi's CoW vs MoR + compaction timeline), then collapse the three stacks into a five-dimension decision matrix (engine reach, schema / partition evolution, streaming upserts, catalog story, best-fit use case) plus a Python pick_table_format() script you can paste into an RFC. Pick the single-format deep-dive when you've already picked your format and want to master it; pick this comparison guide when you're about to pick or about to justify the pick to a senior architecture review.
What's the real difference between delta lake vs iceberg for a multi-engine team?
The biggest practical difference in 2026 is engine reach and catalog story, not the on-disk format. Iceberg is read natively by Snowflake (via Polaris), BigQuery (via BigLake), Athena, Trino, Spark, Flink, ClickHouse, StarRocks, and DuckDB; Delta is Spark-first and is read by Databricks SQL natively, by Trino via Delta Kernel, by Synapse with caveats, and by Athena via Delta UniForm. If you read from > 2 engines, Iceberg wins; if you're inside Databricks, Delta wins. The second-biggest difference is the catalog story — Iceberg has pluggable backends (REST, Glue, Nessie, Polaris, HMS), Delta is best with Unity Catalog inside Databricks. Both formats now ship ACID, time travel, schema evolution, and MERGE INTO; the headline features are at parity.
When should I pick apache hudi over Iceberg or Delta?
Pick Hudi when your write pattern is streaming upserts at high TPS — typically > 10,000 upserts/second from a CDC source like Debezium, a Flink job, or a Kafka stream. Hudi's hudi merge on read table type appends a small Avro delta log next to the base Parquet file rather than rewriting the Parquet on every update; an async compaction service merges the logs back periodically. This makes Hudi MoR an order of magnitude cheaper for high-throughput upserts than Iceberg or Delta's copy-on-write MERGE INTO. Hudi's other Hudi-native super-power is incremental queries — SELECT * FROM t WHERE _hoodie_commit_time > '...' returns only rows changed since an instant, which is the canonical Hudi-native CDC export pattern. If your write pattern is hourly batch or daily batch, Hudi is over-engineering; pick Iceberg or Delta instead.
What is an iceberg snapshot, and why are there five metadata layers?
An iceberg snapshot is a single immutable commit of a table — every write produces a new snapshot, and readers pin queries to a specific snapshot for consistent results. The five metadata layers (catalog → metadata.json → snapshot → manifest list → manifests → data files) exist because each layer is independently compactable and independently prunable. The catalog owns the current-pointer; metadata.json carries schema + snapshot history; each snapshot references one manifest list; the manifest list lists manifest files with per-manifest partition bounds (engines prune at this layer first); each manifest lists data files with per-file column statistics (engines prune at this layer second); only the surviving data files are actually opened. This two-stage pruning is what makes Iceberg fast on huge tables with selective queries — most reads open < 1% of the table.
What does the delta transaction log look like, and how does time travel work?
The delta transaction log lives in _delta_log/ under the table folder; it's a numbered sequence of JSON files (one per commit) plus an occasional Parquet checkpoint. Each JSON contains actions — add (a new Parquet file with stats), remove (a tombstoned file), metaData (schema), protocol (reader/writer versions), and commitInfo (audit metadata). A reader reconstructs the current file set by replaying the log; checkpoints (written every 10 commits by default) collapse the cumulative state into a single Parquet so replay is bounded. Time travel is a substring of the same replay — SELECT * FROM t VERSION AS OF 42 replays the log only up to version 42 and stops; SELECT * FROM t TIMESTAMP AS OF '...' does the same with a timestamp lookup. Time travel is free (no rewrite); the only cost is the bounded log replay.
What's the difference between hudi copy on write and hudi merge on read?
hudi copy on write (CoW) rewrites the affected Parquet file in full on every update; readers see only Parquet, so reads are fast; writers pay the rewrite cost, so write throughput is limited on high-frequency updates. hudi merge on read (MoR) appends a small Avro delta log next to the base Parquet on every update; readers merge Parquet + uncompacted log on the fly, so reads are slower; writers append cheaply, so write throughput is much higher; an async compaction service merges logs back into Parquet on a schedule to keep read cost bounded. Pick CoW for read-heavy + low-frequency-update workloads (analytics dashboards, feature stores). Pick MoR for write-heavy streaming workloads (CDC sinks, Kafka-to-warehouse pipelines). The choice is per-table, not per-cluster — most Hudi deployments mix both.
Can I switch table formats later — Iceberg → Delta or vice versa?
Yes, increasingly cheaply. Three migration paths exist in 2026. Apache XTable (formerly OneTable) writes one physical Parquet set with three sets of metadata so the same files appear as an Iceberg table, a Delta table, and a Hudi table simultaneously; readers in any format see the same data. Delta Lake UniForm writes Iceberg metadata alongside Delta metadata so Delta writers and Iceberg readers can share one table without duplication. Full migration is also possible: tools like Iceberg's migrate procedure, Delta's CONVERT TO DELTA, and Hudi's bootstrap operation can flip an existing Parquet directory to a managed table format in-place. Most teams pick one format and live with it; the dual-metadata layers exist for the few teams that genuinely need cross-format reads. The senior interview signal is naming XTable + UniForm — most candidates don't know they exist.
Practice on PipeCode
PipeCode ships 450+ data-engineering interview problems — including SQL + Python drills keyed to the same lakehouse mental model this guide teaches (snapshot anatomy, transaction-log replay, copy-on-write vs merge-on-read trade-offs, partition evolution, MERGE upserts, incremental queries, and catalog-led architecture decisions). Whether you're prepping for an apache iceberg vs delta lake architecture round, drilling Hudi streaming upserts the week before a Flink interview, or rehearsing the five-dimension decision matrix for an RFC, the practice library mirrors the same five-section structure — plus the Snowflake + BigQuery + Athena + Trino + Spark + Flink engine surfaces you'll wire into your production lakehouse.
Kick off via Explore practice →; drill the SQL practice lane →; fan out into the ETL drills →; rehearse streaming + CDC practice →; reinforce aggregation reconciliation patterns →; widen coverage on the full Python practice library →.





Top comments (0)