apache iceberg rewrote what "a table" means on object storage. A decade of "files in S3 with a _SUCCESS marker" gave us cheap storage and unbounded scale, but every meaningful invariant a SQL engineer expects — atomicity, isolation, time travel, schema evolution, safe rollback — had to be re-invented on top of immutable blobs. Iceberg's answer is the snapshot manifest: a tiny pointer-swap that makes a multi-gigabyte write atomic, a 90-day rollback trivial, and a "compare last good prod table to today's run" diff a one-line query.
This guide is the production patterns playbook for senior data engineers who already know what Iceberg is and now have to ship it. It walks through the iceberg snapshots model, the apache iceberg vs delta lake decision, the iceberg branching lifecycle (createBranch, fastForward, cherryPick), the iceberg tagging API for immutable compliance pins, the iceberg wap (write audit publish) pattern as the modern replacement for spark.wap.id, and the iceberg time travel plus maintenance cadence (snapshot expiration, OPTIMIZE, rewrite_manifests, schema evolution) that keep a multi-petabyte lakehouse fast and cheap. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the ETL practice library →, rehearse on database design problems →, and stack the streaming-side muscles with streaming problems →.
On this page
- Why open table formats — snapshot atomicity, ACID, time travel
- Iceberg branches — createBranch / fastForward / cherry-pick, isolated dev
- Iceberg tags — immutable labels for compliance and audit
- WAP pattern — write to audit branch, validate, fast-forward to main
- Production ops — snapshot expiration, OPTIMIZE, schema evolution
- Cheat sheet — Iceberg branching, tagging, and WAP recipes
- Frequently asked questions
- Practice on PipeCode
1. Why open table formats — snapshot atomicity, ACID, time travel
The "files in S3" era broke at scale — the snapshot manifest is the primitive that fixed every invariant a SQL engineer expects
The mental model in one line: an Iceberg table is a tiny JSON pointer that names the current snapshot; every write produces a new snapshot and the table commit is a single conditional pointer swap. Once you internalise "tables are pointer-swaps over immutable files," every Iceberg feature — atomic commits, time travel, branches, tags, WAP — falls out of the same primitive.
Why the pre-Iceberg world broke.
-
Hive-style partitioned directories told the reader "everything under
/sales/dt=2026-06-12/is the partition for that day." There was no commit record, so a job that wrote 12 of 24 files before crashing left partial data visible to every downstream reader. -
_SUCCESSmarkers were a convention, not a contract — Spark wrote them, but Trino did not always honour them, and a successful marker plus a half-rewritten Parquet file could still corrupt a query. - No schema-evolution invariant — a column rename in Hive was a rewrite of every Parquet file. A column drop was a code review and a prayer.
- No transactional semantics — concurrent writers raced on the same partition and the last-writer-wins resolution silently dropped commits.
-
Listing-bound planning — query planning required
LISTagainst S3 for every partition, which was both slow and a wallet-burn at scale.
The Iceberg primitive.
- The snapshot. Every commit produces one snapshot. A snapshot is a manifest list — a tiny file that points to a set of manifest files, which point to the actual Parquet/ORC data files. The current table state is the current snapshot.
-
The atomic swap. A commit succeeds when the catalog (REST / Glue / Nessie / Polaris / Unity) flips its "current snapshot id" pointer from
snap-Atosnap-Bin a single conditional write. If the swap fails (because someone else committed first), the writer retries the manifest against the new base. No partial reads, no half-written tables. - ACID without locks. Atomicity comes from the pointer swap; consistency from the manifest validation; isolation from immutable snapshots (readers pin a snapshot id at planning time and never see writes that arrive after); durability from S3 itself.
-
Time travel for free. Every historical snapshot is queryable via
AS OF VERSIONorAS OF TIMESTAMPuntil you explicitly expire it. Rollback is a single API call. -
No-list planning. The manifest tree carries column-level stats (min/max, null counts) per file, so the planner prunes 99% of files before any
LISTorGETis issued. Planning latency drops from minutes to seconds on petabyte tables.
Apache Iceberg vs Delta Lake — the two-format reality.
- Both ship snapshot-based ACID on object storage. Both support time travel, schema evolution, and atomic commits.
- Iceberg is catalog-pluggable — the table identity lives in the catalog (REST, Glue, Nessie, Polaris, Snowflake-managed, Unity). Multi-engine reads are first-class: Spark, Trino, Flink, Snowflake, DuckDB, ClickHouse, StarRocks, Athena, BigQuery all read Iceberg natively.
-
Delta Lake has its own commit log living next to the data (
_delta_log/). Engines like Trino and Flink can read Delta but the writer ecosystem outside Spark/Databricks is thinner. Delta UniForm shrinks this gap by writing Iceberg-compatible metadata alongside Delta. - Iceberg branches and tags ship as core API (Spark, Flink, Trino procedures). Delta's equivalent is "shallow clone" plus the Delta UniForm bridge — semantically similar but a different surface.
- The 2026 rule of thumb. For a new lakehouse with Snowflake / Databricks / Trino / Flink all reading and writing: pick Iceberg. For a Databricks-first shop with limited multi-engine writers: Delta with UniForm is operationally simpler. For "I might switch engines in 2 years": Iceberg is the safer bet.
What interviewers listen for.
- Do you say "a commit is a pointer swap" when asked what makes Iceberg atomic? — senior signal.
- Do you reach for
snapshotsmetadata table to inspect history before reaching forgit log-style mental models? — required answer. - Do you mention catalog choice (REST vs Glue vs Nessie vs Polaris) as a first-class production decision? — senior signal.
- Do you know that
time travelreads pin the snapshot at planning time, so a long-running query is unaffected by concurrent commits? — required answer.
Worked example — list snapshots and inspect commit history
Detailed explanation. Every Iceberg table exposes a metadata sub-table called .snapshots that lists every commit. Reading it is the first move when a downstream report changed unexpectedly — you can see exactly which snapshot landed, who wrote it, and how many files were added or deleted.
Question. Given a production fact table db.sales, list the last five snapshots with commit timestamp, operation, and the file-level delta (added / deleted). How would you use the result to investigate a "row count jumped by 8% overnight" alert?
Input.
| snapshot_id | committed_at | operation | added_data_files | deleted_data_files |
|---|---|---|---|---|
| s-101 | 2026-06-12 02:00 | append | 24 | 0 |
| s-102 | 2026-06-12 04:00 | append | 12 | 0 |
| s-103 | 2026-06-12 06:00 | overwrite | 240 | 36 |
| s-104 | 2026-06-12 08:00 | append | 6 | 0 |
| s-105 | 2026-06-12 09:30 | append | 4 | 0 |
Code.
-- Spark / Trino / Snowflake-Iceberg syntax
SELECT
snapshot_id,
committed_at,
operation,
summary['added-data-files'] AS added_data_files,
summary['deleted-data-files'] AS deleted_data_files,
summary['total-records'] AS total_records
FROM db.sales.snapshots
ORDER BY committed_at DESC
LIMIT 5;
Step-by-step explanation.
-
db.sales.snapshotsis the metadata sub-table for thesalestable — every Iceberg table exposes it automatically. -
summaryis aMAP<STRING,STRING>of commit-level counters: added/deleted data files, added/deleted records, total records, partitioning summary, and any caller-attached properties (spark.app.id, dbt run id, Airflow dag run id). - The
overwriterow at s-103 is the suspicious commit — it deleted 36 files and added 240, which is the textbook signature of an OPTIMIZE / rewrite_data_files run or a backfill that replaced a partition. Cross-check thewap.idorspark.app.idinsummaryto know which. - With the suspect snapshot identified, an investigator can
time travel(SELECT ... FROM db.sales VERSION AS OF s-102vss-103) and diff the row counts per partition to confirm whether the jump came from the overwrite.
Output.
| snapshot_id | committed_at | operation | added | deleted | total_records |
|---|---|---|---|---|---|
| s-105 | 2026-06-12 09:30 | append | 4 | 0 | 12_004_200 |
| s-104 | 2026-06-12 08:00 | append | 6 | 0 | 12_001_800 |
| s-103 | 2026-06-12 06:00 | overwrite | 240 | 36 | 11_995_400 |
| s-102 | 2026-06-12 04:00 | append | 12 | 0 | 11_096_200 |
| s-101 | 2026-06-12 02:00 | append | 24 | 0 | 11_088_400 |
Rule of thumb. When a downstream metric changes unexpectedly, your first SQL should always be SELECT * FROM <table>.snapshots ORDER BY committed_at DESC LIMIT N. The snapshots table is the audit log every other format had to bolt on later.
Worked example — time travel to query the table at a past timestamp
Detailed explanation. Iceberg supports both VERSION AS OF <snapshot_id> (exact snapshot) and TIMESTAMP AS OF <ts> (the snapshot that was current at that wall-clock time). Time travel reads pin the snapshot at planning time, so a long-running analytical query is immune to concurrent writes.
Question. A daily revenue report ran at 06:00 and produced a number the business is disputing. Re-run the same query against the table as it existed at 06:00 and confirm the historical value.
Input.
| snapshot_id | committed_at |
|---|---|
| s-102 | 2026-06-12 04:00 |
| s-103 | 2026-06-12 06:00 |
| s-104 | 2026-06-12 08:00 |
Code.
-- Reproduce the report exactly as it ran at 06:00
SELECT
region,
SUM(amount) AS revenue
FROM db.sales TIMESTAMP AS OF '2026-06-12 06:00:00'
WHERE order_date = DATE '2026-06-11'
GROUP BY region
ORDER BY region;
-- Or pin the exact snapshot id from the snapshots audit
SELECT
region,
SUM(amount) AS revenue
FROM db.sales VERSION AS OF 103
WHERE order_date = DATE '2026-06-11'
GROUP BY region
ORDER BY region;
Step-by-step explanation.
-
TIMESTAMP AS OF '2026-06-12 06:00:00'resolves to the snapshot that was current at exactly that moment — in this table, snapshot s-103 (the 06:00 overwrite). The planner uses the manifest tree from s-103, not the current pointer. -
VERSION AS OF 103is the unambiguous form — if downstream auditors need to reproduce a number, the snapshot id is the contract. - Both forms cost the same to plan and execute as a regular read; the only difference is which snapshot manifest the planner walks.
- Because the snapshot is pinned at planning time, no concurrent writer can affect this query's result. The number you compute is the historical truth.
Output.
| region | revenue (at 06:00) | revenue (now) |
|---|---|---|
| EU | 1_240_500 | 1_310_200 |
| US | 980_400 | 1_045_900 |
| APAC | 420_900 | 444_100 |
Rule of thumb. Every disputed metric should be re-runnable with a TIMESTAMP AS OF or VERSION AS OF clause against the original snapshot. If your downstream pipeline forgets to capture the snapshot id, you lose audit reproducibility — capture it in every run's lineage metadata.
Worked example — atomic rollback to a previous snapshot
Detailed explanation. If a bad commit corrupts the table — wrong partition overwritten, accidental DELETE, schema-violating insert — rollback_to_snapshot flips the catalog pointer back to the prior snapshot in a single atomic operation. The bad snapshot still exists on disk (until expired) but is no longer the current state.
Question. A nightly job overwrote the eu partition with bad data in snapshot s-103. Roll the table back to snapshot s-102 atomically, so every reader returns the pre-s-103 state immediately.
Input.
| snapshot_id | committed_at | operation |
|---|---|---|
| s-101 | 2026-06-12 02:00 | append |
| s-102 | 2026-06-12 04:00 | append |
| s-103 | 2026-06-12 06:00 | overwrite (BAD) |
Code.
-- Spark / Trino procedure form
CALL system.rollback_to_snapshot('db.sales', 102);
-- Or rollback to the snapshot that was current at a wall-clock time
CALL system.rollback_to_timestamp('db.sales', TIMESTAMP '2026-06-12 05:00:00');
Step-by-step explanation.
- The procedure writes a new snapshot s-104 whose state is identical to s-102 (the pointer is reset). The bad snapshot s-103 is still on disk but is no longer in the chain of "current" snapshots.
- The pointer swap is atomic — between the millisecond before and after, no reader can see a half-rolled-back table.
- Concurrent readers that started before the rollback continue against their pinned snapshot; new readers see the rolled-back state.
- The bad snapshot remains queryable via
VERSION AS OF 103for post-mortem analysis until you explicitlyexpire_snapshotspast it.
Output.
| step | current snapshot | state visible to readers |
|---|---|---|
| before | s-103 | bad EU data |
| after | s-104 (= s-102 state) | pre-bad EU data |
Rule of thumb. Treat rollback_to_snapshot as your "undo button" — but always pair it with a snapshot inspection (SELECT * FROM <table>.snapshots) so you roll back to the correct prior snapshot. Rolling back to the wrong snapshot is its own outage.
SQL interview question on Iceberg atomicity and time travel
A senior interviewer often opens with: "Walk me through what happens at the catalog when Spark commits an append to an Iceberg table — what is atomic, what is not, and how does a concurrent reader see (or not see) the new rows?" It probes whether the candidate understands snapshot isolation, the pointer-swap commit, and the read-pinning behaviour.
Solution Using snapshot isolation and the manifest-pointer commit
-- 1) Inspect the current pointer and the in-flight commit candidate
SELECT
name,
snapshot_id,
parent_id,
is_current_ancestor
FROM db.sales.refs; -- branches + tags + main
SELECT
snapshot_id,
parent_id,
operation,
committed_at,
summary['added-data-files'] AS added,
summary['spark.app.id'] AS writer
FROM db.sales.snapshots
ORDER BY committed_at DESC
LIMIT 3;
-- 2) Reader query that pins a snapshot at planning time
SELECT
region,
COUNT(*) AS row_count
FROM db.sales VERSION AS OF (
SELECT MAX(snapshot_id) FROM db.sales.snapshots
WHERE committed_at <= TIMESTAMP '2026-06-12 06:00:00'
)
GROUP BY region;
Step-by-step trace.
| step | catalog state | reader state | invariant |
|---|---|---|---|
| 1 | current = s-102 | new reader plans against s-102 | snapshot pinned |
| 2 | writer prepares s-103 manifest | s-102 still current | no swap yet |
| 3 | writer calls catalog.commit(s-102→s-103) | atomic conditional swap | last-writer-wins or retry |
| 4 | current = s-103 | original reader still sees s-102 | snapshot isolation |
| 5 | second reader plans now | sees s-103 | no partial reads |
The trace shows the central guarantee: between steps 1 and 4, the original reader's view is locked to s-102 — it cannot accidentally see half of s-103's writes. The new reader at step 5 sees the fully committed s-103. There is no in-between visibility state.
Output:
| reader | snapshot pinned | row_count visible |
|---|---|---|
| reader A (started at t1) | s-102 | 11_096_200 |
| reader B (started at t5) | s-103 | 11_995_400 |
| reader C (TIMESTAMP AS OF '06:00') | s-103 (was current at 06:00) | 11_995_400 |
Why this works — concept by concept:
- Snapshot — a manifest list pointing to a set of data files; immutable once written. Every write produces a new snapshot.
-
Catalog pointer swap — the table's
current_snapshot_idlives in the catalog (REST, Glue, Nessie, Polaris, Unity, or a metadata file pointer in HDFS-catalog mode). Commit = conditional update of that single field. - Snapshot isolation — readers capture the current snapshot id at planning time and use it throughout the query. Concurrent commits never affect an in-flight read.
- Last-writer-wins with retry — two concurrent writers race to swap the pointer; the loser re-bases its manifest against the winner's snapshot and retries. No locks, no deadlocks.
- Cost — pointer swap is one catalog write; the manifest tree for the new snapshot is O(changed manifests), not O(table). Atomicity is essentially free.
SQL
Topic — database
Database design problems
2. Iceberg branches — createBranch / fastForward / cherry-pick, isolated dev
iceberg branching makes the table look like a Git repo — branches are named snapshot refs, not table copies
The mental model in one line: an Iceberg branch is a named pointer to a snapshot — exactly like a Git branch is a named pointer to a commit — so creating one is O(1) metadata, not an O(table) copy. Once you say "branches are refs, not clones," the entire branching API (createBranch, fastForward, cherryPick) reads like Git porcelain over the same snapshot DAG.
The branch primitive in five bullets.
- A branch is a named ref. The catalog stores one row per branch (and per tag) — name, snapshot_id, max_ref_age_ms, min_snapshots_to_keep, max_snapshot_age_ms. The data is shared with main; only the ref pointer is new.
-
mainis the implicit branch every table starts with. Every commit on the table updatesmainunless you explicitly target a different branch. -
Branch writes are isolated — a writer that targets
devdoes not touch themainpointer, so readers onmainare completely unaffected by experimental work. -
fastForwardresolves a branch into main when it is strictly ahead. Ifdevwas created from snapshot s-102 and has commits s-d1 → s-d2 → s-d3, andmainis still at s-102, thenfastForward('main', 'dev')moves themainpointer to s-d3 in a single atomic operation. The data files ondevsimply become the files onmain. -
cherryPickbrings a single snapshot from one branch into another as a new snapshot. The picked snapshot's data files are re-committed under main's lineage; the original branch is untouched.
Branch lifecycle API.
-
createBranch(name, snapshot_id?)— creates a new branch pointing tosnapshot_id(or current if omitted). O(1) metadata write. -
createBranch(name, snapshot_id, retention?)— optional retention controls protect old snapshots referenced by the branch fromexpire_snapshots. -
fastForward(target, source)— fast-forward thetargetbranch tosource. Fails iftargethas commits not reachable fromsource(would require a real merge, which Iceberg does not implement — you mustcherryPickorrewrite). -
cherryPick(snapshot_id)— copy a single snapshot's changes onto the current branch as a new commit. Conflicts with concurrent writes resolve via re-application. -
dropBranch(name)— removes the branch ref; snapshots become eligible for expiration if no other ref protects them.
When branches are the right answer.
-
Sandbox a backfill. Create
backfill-2026-Q1from main, run the rewrite against it, validate, fast-forward into main. Production reads of main never see in-progress backfill state. -
Long-lived experiment. Data scientists join a new feature into a
feat-customer360branch and query it; main continues to serve production. If the experiment ships, fast-forward; if it does not, drop the branch. - Multi-tenant staging. Each downstream team gets a branch with its own data-correction overlays; main remains canonical.
- WAP (next section). The audit branch is created per run, written to, validated, then fast-forwarded into main — the modern WAP pattern.
When branches are the wrong answer.
- As a substitute for partitioning. A branch per tenant is a metadata explosion; use a tenant column and row-level access control instead.
- As a replacement for snapshots. If you only need "the table as of yesterday," a snapshot id or a tag is cheaper than a branch.
-
As a soft DELETE. Iceberg supports
DELETE FROMwith merge-on-read or copy-on-write at the table level; branches are not for hiding rows.
Common interview probes on branches.
- "What is the difference between an Iceberg branch and a Git branch?" — semantically nearly identical (both are named refs to a commit DAG), but Iceberg branches do not implement three-way merge — fast-forward and cherry-pick are the only merge primitives.
- "Can two writers commit to the same branch concurrently?" — yes, same last-writer-wins retry as on main. Branches do not serialise writes; they isolate one branch's commits from another's.
- "How are branches stored?" — as rows in the table's
refsmetadata sub-table. The catalog stores the current snapshot id for each ref; the data files are shared across branches. - "Does dropping a branch delete the data?" — no. Data files referenced by the dropped branch survive until the next
expire_snapshotsrun determines they are unreachable from every remaining ref.
Worked example — create a dev branch, mutate, and fast-forward into main
Detailed explanation. A data engineer wants to test a partition rewrite against a copy of db.sales without affecting production readers. She creates a dev branch from the current main snapshot, runs the rewrite against dev, validates the row counts, then fast-forwards main to the new state in a single atomic call.
Question. Given db.sales at snapshot s-105, create a dev branch, write a rewrite snapshot to it that compacts small files in the eu partition, validate row counts, and fast-forward main to the new dev head.
Input.
| step | branch | snapshot pointer |
|---|---|---|
| start | main | s-105 |
| start | dev | (not created) |
Code.
-- 1) Create the dev branch from current main
ALTER TABLE db.sales CREATE BRANCH `dev`;
-- 2) Write the rewrite on the dev branch
CALL system.rewrite_data_files(
table => 'db.sales',
options => map('target-file-size-bytes', '536870912'),
where => "region = 'EU'",
branch => 'dev'
);
-- 3) Validate row counts match between main and dev for EU
SELECT
'main' AS branch, COUNT(*) AS row_count FROM db.sales VERSION AS OF (SELECT snapshot_id FROM db.sales.refs WHERE name='main')
WHERE region = 'EU'
UNION ALL
SELECT
'dev', COUNT(*) FROM db.sales VERSION AS OF (SELECT snapshot_id FROM db.sales.refs WHERE name='dev')
WHERE region = 'EU';
-- 4) Fast-forward main to dev's head atomically
CALL system.fast_forward('db.sales', 'main', 'dev');
-- 5) Drop the dev branch (data files now live under main's lineage)
ALTER TABLE db.sales DROP BRANCH `dev`;
Step-by-step explanation.
-
CREATE BRANCH devwrites one new ref row to the catalog pointing to s-105. No data is copied. O(1). -
rewrite_data_files(..., branch => 'dev')runs the compaction targetingdev. The new compacted snapshot s-d1 lands ondev;mainis still at s-105 and production readers see no change. - The validation query reads both branches via
VERSION AS OFagainst the ref-resolved snapshot ids. Row counts must match — compaction does not change semantics. -
fast_forward('main', 'dev')moves themainpointer to s-d1 atomically. The transition is a single catalog write; no reader sees an in-between state. The cost is zero new data files — the files are already there, only the pointer changes. -
DROP BRANCH devremoves the ref row. The files it referenced are still reachable from main (because of the fast-forward), so nothing is orphaned.
Output.
| step | main pointer | dev pointer | EU files | EU rows |
|---|---|---|---|---|
| 1 | s-105 | s-105 | 120 small | 4.2M |
| 2 | s-105 | s-d1 | main:120, dev:18 large | 4.2M / 4.2M |
| 3 | s-105 | s-d1 | unchanged | counts match |
| 4 | s-d1 | s-d1 | 18 large | 4.2M |
| 5 | s-d1 | (dropped) | 18 large | 4.2M |
Rule of thumb. Every multi-step mutation that could be risky (rewrite, schema evolution, partition reshuffle) belongs on its own branch. The branch is your safety net — if validation fails, you drop the branch and nothing on main changed.
Worked example — cherry-pick a single fix snapshot from a hotfix branch
Detailed explanation. A nightly job ran a multi-step transformation on the feat-2026-Q2 branch. One of the snapshots in the chain is a legitimate bug fix that should be applied to main; the rest are still in progress. Cherry-pick brings just that one snapshot into main without merging the entire branch.
Question. Given feat-2026-Q2 with snapshots s-f1 (refactor), s-f2 (bug fix to NULL handling), s-f3 (unfinished new column), cherry-pick s-f2 into main without bringing s-f1 or s-f3.
Input.
| branch | snapshots |
|---|---|
| main | s-101 → ... → s-105 |
| feat-2026-Q2 | s-105 → s-f1 → s-f2 → s-f3 |
Code.
-- Cherry-pick the bug-fix snapshot only
CALL system.cherrypick_snapshot('db.sales', '<s-f2-id>');
-- Verify the new main snapshot inherited the fix
SELECT
snapshot_id, parent_id, operation,
summary['source-snapshot-id'] AS cherry_picked_from
FROM db.sales.snapshots
ORDER BY committed_at DESC
LIMIT 2;
Step-by-step explanation.
-
cherrypick_snapshotreads the data-file delta ofs-f2(the bug fix) and re-applies it as a new commits-106on main. The data files referenced by s-f2 are linked under main's manifest tree. - The new snapshot's
summaryrecordssource-snapshot-id = s-f2so the audit trail captures the cherry-pick origin. -
feat-2026-Q2is untouched —s-f1,s-f2,s-f3remain on the branch for the team's continued work. - If
s-f2had touched the same files as a concurrent main commit, cherry-pick would fail with a conflict; the operator would then either rebases-f2against the new main head and retry, or rewrite the fix as a fresh commit on main.
Output.
| snapshot_id | parent_id | operation | cherry_picked_from |
|---|---|---|---|
| s-106 | s-105 | overwrite | s-f2 |
| s-105 | s-104 | append | (none) |
Rule of thumb. Use cherryPick when you need a single fix from a long-lived branch without merging everything else. Use fastForward when the entire branch is ready to land. Use neither — and reach for a fresh INSERT INTO on main — when the change is trivial enough that the branch overhead is not worth it.
Worked example — long-lived experiment branch with concurrent main writes
Detailed explanation. A data-science team works on a feat-customer360 branch for three weeks. During those three weeks, the production ETL keeps committing to main. The team can periodically rebase their branch against the new main head — or they can keep working on the old snapshot and reconcile at the end.
Question. Show the snapshot DAG state at three points in time: branch creation, two weeks in, and at fast-forward time. Explain why concurrent main writes do not block the branch.
Input.
| time | main | feat-customer360 |
|---|---|---|
| t0 | s-105 | (created at s-105) |
| t1 (week 1) | s-110 | s-105 → s-c1 |
| t2 (week 2) | s-115 | s-105 → s-c1 → s-c2 |
| t3 (week 3) | s-120 | s-105 → s-c1 → s-c2 → s-c3 |
Code.
-- At t0: create the branch
ALTER TABLE db.customers CREATE BRANCH `feat-customer360`;
-- At t1, t2, t3: append to the branch (writes do not touch main)
INSERT INTO db.customers.`branch_feat-customer360`
SELECT * FROM staging.customer360_t1;
-- At t3: rebase or cherry-pick to land the work
-- Option A: cherry-pick each branch snapshot in order onto main
CALL system.cherrypick_snapshot('db.customers', '<s-c1-id>');
CALL system.cherrypick_snapshot('db.customers', '<s-c2-id>');
CALL system.cherrypick_snapshot('db.customers', '<s-c3-id>');
-- Option B: validate and fast-forward only if the branch is a clean superset
-- (rare for a 3-week-old branch; usually some main commits conflict)
CALL system.fast_forward('db.customers', 'main', 'feat-customer360');
Step-by-step explanation.
- At t0, both refs point to s-105. No data is duplicated; only the ref row is new.
- Between t0 and t3, the ETL commits 15 new snapshots to main (s-106 → s-120). Each commit only touches the
mainref; thefeat-customer360ref is unaffected. - The DS team commits 3 snapshots to the branch (s-c1, s-c2, s-c3). Each touches only the branch ref;
mainis unaffected. - At t3, the team wants to land the work. Because main has moved (s-105 → s-120) and the branch is based on s-105, a plain
fast_forwardwill fail — the branch is not strictly ahead of main. - The team's choice is cherry-pick each branch snapshot in order onto main, resolving conflicts where the branch touched a file that main also touched, or rebase by re-running the transformations against the current main head on a new short-lived branch and then fast-forwarding.
Output.
| time | main snapshot | branch snapshot | data files shared? |
|---|---|---|---|
| t0 | s-105 | s-105 | yes (identical) |
| t1 | s-110 | s-c1 | partly — s-c1's files unique to branch |
| t2 | s-115 | s-c2 | partly |
| t3 | s-120 | s-c3 | partly |
| after cherry-picks | s-123 | s-c3 | branch files now also reachable from main |
Rule of thumb. Long-lived branches accumulate divergence. If a branch lives longer than the typical main commit cadence, plan a rebase or cherry-pick strategy from day one. Treat the branch like a feature branch in Git: rebase early, rebase often, or accept that the final landing will be a multi-step cherry-pick.
SQL interview question on sandboxing a backfill
A senior interviewer might frame this as: "You need to backfill 90 days of bad partitions on a 50TB fact table. The downstream consumers cannot tolerate seeing partial backfill state. How would you structure the work using Iceberg branches, and what is your rollback plan?" It probes branch lifecycle, atomic publication, and operator hygiene.
Solution Using a backfill branch, validation gate, and fast-forward publish
-- 1) Create the backfill branch from the current production head
ALTER TABLE db.fact_orders CREATE BRANCH `backfill-2026-Q1`
AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main');
-- 2) Run the backfill targeting the branch
INSERT OVERWRITE TABLE db.fact_orders.`branch_backfill-2026-Q1`
SELECT * FROM staging.corrected_q1_orders;
-- 3) Validation gate — row count, null rate, and per-partition delta vs main
WITH gate AS (
SELECT
COUNT(*) AS row_count_branch,
SUM(CASE WHEN amount IS NULL THEN 1 END) AS null_amount_branch
FROM db.fact_orders VERSION AS OF (
SELECT snapshot_id FROM db.fact_orders.refs WHERE name='backfill-2026-Q1'
)
WHERE order_date BETWEEN DATE '2026-01-01' AND DATE '2026-03-31'
)
SELECT
*,
CASE
WHEN row_count_branch < 12000000 THEN 'FAIL — too few rows'
WHEN null_amount_branch > 0 THEN 'FAIL — null amounts'
ELSE 'PASS'
END AS gate_decision
FROM gate;
-- 4) Publish: fast-forward main to the validated branch head
CALL system.fast_forward('db.fact_orders', 'main', 'backfill-2026-Q1');
-- 5) On failure: drop the branch; main is untouched
-- ALTER TABLE db.fact_orders DROP BRANCH `backfill-2026-Q1`;
Step-by-step trace.
| step | main | backfill branch | gate state |
|---|---|---|---|
| 1 | s-200 | s-200 | not started |
| 2 | s-200 | s-b1 (overwrite Q1) | not started |
| 3 | s-200 | s-b1 | PASS (rows=12.4M, null=0) |
| 4 | s-b1 | s-b1 | published atomically |
| 5 (fail path) | s-200 | dropped | main untouched |
The trace illustrates the central guarantee: between step 2 and step 4, production readers see the old Q1 data on main. Between step 4 (the moment of fast_forward) and step 4 + 1 microsecond, every reader transitions to the new Q1 state. There is no intermediate visibility — exactly the contract the consumers require.
Output:
| metric | value |
|---|---|
| main snapshot before | s-200 |
| main snapshot after | s-b1 |
| backfill rows written | 12_400_000 |
| readers that saw partial state | 0 |
| rollback path on gate failure | drop branch, no further action |
Why this works — concept by concept:
-
Branch isolation — every write targets the
backfill-2026-Q1ref; main is untouched until the explicitfast_forward. Concurrent main commits proceed normally. - Validation gate — runs entirely on the branch snapshot, so the gate is testing exactly the bytes that will be published. No staging-vs-prod skew.
-
Atomic publish —
fast_forwardis a single catalog write that flips the main pointer. No reader sees a half-published state. -
Rollback by dropping a branch — if the gate fails, the branch is dropped, main never moves, and the operator has zero cleanup. The bad data files are reclaimed by the next
expire_snapshotsrun. - No table copy — the branch is O(1) metadata at creation; the data files are written incrementally as the backfill commits. The total cost is "one backfill" plus a handful of ref-row writes.
- Cost — branch create / drop / fast_forward are all O(1) catalog writes. The backfill itself is O(corrected_data) regardless of whether you use a branch.
SQL
Topic — ETL
ETL pipeline problems
3. Iceberg tags — immutable labels for compliance and audit
iceberg tagging pins a snapshot under a stable name — the audit-friendly equivalent of "the version regulators asked for"
The mental model in one line: a tag is an immutable named ref to a specific snapshot — you cannot move it, only create or drop it — which makes tags the right primitive for compliance pins, golden datasets, and "last known good" markers. Once you say "tags do not move; branches do," the entire tagging API and its retention semantics fall out.
The tag primitive in five bullets.
- A tag is a named ref to a single snapshot. Same catalog row shape as a branch; the only difference is the API does not let you move the pointer.
-
Tags are immutable in the move sense, not in the existence sense. You can
dropTagand recreate it pointing to a different snapshot, but you cannot atomically retarget an existing tag. -
Tags protect snapshots from expiration. As long as a tag references a snapshot, that snapshot survives
expire_snapshots— even when the retention window otherwise would have evicted it. -
Tags carry optional retention metadata —
max_ref_age_mslets a tag self-expire after, say, 7 years for SOX compliance. -
Tags are queryable like branches —
SELECT * FROM db.sales VERSION AS OF 'eom-2025-12'reads the snapshot the tag points at.
Tag lifecycle API.
-
createTag(name, snapshot_id?)— pinsnapshot_id(or current head) undername. Fails ifnamealready exists. -
createTag(name, snapshot_id, retention)— attach an explicit lifetime; common for compliance ("keep for 7 years"). -
dropTag(name)— remove the tag ref. The snapshot becomes eligible for expiration if no other ref protects it. -
replaceTag(name, snapshot_id)(rarely available; check engine version) — atomically retarget. Most engines require drop + create. -
AS OF '<tag-name>'— query syntax for tag-aware time travel.
When tags are the right answer.
- Compliance and audit pins — "tag the last good snapshot before the quarterly close" — regulators can re-query the exact bytes that produced the filed numbers.
-
Golden datasets —
gold-v1,gold-v2, ... tag releases of a curated dataset so downstream consumers can pin a known-good version even as you keep iterating on main. -
"Last known good" markers — tag the last snapshot that passed all data-quality checks so a rollback is a one-line
rollback_to_snapshotlookup against the tag. - Disaster-recovery anchors — tag every successful daily run; a corrupted commit can be undone by rolling back to yesterday's tag.
When tags are the wrong answer.
- As a substitute for branches. Tags do not isolate writes; they label snapshots. Use a branch for in-progress work.
- As a substitute for snapshot retention policies. Tags do not change the table-level retention; they pin individual snapshots. The right way to retain "all snapshots for 90 days" is the table retention config, not a tag per snapshot.
- As a per-row label. Tags name table-level points in time, not row-level conditions. Use a column for row labels.
Tag-aware retention — the subtle part.
-
expire_snapshotswalks every ref (branches + tags + main) and computes the set of reachable snapshots. Any snapshot reachable from any ref is kept; everything else is eligible for expiration. -
A tag at snapshot s-090 protects s-090 and its parent chain up to the root, unless you also configure
min_snapshots_to_keepormax_snapshot_age_msat the tag level — in which case the tag protects only s-090 and snapshots within the configured retention. -
Result. Tags are the safest way to keep a specific historical state queryable without ballooning the snapshot count. Pin the snapshots you need; let
expire_snapshotsreclaim the rest.
Compliance use case in one paragraph. Suppose a regulator asks: "Reproduce the revenue number you filed on 2026-01-31." Without tags, you must hope the snapshot from that date is still within your retention window. With tags, you ran createTag('filed-2026-01', <snapshot_at_eom>) the moment you filed, and the snapshot is guaranteed to survive for the configured retention (typically 7 years for SOX). SELECT ... FROM db.sales VERSION AS OF 'filed-2026-01' reproduces the number exactly. That is the entire audit reproducibility story in one tag.
Worked example — tag the last good snapshot before a known-bad release
Detailed explanation. A team is about to deploy a new ETL version that they expect to be safe but want a one-call rollback path. Before the deploy, they tag the current production snapshot as pre-deploy-2026-06-12. If the new ETL produces bad data, rollback is a single command using the tag.
Question. Given db.sales at snapshot s-105 (last good), tag the snapshot as pre-deploy-2026-06-12. Show how to roll back to the tag if the next commit corrupts the table.
Input.
| snapshot_id | committed_at |
|---|---|
| s-104 | 2026-06-12 08:00 |
| s-105 | 2026-06-12 09:30 (last good) |
Code.
-- 1) Tag the last good snapshot before deploy
ALTER TABLE db.sales CREATE TAG `pre-deploy-2026-06-12`
AS OF VERSION (SELECT snapshot_id FROM db.sales.refs WHERE name='main');
-- 2) (Deploy happens, suppose s-106 corrupts the table)
-- 3) Roll back to the tagged snapshot atomically
CALL system.rollback_to_snapshot(
'db.sales',
(SELECT snapshot_id FROM db.sales.refs WHERE name='pre-deploy-2026-06-12')
);
-- 4) Inspect the refs table to confirm the rollback landed
SELECT name, snapshot_id, type FROM db.sales.refs ORDER BY name;
Step-by-step explanation.
-
CREATE TAGwrites a ref row pointing to s-105. The catalog now has two refs to that snapshot:mainandpre-deploy-2026-06-12. - After the deploy commits s-106 (bad), main = s-106 and the tag = s-105.
-
rollback_to_snapshotflips main back to s-105 atomically. The bad s-106 still exists but is no longer current; readers see the s-105 state immediately. - The refs table confirms:
main= s-105 (post rollback) andpre-deploy-2026-06-12= s-105 (tag, unchanged). - The tag remains pinned at s-105 forever (or until you drop it). Even if you later expire s-106 (the bad snapshot), s-105 is permanently protected by the tag.
Output.
| name | snapshot_id | type |
|---|---|---|
| main | s-105 | BRANCH |
| pre-deploy-2026-06-12 | s-105 | TAG |
Rule of thumb. Tag the production snapshot before every deploy, dbt release, or pipeline change. The cost is one ref row; the benefit is a one-command rollback that always lands on a known-good state.
Worked example — query "AS OF" a tag from Spark, Trino, and Snowflake
Detailed explanation. Iceberg tags are first-class in every major reader. The SQL syntax to query "as of" a tag varies slightly per engine but the semantics are identical: pin the table state to the tagged snapshot and run the query.
Question. Given db.sales with a tag gold-v1 pinned at snapshot s-093, write the "as of tag" query in Spark / Trino / Snowflake / Flink syntax. Show that all four return identical results.
Input.
| tag name | snapshot_id |
|---|---|
| gold-v1 | s-093 |
Code.
-- Spark SQL
SELECT region, SUM(amount) AS revenue
FROM db.sales VERSION AS OF 'gold-v1'
WHERE order_date = DATE '2026-05-01'
GROUP BY region;
-- Trino / Presto
SELECT region, SUM(amount) AS revenue
FROM db.sales FOR VERSION AS OF 'gold-v1'
WHERE order_date = DATE '2026-05-01'
GROUP BY region;
-- Snowflake (Iceberg tables)
SELECT region, SUM(amount) AS revenue
FROM db.sales AT (BRANCH => 'gold-v1') -- tag works through the BRANCH/TAG ref API
WHERE order_date = DATE '2026-05-01'
GROUP BY region;
-- Flink SQL
SELECT region, SUM(amount) AS revenue
FROM db.sales /*+ OPTIONS('snapshot-id'='<resolved-from-tag>') */
WHERE order_date = DATE '2026-05-01'
GROUP BY region;
Step-by-step explanation.
- Each engine has slightly different
AS OFsyntax, but all four resolvegold-v1to snapshot s-093 by reading therefsmetadata sub-table. - The planner then walks the s-093 manifest tree to compute the file list, exactly as for any other snapshot read.
- The query returns the same revenue numbers in every engine because the underlying data files are identical — they are addressed by the snapshot id, not the engine.
- This is the multi-engine reader contract Iceberg promises: Spark, Trino, Snowflake, Flink, DuckDB, ClickHouse, StarRocks, Athena, BigQuery all see the same view of the same snapshot.
Output (identical across engines).
| region | revenue |
|---|---|
| EU | 1_240_500 |
| US | 980_400 |
| APAC | 420_900 |
Rule of thumb. When you publish a "golden" dataset version, give it a tag. Consumers on any engine can pin their reads to the tag with one syntax change, and the engine's planner does the rest. Tags are the cross-engine cross-team contract.
Worked example — protect a tag from snapshot expiration
Detailed explanation. A common production failure: the snapshot retention is 7 days, an auditor asks for the month-end snapshot 30 days later, and expire_snapshots has already reclaimed it. Tags solve this by pinning the snapshot independently of the retention window.
Question. Configure db.sales so that the month-end snapshot is preserved for 7 years (SOX requirement) even though the default snapshot retention is 7 days.
Input.
| config | value |
|---|---|
| default snapshot retention | 7 days |
| month-end snapshot id | s-093 |
| requirement | keep s-093 for 7 years |
Code.
-- 1) Create the tag with explicit retention
ALTER TABLE db.sales CREATE TAG `eom-2026-05`
AS OF VERSION 93
RETAIN 2557 DAYS; -- ~7 years
-- 2) Configure the table-level retention defaults (apply to non-tagged snapshots)
ALTER TABLE db.sales SET TBLPROPERTIES (
'history.expire.max-snapshot-age-ms' = '604800000', -- 7 days
'history.expire.min-snapshots-to-keep' = '5'
);
-- 3) Run the periodic expire job (typically Airflow daily)
CALL system.expire_snapshots(
table => 'db.sales',
older_than => TIMESTAMP '2026-06-05 00:00:00',
retain_last => 5
);
-- 4) Verify the tagged snapshot survives
SELECT snapshot_id, committed_at
FROM db.sales.snapshots
WHERE snapshot_id = 93;
Step-by-step explanation.
- The tag
eom-2026-05pins s-093 and carries its ownRETAIN 2557 DAYSretention, overriding the table-level "expire after 7 days" rule for this specific snapshot. - The table-level config is unchanged for other snapshots — appends and daily commits still age out after 7 days.
- When
expire_snapshotsruns, it walks every ref (main, tags, branches) and computes "reachable snapshots." s-093 is reachable from the tag, so it is kept. - Seven years later, when the auditor queries
db.sales VERSION AS OF 'eom-2026-05', the snapshot is still there. The query reproduces the month-end revenue numbers exactly.
Output.
| snapshot_id | committed_at | retained because |
|---|---|---|
| s-093 | 2026-05-31 23:55 | tag eom-2026-05 (RETAIN 7 years) |
Rule of thumb. Compliance pins go in tags with explicit RETAIN clauses. Never rely on table-level retention for "must-survive-for-N-years" snapshots — tags are the contract.
SQL interview question on tag-based audit reproducibility
A senior interviewer often asks: "How would you guarantee that any historical revenue number you filed with the regulator is byte-reproducible for 7 years, even though your team runs OPTIMIZE and expire_snapshots daily?" It probes tag retention, the interaction between tags and expiration, and the audit-trail story.
Solution Using compliance tags with explicit retention
-- 1) Tag the filed snapshot the moment the filing is finalised
ALTER TABLE finance.revenue CREATE TAG `filed-2026-Q1`
AS OF VERSION (SELECT snapshot_id FROM finance.revenue.refs WHERE name='main')
RETAIN 2557 DAYS; -- 7 years SOX retention
-- 2) Record the tag plus the snapshot id in the filings audit ledger
INSERT INTO finance.filings_audit (filing_id, table_name, snapshot_id, tag_name, filed_at)
SELECT
'Q1-2026-revenue',
'finance.revenue',
r.snapshot_id,
r.name,
CURRENT_TIMESTAMP
FROM finance.revenue.refs r
WHERE name = 'filed-2026-Q1';
-- 3) Years later: reproduce the filed number
SELECT
region,
SUM(amount) AS filed_revenue
FROM finance.revenue VERSION AS OF 'filed-2026-Q1'
WHERE filing_period = 'Q1-2026'
GROUP BY region;
Step-by-step trace.
| step | action | catalog state |
|---|---|---|
| 1 | createTag filed-2026-Q1 RETAIN 7y | new ref protecting s-093 |
| 2 | insert into audit ledger | external row capturing tag |
| 3 (1 year later) | expire_snapshots(older_than=300d, retain_last=5) | s-093 protected; other s ≤ 300d gone |
| 4 (7 years later) | SELECT VERSION AS OF 'filed-2026-Q1' | snapshot s-093 still queryable |
| 5 (7y + 1d later) | tag self-expires per RETAIN clause | snapshot s-093 eligible to expire |
The trace shows the contract: the tag's RETAIN 7 years clause makes s-093 immune to expire_snapshots for the full 7 years, after which the tag itself expires and the snapshot becomes a normal candidate for the table's default retention.
Output:
| audit query | result | reproducible? |
|---|---|---|
| Q1-2026 revenue at filing time | 4_205_300 | yes |
| Q1-2026 revenue 1 year later | 4_205_300 | yes (tag intact) |
| Q1-2026 revenue 7 years later | 4_205_300 | yes (tag still intact) |
Why this works — concept by concept:
-
Tag as an immutable pin — the tag binds s-093 to a stable name; the tag's
RETAINclause overrides table-level retention. - Audit ledger as external evidence — recording the tag name + snapshot id in an external ledger means even if the Iceberg metadata is somehow lost, the audit trail names the exact snapshot.
- expire_snapshots respects refs — the expiration job walks every ref (branches + tags + main) and only reclaims snapshots not reachable from any ref. Tags guarantee reachability.
-
VERSION AS OF resolves the tag at read time — the engine reads the
refssub-table, finds the snapshot id, and walks the manifest tree. No filename lookups, no path hacks. - Multi-engine reproducibility — because the data files are immutable Parquet referenced by the manifest tree, Spark, Trino, Snowflake, and Flink all return identical results when reading the same tag.
- Cost — one ref row plus the protected snapshot's files. The 7-year retention is a storage decision, not a compute decision; the read cost is identical to any other snapshot read.
SQL
Topic — design
Design problems
4. WAP pattern — write to audit branch, validate, fast-forward to main
iceberg wap is the production-grade write audit publish pattern — branch isolation makes the staging table obsolete
The mental model in one line: WAP writes the new snapshot to an isolated audit branch, runs validation against that branch, and on pass fastForwards main to the validated snapshot — making "no consumer ever sees half-published data" a structural guarantee rather than a hope. Once you say "the audit branch IS the staging table — same files, isolated ref," WAP becomes a 5-line ritual instead of a 100-line copy-then-swap.
The WAP contract in five bullets.
-
Write. The transform writes a new snapshot to an audit branch (e.g.
audit-<runid>), not to main. Main remains at the previous good state. - Audit. Validation checks run against the audit branch — row count vs expected, null rates, freshness, business-rule predicates. The audit reads the exact bytes that will publish.
-
Publish. If audit passes,
fastForward('main', 'audit-<runid>')is the single atomic operation that makes the new snapshot live. Readers transition without seeing a partial state. -
Drop on failure. If audit fails, the audit branch is dropped. Main never moves; the bad data files are reclaimed by the next
expire_snapshots. -
Multi-engine reader isolation. During the audit window, Spark / Trino / Snowflake readers on
mainsee the prior snapshot; only the audit job (which explicitly reads theaudit-<runid>branch) sees the new state.
Pre-branches WAP (the legacy pattern).
-
spark.wap.id— a Spark session property. When set, every commit attaches the WAP id as astagedsnapshot rather than landing as the new current snapshot. -
Publish — a separate procedure call (
publish_changes) atomically promotes the staged snapshot to current. - Drawback. Only Spark wrote staged snapshots; Trino / Flink could not participate. Readers on other engines either saw the staged snapshot (depending on engine version) or were blocked from reading altogether.
Modern branch-based WAP.
- Engine-agnostic. Any engine that supports branches (Spark, Trino, Flink, Snowflake, Polars + Iceberg) can write to and read from the audit branch.
- Composable. The audit branch is a real branch — you can branch-from-branch, cherry-pick, or run multiple audits in parallel.
-
No special session property. The branch ref is the entire mechanism; no
spark.wap.idneeded. -
The 2026 rule. All new WAP implementations should use branches. The
spark.wap.idproperty is legacy.
Audit checks that belong in a WAP gate.
-
Row count vs expected —
(new_rows / prior_rows)within [0.8, 1.2] for daily appends; tighter for slowly-changing dimensions. -
Null rates per critical column —
(null_count / row_count)below a threshold (e.g. < 1% forcustomer_id, < 5% foremail). -
Freshness —
MAX(event_time)within the SLA window (typically>= now() - 2hfor hourly tables). -
Business rules —
SUM(amount)per partition above a floor (catches "the integration silently zeroed out a column"). - Schema — match the contract: every required column present, types unchanged. Use a JSON Schema or dbt contract for portability.
-
Referential integrity — every
customer_idin the new snapshot exists indim_customer. Detects orphans before they hit BI.
Tools that play well with branch-WAP.
-
dbt —
--targetflag pluspre_hook/post_hookto wire branch creation, model build, and fast-forward. dbt-Iceberg adapter handles the branch syntax. -
Great Expectations — runs as the audit gate; pointed at the audit branch via
VERSION AS OF. - Soda Core — declarative checks; supports Iceberg via Spark/Trino connections.
-
Airflow / Dagster — orchestrate the create-branch → run-transform → audit → publish-or-drop sequence as discrete tasks. The publish task is a 1-line
fast_forwardcall.
Common interview probes on WAP.
- "What is WAP and why does it matter?" — write-audit-publish, the contract that no consumer ever sees a partial or invalid commit. Matters because object-storage tables would otherwise expose half-written state.
- "How does branch-based WAP differ from
spark.wap.id?" — branches are engine-agnostic, composable, and do not require a session property.spark.wap.idwas Spark-only. - "What happens to the bad data on a WAP failure?" — the audit branch is dropped; the files are reclaimed by the next
expire_snapshots. Main never moves. - "Can multiple WAP runs proceed in parallel?" — yes, one branch per run id. They publish in their respective fast-forward order. Conflicts between runs are detected by the catalog's last-writer-wins retry.
Worked example — Spark write to the audit branch
Detailed explanation. A daily ETL appends new orders to db.fact_orders. Under branch-WAP, the writer targets the audit-20260612 branch instead of main; the audit gate then reads from that branch.
Question. Show the Spark / SQL code that creates an audit branch from the current main head, appends today's orders to the branch, and registers the branch in a run-id ledger for later audit.
Input.
| run_id | source rows | target table | branch ref |
|---|---|---|---|
| 20260612 | 124_000 | db.fact_orders | audit-20260612 |
Code.
-- 1) Create the audit branch from current main
ALTER TABLE db.fact_orders CREATE BRANCH `audit-20260612`
AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main');
-- 2) Append today's orders to the branch
INSERT INTO db.fact_orders.`branch_audit-20260612`
SELECT
order_id, customer_id, order_date, amount, region
FROM staging.orders_20260612;
-- 3) Register the audit run in the lineage ledger
INSERT INTO etl.runs (run_id, table_name, branch_name, started_at)
VALUES ('20260612', 'db.fact_orders', 'audit-20260612', CURRENT_TIMESTAMP);
Step-by-step explanation.
-
CREATE BRANCH audit-20260612writes one ref row pointing at main's current snapshot. No data is copied. -
INSERT INTO ...branch_audit-20260612appends to the branch. The new files land in S3 under the table's data prefix and are referenced by the branch's manifest tree. Main is untouched. - The ledger row captures the run id and the branch name so downstream auditors and the publish task can find the right branch by run id rather than guessing the name.
- At this point, production readers on main see the previous day's data; only readers that explicitly target
branch_audit-20260612see today's appends.
Output.
| step | main snapshot | audit-20260612 snapshot | rows visible on main |
|---|---|---|---|
| pre | s-200 | (not created) | 12_000_000 |
| 1 | s-200 | s-200 | 12_000_000 |
| 2 | s-200 | s-a1 | 12_000_000 |
| 3 | s-200 | s-a1 | 12_000_000 (ledger row added) |
Rule of thumb. Every WAP writer targets a branch named after the run id. The branch name is the link between the orchestrator's run id, the audit gate, the publish task, and the post-mortem audit trail. Never let the writer commit directly to main during WAP.
Worked example — dbt-style audit gate (row counts, null rate, freshness)
Detailed explanation. The audit gate reads the audit branch and verifies three invariants: row count is within tolerance, the critical column has a null rate below threshold, and freshness is within SLA. If any check fails, the gate returns FAIL and the publish task is skipped.
Question. Write the audit gate SQL for db.fact_orders on the audit-20260612 branch. Pass condition: rows in [80%, 120%] of yesterday's count, null_customer_id rate < 1%, MAX(order_date) = today.
Input.
| metric | expected |
|---|---|
| yesterday row count | 120_000 |
| min today rows | 96_000 |
| max today rows | 144_000 |
| max null_customer_id rate | 0.01 |
| freshness | MAX(order_date) = today |
Code.
WITH branch_state AS (
SELECT snapshot_id FROM db.fact_orders.refs WHERE name = 'audit-20260612'
),
branch_metrics AS (
SELECT
COUNT(*) AS row_count,
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) AS null_customer,
MAX(order_date) AS max_order_date
FROM db.fact_orders VERSION AS OF (SELECT snapshot_id FROM branch_state)
WHERE order_date = CURRENT_DATE
),
yesterday AS (
SELECT COUNT(*) AS row_count
FROM db.fact_orders
WHERE order_date = CURRENT_DATE - INTERVAL '1' DAY
)
SELECT
bm.row_count,
bm.null_customer,
bm.max_order_date,
y.row_count AS y_row_count,
CASE
WHEN bm.row_count < 0.8 * y.row_count THEN 'FAIL — too few rows'
WHEN bm.row_count > 1.2 * y.row_count THEN 'FAIL — too many rows'
WHEN bm.null_customer > 0.01 * bm.row_count THEN 'FAIL — null_customer_id rate too high'
WHEN bm.max_order_date <> CURRENT_DATE THEN 'FAIL — stale data'
ELSE 'PASS'
END AS gate_decision
FROM branch_metrics bm
CROSS JOIN yesterday y;
Step-by-step explanation.
-
branch_stateresolves the run-id branch to its current snapshot id by reading therefssub-table. -
branch_metricsreads the audit branch at that snapshot viaVERSION AS OF— exactly the bytes that will publish if the gate passes. -
yesterdayreads main (the unqualified table reference) for the comparison baseline. Note: it reads main, not the audit branch — that is the whole point of the isolation. - The final SELECT computes the three pass conditions; the first failing check wins. The result is a single-row decision that the orchestrator inspects.
- On
PASS, the orchestrator callsfast_forward. OnFAIL, it callsDROP BRANCHand pages the on-call.
Output (pass case).
| row_count | null_customer | max_order_date | y_row_count | gate_decision |
|---|---|---|---|---|
| 124_000 | 80 | 2026-06-12 | 120_000 | PASS |
Output (fail case — too few rows).
| row_count | null_customer | max_order_date | y_row_count | gate_decision |
|---|---|---|---|---|
| 60_000 | 40 | 2026-06-12 | 120_000 | FAIL — too few rows |
Rule of thumb. The gate must read the audit branch (via VERSION AS OF), not main. If you accidentally read main, you are checking the previous day's data and the gate will pass on every run — a silent data outage.
Worked example — fast-forward publish vs rollback on failure
Detailed explanation. The publish step is a single API call. Branch-based WAP turns "did the publish succeed?" into "did the catalog write succeed?" — a binary, retriable signal.
Question. Show the publish-or-rollback orchestrator logic for db.fact_orders after the audit gate returns PASS or FAIL.
Input.
| audit result | action |
|---|---|
| PASS | fast_forward main → audit-20260612 |
| FAIL | drop branch, alert |
Code.
# Pseudocode for the orchestrator (Airflow / Dagster task)
def publish_or_rollback(table, branch, audit_result):
if audit_result == "PASS":
spark.sql(f"CALL system.fast_forward('{table}', 'main', '{branch}')")
spark.sql(f"ALTER TABLE {table} DROP BRANCH `{branch}`")
return "published"
else:
spark.sql(f"ALTER TABLE {table} DROP BRANCH `{branch}`")
page_oncall(
severity="P2",
message=f"WAP gate failed on {table} branch={branch}: {audit_result}",
)
return "rolled_back"
Step-by-step explanation.
- On PASS,
fast_forwardflips main's pointer to the audit branch's head in one atomic catalog write. Every reader transitions to the new state. - After fast-forward, the audit branch ref still points at the (now-shared) head snapshot. Dropping it cleans up the ref; the data files are already part of main's lineage.
- On FAIL, dropping the branch removes the only ref pointing to the bad snapshot. The files become unreachable and will be reclaimed by the next
expire_snapshotsjob. - The page-on-call call is what makes WAP a real production pattern — silent failures still need humans.
Output.
| outcome | main pointer | branch ref | data files |
|---|---|---|---|
| PASS | moved to s-a1 | dropped | bad files become main files |
| FAIL | unchanged at s-200 | dropped | bad files orphaned, reclaimed later |
Rule of thumb. Treat the publish step as an idempotent operation — if the orchestrator crashes mid-step, retry the fast-forward. The catalog's atomic semantics make the retry safe even if the fast-forward partially succeeded.
Worked example — multi-engine reader isolation during the audit window
Detailed explanation. During the audit window, Spark / Trino / Snowflake readers all see the prior good main snapshot. Only the audit gate (which explicitly targets the branch) sees the in-flight data.
Question. Show what each of three readers sees during the audit window for db.fact_orders: a Trino dashboard query, a Snowflake batch job, and the Spark audit gate.
Input.
| engine | query target | expects to see |
|---|---|---|
| Trino | main (default) | prior good snapshot |
| Snowflake | main (default) | prior good snapshot |
| Spark audit | audit-20260612 branch | in-flight snapshot |
Code.
-- Trino — production dashboard (main, prior good)
SELECT region, SUM(amount) AS revenue
FROM iceberg.db.fact_orders
WHERE order_date = CURRENT_DATE - INTERVAL '1' DAY
GROUP BY region;
-- Snowflake — overnight finance batch (main, prior good)
SELECT region, SUM(amount) AS revenue
FROM db.fact_orders
WHERE order_date = CURRENT_DATE - INTERVAL '1 DAY'
GROUP BY region;
-- Spark — audit gate reads the audit branch only
SELECT region, SUM(amount) AS revenue
FROM db.fact_orders VERSION AS OF (
SELECT snapshot_id FROM db.fact_orders.refs WHERE name='audit-20260612'
)
WHERE order_date = CURRENT_DATE
GROUP BY region;
Step-by-step explanation.
- Trino's planner resolves
iceberg.db.fact_ordersagainst its catalog, findsmain's snapshot pointer at s-200, and reads that snapshot. The in-flight audit branch is invisible. - Snowflake's planner does the same against its catalog binding. The audit branch is also invisible to Snowflake.
- Spark's audit gate explicitly resolves the audit branch from
refsand pins the snapshot for the gate query. It sees the in-flight bytes — exactly what will publish. - The isolation is provided by the catalog's ref model. No engine sees the audit branch unless it explicitly asks for it.
Output.
| engine | snapshot read | data seen |
|---|---|---|
| Trino dashboard | s-200 (main) | prior good |
| Snowflake batch | s-200 (main) | prior good |
| Spark audit gate | s-a1 (audit) | in-flight |
Rule of thumb. Branch-based WAP provides multi-engine reader isolation out of the box. You do not need to coordinate engines, freeze readers, or run a maintenance window — the catalog ref model does it for you.
SQL interview question on designing WAP for a 100TB fact table
A senior interviewer might say: "Design a WAP pattern for a 100TB fact table written every 15 minutes by 8 parallel Spark jobs, read by Trino dashboards every second, and audited by a Great Expectations suite. What does the catalog look like, and what is your failure-mode story?" It probes orchestrator design, branch naming, retention, and the multi-engine isolation guarantee.
Solution Using one audit branch per run-id with fast-forward publish
-- The contract:
-- 1. Writer task creates branch audit-<run_id> from current main
-- 2. Writer commits to the branch (parallel writers retry on last-writer-wins)
-- 3. Audit task runs validations against the branch
-- 4. Publish task fast-forwards main on PASS, drops branch on FAIL
-- 5. Daily maintenance expires orphaned files from FAIL runs
-- 1) Writer (one of N parallel jobs)
ALTER TABLE db.fact_orders CREATE BRANCH IF NOT EXISTS `audit-${run_id}`
AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main');
INSERT INTO db.fact_orders.`branch_audit-${run_id}`
SELECT * FROM staging.orders_partition_${shard_id};
-- 2) Audit gate (Great Expectations / Soda / dbt-test pointed at the branch)
-- (returns PASS / FAIL)
-- 3) Publish (orchestrator)
CALL system.fast_forward('db.fact_orders', 'main', 'audit-${run_id}');
ALTER TABLE db.fact_orders DROP BRANCH `audit-${run_id}`;
-- 4) Daily maintenance
CALL system.expire_snapshots(
table => 'db.fact_orders',
older_than => CURRENT_TIMESTAMP - INTERVAL '7' DAY,
retain_last => 50
);
CALL system.remove_orphan_files(
table => 'db.fact_orders',
older_than => CURRENT_TIMESTAMP - INTERVAL '3' DAY
);
Step-by-step trace.
| time | event | main pointer | audit branch | gate |
|---|---|---|---|---|
| t0 | run starts | s-200 | (create at s-200) | pending |
| t0 + 5m | 8 parallel writers commit to branch | s-200 | s-a8 | pending |
| t0 + 8m | audit reads branch | s-200 | s-a8 | running |
| t0 + 10m | gate PASS | s-200 | s-a8 | PASS |
| t0 + 10m + 1ms | fast_forward | s-a8 | s-a8 | published |
| t0 + 10m + 2ms | drop branch | s-a8 | (dropped) | done |
| next day 02:00 | maintenance | s-a14 (more runs) | -- | -- |
The trace shows the central guarantee: during the entire t0 → t0+10m window, the 1000 Trino dashboard readers per second see s-200 — completely consistent prior state. At t0+10m+1ms, every new reader sees s-a8. There is no transition state.
Output:
| metric | value |
|---|---|
| writes per day | 96 |
| audit-branch refs created per day | 96 |
| audit-branch refs alive at end of day | 0 (all dropped) |
| storage overhead from WAP | < 1% (only failed-audit files) |
| readers that saw partial state | 0 |
| catalog write per publish | 1 (the fast_forward) |
Why this works — concept by concept:
- Branch per run-id — names the audit window uniquely and prevents collisions between parallel runs. The orchestrator's run id is the universal handle.
- Multiple writers on one branch — the catalog's last-writer-wins retry handles parallel writers natively. Each writer commits its shard; the final branch snapshot is the union.
- Audit reads the branch — exactly the bytes that will publish. No staging-vs-prod skew.
- fast_forward is one catalog write — atomic publish. Every reader transitions at the same instant.
- Drop branch on success — keeps the catalog clean; data files survive under main's lineage.
- expire_snapshots reclaims orphans — failed-audit branches' files become unreachable and are reclaimed by the daily maintenance.
- Cost — proportional to data (writes), not table size. The branch-create / drop / fast-forward are O(1) catalog writes. Maintenance is O(snapshots + files) and runs off-peak.
SQL
Topic — streaming
Streaming ingestion patterns
5. Production ops — snapshot expiration, OPTIMIZE, schema evolution
Iceberg health is a maintenance cadence — expire_snapshots, rewrite_data_files, rewrite_manifests, and remove_orphan_files are the four sectors of the production wheel
The mental model in one line: a healthy Iceberg table needs four periodic ops — expire snapshots (control history), OPTIMIZE / rewrite_data_files (bin-pack small files), rewrite_manifests (keep planning fast), and remove_orphan_files (reclaim storage from failed writes) — together they keep planning latency low and S3 cost bounded. Once you say "Iceberg is a self-managing format only if you actually run the management," the production cadence becomes a daily, weekly, monthly checklist.
The four maintenance sectors.
-
expire_snapshots— removes old snapshot manifests beyond the retention window, freeing the data files they uniquely referenced. Tags and branches protect their snapshots. Daily cadence on hot tables, weekly on slow ones. -
rewrite_data_files(OPTIMIZE) — compacts small data files into bin-packed target-sized files. Cuts planning time and read latency. Weekly cadence on append-heavy tables; on-demand after large overwrite jobs. -
rewrite_manifests— re-buckets the manifest list to keep manifests balanced by partition and size. Critical for tables with many commits per day. Monthly cadence usually suffices. -
remove_orphan_files— finds files in the table's S3 prefix not referenced by any snapshot (results of failed writes, crashed jobs, dropped branches) and deletes them. Weekly cadence; run only afterexpire_snapshotshas settled.
Schema evolution — the safe parts and the unsafe parts.
- Add column. Always safe — new column appended, existing rows have NULL for the new column. No file rewrite.
- Drop column. Safe — column is removed from the schema; underlying files keep the column data but readers stop seeing it. No file rewrite.
- Rename column. Safe — Iceberg uses field-ids, not column names, so a rename is a metadata-only operation. No file rewrite. (Compare to Hive, where a rename was a full rewrite.)
- Reorder columns. Safe — the schema is field-id-keyed, so column order is a presentation choice.
-
Widen type. Safe in many cases —
INT → BIGINT,FLOAT → DOUBLE,DECIMAL(p, s) → DECIMAL(p', s)withp' > pand sames. - Narrow type. Unsafe — would lose precision or range. Iceberg refuses.
-
Change type incompatibly (
STRING → INT). Unsafe — requires a backfill via a new column + cutover. - Promote nullable → required. Unsafe — would require validating every existing row. Iceberg refuses unless the table is empty.
Partition evolution.
- Iceberg supports changing the partition spec without rewriting old data — a textbook feature no other format offers.
-
Example. A table starts partitioned by
month(event_time); after a year, traffic grows and you want daily partitions.ALTER TABLE ... ADD PARTITION FIELD day(event_time)switches the spec. New data is partitioned daily; old data stays in monthly partitions. - Readers handle it transparently — the manifest tree records which partition spec each file was written under; the planner prunes per-spec.
- Caveat. Partition evolution is most useful as a future-proofing escape hatch, not a routine knob. Switching specs every quarter is a sign the original spec was wrong; pick a defensive spec at design time.
Catalog choice — the production decision that shapes everything.
- REST catalog (e.g. Tabular, Polaris, Lakekeeper) — engine-agnostic, vendor-neutral, audited via HTTP. The right default for a multi-engine shop.
- AWS Glue — managed, low-friction in AWS, but locks you to AWS for catalog operations. Cheap and easy at small-to-medium scale.
- Nessie — Git-style catalog with branching at the catalog level (not just the table level). Powerful for multi-table transactions; richer than the per-table branches.
- Snowflake-managed — Snowflake owns the catalog; other engines read through Snowflake's Iceberg APIs. Great if Snowflake is your primary engine.
- Unity Catalog — Databricks' catalog; Iceberg compatibility is improving. Right answer for Databricks-heavy shops moving toward open formats.
- Polaris — Snowflake's open-source REST catalog implementation. Multi-engine friendly with first-class tag/branch support.
Multi-engine reader matrix (2026).
- Spark — read + write + admin (full).
- Trino / Presto — read + write + admin (full).
- Flink — read + write (full streaming + batch).
- Snowflake — read + write (Iceberg-managed or external).
- DuckDB — read (via the iceberg extension); write support landing in 2026.
- ClickHouse — read (via the Iceberg engine); write support nascent.
- StarRocks — read (full); write support landing.
- Athena (AWS) — read + write.
- BigQuery — read (via BigLake / Iceberg external tables); write maturing.
Common interview probes on production ops.
- "How often should I run
expire_snapshots?" — daily on tables with frequent commits; weekly is fine on slow tables. Always after a large OPTIMIZE. - "What is the difference between
rewrite_data_filesandrewrite_manifests?" — the first compacts data files; the second compacts metadata. Both reduce planning time, but for different reasons. - "How do you know when OPTIMIZE is needed?" — average data-file size below ~128MB on a partition, or
db.<table>.filesshowing > 1000 files per active partition. Both are signs of read-time scan amplification. - "Can I rename a column on a 100TB Iceberg table without rewriting data?" — yes. Rename is a metadata-only operation because Iceberg uses field-ids.
Worked example — daily snapshot expiration job protecting tags
Detailed explanation. A daily Airflow task expires snapshots older than 7 days while protecting every tagged snapshot. Tags marked for SOX (e.g. eom-*, filed-*) survive indefinitely until their tag is explicitly dropped.
Question. Write the expire_snapshots call for db.fact_orders with a 7-day retention, keeping the last 30 snapshots minimum, and verifying that tagged snapshots survive.
Input.
| config | value |
|---|---|
| older_than | 7 days ago |
| retain_last | 30 |
| protected tags | eom-2026-05, filed-2026-Q1 |
Code.
-- 1) Inspect refs to confirm tag protection
SELECT name, snapshot_id, type
FROM db.fact_orders.refs
WHERE type = 'TAG';
-- 2) Run the expire job
CALL system.expire_snapshots(
table => 'db.fact_orders',
older_than => CURRENT_TIMESTAMP - INTERVAL '7' DAY,
retain_last => 30
);
-- 3) Verify tagged snapshots still exist
SELECT
r.name AS tag_name,
r.snapshot_id,
s.committed_at,
CASE WHEN s.snapshot_id IS NULL THEN 'MISSING' ELSE 'PRESENT' END AS status
FROM db.fact_orders.refs r
LEFT JOIN db.fact_orders.snapshots s ON r.snapshot_id = s.snapshot_id
WHERE r.type = 'TAG';
Step-by-step explanation.
- The refs pre-check captures the tags that must survive — useful as a baseline for the post-check.
-
expire_snapshotscomputes the set of reachable snapshots: every snapshot referenced (directly or via the parent chain) by any ref (main, branches, tags). Snapshots not reachable AND older than 7 days AND not in the last 30 are deleted. - The post-check confirms that every tag still resolves to a present snapshot. If any tag shows
MISSING, an alert fires (this should never happen in correctly configured tables). - The data files freed by expired snapshots are not deleted immediately — they are simply unreferenced. The next
remove_orphan_filescall reclaims them.
Output.
| tag_name | snapshot_id | committed_at | status |
|---|---|---|---|
| eom-2026-05 | 93 | 2026-05-31 23:55 | PRESENT |
| filed-2026-Q1 | 87 | 2026-04-15 09:00 | PRESENT |
Rule of thumb. Run expire_snapshots daily on hot tables. Tags are the only contract that guarantees a snapshot will survive — table-level retention can change, but tags are explicit.
Worked example — OPTIMIZE (rewrite_data_files) with bin-pack target
Detailed explanation. A table with hundreds of thousands of small Parquet files per partition plans queries slowly because the planner must open every file's footer. rewrite_data_files reads small files in a partition and writes them out as fewer, larger bin-packed files (typically 512MB target). Read latency drops by 5-10x on heavy partitions.
Question. Run rewrite_data_files on the EU partition of db.fact_orders, targeting 512MB output files, and verify that the data-file count dropped while the row count stayed identical.
Input.
| partition | files before | avg size before | rows before |
|---|---|---|---|
| EU | 1240 | 12 MB | 4_200_000 |
Code.
-- 1) Inspect data-file distribution for the EU partition
SELECT
file_path,
file_size_in_bytes,
record_count
FROM db.fact_orders.files
WHERE partition.region = 'EU'
ORDER BY file_size_in_bytes ASC
LIMIT 5;
-- 2) Run bin-pack rewrite on the EU partition
CALL system.rewrite_data_files(
table => 'db.fact_orders',
options => map(
'target-file-size-bytes', '536870912', -- 512 MB
'min-input-files', '5',
'max-concurrent-file-group-rewrites', '8'
),
where => "region = 'EU'"
);
-- 3) Verify file count dropped and row count is identical
SELECT
COUNT(*) AS file_count,
SUM(file_size_in_bytes) / 1e9 AS total_gb,
SUM(record_count) AS row_count
FROM db.fact_orders.files
WHERE partition.region = 'EU';
Step-by-step explanation.
- The pre-check confirms the symptom: ~1240 files averaging 12MB — far below the 128MB-512MB sweet spot most engines prefer.
-
rewrite_data_filesreads the small files inwhere-matched partitions and writes new files oftarget-file-size-bytes. The procedure commits a new snapshot whoseadded_data_filesare the new large files anddeleted_data_filesare the old small ones. Row counts are preserved by construction. - The post-check shows the new file count (~12 files of ~500MB each), same total bytes (a tiny overhead from format headers), same row count.
- Planning time on the next query against the EU partition drops by an order of magnitude because the planner now opens 12 footers instead of 1240.
Output.
| metric | before | after |
|---|---|---|
| file_count | 1240 | 12 |
| total_gb | 14.5 | 14.5 |
| row_count | 4_200_000 | 4_200_000 |
| planning latency (typical) | 6.4 s | 0.5 s |
Rule of thumb. Run weekly rewrite_data_files on append-heavy tables. Target 256-512MB per file. Skip partitions whose average file size is already in range — the procedure's min-input-files option lets you do this automatically.
Worked example — schema evolution: add column, drop column, rename
Detailed explanation. Iceberg's field-id-keyed schema makes column ops metadata-only. Adding, dropping, or renaming a column does not rewrite any data files — it updates the schema metadata and changes how the readers map files to columns.
Question. On db.fact_orders, add a currency column (default NULL), rename amount to amount_local, drop the obsolete legacy_flag column. Verify zero file rewrites.
Input.
| operation | column |
|---|---|
| add | currency STRING |
| rename | amount → amount_local |
| drop | legacy_flag |
Code.
-- 1) Inspect current schema
DESCRIBE TABLE db.fact_orders;
-- 2) Schema evolution — all three operations are metadata-only
ALTER TABLE db.fact_orders ADD COLUMN currency STRING;
ALTER TABLE db.fact_orders RENAME COLUMN amount TO amount_local;
ALTER TABLE db.fact_orders DROP COLUMN legacy_flag;
-- 3) Verify no data files were rewritten by the schema change
SELECT
snapshot_id, operation,
summary['added-data-files'] AS added,
summary['deleted-data-files'] AS deleted
FROM db.fact_orders.snapshots
ORDER BY committed_at DESC
LIMIT 4;
Step-by-step explanation.
- The schema pre-check captures the current columns.
amount,legacy_flag, etc. - Each
ALTER TABLEwrites a new snapshot whoseoperationisreplace(schema update) and whosesummaryshowsadded_data_files = 0,deleted_data_files = 0— the data is untouched. - The
RENAMEworks because Iceberg stores column field-ids in the manifest tree; the rename changes the schema's name → field-id mapping without touching the files. - The
DROP COLUMNremoves the column from the schema; existing files still contain the data, but readers no longer expose it. The disk usage is unchanged until a futurerewrite_data_fileswrites new files without the dropped column. - The post-check confirms three new snapshots, each with zero data-file delta. Schema evolution at scale, for free.
Output.
| snapshot_id | operation | added | deleted |
|---|---|---|---|
| s-201 | replace (add currency) | 0 | 0 |
| s-202 | replace (rename amount) | 0 | 0 |
| s-203 | replace (drop legacy_flag) | 0 | 0 |
| s-200 | append (prior) | 12 | 0 |
Rule of thumb. Schema evolution on Iceberg is a metadata operation, not a rewrite. The cost is one catalog write per change. Compared to Hive, where a column rename was a full table copy, this is the single most-quoted operational win of the format.
Worked example — multi-engine read of the same Iceberg table
Detailed explanation. A typical 2026 stack has Spark for batch transforms, Trino for ad-hoc analytics, Snowflake for finance reporting, and Flink for streaming. All four read the same Iceberg table through the catalog; none of them needs format-specific glue.
Question. Show the canonical SELECT against db.fact_orders in Spark / Trino / Snowflake / Flink. Confirm that the result is identical.
Input.
| engine | catalog binding |
|---|---|
| Spark | spark_catalog → REST |
| Trino | iceberg.db.fact_orders → REST |
| Snowflake | db.fact_orders (Iceberg external) |
| Flink | iceberg_catalog.db.fact_orders → REST |
Code.
-- Spark
SELECT region, SUM(amount_local) AS revenue
FROM db.fact_orders
WHERE order_date = DATE '2026-06-12'
GROUP BY region;
-- Trino
SELECT region, SUM(amount_local) AS revenue
FROM iceberg.db.fact_orders
WHERE order_date = DATE '2026-06-12'
GROUP BY region;
-- Snowflake (Iceberg external table)
SELECT region, SUM(amount_local) AS revenue
FROM db.fact_orders
WHERE order_date = '2026-06-12'
GROUP BY region;
-- Flink SQL
SELECT region, SUM(amount_local) AS revenue
FROM iceberg_catalog.db.fact_orders
WHERE order_date = DATE '2026-06-12'
GROUP BY region;
Step-by-step explanation.
- Each engine resolves the table identifier against its catalog binding. All four catalogs point to the same Iceberg metadata location.
- Each engine's planner reads the current snapshot's manifest tree from the same S3 prefix and prunes files using column-level stats.
- Each engine executes the same SQL semantics over the same data files. The result is byte-identical because the data is the same Parquet.
- The only differences are engine-specific syntax (e.g. Trino's three-part identifier vs Spark's two-part) and connector configuration.
Output (identical across engines).
| region | revenue |
|---|---|
| EU | 1_310_200 |
| US | 1_045_900 |
| APAC | 444_100 |
Rule of thumb. When designing a new lakehouse, list every engine that needs to read it; pick a catalog (REST, Polaris, Nessie, Glue) that all of them support; let Iceberg's format do the rest. The "one format, every engine" promise is only as strong as the catalog you pick.
SQL interview question on a production Iceberg maintenance cadence
A senior interviewer often opens with: "Design the weekly maintenance schedule for a 200TB Iceberg fact table with 96 commits per day and 12 active downstream readers. What runs daily, what runs weekly, what runs monthly, and how do you protect compliance tags?" It probes operational hygiene, retention design, and the interaction between expiration and tags.
Solution Using a tag-protected daily / weekly / monthly cadence
-- DAILY — runs at 02:00 in low-traffic window
CALL system.expire_snapshots(
table => 'db.fact_orders',
older_than => CURRENT_TIMESTAMP - INTERVAL '7' DAY,
retain_last => 50
);
-- WEEKLY — runs Sunday 02:30
CALL system.rewrite_data_files(
table => 'db.fact_orders',
options => map(
'target-file-size-bytes', '536870912',
'min-input-files', '5'
)
);
CALL system.remove_orphan_files(
table => 'db.fact_orders',
older_than => CURRENT_TIMESTAMP - INTERVAL '3' DAY
);
-- MONTHLY — runs first Sunday at 03:00
CALL system.rewrite_manifests(
table => 'db.fact_orders'
);
-- ONGOING — tagging step in the publish task (already covered in WAP)
ALTER TABLE db.fact_orders CREATE TAG `eom-${YYYY-MM}`
AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main')
RETAIN 2557 DAYS;
Step-by-step trace.
| cadence | op | input | output |
|---|---|---|---|
| daily | expire_snapshots(7d, retain 50) | ~700 snapshots, 50 tags | snapshots > 7d AND not in last 50 AND not tagged → eligible to expire |
| weekly | rewrite_data_files | partitions with avg file < 128MB | bin-packed to 512MB |
| weekly | remove_orphan_files | S3 files not in any snapshot, > 3d old | deleted |
| monthly | rewrite_manifests | imbalanced manifests | rebalanced |
| event | createTag eom-YYYY-MM | last good monthly snapshot | tag protects for 7 years |
The trace shows how the four operations layer: daily expiration controls snapshot count; weekly OPTIMIZE controls file count; weekly orphan cleanup controls S3 cost; monthly rewrite_manifests controls planning time; per-month tagging controls compliance pinning.
Output:
| metric | before | after |
|---|---|---|
| avg planning latency | 4.2 s | 0.6 s |
| avg files per partition | 800 | 28 |
| S3 storage (table) | 240 TB | 200 TB |
| snapshot count | ~1200 | ~50 + tags |
| compliance tags preserved | yes | yes |
Why this works — concept by concept:
- expire_snapshots respects refs — tags and branches protect their snapshots. The retention window controls only un-referenced snapshots.
- rewrite_data_files bin-packs — compacts small files into 256-512MB targets; reduces planning latency by an order of magnitude.
- remove_orphan_files reclaims S3 — files left by failed writes or dropped branches; runs after expire_snapshots to avoid races.
- rewrite_manifests rebalances metadata — keeps the manifest tree shallow and partition-aligned for fast planning at large commit counts.
-
Per-month tagging — the audit trail and compliance pin in one tag; the
RETAIN 7 YEARSclause overrides the daily 7-day expiration for tagged snapshots. - Cost — daily expire is O(snapshots); weekly OPTIMIZE is O(small files); orphan cleanup is O(S3 listing); rewrite_manifests is O(manifest size). All four are bounded by table change rate, not table size — so the cadence scales sublinearly.
SQL
Topic — optimization
Optimization problems
Cheat sheet — Iceberg branching, tagging, and WAP recipes
-
Inspect commit history.
SELECT * FROM <table>.snapshots ORDER BY committed_at DESC LIMIT N— the audit log every other format had to bolt on. -
List every named ref.
SELECT name, snapshot_id, type FROM <table>.refs— branches plus tags plus main, in one query. -
Time travel to a timestamp.
SELECT ... FROM <table> TIMESTAMP AS OF '2026-06-12 06:00:00'— pin the planner to the snapshot that was current then. -
Time travel to a snapshot id.
SELECT ... FROM <table> VERSION AS OF 103— unambiguous, audit-friendly. -
Time travel to a tag.
SELECT ... FROM <table> VERSION AS OF 'gold-v1'— engine resolves the tag from the refs table. -
Atomic rollback.
CALL system.rollback_to_snapshot('<table>', <prior_id>)— flip the main pointer back; bad snapshot stays on disk until expired. -
Create a branch.
ALTER TABLE <table> CREATE BRANCH '<name>' AS OF VERSION (SELECT snapshot_id FROM <table>.refs WHERE name='main'). -
Drop a branch.
ALTER TABLE <table> DROP BRANCH '<name>'— ref removed; data files freed only if no other ref references them. -
Fast-forward a branch into main.
CALL system.fast_forward('<table>', 'main', '<branch>')— single atomic publish. -
Cherry-pick a single snapshot.
CALL system.cherrypick_snapshot('<table>', <source_snapshot_id>)— bring one fix from a branch into main. -
Create a tag (with retention).
ALTER TABLE <table> CREATE TAG '<name>' AS OF VERSION <id> RETAIN 2557 DAYS— 7-year SOX-style pin. -
Drop a tag.
ALTER TABLE <table> DROP TAG '<name>'— pinned snapshot becomes eligible to expire if no other ref protects it. -
WAP write (branch-based).
ALTER TABLE ... CREATE BRANCH 'audit-<runid>'; INSERT INTO ...branch_audit-<runid> SELECT ...; gate; fast_forward; drop branch. -
Bin-pack OPTIMIZE.
CALL system.rewrite_data_files(table => '<table>', options => map('target-file-size-bytes','536870912'), where => "<predicate>"). -
Daily snapshot expiration.
CALL system.expire_snapshots(table => '<table>', older_than => CURRENT_TIMESTAMP - INTERVAL '7' DAY, retain_last => 50)— tags and branches protect their snapshots automatically. -
Rewrite manifests (monthly).
CALL system.rewrite_manifests(table => '<table>')— re-balance metadata to keep planning fast. -
Reclaim orphan files.
CALL system.remove_orphan_files(table => '<table>', older_than => CURRENT_TIMESTAMP - INTERVAL '3' DAY)— run after expire_snapshots. -
Schema evolution (metadata only).
ALTER TABLE ... ADD COLUMN ...,RENAME COLUMN ... TO ...,DROP COLUMN ...— all are zero data-file rewrites because Iceberg uses field-ids. -
Partition evolution.
ALTER TABLE ... ADD PARTITION FIELD day(event_time)— change spec without rewriting old data; readers handle both specs transparently. -
Inspect file count per partition.
SELECT partition, COUNT(*) FROM <table>.files GROUP BY partition ORDER BY COUNT(*) DESC— identify OPTIMIZE candidates. - Apache Iceberg vs Delta Lake quick rule. Multi-engine multi-vendor lakehouse: Iceberg. Databricks-first shop: Delta with UniForm. Migrating in 2 years: Iceberg.
Frequently asked questions
What is the difference between an Iceberg branch and a tag?
A branch is a movable named ref to a snapshot — writes targeting the branch advance its pointer to new snapshots. A tag is an immutable named ref — once created, its snapshot id does not change (you can only drop and recreate it). Branches are for in-progress isolated work (WAP, dev sandboxes, long-lived experiments); tags are for compliance pins, golden-dataset releases, and "last known good" markers. Both protect their snapshots from expire_snapshots, but tags carry an explicit RETAIN clause and are the right primitive for "must survive for 7 years" audit requirements.
Apache Iceberg vs Delta Lake — which should I pick for a new lakehouse?
For a multi-engine lakehouse with Snowflake, Databricks, Trino, Flink, and ad-hoc DuckDB readers, pick Iceberg — its catalog-pluggable design and first-class branch/tag APIs are universally supported. For a Databricks-first shop with limited multi-engine writers and a strong Spark / Photon investment, Delta with UniForm (which writes Iceberg-compatible metadata alongside Delta) is operationally simpler. For "I might switch engines in two years" or "I want to avoid vendor lock-in," Iceberg is the safer long-term bet. Both formats now ship snapshot atomicity, ACID, time travel, and schema evolution — the differences are catalog model, branching API, and the writer ecosystem outside Spark.
How does WAP differ from a staging table?
A staging table is a physical copy: you write to staging.fact_orders, validate, then INSERT INTO production.fact_orders SELECT * FROM staging.fact_orders — two copies of the data exist briefly and the publish is a multi-statement transaction (or worse, two separate transactions on object storage). Branch-based WAP writes the new snapshot to an isolated branch on the same table; the data files are shared; the publish is a single atomic catalog write that flips the main ref. WAP eliminates the copy, eliminates the two-step publish, and provides the same isolation guarantee with zero storage overhead beyond the new data itself.
Do Iceberg tags prevent snapshot expiration?
Yes. expire_snapshots walks every ref (main, branches, tags) and computes the set of reachable snapshots. Any snapshot reachable from any ref is kept regardless of the older_than clause. Tags can also carry an explicit RETAIN clause that self-expires the tag after a fixed duration (e.g. 7 years for SOX compliance), after which the protected snapshot becomes eligible for normal expiration. The combined contract — "tagged snapshots survive until the tag is dropped or self-expires" — is what makes tags the right primitive for compliance and audit reproducibility.
Can I read an Iceberg branch from Trino or Flink, not just Spark?
Yes. Branch-based WAP works because branches are catalog-level refs, not Spark-specific state. Trino reads a branch via SELECT ... FROM iceberg.db.table FOR VERSION AS OF '<branch>'; Flink via the OPTIONS('snapshot-id'='<resolved-from-branch>') hint; Snowflake via AT(BRANCH => '<branch>'). Every engine that has Iceberg catalog support resolves the branch's snapshot id through the refs metadata sub-table and reads it identically to any other snapshot. This multi-engine isolation is what made the legacy spark.wap.id property obsolete and why branch-based WAP is the 2026 default.
How do I roll back a bad production commit safely?
Two paths. First, if a tag protects the last-good snapshot, CALL system.rollback_to_snapshot('<table>', (SELECT snapshot_id FROM <table>.refs WHERE name='<tag>')) flips the main pointer back atomically. Every reader transitions immediately; the bad snapshot remains on disk for post-mortem until expired. Second, if no tag exists, query <table>.snapshots ORDER BY committed_at DESC LIMIT 5 to identify the prior good snapshot id, then call rollback_to_snapshot('<table>', <good_snapshot_id>). In both cases, the catalog's atomic pointer swap guarantees no reader sees a half-rolled-back state. The lesson: tag every production deploy beforehand — rollback then becomes a one-line lookup instead of a forensic exercise.
Practice on PipeCode
- Drill the ETL practice library → for end-to-end pipeline design and the patterns Iceberg encodes (idempotent writes, atomic publish, schema evolution).
- Rehearse database design problems → for the snapshot-isolation and ACID-on-object-storage reasoning interviewers want.
- Sharpen optimization drills → when the interviewer wants planning-time and bin-pack reasoning.
- Layer streaming problems → for CDC-into-Iceberg patterns and the WAP-on-streams design probe.
- Stack joins practice → for the "ref-table vs fact-table" decisions every lakehouse forces.
- Layer aggregation problems → for the read-side metrics that ride on top of Iceberg fact tables.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the system-design axis with the ETL system design course →.
- For long-form schema craft, work through data modelling for DE interviews →.
- For the engine internals that drive Iceberg writers and readers, study the Apache Spark internals course →.
Pipecode.ai is Leetcode for Data Engineering — every Iceberg pattern above ships with hands-on practice rooms where you create the audit branch, run the validation gate, fast-forward main, and protect compliance tags against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your branch-based WAP actually behaves the same on Trino as it does on Spark — you can prove it on graded reps.





Top comments (0)