DEV Community

Cover image for Apache Iceberg Branching, Tagging & WAP: Production Patterns
Gowtham Potureddi
Gowtham Potureddi

Posted on

Apache Iceberg Branching, Tagging & WAP: Production Patterns

apache iceberg rewrote what "a table" means on object storage. A decade of "files in S3 with a _SUCCESS marker" gave us cheap storage and unbounded scale, but every meaningful invariant a SQL engineer expects — atomicity, isolation, time travel, schema evolution, safe rollback — had to be re-invented on top of immutable blobs. Iceberg's answer is the snapshot manifest: a tiny pointer-swap that makes a multi-gigabyte write atomic, a 90-day rollback trivial, and a "compare last good prod table to today's run" diff a one-line query.

This guide is the production patterns playbook for senior data engineers who already know what Iceberg is and now have to ship it. It walks through the iceberg snapshots model, the apache iceberg vs delta lake decision, the iceberg branching lifecycle (createBranch, fastForward, cherryPick), the iceberg tagging API for immutable compliance pins, the iceberg wap (write audit publish) pattern as the modern replacement for spark.wap.id, and the iceberg time travel plus maintenance cadence (snapshot expiration, OPTIMIZE, rewrite_manifests, schema evolution) that keep a multi-petabyte lakehouse fast and cheap. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for Apache Iceberg branching, tagging and write-audit-publish — bold white headline 'Apache Iceberg · Branching · Tagging · WAP' with subtitle 'snapshots · branches · tags · write-audit-publish' over a stylised glacier scene with a tall central snapshot-iceberg labelled 'main' and smaller branching side-icebergs in green and orange connected by thin glowing lines, on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the ETL practice library →, rehearse on database design problems →, and stack the streaming-side muscles with streaming problems →.


On this page


1. Why open table formats — snapshot atomicity, ACID, time travel

The "files in S3" era broke at scale — the snapshot manifest is the primitive that fixed every invariant a SQL engineer expects

The mental model in one line: an Iceberg table is a tiny JSON pointer that names the current snapshot; every write produces a new snapshot and the table commit is a single conditional pointer swap. Once you internalise "tables are pointer-swaps over immutable files," every Iceberg feature — atomic commits, time travel, branches, tags, WAP — falls out of the same primitive.

Why the pre-Iceberg world broke.

  • Hive-style partitioned directories told the reader "everything under /sales/dt=2026-06-12/ is the partition for that day." There was no commit record, so a job that wrote 12 of 24 files before crashing left partial data visible to every downstream reader.
  • _SUCCESS markers were a convention, not a contract — Spark wrote them, but Trino did not always honour them, and a successful marker plus a half-rewritten Parquet file could still corrupt a query.
  • No schema-evolution invariant — a column rename in Hive was a rewrite of every Parquet file. A column drop was a code review and a prayer.
  • No transactional semantics — concurrent writers raced on the same partition and the last-writer-wins resolution silently dropped commits.
  • Listing-bound planning — query planning required LIST against S3 for every partition, which was both slow and a wallet-burn at scale.

The Iceberg primitive.

  • The snapshot. Every commit produces one snapshot. A snapshot is a manifest list — a tiny file that points to a set of manifest files, which point to the actual Parquet/ORC data files. The current table state is the current snapshot.
  • The atomic swap. A commit succeeds when the catalog (REST / Glue / Nessie / Polaris / Unity) flips its "current snapshot id" pointer from snap-A to snap-B in a single conditional write. If the swap fails (because someone else committed first), the writer retries the manifest against the new base. No partial reads, no half-written tables.
  • ACID without locks. Atomicity comes from the pointer swap; consistency from the manifest validation; isolation from immutable snapshots (readers pin a snapshot id at planning time and never see writes that arrive after); durability from S3 itself.
  • Time travel for free. Every historical snapshot is queryable via AS OF VERSION or AS OF TIMESTAMP until you explicitly expire it. Rollback is a single API call.
  • No-list planning. The manifest tree carries column-level stats (min/max, null counts) per file, so the planner prunes 99% of files before any LIST or GET is issued. Planning latency drops from minutes to seconds on petabyte tables.

Apache Iceberg vs Delta Lake — the two-format reality.

  • Both ship snapshot-based ACID on object storage. Both support time travel, schema evolution, and atomic commits.
  • Iceberg is catalog-pluggable — the table identity lives in the catalog (REST, Glue, Nessie, Polaris, Snowflake-managed, Unity). Multi-engine reads are first-class: Spark, Trino, Flink, Snowflake, DuckDB, ClickHouse, StarRocks, Athena, BigQuery all read Iceberg natively.
  • Delta Lake has its own commit log living next to the data (_delta_log/). Engines like Trino and Flink can read Delta but the writer ecosystem outside Spark/Databricks is thinner. Delta UniForm shrinks this gap by writing Iceberg-compatible metadata alongside Delta.
  • Iceberg branches and tags ship as core API (Spark, Flink, Trino procedures). Delta's equivalent is "shallow clone" plus the Delta UniForm bridge — semantically similar but a different surface.
  • The 2026 rule of thumb. For a new lakehouse with Snowflake / Databricks / Trino / Flink all reading and writing: pick Iceberg. For a Databricks-first shop with limited multi-engine writers: Delta with UniForm is operationally simpler. For "I might switch engines in 2 years": Iceberg is the safer bet.

What interviewers listen for.

  • Do you say "a commit is a pointer swap" when asked what makes Iceberg atomic? — senior signal.
  • Do you reach for snapshots metadata table to inspect history before reaching for git log-style mental models? — required answer.
  • Do you mention catalog choice (REST vs Glue vs Nessie vs Polaris) as a first-class production decision? — senior signal.
  • Do you know that time travel reads pin the snapshot at planning time, so a long-running query is unaffected by concurrent commits? — required answer.

Worked example — list snapshots and inspect commit history

Detailed explanation. Every Iceberg table exposes a metadata sub-table called .snapshots that lists every commit. Reading it is the first move when a downstream report changed unexpectedly — you can see exactly which snapshot landed, who wrote it, and how many files were added or deleted.

Question. Given a production fact table db.sales, list the last five snapshots with commit timestamp, operation, and the file-level delta (added / deleted). How would you use the result to investigate a "row count jumped by 8% overnight" alert?

Input.

snapshot_id committed_at operation added_data_files deleted_data_files
s-101 2026-06-12 02:00 append 24 0
s-102 2026-06-12 04:00 append 12 0
s-103 2026-06-12 06:00 overwrite 240 36
s-104 2026-06-12 08:00 append 6 0
s-105 2026-06-12 09:30 append 4 0

Code.

-- Spark / Trino / Snowflake-Iceberg syntax
SELECT
    snapshot_id,
    committed_at,
    operation,
    summary['added-data-files']   AS added_data_files,
    summary['deleted-data-files'] AS deleted_data_files,
    summary['total-records']      AS total_records
FROM db.sales.snapshots
ORDER BY committed_at DESC
LIMIT 5;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. db.sales.snapshots is the metadata sub-table for the sales table — every Iceberg table exposes it automatically.
  2. summary is a MAP<STRING,STRING> of commit-level counters: added/deleted data files, added/deleted records, total records, partitioning summary, and any caller-attached properties (spark.app.id, dbt run id, Airflow dag run id).
  3. The overwrite row at s-103 is the suspicious commit — it deleted 36 files and added 240, which is the textbook signature of an OPTIMIZE / rewrite_data_files run or a backfill that replaced a partition. Cross-check the wap.id or spark.app.id in summary to know which.
  4. With the suspect snapshot identified, an investigator can time travel (SELECT ... FROM db.sales VERSION AS OF s-102 vs s-103) and diff the row counts per partition to confirm whether the jump came from the overwrite.

Output.

snapshot_id committed_at operation added deleted total_records
s-105 2026-06-12 09:30 append 4 0 12_004_200
s-104 2026-06-12 08:00 append 6 0 12_001_800
s-103 2026-06-12 06:00 overwrite 240 36 11_995_400
s-102 2026-06-12 04:00 append 12 0 11_096_200
s-101 2026-06-12 02:00 append 24 0 11_088_400

Rule of thumb. When a downstream metric changes unexpectedly, your first SQL should always be SELECT * FROM <table>.snapshots ORDER BY committed_at DESC LIMIT N. The snapshots table is the audit log every other format had to bolt on later.

Worked example — time travel to query the table at a past timestamp

Detailed explanation. Iceberg supports both VERSION AS OF <snapshot_id> (exact snapshot) and TIMESTAMP AS OF <ts> (the snapshot that was current at that wall-clock time). Time travel reads pin the snapshot at planning time, so a long-running analytical query is immune to concurrent writes.

Question. A daily revenue report ran at 06:00 and produced a number the business is disputing. Re-run the same query against the table as it existed at 06:00 and confirm the historical value.

Input.

snapshot_id committed_at
s-102 2026-06-12 04:00
s-103 2026-06-12 06:00
s-104 2026-06-12 08:00

Code.

-- Reproduce the report exactly as it ran at 06:00
SELECT
    region,
    SUM(amount) AS revenue
FROM db.sales TIMESTAMP AS OF '2026-06-12 06:00:00'
WHERE order_date = DATE '2026-06-11'
GROUP BY region
ORDER BY region;

-- Or pin the exact snapshot id from the snapshots audit
SELECT
    region,
    SUM(amount) AS revenue
FROM db.sales VERSION AS OF 103
WHERE order_date = DATE '2026-06-11'
GROUP BY region
ORDER BY region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. TIMESTAMP AS OF '2026-06-12 06:00:00' resolves to the snapshot that was current at exactly that moment — in this table, snapshot s-103 (the 06:00 overwrite). The planner uses the manifest tree from s-103, not the current pointer.
  2. VERSION AS OF 103 is the unambiguous form — if downstream auditors need to reproduce a number, the snapshot id is the contract.
  3. Both forms cost the same to plan and execute as a regular read; the only difference is which snapshot manifest the planner walks.
  4. Because the snapshot is pinned at planning time, no concurrent writer can affect this query's result. The number you compute is the historical truth.

Output.

region revenue (at 06:00) revenue (now)
EU 1_240_500 1_310_200
US 980_400 1_045_900
APAC 420_900 444_100

Rule of thumb. Every disputed metric should be re-runnable with a TIMESTAMP AS OF or VERSION AS OF clause against the original snapshot. If your downstream pipeline forgets to capture the snapshot id, you lose audit reproducibility — capture it in every run's lineage metadata.

Worked example — atomic rollback to a previous snapshot

Detailed explanation. If a bad commit corrupts the table — wrong partition overwritten, accidental DELETE, schema-violating insert — rollback_to_snapshot flips the catalog pointer back to the prior snapshot in a single atomic operation. The bad snapshot still exists on disk (until expired) but is no longer the current state.

Question. A nightly job overwrote the eu partition with bad data in snapshot s-103. Roll the table back to snapshot s-102 atomically, so every reader returns the pre-s-103 state immediately.

Input.

snapshot_id committed_at operation
s-101 2026-06-12 02:00 append
s-102 2026-06-12 04:00 append
s-103 2026-06-12 06:00 overwrite (BAD)

Code.

-- Spark / Trino procedure form
CALL system.rollback_to_snapshot('db.sales', 102);

-- Or rollback to the snapshot that was current at a wall-clock time
CALL system.rollback_to_timestamp('db.sales', TIMESTAMP '2026-06-12 05:00:00');
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The procedure writes a new snapshot s-104 whose state is identical to s-102 (the pointer is reset). The bad snapshot s-103 is still on disk but is no longer in the chain of "current" snapshots.
  2. The pointer swap is atomic — between the millisecond before and after, no reader can see a half-rolled-back table.
  3. Concurrent readers that started before the rollback continue against their pinned snapshot; new readers see the rolled-back state.
  4. The bad snapshot remains queryable via VERSION AS OF 103 for post-mortem analysis until you explicitly expire_snapshots past it.

Output.

step current snapshot state visible to readers
before s-103 bad EU data
after s-104 (= s-102 state) pre-bad EU data

Rule of thumb. Treat rollback_to_snapshot as your "undo button" — but always pair it with a snapshot inspection (SELECT * FROM <table>.snapshots) so you roll back to the correct prior snapshot. Rolling back to the wrong snapshot is its own outage.

SQL interview question on Iceberg atomicity and time travel

A senior interviewer often opens with: "Walk me through what happens at the catalog when Spark commits an append to an Iceberg table — what is atomic, what is not, and how does a concurrent reader see (or not see) the new rows?" It probes whether the candidate understands snapshot isolation, the pointer-swap commit, and the read-pinning behaviour.

Solution Using snapshot isolation and the manifest-pointer commit

-- 1) Inspect the current pointer and the in-flight commit candidate
SELECT
    name,
    snapshot_id,
    parent_id,
    is_current_ancestor
FROM db.sales.refs;          -- branches + tags + main

SELECT
    snapshot_id,
    parent_id,
    operation,
    committed_at,
    summary['added-data-files'] AS added,
    summary['spark.app.id']     AS writer
FROM db.sales.snapshots
ORDER BY committed_at DESC
LIMIT 3;

-- 2) Reader query that pins a snapshot at planning time
SELECT
    region,
    COUNT(*) AS row_count
FROM db.sales VERSION AS OF (
    SELECT MAX(snapshot_id) FROM db.sales.snapshots
    WHERE committed_at <= TIMESTAMP '2026-06-12 06:00:00'
)
GROUP BY region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step catalog state reader state invariant
1 current = s-102 new reader plans against s-102 snapshot pinned
2 writer prepares s-103 manifest s-102 still current no swap yet
3 writer calls catalog.commit(s-102→s-103) atomic conditional swap last-writer-wins or retry
4 current = s-103 original reader still sees s-102 snapshot isolation
5 second reader plans now sees s-103 no partial reads

The trace shows the central guarantee: between steps 1 and 4, the original reader's view is locked to s-102 — it cannot accidentally see half of s-103's writes. The new reader at step 5 sees the fully committed s-103. There is no in-between visibility state.

Output:

reader snapshot pinned row_count visible
reader A (started at t1) s-102 11_096_200
reader B (started at t5) s-103 11_995_400
reader C (TIMESTAMP AS OF '06:00') s-103 (was current at 06:00) 11_995_400

Why this works — concept by concept:

  • Snapshot — a manifest list pointing to a set of data files; immutable once written. Every write produces a new snapshot.
  • Catalog pointer swap — the table's current_snapshot_id lives in the catalog (REST, Glue, Nessie, Polaris, Unity, or a metadata file pointer in HDFS-catalog mode). Commit = conditional update of that single field.
  • Snapshot isolation — readers capture the current snapshot id at planning time and use it throughout the query. Concurrent commits never affect an in-flight read.
  • Last-writer-wins with retry — two concurrent writers race to swap the pointer; the loser re-bases its manifest against the winner's snapshot and retries. No locks, no deadlocks.
  • Cost — pointer swap is one catalog write; the manifest tree for the new snapshot is O(changed manifests), not O(table). Atomicity is essentially free.

SQL
Topic — database
Database design problems

Practice →


2. Iceberg branches — createBranch / fastForward / cherry-pick, isolated dev

iceberg branching makes the table look like a Git repo — branches are named snapshot refs, not table copies

The mental model in one line: an Iceberg branch is a named pointer to a snapshot — exactly like a Git branch is a named pointer to a commit — so creating one is O(1) metadata, not an O(table) copy. Once you say "branches are refs, not clones," the entire branching API (createBranch, fastForward, cherryPick) reads like Git porcelain over the same snapshot DAG.

Iconographic commit-tree of an Iceberg table — a thick blue 'main' trunk with five snapshot-nodes (s-101 to s-105), a green 'dev' branch forking up from s-102 with three of its own snapshots ending in a fast-forward arrow back into main, and an orange 'qa' branch forking down from s-103 with a cherry-pick arc back into main near s-104; a side card titled 'branch operations' with three chips (createBranch, fastForward, cherryPick); on a light PipeCode card.

The branch primitive in five bullets.

  • A branch is a named ref. The catalog stores one row per branch (and per tag) — name, snapshot_id, max_ref_age_ms, min_snapshots_to_keep, max_snapshot_age_ms. The data is shared with main; only the ref pointer is new.
  • main is the implicit branch every table starts with. Every commit on the table updates main unless you explicitly target a different branch.
  • Branch writes are isolated — a writer that targets dev does not touch the main pointer, so readers on main are completely unaffected by experimental work.
  • fastForward resolves a branch into main when it is strictly ahead. If dev was created from snapshot s-102 and has commits s-d1 → s-d2 → s-d3, and main is still at s-102, then fastForward('main', 'dev') moves the main pointer to s-d3 in a single atomic operation. The data files on dev simply become the files on main.
  • cherryPick brings a single snapshot from one branch into another as a new snapshot. The picked snapshot's data files are re-committed under main's lineage; the original branch is untouched.

Branch lifecycle API.

  • createBranch(name, snapshot_id?) — creates a new branch pointing to snapshot_id (or current if omitted). O(1) metadata write.
  • createBranch(name, snapshot_id, retention?) — optional retention controls protect old snapshots referenced by the branch from expire_snapshots.
  • fastForward(target, source) — fast-forward the target branch to source. Fails if target has commits not reachable from source (would require a real merge, which Iceberg does not implement — you must cherryPick or rewrite).
  • cherryPick(snapshot_id) — copy a single snapshot's changes onto the current branch as a new commit. Conflicts with concurrent writes resolve via re-application.
  • dropBranch(name) — removes the branch ref; snapshots become eligible for expiration if no other ref protects them.

When branches are the right answer.

  • Sandbox a backfill. Create backfill-2026-Q1 from main, run the rewrite against it, validate, fast-forward into main. Production reads of main never see in-progress backfill state.
  • Long-lived experiment. Data scientists join a new feature into a feat-customer360 branch and query it; main continues to serve production. If the experiment ships, fast-forward; if it does not, drop the branch.
  • Multi-tenant staging. Each downstream team gets a branch with its own data-correction overlays; main remains canonical.
  • WAP (next section). The audit branch is created per run, written to, validated, then fast-forwarded into main — the modern WAP pattern.

When branches are the wrong answer.

  • As a substitute for partitioning. A branch per tenant is a metadata explosion; use a tenant column and row-level access control instead.
  • As a replacement for snapshots. If you only need "the table as of yesterday," a snapshot id or a tag is cheaper than a branch.
  • As a soft DELETE. Iceberg supports DELETE FROM with merge-on-read or copy-on-write at the table level; branches are not for hiding rows.

Common interview probes on branches.

  • "What is the difference between an Iceberg branch and a Git branch?" — semantically nearly identical (both are named refs to a commit DAG), but Iceberg branches do not implement three-way merge — fast-forward and cherry-pick are the only merge primitives.
  • "Can two writers commit to the same branch concurrently?" — yes, same last-writer-wins retry as on main. Branches do not serialise writes; they isolate one branch's commits from another's.
  • "How are branches stored?" — as rows in the table's refs metadata sub-table. The catalog stores the current snapshot id for each ref; the data files are shared across branches.
  • "Does dropping a branch delete the data?" — no. Data files referenced by the dropped branch survive until the next expire_snapshots run determines they are unreachable from every remaining ref.

Worked example — create a dev branch, mutate, and fast-forward into main

Detailed explanation. A data engineer wants to test a partition rewrite against a copy of db.sales without affecting production readers. She creates a dev branch from the current main snapshot, runs the rewrite against dev, validates the row counts, then fast-forwards main to the new state in a single atomic call.

Question. Given db.sales at snapshot s-105, create a dev branch, write a rewrite snapshot to it that compacts small files in the eu partition, validate row counts, and fast-forward main to the new dev head.

Input.

step branch snapshot pointer
start main s-105
start dev (not created)

Code.

-- 1) Create the dev branch from current main
ALTER TABLE db.sales CREATE BRANCH `dev`;

-- 2) Write the rewrite on the dev branch
CALL system.rewrite_data_files(
    table         => 'db.sales',
    options       => map('target-file-size-bytes', '536870912'),
    where         => "region = 'EU'",
    branch        => 'dev'
);

-- 3) Validate row counts match between main and dev for EU
SELECT
    'main' AS branch, COUNT(*) AS row_count FROM db.sales VERSION AS OF (SELECT snapshot_id FROM db.sales.refs WHERE name='main')
WHERE region = 'EU'
UNION ALL
SELECT
    'dev', COUNT(*) FROM db.sales VERSION AS OF (SELECT snapshot_id FROM db.sales.refs WHERE name='dev')
WHERE region = 'EU';

-- 4) Fast-forward main to dev's head atomically
CALL system.fast_forward('db.sales', 'main', 'dev');

-- 5) Drop the dev branch (data files now live under main's lineage)
ALTER TABLE db.sales DROP BRANCH `dev`;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. CREATE BRANCH dev writes one new ref row to the catalog pointing to s-105. No data is copied. O(1).
  2. rewrite_data_files(..., branch => 'dev') runs the compaction targeting dev. The new compacted snapshot s-d1 lands on dev; main is still at s-105 and production readers see no change.
  3. The validation query reads both branches via VERSION AS OF against the ref-resolved snapshot ids. Row counts must match — compaction does not change semantics.
  4. fast_forward('main', 'dev') moves the main pointer to s-d1 atomically. The transition is a single catalog write; no reader sees an in-between state. The cost is zero new data files — the files are already there, only the pointer changes.
  5. DROP BRANCH dev removes the ref row. The files it referenced are still reachable from main (because of the fast-forward), so nothing is orphaned.

Output.

step main pointer dev pointer EU files EU rows
1 s-105 s-105 120 small 4.2M
2 s-105 s-d1 main:120, dev:18 large 4.2M / 4.2M
3 s-105 s-d1 unchanged counts match
4 s-d1 s-d1 18 large 4.2M
5 s-d1 (dropped) 18 large 4.2M

Rule of thumb. Every multi-step mutation that could be risky (rewrite, schema evolution, partition reshuffle) belongs on its own branch. The branch is your safety net — if validation fails, you drop the branch and nothing on main changed.

Worked example — cherry-pick a single fix snapshot from a hotfix branch

Detailed explanation. A nightly job ran a multi-step transformation on the feat-2026-Q2 branch. One of the snapshots in the chain is a legitimate bug fix that should be applied to main; the rest are still in progress. Cherry-pick brings just that one snapshot into main without merging the entire branch.

Question. Given feat-2026-Q2 with snapshots s-f1 (refactor), s-f2 (bug fix to NULL handling), s-f3 (unfinished new column), cherry-pick s-f2 into main without bringing s-f1 or s-f3.

Input.

branch snapshots
main s-101 → ... → s-105
feat-2026-Q2 s-105 → s-f1 → s-f2 → s-f3

Code.

-- Cherry-pick the bug-fix snapshot only
CALL system.cherrypick_snapshot('db.sales', '<s-f2-id>');

-- Verify the new main snapshot inherited the fix
SELECT
    snapshot_id, parent_id, operation,
    summary['source-snapshot-id'] AS cherry_picked_from
FROM db.sales.snapshots
ORDER BY committed_at DESC
LIMIT 2;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. cherrypick_snapshot reads the data-file delta of s-f2 (the bug fix) and re-applies it as a new commit s-106 on main. The data files referenced by s-f2 are linked under main's manifest tree.
  2. The new snapshot's summary records source-snapshot-id = s-f2 so the audit trail captures the cherry-pick origin.
  3. feat-2026-Q2 is untouched — s-f1, s-f2, s-f3 remain on the branch for the team's continued work.
  4. If s-f2 had touched the same files as a concurrent main commit, cherry-pick would fail with a conflict; the operator would then either rebase s-f2 against the new main head and retry, or rewrite the fix as a fresh commit on main.

Output.

snapshot_id parent_id operation cherry_picked_from
s-106 s-105 overwrite s-f2
s-105 s-104 append (none)

Rule of thumb. Use cherryPick when you need a single fix from a long-lived branch without merging everything else. Use fastForward when the entire branch is ready to land. Use neither — and reach for a fresh INSERT INTO on main — when the change is trivial enough that the branch overhead is not worth it.

Worked example — long-lived experiment branch with concurrent main writes

Detailed explanation. A data-science team works on a feat-customer360 branch for three weeks. During those three weeks, the production ETL keeps committing to main. The team can periodically rebase their branch against the new main head — or they can keep working on the old snapshot and reconcile at the end.

Question. Show the snapshot DAG state at three points in time: branch creation, two weeks in, and at fast-forward time. Explain why concurrent main writes do not block the branch.

Input.

time main feat-customer360
t0 s-105 (created at s-105)
t1 (week 1) s-110 s-105 → s-c1
t2 (week 2) s-115 s-105 → s-c1 → s-c2
t3 (week 3) s-120 s-105 → s-c1 → s-c2 → s-c3

Code.

-- At t0: create the branch
ALTER TABLE db.customers CREATE BRANCH `feat-customer360`;

-- At t1, t2, t3: append to the branch (writes do not touch main)
INSERT INTO db.customers.`branch_feat-customer360`
SELECT * FROM staging.customer360_t1;

-- At t3: rebase or cherry-pick to land the work
-- Option A: cherry-pick each branch snapshot in order onto main
CALL system.cherrypick_snapshot('db.customers', '<s-c1-id>');
CALL system.cherrypick_snapshot('db.customers', '<s-c2-id>');
CALL system.cherrypick_snapshot('db.customers', '<s-c3-id>');

-- Option B: validate and fast-forward only if the branch is a clean superset
-- (rare for a 3-week-old branch; usually some main commits conflict)
CALL system.fast_forward('db.customers', 'main', 'feat-customer360');
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. At t0, both refs point to s-105. No data is duplicated; only the ref row is new.
  2. Between t0 and t3, the ETL commits 15 new snapshots to main (s-106 → s-120). Each commit only touches the main ref; the feat-customer360 ref is unaffected.
  3. The DS team commits 3 snapshots to the branch (s-c1, s-c2, s-c3). Each touches only the branch ref; main is unaffected.
  4. At t3, the team wants to land the work. Because main has moved (s-105 → s-120) and the branch is based on s-105, a plain fast_forward will fail — the branch is not strictly ahead of main.
  5. The team's choice is cherry-pick each branch snapshot in order onto main, resolving conflicts where the branch touched a file that main also touched, or rebase by re-running the transformations against the current main head on a new short-lived branch and then fast-forwarding.

Output.

time main snapshot branch snapshot data files shared?
t0 s-105 s-105 yes (identical)
t1 s-110 s-c1 partly — s-c1's files unique to branch
t2 s-115 s-c2 partly
t3 s-120 s-c3 partly
after cherry-picks s-123 s-c3 branch files now also reachable from main

Rule of thumb. Long-lived branches accumulate divergence. If a branch lives longer than the typical main commit cadence, plan a rebase or cherry-pick strategy from day one. Treat the branch like a feature branch in Git: rebase early, rebase often, or accept that the final landing will be a multi-step cherry-pick.

SQL interview question on sandboxing a backfill

A senior interviewer might frame this as: "You need to backfill 90 days of bad partitions on a 50TB fact table. The downstream consumers cannot tolerate seeing partial backfill state. How would you structure the work using Iceberg branches, and what is your rollback plan?" It probes branch lifecycle, atomic publication, and operator hygiene.

Solution Using a backfill branch, validation gate, and fast-forward publish

-- 1) Create the backfill branch from the current production head
ALTER TABLE db.fact_orders CREATE BRANCH `backfill-2026-Q1`
  AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main');

-- 2) Run the backfill targeting the branch
INSERT OVERWRITE TABLE db.fact_orders.`branch_backfill-2026-Q1`
SELECT * FROM staging.corrected_q1_orders;

-- 3) Validation gate — row count, null rate, and per-partition delta vs main
WITH gate AS (
    SELECT
        COUNT(*)                                  AS row_count_branch,
        SUM(CASE WHEN amount IS NULL THEN 1 END)  AS null_amount_branch
    FROM db.fact_orders VERSION AS OF (
        SELECT snapshot_id FROM db.fact_orders.refs WHERE name='backfill-2026-Q1'
    )
    WHERE order_date BETWEEN DATE '2026-01-01' AND DATE '2026-03-31'
)
SELECT
    *,
    CASE
        WHEN row_count_branch   < 12000000      THEN 'FAIL — too few rows'
        WHEN null_amount_branch > 0             THEN 'FAIL — null amounts'
        ELSE                                         'PASS'
    END AS gate_decision
FROM gate;

-- 4) Publish: fast-forward main to the validated branch head
CALL system.fast_forward('db.fact_orders', 'main', 'backfill-2026-Q1');

-- 5) On failure: drop the branch; main is untouched
-- ALTER TABLE db.fact_orders DROP BRANCH `backfill-2026-Q1`;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step main backfill branch gate state
1 s-200 s-200 not started
2 s-200 s-b1 (overwrite Q1) not started
3 s-200 s-b1 PASS (rows=12.4M, null=0)
4 s-b1 s-b1 published atomically
5 (fail path) s-200 dropped main untouched

The trace illustrates the central guarantee: between step 2 and step 4, production readers see the old Q1 data on main. Between step 4 (the moment of fast_forward) and step 4 + 1 microsecond, every reader transitions to the new Q1 state. There is no intermediate visibility — exactly the contract the consumers require.

Output:

metric value
main snapshot before s-200
main snapshot after s-b1
backfill rows written 12_400_000
readers that saw partial state 0
rollback path on gate failure drop branch, no further action

Why this works — concept by concept:

  • Branch isolation — every write targets the backfill-2026-Q1 ref; main is untouched until the explicit fast_forward. Concurrent main commits proceed normally.
  • Validation gate — runs entirely on the branch snapshot, so the gate is testing exactly the bytes that will be published. No staging-vs-prod skew.
  • Atomic publishfast_forward is a single catalog write that flips the main pointer. No reader sees a half-published state.
  • Rollback by dropping a branch — if the gate fails, the branch is dropped, main never moves, and the operator has zero cleanup. The bad data files are reclaimed by the next expire_snapshots run.
  • No table copy — the branch is O(1) metadata at creation; the data files are written incrementally as the backfill commits. The total cost is "one backfill" plus a handful of ref-row writes.
  • Cost — branch create / drop / fast_forward are all O(1) catalog writes. The backfill itself is O(corrected_data) regardless of whether you use a branch.

SQL
Topic — ETL
ETL pipeline problems

Practice →


3. Iceberg tags — immutable labels for compliance and audit

iceberg tagging pins a snapshot under a stable name — the audit-friendly equivalent of "the version regulators asked for"

The mental model in one line: a tag is an immutable named ref to a specific snapshot — you cannot move it, only create or drop it — which makes tags the right primitive for compliance pins, golden datasets, and "last known good" markers. Once you say "tags do not move; branches do," the entire tagging API and its retention semantics fall out.

Iconographic scroll of immutable wax-sealed labels — a horizontal snapshot timeline (s-090 to s-096) with four coloured wax-seal medallion tags pinned at specific snapshots (eom-2025-12 purple at s-091, gold-v1 blue at s-093, sox-quarter orange at s-094, prod-good green at s-096), each carrying a snapshot-ID pill and a small lock icon, with an 'expire snapshots' sweep arrow blocked by a shield glyph at each tagged snapshot; a side card titled 'tag rules' with three chips (immutable, point-in-time, audit-friendly); on a light PipeCode card.

The tag primitive in five bullets.

  • A tag is a named ref to a single snapshot. Same catalog row shape as a branch; the only difference is the API does not let you move the pointer.
  • Tags are immutable in the move sense, not in the existence sense. You can dropTag and recreate it pointing to a different snapshot, but you cannot atomically retarget an existing tag.
  • Tags protect snapshots from expiration. As long as a tag references a snapshot, that snapshot survives expire_snapshots — even when the retention window otherwise would have evicted it.
  • Tags carry optional retention metadatamax_ref_age_ms lets a tag self-expire after, say, 7 years for SOX compliance.
  • Tags are queryable like branchesSELECT * FROM db.sales VERSION AS OF 'eom-2025-12' reads the snapshot the tag points at.

Tag lifecycle API.

  • createTag(name, snapshot_id?) — pin snapshot_id (or current head) under name. Fails if name already exists.
  • createTag(name, snapshot_id, retention) — attach an explicit lifetime; common for compliance ("keep for 7 years").
  • dropTag(name) — remove the tag ref. The snapshot becomes eligible for expiration if no other ref protects it.
  • replaceTag(name, snapshot_id) (rarely available; check engine version) — atomically retarget. Most engines require drop + create.
  • AS OF '<tag-name>' — query syntax for tag-aware time travel.

When tags are the right answer.

  • Compliance and audit pins — "tag the last good snapshot before the quarterly close" — regulators can re-query the exact bytes that produced the filed numbers.
  • Golden datasetsgold-v1, gold-v2, ... tag releases of a curated dataset so downstream consumers can pin a known-good version even as you keep iterating on main.
  • "Last known good" markers — tag the last snapshot that passed all data-quality checks so a rollback is a one-line rollback_to_snapshot lookup against the tag.
  • Disaster-recovery anchors — tag every successful daily run; a corrupted commit can be undone by rolling back to yesterday's tag.

When tags are the wrong answer.

  • As a substitute for branches. Tags do not isolate writes; they label snapshots. Use a branch for in-progress work.
  • As a substitute for snapshot retention policies. Tags do not change the table-level retention; they pin individual snapshots. The right way to retain "all snapshots for 90 days" is the table retention config, not a tag per snapshot.
  • As a per-row label. Tags name table-level points in time, not row-level conditions. Use a column for row labels.

Tag-aware retention — the subtle part.

  • expire_snapshots walks every ref (branches + tags + main) and computes the set of reachable snapshots. Any snapshot reachable from any ref is kept; everything else is eligible for expiration.
  • A tag at snapshot s-090 protects s-090 and its parent chain up to the root, unless you also configure min_snapshots_to_keep or max_snapshot_age_ms at the tag level — in which case the tag protects only s-090 and snapshots within the configured retention.
  • Result. Tags are the safest way to keep a specific historical state queryable without ballooning the snapshot count. Pin the snapshots you need; let expire_snapshots reclaim the rest.

Compliance use case in one paragraph. Suppose a regulator asks: "Reproduce the revenue number you filed on 2026-01-31." Without tags, you must hope the snapshot from that date is still within your retention window. With tags, you ran createTag('filed-2026-01', <snapshot_at_eom>) the moment you filed, and the snapshot is guaranteed to survive for the configured retention (typically 7 years for SOX). SELECT ... FROM db.sales VERSION AS OF 'filed-2026-01' reproduces the number exactly. That is the entire audit reproducibility story in one tag.

Worked example — tag the last good snapshot before a known-bad release

Detailed explanation. A team is about to deploy a new ETL version that they expect to be safe but want a one-call rollback path. Before the deploy, they tag the current production snapshot as pre-deploy-2026-06-12. If the new ETL produces bad data, rollback is a single command using the tag.

Question. Given db.sales at snapshot s-105 (last good), tag the snapshot as pre-deploy-2026-06-12. Show how to roll back to the tag if the next commit corrupts the table.

Input.

snapshot_id committed_at
s-104 2026-06-12 08:00
s-105 2026-06-12 09:30 (last good)

Code.

-- 1) Tag the last good snapshot before deploy
ALTER TABLE db.sales CREATE TAG `pre-deploy-2026-06-12`
  AS OF VERSION (SELECT snapshot_id FROM db.sales.refs WHERE name='main');

-- 2) (Deploy happens, suppose s-106 corrupts the table)

-- 3) Roll back to the tagged snapshot atomically
CALL system.rollback_to_snapshot(
    'db.sales',
    (SELECT snapshot_id FROM db.sales.refs WHERE name='pre-deploy-2026-06-12')
);

-- 4) Inspect the refs table to confirm the rollback landed
SELECT name, snapshot_id, type FROM db.sales.refs ORDER BY name;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. CREATE TAG writes a ref row pointing to s-105. The catalog now has two refs to that snapshot: main and pre-deploy-2026-06-12.
  2. After the deploy commits s-106 (bad), main = s-106 and the tag = s-105.
  3. rollback_to_snapshot flips main back to s-105 atomically. The bad s-106 still exists but is no longer current; readers see the s-105 state immediately.
  4. The refs table confirms: main = s-105 (post rollback) and pre-deploy-2026-06-12 = s-105 (tag, unchanged).
  5. The tag remains pinned at s-105 forever (or until you drop it). Even if you later expire s-106 (the bad snapshot), s-105 is permanently protected by the tag.

Output.

name snapshot_id type
main s-105 BRANCH
pre-deploy-2026-06-12 s-105 TAG

Rule of thumb. Tag the production snapshot before every deploy, dbt release, or pipeline change. The cost is one ref row; the benefit is a one-command rollback that always lands on a known-good state.

Worked example — query "AS OF" a tag from Spark, Trino, and Snowflake

Detailed explanation. Iceberg tags are first-class in every major reader. The SQL syntax to query "as of" a tag varies slightly per engine but the semantics are identical: pin the table state to the tagged snapshot and run the query.

Question. Given db.sales with a tag gold-v1 pinned at snapshot s-093, write the "as of tag" query in Spark / Trino / Snowflake / Flink syntax. Show that all four return identical results.

Input.

tag name snapshot_id
gold-v1 s-093

Code.

-- Spark SQL
SELECT region, SUM(amount) AS revenue
FROM db.sales VERSION AS OF 'gold-v1'
WHERE order_date = DATE '2026-05-01'
GROUP BY region;

-- Trino / Presto
SELECT region, SUM(amount) AS revenue
FROM db.sales FOR VERSION AS OF 'gold-v1'
WHERE order_date = DATE '2026-05-01'
GROUP BY region;

-- Snowflake (Iceberg tables)
SELECT region, SUM(amount) AS revenue
FROM db.sales AT (BRANCH => 'gold-v1')   -- tag works through the BRANCH/TAG ref API
WHERE order_date = DATE '2026-05-01'
GROUP BY region;

-- Flink SQL
SELECT region, SUM(amount) AS revenue
FROM db.sales /*+ OPTIONS('snapshot-id'='<resolved-from-tag>') */
WHERE order_date = DATE '2026-05-01'
GROUP BY region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Each engine has slightly different AS OF syntax, but all four resolve gold-v1 to snapshot s-093 by reading the refs metadata sub-table.
  2. The planner then walks the s-093 manifest tree to compute the file list, exactly as for any other snapshot read.
  3. The query returns the same revenue numbers in every engine because the underlying data files are identical — they are addressed by the snapshot id, not the engine.
  4. This is the multi-engine reader contract Iceberg promises: Spark, Trino, Snowflake, Flink, DuckDB, ClickHouse, StarRocks, Athena, BigQuery all see the same view of the same snapshot.

Output (identical across engines).

region revenue
EU 1_240_500
US 980_400
APAC 420_900

Rule of thumb. When you publish a "golden" dataset version, give it a tag. Consumers on any engine can pin their reads to the tag with one syntax change, and the engine's planner does the rest. Tags are the cross-engine cross-team contract.

Worked example — protect a tag from snapshot expiration

Detailed explanation. A common production failure: the snapshot retention is 7 days, an auditor asks for the month-end snapshot 30 days later, and expire_snapshots has already reclaimed it. Tags solve this by pinning the snapshot independently of the retention window.

Question. Configure db.sales so that the month-end snapshot is preserved for 7 years (SOX requirement) even though the default snapshot retention is 7 days.

Input.

config value
default snapshot retention 7 days
month-end snapshot id s-093
requirement keep s-093 for 7 years

Code.

-- 1) Create the tag with explicit retention
ALTER TABLE db.sales CREATE TAG `eom-2026-05`
  AS OF VERSION 93
  RETAIN 2557 DAYS;            -- ~7 years

-- 2) Configure the table-level retention defaults (apply to non-tagged snapshots)
ALTER TABLE db.sales SET TBLPROPERTIES (
  'history.expire.max-snapshot-age-ms' = '604800000',     -- 7 days
  'history.expire.min-snapshots-to-keep' = '5'
);

-- 3) Run the periodic expire job (typically Airflow daily)
CALL system.expire_snapshots(
    table             => 'db.sales',
    older_than        => TIMESTAMP '2026-06-05 00:00:00',
    retain_last       => 5
);

-- 4) Verify the tagged snapshot survives
SELECT snapshot_id, committed_at
FROM db.sales.snapshots
WHERE snapshot_id = 93;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The tag eom-2026-05 pins s-093 and carries its own RETAIN 2557 DAYS retention, overriding the table-level "expire after 7 days" rule for this specific snapshot.
  2. The table-level config is unchanged for other snapshots — appends and daily commits still age out after 7 days.
  3. When expire_snapshots runs, it walks every ref (main, tags, branches) and computes "reachable snapshots." s-093 is reachable from the tag, so it is kept.
  4. Seven years later, when the auditor queries db.sales VERSION AS OF 'eom-2026-05', the snapshot is still there. The query reproduces the month-end revenue numbers exactly.

Output.

snapshot_id committed_at retained because
s-093 2026-05-31 23:55 tag eom-2026-05 (RETAIN 7 years)

Rule of thumb. Compliance pins go in tags with explicit RETAIN clauses. Never rely on table-level retention for "must-survive-for-N-years" snapshots — tags are the contract.

SQL interview question on tag-based audit reproducibility

A senior interviewer often asks: "How would you guarantee that any historical revenue number you filed with the regulator is byte-reproducible for 7 years, even though your team runs OPTIMIZE and expire_snapshots daily?" It probes tag retention, the interaction between tags and expiration, and the audit-trail story.

Solution Using compliance tags with explicit retention

-- 1) Tag the filed snapshot the moment the filing is finalised
ALTER TABLE finance.revenue CREATE TAG `filed-2026-Q1`
  AS OF VERSION (SELECT snapshot_id FROM finance.revenue.refs WHERE name='main')
  RETAIN 2557 DAYS;           -- 7 years SOX retention

-- 2) Record the tag plus the snapshot id in the filings audit ledger
INSERT INTO finance.filings_audit (filing_id, table_name, snapshot_id, tag_name, filed_at)
SELECT
    'Q1-2026-revenue',
    'finance.revenue',
    r.snapshot_id,
    r.name,
    CURRENT_TIMESTAMP
FROM finance.revenue.refs r
WHERE name = 'filed-2026-Q1';

-- 3) Years later: reproduce the filed number
SELECT
    region,
    SUM(amount) AS filed_revenue
FROM finance.revenue VERSION AS OF 'filed-2026-Q1'
WHERE filing_period = 'Q1-2026'
GROUP BY region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step action catalog state
1 createTag filed-2026-Q1 RETAIN 7y new ref protecting s-093
2 insert into audit ledger external row capturing tag
3 (1 year later) expire_snapshots(older_than=300d, retain_last=5) s-093 protected; other s ≤ 300d gone
4 (7 years later) SELECT VERSION AS OF 'filed-2026-Q1' snapshot s-093 still queryable
5 (7y + 1d later) tag self-expires per RETAIN clause snapshot s-093 eligible to expire

The trace shows the contract: the tag's RETAIN 7 years clause makes s-093 immune to expire_snapshots for the full 7 years, after which the tag itself expires and the snapshot becomes a normal candidate for the table's default retention.

Output:

audit query result reproducible?
Q1-2026 revenue at filing time 4_205_300 yes
Q1-2026 revenue 1 year later 4_205_300 yes (tag intact)
Q1-2026 revenue 7 years later 4_205_300 yes (tag still intact)

Why this works — concept by concept:

  • Tag as an immutable pin — the tag binds s-093 to a stable name; the tag's RETAIN clause overrides table-level retention.
  • Audit ledger as external evidence — recording the tag name + snapshot id in an external ledger means even if the Iceberg metadata is somehow lost, the audit trail names the exact snapshot.
  • expire_snapshots respects refs — the expiration job walks every ref (branches + tags + main) and only reclaims snapshots not reachable from any ref. Tags guarantee reachability.
  • VERSION AS OF resolves the tag at read time — the engine reads the refs sub-table, finds the snapshot id, and walks the manifest tree. No filename lookups, no path hacks.
  • Multi-engine reproducibility — because the data files are immutable Parquet referenced by the manifest tree, Spark, Trino, Snowflake, and Flink all return identical results when reading the same tag.
  • Cost — one ref row plus the protected snapshot's files. The 7-year retention is a storage decision, not a compute decision; the read cost is identical to any other snapshot read.

SQL
Topic — design
Design problems

Practice →


4. WAP pattern — write to audit branch, validate, fast-forward to main

iceberg wap is the production-grade write audit publish pattern — branch isolation makes the staging table obsolete

The mental model in one line: WAP writes the new snapshot to an isolated audit branch, runs validation against that branch, and on pass fastForwards main to the validated snapshot — making "no consumer ever sees half-published data" a structural guarantee rather than a hope. Once you say "the audit branch IS the staging table — same files, isolated ref," WAP becomes a 5-line ritual instead of a 100-line copy-then-swap.

Iconographic three-station pipeline — a horizontal conveyor with a 'Write' station carrying a clean laptop-and-keyboard glyph (purple ribbon), an 'Audit' station carrying a tall gate-arch with a large green checkmark (orange ribbon), and a 'Publish' station carrying a stage-platform with a downward spotlight cone (green ribbon); a vertical 'audit branch lifecycle' ribbon on the right side with four phase-tiles (create, write, check, publish); on a light PipeCode card.

The WAP contract in five bullets.

  • Write. The transform writes a new snapshot to an audit branch (e.g. audit-<runid>), not to main. Main remains at the previous good state.
  • Audit. Validation checks run against the audit branch — row count vs expected, null rates, freshness, business-rule predicates. The audit reads the exact bytes that will publish.
  • Publish. If audit passes, fastForward('main', 'audit-<runid>') is the single atomic operation that makes the new snapshot live. Readers transition without seeing a partial state.
  • Drop on failure. If audit fails, the audit branch is dropped. Main never moves; the bad data files are reclaimed by the next expire_snapshots.
  • Multi-engine reader isolation. During the audit window, Spark / Trino / Snowflake readers on main see the prior snapshot; only the audit job (which explicitly reads the audit-<runid> branch) sees the new state.

Pre-branches WAP (the legacy pattern).

  • spark.wap.id — a Spark session property. When set, every commit attaches the WAP id as a staged snapshot rather than landing as the new current snapshot.
  • Publish — a separate procedure call (publish_changes) atomically promotes the staged snapshot to current.
  • Drawback. Only Spark wrote staged snapshots; Trino / Flink could not participate. Readers on other engines either saw the staged snapshot (depending on engine version) or were blocked from reading altogether.

Modern branch-based WAP.

  • Engine-agnostic. Any engine that supports branches (Spark, Trino, Flink, Snowflake, Polars + Iceberg) can write to and read from the audit branch.
  • Composable. The audit branch is a real branch — you can branch-from-branch, cherry-pick, or run multiple audits in parallel.
  • No special session property. The branch ref is the entire mechanism; no spark.wap.id needed.
  • The 2026 rule. All new WAP implementations should use branches. The spark.wap.id property is legacy.

Audit checks that belong in a WAP gate.

  • Row count vs expected(new_rows / prior_rows) within [0.8, 1.2] for daily appends; tighter for slowly-changing dimensions.
  • Null rates per critical column(null_count / row_count) below a threshold (e.g. < 1% for customer_id, < 5% for email).
  • FreshnessMAX(event_time) within the SLA window (typically >= now() - 2h for hourly tables).
  • Business rulesSUM(amount) per partition above a floor (catches "the integration silently zeroed out a column").
  • Schema — match the contract: every required column present, types unchanged. Use a JSON Schema or dbt contract for portability.
  • Referential integrity — every customer_id in the new snapshot exists in dim_customer. Detects orphans before they hit BI.

Tools that play well with branch-WAP.

  • dbt--target flag plus pre_hook / post_hook to wire branch creation, model build, and fast-forward. dbt-Iceberg adapter handles the branch syntax.
  • Great Expectations — runs as the audit gate; pointed at the audit branch via VERSION AS OF.
  • Soda Core — declarative checks; supports Iceberg via Spark/Trino connections.
  • Airflow / Dagster — orchestrate the create-branch → run-transform → audit → publish-or-drop sequence as discrete tasks. The publish task is a 1-line fast_forward call.

Common interview probes on WAP.

  • "What is WAP and why does it matter?" — write-audit-publish, the contract that no consumer ever sees a partial or invalid commit. Matters because object-storage tables would otherwise expose half-written state.
  • "How does branch-based WAP differ from spark.wap.id?" — branches are engine-agnostic, composable, and do not require a session property. spark.wap.id was Spark-only.
  • "What happens to the bad data on a WAP failure?" — the audit branch is dropped; the files are reclaimed by the next expire_snapshots. Main never moves.
  • "Can multiple WAP runs proceed in parallel?" — yes, one branch per run id. They publish in their respective fast-forward order. Conflicts between runs are detected by the catalog's last-writer-wins retry.

Worked example — Spark write to the audit branch

Detailed explanation. A daily ETL appends new orders to db.fact_orders. Under branch-WAP, the writer targets the audit-20260612 branch instead of main; the audit gate then reads from that branch.

Question. Show the Spark / SQL code that creates an audit branch from the current main head, appends today's orders to the branch, and registers the branch in a run-id ledger for later audit.

Input.

run_id source rows target table branch ref
20260612 124_000 db.fact_orders audit-20260612

Code.

-- 1) Create the audit branch from current main
ALTER TABLE db.fact_orders CREATE BRANCH `audit-20260612`
  AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main');

-- 2) Append today's orders to the branch
INSERT INTO db.fact_orders.`branch_audit-20260612`
SELECT
    order_id, customer_id, order_date, amount, region
FROM staging.orders_20260612;

-- 3) Register the audit run in the lineage ledger
INSERT INTO etl.runs (run_id, table_name, branch_name, started_at)
VALUES ('20260612', 'db.fact_orders', 'audit-20260612', CURRENT_TIMESTAMP);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. CREATE BRANCH audit-20260612 writes one ref row pointing at main's current snapshot. No data is copied.
  2. INSERT INTO ...branch_audit-20260612 appends to the branch. The new files land in S3 under the table's data prefix and are referenced by the branch's manifest tree. Main is untouched.
  3. The ledger row captures the run id and the branch name so downstream auditors and the publish task can find the right branch by run id rather than guessing the name.
  4. At this point, production readers on main see the previous day's data; only readers that explicitly target branch_audit-20260612 see today's appends.

Output.

step main snapshot audit-20260612 snapshot rows visible on main
pre s-200 (not created) 12_000_000
1 s-200 s-200 12_000_000
2 s-200 s-a1 12_000_000
3 s-200 s-a1 12_000_000 (ledger row added)

Rule of thumb. Every WAP writer targets a branch named after the run id. The branch name is the link between the orchestrator's run id, the audit gate, the publish task, and the post-mortem audit trail. Never let the writer commit directly to main during WAP.

Worked example — dbt-style audit gate (row counts, null rate, freshness)

Detailed explanation. The audit gate reads the audit branch and verifies three invariants: row count is within tolerance, the critical column has a null rate below threshold, and freshness is within SLA. If any check fails, the gate returns FAIL and the publish task is skipped.

Question. Write the audit gate SQL for db.fact_orders on the audit-20260612 branch. Pass condition: rows in [80%, 120%] of yesterday's count, null_customer_id rate < 1%, MAX(order_date) = today.

Input.

metric expected
yesterday row count 120_000
min today rows 96_000
max today rows 144_000
max null_customer_id rate 0.01
freshness MAX(order_date) = today

Code.

WITH branch_state AS (
    SELECT snapshot_id FROM db.fact_orders.refs WHERE name = 'audit-20260612'
),
branch_metrics AS (
    SELECT
        COUNT(*)                                              AS row_count,
        SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END)  AS null_customer,
        MAX(order_date)                                       AS max_order_date
    FROM db.fact_orders VERSION AS OF (SELECT snapshot_id FROM branch_state)
    WHERE order_date = CURRENT_DATE
),
yesterday AS (
    SELECT COUNT(*) AS row_count
    FROM db.fact_orders
    WHERE order_date = CURRENT_DATE - INTERVAL '1' DAY
)
SELECT
    bm.row_count,
    bm.null_customer,
    bm.max_order_date,
    y.row_count AS y_row_count,
    CASE
        WHEN bm.row_count < 0.8 * y.row_count THEN 'FAIL — too few rows'
        WHEN bm.row_count > 1.2 * y.row_count THEN 'FAIL — too many rows'
        WHEN bm.null_customer > 0.01 * bm.row_count THEN 'FAIL — null_customer_id rate too high'
        WHEN bm.max_order_date <> CURRENT_DATE THEN 'FAIL — stale data'
        ELSE 'PASS'
    END AS gate_decision
FROM branch_metrics bm
CROSS JOIN yesterday y;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. branch_state resolves the run-id branch to its current snapshot id by reading the refs sub-table.
  2. branch_metrics reads the audit branch at that snapshot via VERSION AS OF — exactly the bytes that will publish if the gate passes.
  3. yesterday reads main (the unqualified table reference) for the comparison baseline. Note: it reads main, not the audit branch — that is the whole point of the isolation.
  4. The final SELECT computes the three pass conditions; the first failing check wins. The result is a single-row decision that the orchestrator inspects.
  5. On PASS, the orchestrator calls fast_forward. On FAIL, it calls DROP BRANCH and pages the on-call.

Output (pass case).

row_count null_customer max_order_date y_row_count gate_decision
124_000 80 2026-06-12 120_000 PASS

Output (fail case — too few rows).

row_count null_customer max_order_date y_row_count gate_decision
60_000 40 2026-06-12 120_000 FAIL — too few rows

Rule of thumb. The gate must read the audit branch (via VERSION AS OF), not main. If you accidentally read main, you are checking the previous day's data and the gate will pass on every run — a silent data outage.

Worked example — fast-forward publish vs rollback on failure

Detailed explanation. The publish step is a single API call. Branch-based WAP turns "did the publish succeed?" into "did the catalog write succeed?" — a binary, retriable signal.

Question. Show the publish-or-rollback orchestrator logic for db.fact_orders after the audit gate returns PASS or FAIL.

Input.

audit result action
PASS fast_forward main → audit-20260612
FAIL drop branch, alert

Code.

# Pseudocode for the orchestrator (Airflow / Dagster task)
def publish_or_rollback(table, branch, audit_result):
    if audit_result == "PASS":
        spark.sql(f"CALL system.fast_forward('{table}', 'main', '{branch}')")
        spark.sql(f"ALTER TABLE {table} DROP BRANCH `{branch}`")
        return "published"
    else:
        spark.sql(f"ALTER TABLE {table} DROP BRANCH `{branch}`")
        page_oncall(
            severity="P2",
            message=f"WAP gate failed on {table} branch={branch}: {audit_result}",
        )
        return "rolled_back"
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. On PASS, fast_forward flips main's pointer to the audit branch's head in one atomic catalog write. Every reader transitions to the new state.
  2. After fast-forward, the audit branch ref still points at the (now-shared) head snapshot. Dropping it cleans up the ref; the data files are already part of main's lineage.
  3. On FAIL, dropping the branch removes the only ref pointing to the bad snapshot. The files become unreachable and will be reclaimed by the next expire_snapshots job.
  4. The page-on-call call is what makes WAP a real production pattern — silent failures still need humans.

Output.

outcome main pointer branch ref data files
PASS moved to s-a1 dropped bad files become main files
FAIL unchanged at s-200 dropped bad files orphaned, reclaimed later

Rule of thumb. Treat the publish step as an idempotent operation — if the orchestrator crashes mid-step, retry the fast-forward. The catalog's atomic semantics make the retry safe even if the fast-forward partially succeeded.

Worked example — multi-engine reader isolation during the audit window

Detailed explanation. During the audit window, Spark / Trino / Snowflake readers all see the prior good main snapshot. Only the audit gate (which explicitly targets the branch) sees the in-flight data.

Question. Show what each of three readers sees during the audit window for db.fact_orders: a Trino dashboard query, a Snowflake batch job, and the Spark audit gate.

Input.

engine query target expects to see
Trino main (default) prior good snapshot
Snowflake main (default) prior good snapshot
Spark audit audit-20260612 branch in-flight snapshot

Code.

-- Trino — production dashboard (main, prior good)
SELECT region, SUM(amount) AS revenue
FROM iceberg.db.fact_orders
WHERE order_date = CURRENT_DATE - INTERVAL '1' DAY
GROUP BY region;

-- Snowflake — overnight finance batch (main, prior good)
SELECT region, SUM(amount) AS revenue
FROM db.fact_orders
WHERE order_date = CURRENT_DATE - INTERVAL '1 DAY'
GROUP BY region;

-- Spark — audit gate reads the audit branch only
SELECT region, SUM(amount) AS revenue
FROM db.fact_orders VERSION AS OF (
    SELECT snapshot_id FROM db.fact_orders.refs WHERE name='audit-20260612'
)
WHERE order_date = CURRENT_DATE
GROUP BY region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Trino's planner resolves iceberg.db.fact_orders against its catalog, finds main's snapshot pointer at s-200, and reads that snapshot. The in-flight audit branch is invisible.
  2. Snowflake's planner does the same against its catalog binding. The audit branch is also invisible to Snowflake.
  3. Spark's audit gate explicitly resolves the audit branch from refs and pins the snapshot for the gate query. It sees the in-flight bytes — exactly what will publish.
  4. The isolation is provided by the catalog's ref model. No engine sees the audit branch unless it explicitly asks for it.

Output.

engine snapshot read data seen
Trino dashboard s-200 (main) prior good
Snowflake batch s-200 (main) prior good
Spark audit gate s-a1 (audit) in-flight

Rule of thumb. Branch-based WAP provides multi-engine reader isolation out of the box. You do not need to coordinate engines, freeze readers, or run a maintenance window — the catalog ref model does it for you.

SQL interview question on designing WAP for a 100TB fact table

A senior interviewer might say: "Design a WAP pattern for a 100TB fact table written every 15 minutes by 8 parallel Spark jobs, read by Trino dashboards every second, and audited by a Great Expectations suite. What does the catalog look like, and what is your failure-mode story?" It probes orchestrator design, branch naming, retention, and the multi-engine isolation guarantee.

Solution Using one audit branch per run-id with fast-forward publish

-- The contract:
--   1. Writer task creates branch audit-<run_id> from current main
--   2. Writer commits to the branch (parallel writers retry on last-writer-wins)
--   3. Audit task runs validations against the branch
--   4. Publish task fast-forwards main on PASS, drops branch on FAIL
--   5. Daily maintenance expires orphaned files from FAIL runs

-- 1) Writer (one of N parallel jobs)
ALTER TABLE db.fact_orders CREATE BRANCH IF NOT EXISTS `audit-${run_id}`
  AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main');

INSERT INTO db.fact_orders.`branch_audit-${run_id}`
SELECT * FROM staging.orders_partition_${shard_id};

-- 2) Audit gate (Great Expectations / Soda / dbt-test pointed at the branch)
-- (returns PASS / FAIL)

-- 3) Publish (orchestrator)
CALL system.fast_forward('db.fact_orders', 'main', 'audit-${run_id}');
ALTER TABLE db.fact_orders DROP BRANCH `audit-${run_id}`;

-- 4) Daily maintenance
CALL system.expire_snapshots(
    table       => 'db.fact_orders',
    older_than  => CURRENT_TIMESTAMP - INTERVAL '7' DAY,
    retain_last => 50
);

CALL system.remove_orphan_files(
    table       => 'db.fact_orders',
    older_than  => CURRENT_TIMESTAMP - INTERVAL '3' DAY
);
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

time event main pointer audit branch gate
t0 run starts s-200 (create at s-200) pending
t0 + 5m 8 parallel writers commit to branch s-200 s-a8 pending
t0 + 8m audit reads branch s-200 s-a8 running
t0 + 10m gate PASS s-200 s-a8 PASS
t0 + 10m + 1ms fast_forward s-a8 s-a8 published
t0 + 10m + 2ms drop branch s-a8 (dropped) done
next day 02:00 maintenance s-a14 (more runs) -- --

The trace shows the central guarantee: during the entire t0 → t0+10m window, the 1000 Trino dashboard readers per second see s-200 — completely consistent prior state. At t0+10m+1ms, every new reader sees s-a8. There is no transition state.

Output:

metric value
writes per day 96
audit-branch refs created per day 96
audit-branch refs alive at end of day 0 (all dropped)
storage overhead from WAP < 1% (only failed-audit files)
readers that saw partial state 0
catalog write per publish 1 (the fast_forward)

Why this works — concept by concept:

  • Branch per run-id — names the audit window uniquely and prevents collisions between parallel runs. The orchestrator's run id is the universal handle.
  • Multiple writers on one branch — the catalog's last-writer-wins retry handles parallel writers natively. Each writer commits its shard; the final branch snapshot is the union.
  • Audit reads the branch — exactly the bytes that will publish. No staging-vs-prod skew.
  • fast_forward is one catalog write — atomic publish. Every reader transitions at the same instant.
  • Drop branch on success — keeps the catalog clean; data files survive under main's lineage.
  • expire_snapshots reclaims orphans — failed-audit branches' files become unreachable and are reclaimed by the daily maintenance.
  • Cost — proportional to data (writes), not table size. The branch-create / drop / fast-forward are O(1) catalog writes. Maintenance is O(snapshots + files) and runs off-peak.

SQL
Topic — streaming
Streaming ingestion patterns

Practice →


5. Production ops — snapshot expiration, OPTIMIZE, schema evolution

Iceberg health is a maintenance cadence — expire_snapshots, rewrite_data_files, rewrite_manifests, and remove_orphan_files are the four sectors of the production wheel

The mental model in one line: a healthy Iceberg table needs four periodic ops — expire snapshots (control history), OPTIMIZE / rewrite_data_files (bin-pack small files), rewrite_manifests (keep planning fast), and remove_orphan_files (reclaim storage from failed writes) — together they keep planning latency low and S3 cost bounded. Once you say "Iceberg is a self-managing format only if you actually run the management," the production cadence becomes a daily, weekly, monthly checklist.

Iconographic ops-maintenance wheel — a centred four-quadrant pie diagram with a small 'maintenance' disc at the centre and a tiny 'tags protected' shield, each quadrant tinted a brand colour with a hero glyph and a concept chip (top-left purple expire_snapshots broom, top-right blue rewrite_manifests manifest-scroll, bottom-right green OPTIMIZE/bin-pack box, bottom-left orange remove_orphan_files lonely-file); a faint dashed rotating arrow around the rim labelled 'daily · weekly · monthly'; a side card titled 'ops cadence' with three chips; on a light PipeCode card.

The four maintenance sectors.

  • expire_snapshots — removes old snapshot manifests beyond the retention window, freeing the data files they uniquely referenced. Tags and branches protect their snapshots. Daily cadence on hot tables, weekly on slow ones.
  • rewrite_data_files (OPTIMIZE) — compacts small data files into bin-packed target-sized files. Cuts planning time and read latency. Weekly cadence on append-heavy tables; on-demand after large overwrite jobs.
  • rewrite_manifests — re-buckets the manifest list to keep manifests balanced by partition and size. Critical for tables with many commits per day. Monthly cadence usually suffices.
  • remove_orphan_files — finds files in the table's S3 prefix not referenced by any snapshot (results of failed writes, crashed jobs, dropped branches) and deletes them. Weekly cadence; run only after expire_snapshots has settled.

Schema evolution — the safe parts and the unsafe parts.

  • Add column. Always safe — new column appended, existing rows have NULL for the new column. No file rewrite.
  • Drop column. Safe — column is removed from the schema; underlying files keep the column data but readers stop seeing it. No file rewrite.
  • Rename column. Safe — Iceberg uses field-ids, not column names, so a rename is a metadata-only operation. No file rewrite. (Compare to Hive, where a rename was a full rewrite.)
  • Reorder columns. Safe — the schema is field-id-keyed, so column order is a presentation choice.
  • Widen type. Safe in many cases — INT → BIGINT, FLOAT → DOUBLE, DECIMAL(p, s) → DECIMAL(p', s) with p' > p and same s.
  • Narrow type. Unsafe — would lose precision or range. Iceberg refuses.
  • Change type incompatibly (STRING → INT). Unsafe — requires a backfill via a new column + cutover.
  • Promote nullable → required. Unsafe — would require validating every existing row. Iceberg refuses unless the table is empty.

Partition evolution.

  • Iceberg supports changing the partition spec without rewriting old data — a textbook feature no other format offers.
  • Example. A table starts partitioned by month(event_time); after a year, traffic grows and you want daily partitions. ALTER TABLE ... ADD PARTITION FIELD day(event_time) switches the spec. New data is partitioned daily; old data stays in monthly partitions.
  • Readers handle it transparently — the manifest tree records which partition spec each file was written under; the planner prunes per-spec.
  • Caveat. Partition evolution is most useful as a future-proofing escape hatch, not a routine knob. Switching specs every quarter is a sign the original spec was wrong; pick a defensive spec at design time.

Catalog choice — the production decision that shapes everything.

  • REST catalog (e.g. Tabular, Polaris, Lakekeeper) — engine-agnostic, vendor-neutral, audited via HTTP. The right default for a multi-engine shop.
  • AWS Glue — managed, low-friction in AWS, but locks you to AWS for catalog operations. Cheap and easy at small-to-medium scale.
  • Nessie — Git-style catalog with branching at the catalog level (not just the table level). Powerful for multi-table transactions; richer than the per-table branches.
  • Snowflake-managed — Snowflake owns the catalog; other engines read through Snowflake's Iceberg APIs. Great if Snowflake is your primary engine.
  • Unity Catalog — Databricks' catalog; Iceberg compatibility is improving. Right answer for Databricks-heavy shops moving toward open formats.
  • Polaris — Snowflake's open-source REST catalog implementation. Multi-engine friendly with first-class tag/branch support.

Multi-engine reader matrix (2026).

  • Spark — read + write + admin (full).
  • Trino / Presto — read + write + admin (full).
  • Flink — read + write (full streaming + batch).
  • Snowflake — read + write (Iceberg-managed or external).
  • DuckDB — read (via the iceberg extension); write support landing in 2026.
  • ClickHouse — read (via the Iceberg engine); write support nascent.
  • StarRocks — read (full); write support landing.
  • Athena (AWS) — read + write.
  • BigQuery — read (via BigLake / Iceberg external tables); write maturing.

Common interview probes on production ops.

  • "How often should I run expire_snapshots?" — daily on tables with frequent commits; weekly is fine on slow tables. Always after a large OPTIMIZE.
  • "What is the difference between rewrite_data_files and rewrite_manifests?" — the first compacts data files; the second compacts metadata. Both reduce planning time, but for different reasons.
  • "How do you know when OPTIMIZE is needed?" — average data-file size below ~128MB on a partition, or db.<table>.files showing > 1000 files per active partition. Both are signs of read-time scan amplification.
  • "Can I rename a column on a 100TB Iceberg table without rewriting data?" — yes. Rename is a metadata-only operation because Iceberg uses field-ids.

Worked example — daily snapshot expiration job protecting tags

Detailed explanation. A daily Airflow task expires snapshots older than 7 days while protecting every tagged snapshot. Tags marked for SOX (e.g. eom-*, filed-*) survive indefinitely until their tag is explicitly dropped.

Question. Write the expire_snapshots call for db.fact_orders with a 7-day retention, keeping the last 30 snapshots minimum, and verifying that tagged snapshots survive.

Input.

config value
older_than 7 days ago
retain_last 30
protected tags eom-2026-05, filed-2026-Q1

Code.

-- 1) Inspect refs to confirm tag protection
SELECT name, snapshot_id, type
FROM db.fact_orders.refs
WHERE type = 'TAG';

-- 2) Run the expire job
CALL system.expire_snapshots(
    table          => 'db.fact_orders',
    older_than     => CURRENT_TIMESTAMP - INTERVAL '7' DAY,
    retain_last    => 30
);

-- 3) Verify tagged snapshots still exist
SELECT
    r.name             AS tag_name,
    r.snapshot_id,
    s.committed_at,
    CASE WHEN s.snapshot_id IS NULL THEN 'MISSING' ELSE 'PRESENT' END AS status
FROM db.fact_orders.refs r
LEFT JOIN db.fact_orders.snapshots s ON r.snapshot_id = s.snapshot_id
WHERE r.type = 'TAG';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The refs pre-check captures the tags that must survive — useful as a baseline for the post-check.
  2. expire_snapshots computes the set of reachable snapshots: every snapshot referenced (directly or via the parent chain) by any ref (main, branches, tags). Snapshots not reachable AND older than 7 days AND not in the last 30 are deleted.
  3. The post-check confirms that every tag still resolves to a present snapshot. If any tag shows MISSING, an alert fires (this should never happen in correctly configured tables).
  4. The data files freed by expired snapshots are not deleted immediately — they are simply unreferenced. The next remove_orphan_files call reclaims them.

Output.

tag_name snapshot_id committed_at status
eom-2026-05 93 2026-05-31 23:55 PRESENT
filed-2026-Q1 87 2026-04-15 09:00 PRESENT

Rule of thumb. Run expire_snapshots daily on hot tables. Tags are the only contract that guarantees a snapshot will survive — table-level retention can change, but tags are explicit.

Worked example — OPTIMIZE (rewrite_data_files) with bin-pack target

Detailed explanation. A table with hundreds of thousands of small Parquet files per partition plans queries slowly because the planner must open every file's footer. rewrite_data_files reads small files in a partition and writes them out as fewer, larger bin-packed files (typically 512MB target). Read latency drops by 5-10x on heavy partitions.

Question. Run rewrite_data_files on the EU partition of db.fact_orders, targeting 512MB output files, and verify that the data-file count dropped while the row count stayed identical.

Input.

partition files before avg size before rows before
EU 1240 12 MB 4_200_000

Code.

-- 1) Inspect data-file distribution for the EU partition
SELECT
    file_path,
    file_size_in_bytes,
    record_count
FROM db.fact_orders.files
WHERE partition.region = 'EU'
ORDER BY file_size_in_bytes ASC
LIMIT 5;

-- 2) Run bin-pack rewrite on the EU partition
CALL system.rewrite_data_files(
    table   => 'db.fact_orders',
    options => map(
        'target-file-size-bytes', '536870912',     -- 512 MB
        'min-input-files',        '5',
        'max-concurrent-file-group-rewrites', '8'
    ),
    where   => "region = 'EU'"
);

-- 3) Verify file count dropped and row count is identical
SELECT
    COUNT(*)                          AS file_count,
    SUM(file_size_in_bytes) / 1e9     AS total_gb,
    SUM(record_count)                 AS row_count
FROM db.fact_orders.files
WHERE partition.region = 'EU';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The pre-check confirms the symptom: ~1240 files averaging 12MB — far below the 128MB-512MB sweet spot most engines prefer.
  2. rewrite_data_files reads the small files in where-matched partitions and writes new files of target-file-size-bytes. The procedure commits a new snapshot whose added_data_files are the new large files and deleted_data_files are the old small ones. Row counts are preserved by construction.
  3. The post-check shows the new file count (~12 files of ~500MB each), same total bytes (a tiny overhead from format headers), same row count.
  4. Planning time on the next query against the EU partition drops by an order of magnitude because the planner now opens 12 footers instead of 1240.

Output.

metric before after
file_count 1240 12
total_gb 14.5 14.5
row_count 4_200_000 4_200_000
planning latency (typical) 6.4 s 0.5 s

Rule of thumb. Run weekly rewrite_data_files on append-heavy tables. Target 256-512MB per file. Skip partitions whose average file size is already in range — the procedure's min-input-files option lets you do this automatically.

Worked example — schema evolution: add column, drop column, rename

Detailed explanation. Iceberg's field-id-keyed schema makes column ops metadata-only. Adding, dropping, or renaming a column does not rewrite any data files — it updates the schema metadata and changes how the readers map files to columns.

Question. On db.fact_orders, add a currency column (default NULL), rename amount to amount_local, drop the obsolete legacy_flag column. Verify zero file rewrites.

Input.

operation column
add currency STRING
rename amount → amount_local
drop legacy_flag

Code.

-- 1) Inspect current schema
DESCRIBE TABLE db.fact_orders;

-- 2) Schema evolution — all three operations are metadata-only
ALTER TABLE db.fact_orders ADD COLUMN currency STRING;
ALTER TABLE db.fact_orders RENAME COLUMN amount TO amount_local;
ALTER TABLE db.fact_orders DROP COLUMN legacy_flag;

-- 3) Verify no data files were rewritten by the schema change
SELECT
    snapshot_id, operation,
    summary['added-data-files']    AS added,
    summary['deleted-data-files']  AS deleted
FROM db.fact_orders.snapshots
ORDER BY committed_at DESC
LIMIT 4;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The schema pre-check captures the current columns. amount, legacy_flag, etc.
  2. Each ALTER TABLE writes a new snapshot whose operation is replace (schema update) and whose summary shows added_data_files = 0, deleted_data_files = 0 — the data is untouched.
  3. The RENAME works because Iceberg stores column field-ids in the manifest tree; the rename changes the schema's name → field-id mapping without touching the files.
  4. The DROP COLUMN removes the column from the schema; existing files still contain the data, but readers no longer expose it. The disk usage is unchanged until a future rewrite_data_files writes new files without the dropped column.
  5. The post-check confirms three new snapshots, each with zero data-file delta. Schema evolution at scale, for free.

Output.

snapshot_id operation added deleted
s-201 replace (add currency) 0 0
s-202 replace (rename amount) 0 0
s-203 replace (drop legacy_flag) 0 0
s-200 append (prior) 12 0

Rule of thumb. Schema evolution on Iceberg is a metadata operation, not a rewrite. The cost is one catalog write per change. Compared to Hive, where a column rename was a full table copy, this is the single most-quoted operational win of the format.

Worked example — multi-engine read of the same Iceberg table

Detailed explanation. A typical 2026 stack has Spark for batch transforms, Trino for ad-hoc analytics, Snowflake for finance reporting, and Flink for streaming. All four read the same Iceberg table through the catalog; none of them needs format-specific glue.

Question. Show the canonical SELECT against db.fact_orders in Spark / Trino / Snowflake / Flink. Confirm that the result is identical.

Input.

engine catalog binding
Spark spark_catalog → REST
Trino iceberg.db.fact_orders → REST
Snowflake db.fact_orders (Iceberg external)
Flink iceberg_catalog.db.fact_orders → REST

Code.

-- Spark
SELECT region, SUM(amount_local) AS revenue
FROM db.fact_orders
WHERE order_date = DATE '2026-06-12'
GROUP BY region;

-- Trino
SELECT region, SUM(amount_local) AS revenue
FROM iceberg.db.fact_orders
WHERE order_date = DATE '2026-06-12'
GROUP BY region;

-- Snowflake (Iceberg external table)
SELECT region, SUM(amount_local) AS revenue
FROM db.fact_orders
WHERE order_date = '2026-06-12'
GROUP BY region;

-- Flink SQL
SELECT region, SUM(amount_local) AS revenue
FROM iceberg_catalog.db.fact_orders
WHERE order_date = DATE '2026-06-12'
GROUP BY region;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Each engine resolves the table identifier against its catalog binding. All four catalogs point to the same Iceberg metadata location.
  2. Each engine's planner reads the current snapshot's manifest tree from the same S3 prefix and prunes files using column-level stats.
  3. Each engine executes the same SQL semantics over the same data files. The result is byte-identical because the data is the same Parquet.
  4. The only differences are engine-specific syntax (e.g. Trino's three-part identifier vs Spark's two-part) and connector configuration.

Output (identical across engines).

region revenue
EU 1_310_200
US 1_045_900
APAC 444_100

Rule of thumb. When designing a new lakehouse, list every engine that needs to read it; pick a catalog (REST, Polaris, Nessie, Glue) that all of them support; let Iceberg's format do the rest. The "one format, every engine" promise is only as strong as the catalog you pick.

SQL interview question on a production Iceberg maintenance cadence

A senior interviewer often opens with: "Design the weekly maintenance schedule for a 200TB Iceberg fact table with 96 commits per day and 12 active downstream readers. What runs daily, what runs weekly, what runs monthly, and how do you protect compliance tags?" It probes operational hygiene, retention design, and the interaction between expiration and tags.

Solution Using a tag-protected daily / weekly / monthly cadence

-- DAILY — runs at 02:00 in low-traffic window
CALL system.expire_snapshots(
    table       => 'db.fact_orders',
    older_than  => CURRENT_TIMESTAMP - INTERVAL '7' DAY,
    retain_last => 50
);

-- WEEKLY — runs Sunday 02:30
CALL system.rewrite_data_files(
    table   => 'db.fact_orders',
    options => map(
        'target-file-size-bytes', '536870912',
        'min-input-files',        '5'
    )
);

CALL system.remove_orphan_files(
    table       => 'db.fact_orders',
    older_than  => CURRENT_TIMESTAMP - INTERVAL '3' DAY
);

-- MONTHLY — runs first Sunday at 03:00
CALL system.rewrite_manifests(
    table => 'db.fact_orders'
);

-- ONGOING — tagging step in the publish task (already covered in WAP)
ALTER TABLE db.fact_orders CREATE TAG `eom-${YYYY-MM}`
  AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main')
  RETAIN 2557 DAYS;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

cadence op input output
daily expire_snapshots(7d, retain 50) ~700 snapshots, 50 tags snapshots > 7d AND not in last 50 AND not tagged → eligible to expire
weekly rewrite_data_files partitions with avg file < 128MB bin-packed to 512MB
weekly remove_orphan_files S3 files not in any snapshot, > 3d old deleted
monthly rewrite_manifests imbalanced manifests rebalanced
event createTag eom-YYYY-MM last good monthly snapshot tag protects for 7 years

The trace shows how the four operations layer: daily expiration controls snapshot count; weekly OPTIMIZE controls file count; weekly orphan cleanup controls S3 cost; monthly rewrite_manifests controls planning time; per-month tagging controls compliance pinning.

Output:

metric before after
avg planning latency 4.2 s 0.6 s
avg files per partition 800 28
S3 storage (table) 240 TB 200 TB
snapshot count ~1200 ~50 + tags
compliance tags preserved yes yes

Why this works — concept by concept:

  • expire_snapshots respects refs — tags and branches protect their snapshots. The retention window controls only un-referenced snapshots.
  • rewrite_data_files bin-packs — compacts small files into 256-512MB targets; reduces planning latency by an order of magnitude.
  • remove_orphan_files reclaims S3 — files left by failed writes or dropped branches; runs after expire_snapshots to avoid races.
  • rewrite_manifests rebalances metadata — keeps the manifest tree shallow and partition-aligned for fast planning at large commit counts.
  • Per-month tagging — the audit trail and compliance pin in one tag; the RETAIN 7 YEARS clause overrides the daily 7-day expiration for tagged snapshots.
  • Cost — daily expire is O(snapshots); weekly OPTIMIZE is O(small files); orphan cleanup is O(S3 listing); rewrite_manifests is O(manifest size). All four are bounded by table change rate, not table size — so the cadence scales sublinearly.

SQL
Topic — optimization
Optimization problems

Practice →


Cheat sheet — Iceberg branching, tagging, and WAP recipes

  • Inspect commit history. SELECT * FROM <table>.snapshots ORDER BY committed_at DESC LIMIT N — the audit log every other format had to bolt on.
  • List every named ref. SELECT name, snapshot_id, type FROM <table>.refs — branches plus tags plus main, in one query.
  • Time travel to a timestamp. SELECT ... FROM <table> TIMESTAMP AS OF '2026-06-12 06:00:00' — pin the planner to the snapshot that was current then.
  • Time travel to a snapshot id. SELECT ... FROM <table> VERSION AS OF 103 — unambiguous, audit-friendly.
  • Time travel to a tag. SELECT ... FROM <table> VERSION AS OF 'gold-v1' — engine resolves the tag from the refs table.
  • Atomic rollback. CALL system.rollback_to_snapshot('<table>', <prior_id>) — flip the main pointer back; bad snapshot stays on disk until expired.
  • Create a branch. ALTER TABLE <table> CREATE BRANCH '<name>' AS OF VERSION (SELECT snapshot_id FROM <table>.refs WHERE name='main').
  • Drop a branch. ALTER TABLE <table> DROP BRANCH '<name>' — ref removed; data files freed only if no other ref references them.
  • Fast-forward a branch into main. CALL system.fast_forward('<table>', 'main', '<branch>') — single atomic publish.
  • Cherry-pick a single snapshot. CALL system.cherrypick_snapshot('<table>', <source_snapshot_id>) — bring one fix from a branch into main.
  • Create a tag (with retention). ALTER TABLE <table> CREATE TAG '<name>' AS OF VERSION <id> RETAIN 2557 DAYS — 7-year SOX-style pin.
  • Drop a tag. ALTER TABLE <table> DROP TAG '<name>' — pinned snapshot becomes eligible to expire if no other ref protects it.
  • WAP write (branch-based). ALTER TABLE ... CREATE BRANCH 'audit-<runid>'; INSERT INTO ...branch_audit-<runid> SELECT ...; gate; fast_forward; drop branch.
  • Bin-pack OPTIMIZE. CALL system.rewrite_data_files(table => '<table>', options => map('target-file-size-bytes','536870912'), where => "<predicate>").
  • Daily snapshot expiration. CALL system.expire_snapshots(table => '<table>', older_than => CURRENT_TIMESTAMP - INTERVAL '7' DAY, retain_last => 50) — tags and branches protect their snapshots automatically.
  • Rewrite manifests (monthly). CALL system.rewrite_manifests(table => '<table>') — re-balance metadata to keep planning fast.
  • Reclaim orphan files. CALL system.remove_orphan_files(table => '<table>', older_than => CURRENT_TIMESTAMP - INTERVAL '3' DAY) — run after expire_snapshots.
  • Schema evolution (metadata only). ALTER TABLE ... ADD COLUMN ..., RENAME COLUMN ... TO ..., DROP COLUMN ... — all are zero data-file rewrites because Iceberg uses field-ids.
  • Partition evolution. ALTER TABLE ... ADD PARTITION FIELD day(event_time) — change spec without rewriting old data; readers handle both specs transparently.
  • Inspect file count per partition. SELECT partition, COUNT(*) FROM <table>.files GROUP BY partition ORDER BY COUNT(*) DESC — identify OPTIMIZE candidates.
  • Apache Iceberg vs Delta Lake quick rule. Multi-engine multi-vendor lakehouse: Iceberg. Databricks-first shop: Delta with UniForm. Migrating in 2 years: Iceberg.

Frequently asked questions

What is the difference between an Iceberg branch and a tag?

A branch is a movable named ref to a snapshot — writes targeting the branch advance its pointer to new snapshots. A tag is an immutable named ref — once created, its snapshot id does not change (you can only drop and recreate it). Branches are for in-progress isolated work (WAP, dev sandboxes, long-lived experiments); tags are for compliance pins, golden-dataset releases, and "last known good" markers. Both protect their snapshots from expire_snapshots, but tags carry an explicit RETAIN clause and are the right primitive for "must survive for 7 years" audit requirements.

Apache Iceberg vs Delta Lake — which should I pick for a new lakehouse?

For a multi-engine lakehouse with Snowflake, Databricks, Trino, Flink, and ad-hoc DuckDB readers, pick Iceberg — its catalog-pluggable design and first-class branch/tag APIs are universally supported. For a Databricks-first shop with limited multi-engine writers and a strong Spark / Photon investment, Delta with UniForm (which writes Iceberg-compatible metadata alongside Delta) is operationally simpler. For "I might switch engines in two years" or "I want to avoid vendor lock-in," Iceberg is the safer long-term bet. Both formats now ship snapshot atomicity, ACID, time travel, and schema evolution — the differences are catalog model, branching API, and the writer ecosystem outside Spark.

How does WAP differ from a staging table?

A staging table is a physical copy: you write to staging.fact_orders, validate, then INSERT INTO production.fact_orders SELECT * FROM staging.fact_orders — two copies of the data exist briefly and the publish is a multi-statement transaction (or worse, two separate transactions on object storage). Branch-based WAP writes the new snapshot to an isolated branch on the same table; the data files are shared; the publish is a single atomic catalog write that flips the main ref. WAP eliminates the copy, eliminates the two-step publish, and provides the same isolation guarantee with zero storage overhead beyond the new data itself.

Do Iceberg tags prevent snapshot expiration?

Yes. expire_snapshots walks every ref (main, branches, tags) and computes the set of reachable snapshots. Any snapshot reachable from any ref is kept regardless of the older_than clause. Tags can also carry an explicit RETAIN clause that self-expires the tag after a fixed duration (e.g. 7 years for SOX compliance), after which the protected snapshot becomes eligible for normal expiration. The combined contract — "tagged snapshots survive until the tag is dropped or self-expires" — is what makes tags the right primitive for compliance and audit reproducibility.

Can I read an Iceberg branch from Trino or Flink, not just Spark?

Yes. Branch-based WAP works because branches are catalog-level refs, not Spark-specific state. Trino reads a branch via SELECT ... FROM iceberg.db.table FOR VERSION AS OF '<branch>'; Flink via the OPTIONS('snapshot-id'='<resolved-from-branch>') hint; Snowflake via AT(BRANCH => '<branch>'). Every engine that has Iceberg catalog support resolves the branch's snapshot id through the refs metadata sub-table and reads it identically to any other snapshot. This multi-engine isolation is what made the legacy spark.wap.id property obsolete and why branch-based WAP is the 2026 default.

How do I roll back a bad production commit safely?

Two paths. First, if a tag protects the last-good snapshot, CALL system.rollback_to_snapshot('<table>', (SELECT snapshot_id FROM <table>.refs WHERE name='<tag>')) flips the main pointer back atomically. Every reader transitions immediately; the bad snapshot remains on disk for post-mortem until expired. Second, if no tag exists, query <table>.snapshots ORDER BY committed_at DESC LIMIT 5 to identify the prior good snapshot id, then call rollback_to_snapshot('<table>', <good_snapshot_id>). In both cases, the catalog's atomic pointer swap guarantees no reader sees a half-rolled-back state. The lesson: tag every production deploy beforehand — rollback then becomes a one-line lookup instead of a forensic exercise.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every Iceberg pattern above ships with hands-on practice rooms where you create the audit branch, run the validation gate, fast-forward main, and protect compliance tags against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your branch-based WAP actually behaves the same on Trino as it does on Spark — you can prove it on graded reps.

Practice ETL design now →
Database design drills →

Top comments (0)