Gowtham Potureddi

Posted on Jun 13

Apache Iceberg Branching, Tagging & WAP: Production Patterns

#sql #python #interview #dataengineering

apache iceberg rewrote what "a table" means on object storage. A decade of "files in S3 with a _SUCCESS marker" gave us cheap storage and unbounded scale, but every meaningful invariant a SQL engineer expects — atomicity, isolation, time travel, schema evolution, safe rollback — had to be re-invented on top of immutable blobs. Iceberg's answer is the snapshot manifest: a tiny pointer-swap that makes a multi-gigabyte write atomic, a 90-day rollback trivial, and a "compare last good prod table to today's run" diff a one-line query.

This guide is the production patterns playbook for senior data engineers who already know what Iceberg is and now have to ship it. It walks through the iceberg snapshots model, the apache iceberg vs delta lake decision, the iceberg branching lifecycle (createBranch, fastForward, cherryPick), the iceberg tagging API for immutable compliance pins, the iceberg wap (write audit publish) pattern as the modern replacement for spark.wap.id, and the iceberg time travel plus maintenance cadence (snapshot expiration, OPTIMIZE, rewrite_manifests, schema evolution) that keep a multi-petabyte lakehouse fast and cheap. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps immediately after reading, drill the ETL practice library →, rehearse on database design problems →, and stack the streaming-side muscles with streaming problems →.

On this page

Why open table formats — snapshot atomicity, ACID, time travel
Iceberg branches — createBranch / fastForward / cherry-pick, isolated dev
Iceberg tags — immutable labels for compliance and audit
WAP pattern — write to audit branch, validate, fast-forward to main
Production ops — snapshot expiration, OPTIMIZE, schema evolution
Cheat sheet — Iceberg branching, tagging, and WAP recipes
Frequently asked questions
Practice on PipeCode

1. Why open table formats — snapshot atomicity, ACID, time travel

The "files in S3" era broke at scale — the snapshot manifest is the primitive that fixed every invariant a SQL engineer expects

The mental model in one line: an Iceberg table is a tiny JSON pointer that names the current snapshot; every write produces a new snapshot and the table commit is a single conditional pointer swap. Once you internalise "tables are pointer-swaps over immutable files," every Iceberg feature — atomic commits, time travel, branches, tags, WAP — falls out of the same primitive.

Why the pre-Iceberg world broke.

Hive-style partitioned directories told the reader "everything under /sales/dt=2026-06-12/ is the partition for that day." There was no commit record, so a job that wrote 12 of 24 files before crashing left partial data visible to every downstream reader.
_SUCCESS markers were a convention, not a contract — Spark wrote them, but Trino did not always honour them, and a successful marker plus a half-rewritten Parquet file could still corrupt a query.
No schema-evolution invariant — a column rename in Hive was a rewrite of every Parquet file. A column drop was a code review and a prayer.
No transactional semantics — concurrent writers raced on the same partition and the last-writer-wins resolution silently dropped commits.
Listing-bound planning — query planning required LIST against S3 for every partition, which was both slow and a wallet-burn at scale.

The Iceberg primitive.

The snapshot. Every commit produces one snapshot. A snapshot is a manifest list — a tiny file that points to a set of manifest files, which point to the actual Parquet/ORC data files. The current table state is the current snapshot.
The atomic swap. A commit succeeds when the catalog (REST / Glue / Nessie / Polaris / Unity) flips its "current snapshot id" pointer from snap-A to snap-B in a single conditional write. If the swap fails (because someone else committed first), the writer retries the manifest against the new base. No partial reads, no half-written tables.
ACID without locks. Atomicity comes from the pointer swap; consistency from the manifest validation; isolation from immutable snapshots (readers pin a snapshot id at planning time and never see writes that arrive after); durability from S3 itself.
Time travel for free. Every historical snapshot is queryable via AS OF VERSION or AS OF TIMESTAMP until you explicitly expire it. Rollback is a single API call.
No-list planning. The manifest tree carries column-level stats (min/max, null counts) per file, so the planner prunes 99% of files before any LIST or GET is issued. Planning latency drops from minutes to seconds on petabyte tables.

Apache Iceberg vs Delta Lake — the two-format reality.

Both ship snapshot-based ACID on object storage. Both support time travel, schema evolution, and atomic commits.
Iceberg is catalog-pluggable — the table identity lives in the catalog (REST, Glue, Nessie, Polaris, Snowflake-managed, Unity). Multi-engine reads are first-class: Spark, Trino, Flink, Snowflake, DuckDB, ClickHouse, StarRocks, Athena, BigQuery all read Iceberg natively.
Delta Lake has its own commit log living next to the data (_delta_log/). Engines like Trino and Flink can read Delta but the writer ecosystem outside Spark/Databricks is thinner. Delta UniForm shrinks this gap by writing Iceberg-compatible metadata alongside Delta.
Iceberg branches and tags ship as core API (Spark, Flink, Trino procedures). Delta's equivalent is "shallow clone" plus the Delta UniForm bridge — semantically similar but a different surface.
The 2026 rule of thumb. For a new lakehouse with Snowflake / Databricks / Trino / Flink all reading and writing: pick Iceberg. For a Databricks-first shop with limited multi-engine writers: Delta with UniForm is operationally simpler. For "I might switch engines in 2 years": Iceberg is the safer bet.

What interviewers listen for.

Do you say "a commit is a pointer swap" when asked what makes Iceberg atomic? — senior signal.
Do you reach for snapshots metadata table to inspect history before reaching for git log-style mental models? — required answer.
Do you mention catalog choice (REST vs Glue vs Nessie vs Polaris) as a first-class production decision? — senior signal.
Do you know that time travel reads pin the snapshot at planning time, so a long-running query is unaffected by concurrent commits? — required answer.

Worked example — list snapshots and inspect commit history

Detailed explanation. Every Iceberg table exposes a metadata sub-table called .snapshots that lists every commit. Reading it is the first move when a downstream report changed unexpectedly — you can see exactly which snapshot landed, who wrote it, and how many files were added or deleted.

Question. Given a production fact table db.sales, list the last five snapshots with commit timestamp, operation, and the file-level delta (added / deleted). How would you use the result to investigate a "row count jumped by 8% overnight" alert?

Input.

snapshot_id	committed_at	operation	added_data_files	deleted_data_files
s-101	2026-06-12 02:00	append	24	0
s-102	2026-06-12 04:00	append	12	0
s-103	2026-06-12 06:00	overwrite	240	36
s-104	2026-06-12 08:00	append	6	0
s-105	2026-06-12 09:30	append	4	0

Code.

-- Spark / Trino / Snowflake-Iceberg syntax
SELECT
    snapshot_id,
    committed_at,
    operation,
    summary['added-data-files']   AS added_data_files,
    summary['deleted-data-files'] AS deleted_data_files,
    summary['total-records']      AS total_records
FROM db.sales.snapshots
ORDER BY committed_at DESC
LIMIT 5;

Step-by-step explanation.

db.sales.snapshots is the metadata sub-table for the sales table — every Iceberg table exposes it automatically.
summary is a MAP<STRING,STRING> of commit-level counters: added/deleted data files, added/deleted records, total records, partitioning summary, and any caller-attached properties (spark.app.id, dbt run id, Airflow dag run id).
The overwrite row at s-103 is the suspicious commit — it deleted 36 files and added 240, which is the textbook signature of an OPTIMIZE / rewrite_data_files run or a backfill that replaced a partition. Cross-check the wap.id or spark.app.id in summary to know which.
With the suspect snapshot identified, an investigator can time travel (SELECT ... FROM db.sales VERSION AS OF s-102 vs s-103) and diff the row counts per partition to confirm whether the jump came from the overwrite.

Output.

snapshot_id	committed_at	operation	added	deleted	total_records
s-105	2026-06-12 09:30	append	4	0	12_004_200
s-104	2026-06-12 08:00	append	6	0	12_001_800
s-103	2026-06-12 06:00	overwrite	240	36	11_995_400
s-102	2026-06-12 04:00	append	12	0	11_096_200
s-101	2026-06-12 02:00	append	24	0	11_088_400

Rule of thumb. When a downstream metric changes unexpectedly, your first SQL should always be SELECT * FROM <table>.snapshots ORDER BY committed_at DESC LIMIT N. The snapshots table is the audit log every other format had to bolt on later.

Worked example — time travel to query the table at a past timestamp

Detailed explanation. Iceberg supports both VERSION AS OF <snapshot_id> (exact snapshot) and TIMESTAMP AS OF <ts> (the snapshot that was current at that wall-clock time). Time travel reads pin the snapshot at planning time, so a long-running analytical query is immune to concurrent writes.

Question. A daily revenue report ran at 06:00 and produced a number the business is disputing. Re-run the same query against the table as it existed at 06:00 and confirm the historical value.

Input.

snapshot_id	committed_at
s-102	2026-06-12 04:00
s-103	2026-06-12 06:00
s-104	2026-06-12 08:00

Code.

-- Reproduce the report exactly as it ran at 06:00
SELECT
    region,
    SUM(amount) AS revenue
FROM db.sales TIMESTAMP AS OF '2026-06-12 06:00:00'
WHERE order_date = DATE '2026-06-11'
GROUP BY region
ORDER BY region;

-- Or pin the exact snapshot id from the snapshots audit
SELECT
    region,
    SUM(amount) AS revenue
FROM db.sales VERSION AS OF 103
WHERE order_date = DATE '2026-06-11'
GROUP BY region
ORDER BY region;

Step-by-step explanation.

TIMESTAMP AS OF '2026-06-12 06:00:00' resolves to the snapshot that was current at exactly that moment — in this table, snapshot s-103 (the 06:00 overwrite). The planner uses the manifest tree from s-103, not the current pointer.
VERSION AS OF 103 is the unambiguous form — if downstream auditors need to reproduce a number, the snapshot id is the contract.
Both forms cost the same to plan and execute as a regular read; the only difference is which snapshot manifest the planner walks.
Because the snapshot is pinned at planning time, no concurrent writer can affect this query's result. The number you compute is the historical truth.

Output.

region	revenue (at 06:00)	revenue (now)
EU	1_240_500	1_310_200
US	980_400	1_045_900
APAC	420_900	444_100

Rule of thumb. Every disputed metric should be re-runnable with a TIMESTAMP AS OF or VERSION AS OF clause against the original snapshot. If your downstream pipeline forgets to capture the snapshot id, you lose audit reproducibility — capture it in every run's lineage metadata.

Worked example — atomic rollback to a previous snapshot

Detailed explanation. If a bad commit corrupts the table — wrong partition overwritten, accidental DELETE, schema-violating insert — rollback_to_snapshot flips the catalog pointer back to the prior snapshot in a single atomic operation. The bad snapshot still exists on disk (until expired) but is no longer the current state.

Question. A nightly job overwrote the eu partition with bad data in snapshot s-103. Roll the table back to snapshot s-102 atomically, so every reader returns the pre-s-103 state immediately.

Input.

snapshot_id	committed_at	operation
s-101	2026-06-12 02:00	append
s-102	2026-06-12 04:00	append
s-103	2026-06-12 06:00	overwrite (BAD)

Code.

-- Spark / Trino procedure form
CALL system.rollback_to_snapshot('db.sales', 102);

-- Or rollback to the snapshot that was current at a wall-clock time
CALL system.rollback_to_timestamp('db.sales', TIMESTAMP '2026-06-12 05:00:00');

Step-by-step explanation.

The procedure writes a new snapshot s-104 whose state is identical to s-102 (the pointer is reset). The bad snapshot s-103 is still on disk but is no longer in the chain of "current" snapshots.
The pointer swap is atomic — between the millisecond before and after, no reader can see a half-rolled-back table.
Concurrent readers that started before the rollback continue against their pinned snapshot; new readers see the rolled-back state.
The bad snapshot remains queryable via VERSION AS OF 103 for post-mortem analysis until you explicitly expire_snapshots past it.

Output.

step	current snapshot	state visible to readers
before	s-103	bad EU data
after	s-104 (= s-102 state)	pre-bad EU data

Rule of thumb. Treat rollback_to_snapshot as your "undo button" — but always pair it with a snapshot inspection (SELECT * FROM <table>.snapshots) so you roll back to the correct prior snapshot. Rolling back to the wrong snapshot is its own outage.

SQL interview question on Iceberg atomicity and time travel

A senior interviewer often opens with: "Walk me through what happens at the catalog when Spark commits an append to an Iceberg table — what is atomic, what is not, and how does a concurrent reader see (or not see) the new rows?" It probes whether the candidate understands snapshot isolation, the pointer-swap commit, and the read-pinning behaviour.

Solution Using snapshot isolation and the manifest-pointer commit

-- 1) Inspect the current pointer and the in-flight commit candidate
SELECT
    name,
    snapshot_id,
    parent_id,
    is_current_ancestor
FROM db.sales.refs;          -- branches + tags + main

SELECT
    snapshot_id,
    parent_id,
    operation,
    committed_at,
    summary['added-data-files'] AS added,
    summary['spark.app.id']     AS writer
FROM db.sales.snapshots
ORDER BY committed_at DESC
LIMIT 3;

-- 2) Reader query that pins a snapshot at planning time
SELECT
    region,
    COUNT(*) AS row_count
FROM db.sales VERSION AS OF (
    SELECT MAX(snapshot_id) FROM db.sales.snapshots
    WHERE committed_at <= TIMESTAMP '2026-06-12 06:00:00'
)
GROUP BY region;

Step-by-step trace.

step	catalog state	reader state	invariant
1	current = s-102	new reader plans against s-102	snapshot pinned
2	writer prepares s-103 manifest	s-102 still current	no swap yet
3	writer calls catalog.commit(s-102→s-103)	atomic conditional swap	last-writer-wins or retry
4	current = s-103	original reader still sees s-102	snapshot isolation
5	second reader plans now	sees s-103	no partial reads

The trace shows the central guarantee: between steps 1 and 4, the original reader's view is locked to s-102 — it cannot accidentally see half of s-103's writes. The new reader at step 5 sees the fully committed s-103. There is no in-between visibility state.

Output:

reader	snapshot pinned	row_count visible
reader A (started at t1)	s-102	11_096_200
reader B (started at t5)	s-103	11_995_400
reader C (TIMESTAMP AS OF '06:00')	s-103 (was current at 06:00)	11_995_400

Why this works — concept by concept:

Snapshot — a manifest list pointing to a set of data files; immutable once written. Every write produces a new snapshot.
Catalog pointer swap — the table's current_snapshot_id lives in the catalog (REST, Glue, Nessie, Polaris, Unity, or a metadata file pointer in HDFS-catalog mode). Commit = conditional update of that single field.
Snapshot isolation — readers capture the current snapshot id at planning time and use it throughout the query. Concurrent commits never affect an in-flight read.
Last-writer-wins with retry — two concurrent writers race to swap the pointer; the loser re-bases its manifest against the winner's snapshot and retries. No locks, no deadlocks.
Cost — pointer swap is one catalog write; the manifest tree for the new snapshot is O(changed manifests), not O(table). Atomicity is essentially free.

SQL
Topic — database
Database design problems

Practice →

2. Iceberg branches — createBranch / fastForward / cherry-pick, isolated dev

`iceberg branching` makes the table look like a Git repo — branches are named snapshot refs, not table copies

The mental model in one line: an Iceberg branch is a named pointer to a snapshot — exactly like a Git branch is a named pointer to a commit — so creating one is O(1) metadata, not an O(table) copy. Once you say "branches are refs, not clones," the entire branching API (createBranch, fastForward, cherryPick) reads like Git porcelain over the same snapshot DAG.

The branch primitive in five bullets.

A branch is a named ref. The catalog stores one row per branch (and per tag) — name, snapshot_id, max_ref_age_ms, min_snapshots_to_keep, max_snapshot_age_ms. The data is shared with main; only the ref pointer is new.
main is the implicit branch every table starts with. Every commit on the table updates main unless you explicitly target a different branch.
Branch writes are isolated — a writer that targets dev does not touch the main pointer, so readers on main are completely unaffected by experimental work.
fastForward resolves a branch into main when it is strictly ahead. If dev was created from snapshot s-102 and has commits s-d1 → s-d2 → s-d3, and main is still at s-102, then fastForward('main', 'dev') moves the main pointer to s-d3 in a single atomic operation. The data files on dev simply become the files on main.
cherryPick brings a single snapshot from one branch into another as a new snapshot. The picked snapshot's data files are re-committed under main's lineage; the original branch is untouched.

Branch lifecycle API.

createBranch(name, snapshot_id?) — creates a new branch pointing to snapshot_id (or current if omitted). O(1) metadata write.
createBranch(name, snapshot_id, retention?) — optional retention controls protect old snapshots referenced by the branch from expire_snapshots.
fastForward(target, source) — fast-forward the target branch to source. Fails if target has commits not reachable from source (would require a real merge, which Iceberg does not implement — you must cherryPick or rewrite).
cherryPick(snapshot_id) — copy a single snapshot's changes onto the current branch as a new commit. Conflicts with concurrent writes resolve via re-application.
dropBranch(name) — removes the branch ref; snapshots become eligible for expiration if no other ref protects them.

When branches are the right answer.

Sandbox a backfill. Create backfill-2026-Q1 from main, run the rewrite against it, validate, fast-forward into main. Production reads of main never see in-progress backfill state.
Long-lived experiment. Data scientists join a new feature into a feat-customer360 branch and query it; main continues to serve production. If the experiment ships, fast-forward; if it does not, drop the branch.
Multi-tenant staging. Each downstream team gets a branch with its own data-correction overlays; main remains canonical.
WAP (next section). The audit branch is created per run, written to, validated, then fast-forwarded into main — the modern WAP pattern.

When branches are the wrong answer.

As a substitute for partitioning. A branch per tenant is a metadata explosion; use a tenant column and row-level access control instead.
As a replacement for snapshots. If you only need "the table as of yesterday," a snapshot id or a tag is cheaper than a branch.
As a soft DELETE. Iceberg supports DELETE FROM with merge-on-read or copy-on-write at the table level; branches are not for hiding rows.

Common interview probes on branches.

"What is the difference between an Iceberg branch and a Git branch?" — semantically nearly identical (both are named refs to a commit DAG), but Iceberg branches do not implement three-way merge — fast-forward and cherry-pick are the only merge primitives.
"Can two writers commit to the same branch concurrently?" — yes, same last-writer-wins retry as on main. Branches do not serialise writes; they isolate one branch's commits from another's.
"How are branches stored?" — as rows in the table's refs metadata sub-table. The catalog stores the current snapshot id for each ref; the data files are shared across branches.
"Does dropping a branch delete the data?" — no. Data files referenced by the dropped branch survive until the next expire_snapshots run determines they are unreachable from every remaining ref.

Worked example — create a `dev` branch, mutate, and fast-forward into `main`

Detailed explanation. A data engineer wants to test a partition rewrite against a copy of db.sales without affecting production readers. She creates a dev branch from the current main snapshot, runs the rewrite against dev, validates the row counts, then fast-forwards main to the new state in a single atomic call.

Question. Given db.sales at snapshot s-105, create a dev branch, write a rewrite snapshot to it that compacts small files in the eu partition, validate row counts, and fast-forward main to the new dev head.

Input.

step	branch	snapshot pointer
start	main	s-105
start	dev	(not created)

Code.

-- 1) Create the dev branch from current main
ALTER TABLE db.sales CREATE BRANCH `dev`;

-- 2) Write the rewrite on the dev branch
CALL system.rewrite_data_files(
    table         => 'db.sales',
    options       => map('target-file-size-bytes', '536870912'),
    where         => "region = 'EU'",
    branch        => 'dev'
);

-- 3) Validate row counts match between main and dev for EU
SELECT
    'main' AS branch, COUNT(*) AS row_count FROM db.sales VERSION AS OF (SELECT snapshot_id FROM db.sales.refs WHERE name='main')
WHERE region = 'EU'
UNION ALL
SELECT
    'dev', COUNT(*) FROM db.sales VERSION AS OF (SELECT snapshot_id FROM db.sales.refs WHERE name='dev')
WHERE region = 'EU';

-- 4) Fast-forward main to dev's head atomically
CALL system.fast_forward('db.sales', 'main', 'dev');

-- 5) Drop the dev branch (data files now live under main's lineage)
ALTER TABLE db.sales DROP BRANCH `dev`;

Step-by-step explanation.

CREATE BRANCH dev writes one new ref row to the catalog pointing to s-105. No data is copied. O(1).
rewrite_data_files(..., branch => 'dev') runs the compaction targeting dev. The new compacted snapshot s-d1 lands on dev; main is still at s-105 and production readers see no change.
The validation query reads both branches via VERSION AS OF against the ref-resolved snapshot ids. Row counts must match — compaction does not change semantics.
fast_forward('main', 'dev') moves the main pointer to s-d1 atomically. The transition is a single catalog write; no reader sees an in-between state. The cost is zero new data files — the files are already there, only the pointer changes.
DROP BRANCH dev removes the ref row. The files it referenced are still reachable from main (because of the fast-forward), so nothing is orphaned.

Output.

step	main pointer	dev pointer	EU files	EU rows
1	s-105	s-105	120 small	4.2M
2	s-105	s-d1	main:120, dev:18 large	4.2M / 4.2M
3	s-105	s-d1	unchanged	counts match
4	s-d1	s-d1	18 large	4.2M
5	s-d1	(dropped)	18 large	4.2M

Rule of thumb. Every multi-step mutation that could be risky (rewrite, schema evolution, partition reshuffle) belongs on its own branch. The branch is your safety net — if validation fails, you drop the branch and nothing on main changed.

Worked example — cherry-pick a single fix snapshot from a hotfix branch

Detailed explanation. A nightly job ran a multi-step transformation on the feat-2026-Q2 branch. One of the snapshots in the chain is a legitimate bug fix that should be applied to main; the rest are still in progress. Cherry-pick brings just that one snapshot into main without merging the entire branch.

Question. Given feat-2026-Q2 with snapshots s-f1 (refactor), s-f2 (bug fix to NULL handling), s-f3 (unfinished new column), cherry-pick s-f2 into main without bringing s-f1 or s-f3.

Input.

branch	snapshots
main	s-101 → ... → s-105
feat-2026-Q2	s-105 → s-f1 → s-f2 → s-f3

Code.

-- Cherry-pick the bug-fix snapshot only
CALL system.cherrypick_snapshot('db.sales', '<s-f2-id>');

-- Verify the new main snapshot inherited the fix
SELECT
    snapshot_id, parent_id, operation,
    summary['source-snapshot-id'] AS cherry_picked_from
FROM db.sales.snapshots
ORDER BY committed_at DESC
LIMIT 2;

Step-by-step explanation.

cherrypick_snapshot reads the data-file delta of s-f2 (the bug fix) and re-applies it as a new commit s-106 on main. The data files referenced by s-f2 are linked under main's manifest tree.
The new snapshot's summary records source-snapshot-id = s-f2 so the audit trail captures the cherry-pick origin.
feat-2026-Q2 is untouched — s-f1, s-f2, s-f3 remain on the branch for the team's continued work.
If s-f2 had touched the same files as a concurrent main commit, cherry-pick would fail with a conflict; the operator would then either rebase s-f2 against the new main head and retry, or rewrite the fix as a fresh commit on main.

Output.

snapshot_id	parent_id	operation	cherry_picked_from
s-106	s-105	overwrite	s-f2
s-105	s-104	append	(none)

Rule of thumb. Use cherryPick when you need a single fix from a long-lived branch without merging everything else. Use fastForward when the entire branch is ready to land. Use neither — and reach for a fresh INSERT INTO on main — when the change is trivial enough that the branch overhead is not worth it.

Worked example — long-lived experiment branch with concurrent main writes

Detailed explanation. A data-science team works on a feat-customer360 branch for three weeks. During those three weeks, the production ETL keeps committing to main. The team can periodically rebase their branch against the new main head — or they can keep working on the old snapshot and reconcile at the end.

Question. Show the snapshot DAG state at three points in time: branch creation, two weeks in, and at fast-forward time. Explain why concurrent main writes do not block the branch.

Input.

time	main	feat-customer360
t0	s-105	(created at s-105)
t1 (week 1)	s-110	s-105 → s-c1
t2 (week 2)	s-115	s-105 → s-c1 → s-c2
t3 (week 3)	s-120	s-105 → s-c1 → s-c2 → s-c3

Code.

-- At t0: create the branch
ALTER TABLE db.customers CREATE BRANCH `feat-customer360`;

-- At t1, t2, t3: append to the branch (writes do not touch main)
INSERT INTO db.customers.`branch_feat-customer360`
SELECT * FROM staging.customer360_t1;

-- At t3: rebase or cherry-pick to land the work
-- Option A: cherry-pick each branch snapshot in order onto main
CALL system.cherrypick_snapshot('db.customers', '<s-c1-id>');
CALL system.cherrypick_snapshot('db.customers', '<s-c2-id>');
CALL system.cherrypick_snapshot('db.customers', '<s-c3-id>');

-- Option B: validate and fast-forward only if the branch is a clean superset
-- (rare for a 3-week-old branch; usually some main commits conflict)
CALL system.fast_forward('db.customers', 'main', 'feat-customer360');

Step-by-step explanation.

At t0, both refs point to s-105. No data is duplicated; only the ref row is new.
Between t0 and t3, the ETL commits 15 new snapshots to main (s-106 → s-120). Each commit only touches the main ref; the feat-customer360 ref is unaffected.
The DS team commits 3 snapshots to the branch (s-c1, s-c2, s-c3). Each touches only the branch ref; main is unaffected.
At t3, the team wants to land the work. Because main has moved (s-105 → s-120) and the branch is based on s-105, a plain fast_forward will fail — the branch is not strictly ahead of main.
The team's choice is cherry-pick each branch snapshot in order onto main, resolving conflicts where the branch touched a file that main also touched, or rebase by re-running the transformations against the current main head on a new short-lived branch and then fast-forwarding.

Output.

time	main snapshot	branch snapshot	data files shared?
t0	s-105	s-105	yes (identical)
t1	s-110	s-c1	partly — s-c1's files unique to branch
t2	s-115	s-c2	partly
t3	s-120	s-c3	partly
after cherry-picks	s-123	s-c3	branch files now also reachable from main

Rule of thumb. Long-lived branches accumulate divergence. If a branch lives longer than the typical main commit cadence, plan a rebase or cherry-pick strategy from day one. Treat the branch like a feature branch in Git: rebase early, rebase often, or accept that the final landing will be a multi-step cherry-pick.

SQL interview question on sandboxing a backfill

A senior interviewer might frame this as: "You need to backfill 90 days of bad partitions on a 50TB fact table. The downstream consumers cannot tolerate seeing partial backfill state. How would you structure the work using Iceberg branches, and what is your rollback plan?" It probes branch lifecycle, atomic publication, and operator hygiene.

Solution Using a `backfill` branch, validation gate, and fast-forward publish

-- 1) Create the backfill branch from the current production head
ALTER TABLE db.fact_orders CREATE BRANCH `backfill-2026-Q1`
  AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main');

-- 2) Run the backfill targeting the branch
INSERT OVERWRITE TABLE db.fact_orders.`branch_backfill-2026-Q1`
SELECT * FROM staging.corrected_q1_orders;

-- 3) Validation gate — row count, null rate, and per-partition delta vs main
WITH gate AS (
    SELECT
        COUNT(*)                                  AS row_count_branch,
        SUM(CASE WHEN amount IS NULL THEN 1 END)  AS null_amount_branch
    FROM db.fact_orders VERSION AS OF (
        SELECT snapshot_id FROM db.fact_orders.refs WHERE name='backfill-2026-Q1'
    )
    WHERE order_date BETWEEN DATE '2026-01-01' AND DATE '2026-03-31'
)
SELECT
    *,
    CASE
        WHEN row_count_branch   < 12000000      THEN 'FAIL — too few rows'
        WHEN null_amount_branch > 0             THEN 'FAIL — null amounts'
        ELSE                                         'PASS'
    END AS gate_decision
FROM gate;

-- 4) Publish: fast-forward main to the validated branch head
CALL system.fast_forward('db.fact_orders', 'main', 'backfill-2026-Q1');

-- 5) On failure: drop the branch; main is untouched
-- ALTER TABLE db.fact_orders DROP BRANCH `backfill-2026-Q1`;

Step-by-step trace.

step	main	backfill branch	gate state
1	s-200	s-200	not started
2	s-200	s-b1 (overwrite Q1)	not started
3	s-200	s-b1	PASS (rows=12.4M, null=0)
4	s-b1	s-b1	published atomically
5 (fail path)	s-200	dropped	main untouched

The trace illustrates the central guarantee: between step 2 and step 4, production readers see the old Q1 data on main. Between step 4 (the moment of fast_forward) and step 4 + 1 microsecond, every reader transitions to the new Q1 state. There is no intermediate visibility — exactly the contract the consumers require.

Output:

metric	value
main snapshot before	s-200
main snapshot after	s-b1
backfill rows written	12_400_000
readers that saw partial state	0
rollback path on gate failure	drop branch, no further action

Why this works — concept by concept:

Branch isolation — every write targets the backfill-2026-Q1 ref; main is untouched until the explicit fast_forward. Concurrent main commits proceed normally.
Validation gate — runs entirely on the branch snapshot, so the gate is testing exactly the bytes that will be published. No staging-vs-prod skew.
Atomic publish — fast_forward is a single catalog write that flips the main pointer. No reader sees a half-published state.
Rollback by dropping a branch — if the gate fails, the branch is dropped, main never moves, and the operator has zero cleanup. The bad data files are reclaimed by the next expire_snapshots run.
No table copy — the branch is O(1) metadata at creation; the data files are written incrementally as the backfill commits. The total cost is "one backfill" plus a handful of ref-row writes.
Cost — branch create / drop / fast_forward are all O(1) catalog writes. The backfill itself is O(corrected_data) regardless of whether you use a branch.

SQL
Topic — ETL
ETL pipeline problems

Practice →

3. Iceberg tags — immutable labels for compliance and audit

`iceberg tagging` pins a snapshot under a stable name — the audit-friendly equivalent of "the version regulators asked for"

The mental model in one line: a tag is an immutable named ref to a specific snapshot — you cannot move it, only create or drop it — which makes tags the right primitive for compliance pins, golden datasets, and "last known good" markers. Once you say "tags do not move; branches do," the entire tagging API and its retention semantics fall out.

The tag primitive in five bullets.

A tag is a named ref to a single snapshot. Same catalog row shape as a branch; the only difference is the API does not let you move the pointer.
Tags are immutable in the move sense, not in the existence sense. You can dropTag and recreate it pointing to a different snapshot, but you cannot atomically retarget an existing tag.
Tags protect snapshots from expiration. As long as a tag references a snapshot, that snapshot survives expire_snapshots — even when the retention window otherwise would have evicted it.
Tags carry optional retention metadata — max_ref_age_ms lets a tag self-expire after, say, 7 years for SOX compliance.
Tags are queryable like branches — SELECT * FROM db.sales VERSION AS OF 'eom-2025-12' reads the snapshot the tag points at.

Tag lifecycle API.

createTag(name, snapshot_id?) — pin snapshot_id (or current head) under name. Fails if name already exists.
createTag(name, snapshot_id, retention) — attach an explicit lifetime; common for compliance ("keep for 7 years").
dropTag(name) — remove the tag ref. The snapshot becomes eligible for expiration if no other ref protects it.
replaceTag(name, snapshot_id) (rarely available; check engine version) — atomically retarget. Most engines require drop + create.
AS OF '<tag-name>' — query syntax for tag-aware time travel.

When tags are the right answer.

Compliance and audit pins — "tag the last good snapshot before the quarterly close" — regulators can re-query the exact bytes that produced the filed numbers.
Golden datasets — gold-v1, gold-v2, ... tag releases of a curated dataset so downstream consumers can pin a known-good version even as you keep iterating on main.
"Last known good" markers — tag the last snapshot that passed all data-quality checks so a rollback is a one-line rollback_to_snapshot lookup against the tag.
Disaster-recovery anchors — tag every successful daily run; a corrupted commit can be undone by rolling back to yesterday's tag.

When tags are the wrong answer.

As a substitute for branches. Tags do not isolate writes; they label snapshots. Use a branch for in-progress work.
As a substitute for snapshot retention policies. Tags do not change the table-level retention; they pin individual snapshots. The right way to retain "all snapshots for 90 days" is the table retention config, not a tag per snapshot.
As a per-row label. Tags name table-level points in time, not row-level conditions. Use a column for row labels.

Tag-aware retention — the subtle part.

expire_snapshots walks every ref (branches + tags + main) and computes the set of reachable snapshots. Any snapshot reachable from any ref is kept; everything else is eligible for expiration.
A tag at snapshot s-090 protects s-090 and its parent chain up to the root, unless you also configure min_snapshots_to_keep or max_snapshot_age_ms at the tag level — in which case the tag protects only s-090 and snapshots within the configured retention.
Result. Tags are the safest way to keep a specific historical state queryable without ballooning the snapshot count. Pin the snapshots you need; let expire_snapshots reclaim the rest.

Compliance use case in one paragraph. Suppose a regulator asks: "Reproduce the revenue number you filed on 2026-01-31." Without tags, you must hope the snapshot from that date is still within your retention window. With tags, you ran createTag('filed-2026-01', <snapshot_at_eom>) the moment you filed, and the snapshot is guaranteed to survive for the configured retention (typically 7 years for SOX). SELECT ... FROM db.sales VERSION AS OF 'filed-2026-01' reproduces the number exactly. That is the entire audit reproducibility story in one tag.

Worked example — tag the last good snapshot before a known-bad release

Detailed explanation. A team is about to deploy a new ETL version that they expect to be safe but want a one-call rollback path. Before the deploy, they tag the current production snapshot as pre-deploy-2026-06-12. If the new ETL produces bad data, rollback is a single command using the tag.

Question. Given db.sales at snapshot s-105 (last good), tag the snapshot as pre-deploy-2026-06-12. Show how to roll back to the tag if the next commit corrupts the table.

Input.

snapshot_id	committed_at
s-104	2026-06-12 08:00
s-105	2026-06-12 09:30 (last good)

Code.

-- 1) Tag the last good snapshot before deploy
ALTER TABLE db.sales CREATE TAG `pre-deploy-2026-06-12`
  AS OF VERSION (SELECT snapshot_id FROM db.sales.refs WHERE name='main');

-- 2) (Deploy happens, suppose s-106 corrupts the table)

-- 3) Roll back to the tagged snapshot atomically
CALL system.rollback_to_snapshot(
    'db.sales',
    (SELECT snapshot_id FROM db.sales.refs WHERE name='pre-deploy-2026-06-12')
);

-- 4) Inspect the refs table to confirm the rollback landed
SELECT name, snapshot_id, type FROM db.sales.refs ORDER BY name;

Step-by-step explanation.

CREATE TAG writes a ref row pointing to s-105. The catalog now has two refs to that snapshot: main and pre-deploy-2026-06-12.
After the deploy commits s-106 (bad), main = s-106 and the tag = s-105.
rollback_to_snapshot flips main back to s-105 atomically. The bad s-106 still exists but is no longer current; readers see the s-105 state immediately.
The refs table confirms: main = s-105 (post rollback) and pre-deploy-2026-06-12 = s-105 (tag, unchanged).
The tag remains pinned at s-105 forever (or until you drop it). Even if you later expire s-106 (the bad snapshot), s-105 is permanently protected by the tag.

Output.

name	snapshot_id	type
main	s-105	BRANCH
pre-deploy-2026-06-12	s-105	TAG

Rule of thumb. Tag the production snapshot before every deploy, dbt release, or pipeline change. The cost is one ref row; the benefit is a one-command rollback that always lands on a known-good state.

Worked example — query "AS OF" a tag from Spark, Trino, and Snowflake

Detailed explanation. Iceberg tags are first-class in every major reader. The SQL syntax to query "as of" a tag varies slightly per engine but the semantics are identical: pin the table state to the tagged snapshot and run the query.

Question. Given db.sales with a tag gold-v1 pinned at snapshot s-093, write the "as of tag" query in Spark / Trino / Snowflake / Flink syntax. Show that all four return identical results.

Input.

tag name	snapshot_id
gold-v1	s-093

Code.

-- Spark SQL
SELECT region, SUM(amount) AS revenue
FROM db.sales VERSION AS OF 'gold-v1'
WHERE order_date = DATE '2026-05-01'
GROUP BY region;

-- Trino / Presto
SELECT region, SUM(amount) AS revenue
FROM db.sales FOR VERSION AS OF 'gold-v1'
WHERE order_date = DATE '2026-05-01'
GROUP BY region;

-- Snowflake (Iceberg tables)
SELECT region, SUM(amount) AS revenue
FROM db.sales AT (BRANCH => 'gold-v1')   -- tag works through the BRANCH/TAG ref API
WHERE order_date = DATE '2026-05-01'
GROUP BY region;

-- Flink SQL
SELECT region, SUM(amount) AS revenue
FROM db.sales /*+ OPTIONS('snapshot-id'='<resolved-from-tag>') */
WHERE order_date = DATE '2026-05-01'
GROUP BY region;

Step-by-step explanation.

Each engine has slightly different AS OF syntax, but all four resolve gold-v1 to snapshot s-093 by reading the refs metadata sub-table.
The planner then walks the s-093 manifest tree to compute the file list, exactly as for any other snapshot read.
The query returns the same revenue numbers in every engine because the underlying data files are identical — they are addressed by the snapshot id, not the engine.
This is the multi-engine reader contract Iceberg promises: Spark, Trino, Snowflake, Flink, DuckDB, ClickHouse, StarRocks, Athena, BigQuery all see the same view of the same snapshot.

Output (identical across engines).

region	revenue
EU	1_240_500
US	980_400
APAC	420_900

Rule of thumb. When you publish a "golden" dataset version, give it a tag. Consumers on any engine can pin their reads to the tag with one syntax change, and the engine's planner does the rest. Tags are the cross-engine cross-team contract.

Worked example — protect a tag from snapshot expiration

Detailed explanation. A common production failure: the snapshot retention is 7 days, an auditor asks for the month-end snapshot 30 days later, and expire_snapshots has already reclaimed it. Tags solve this by pinning the snapshot independently of the retention window.

Question. Configure db.sales so that the month-end snapshot is preserved for 7 years (SOX requirement) even though the default snapshot retention is 7 days.

Input.

config	value
default snapshot retention	7 days
month-end snapshot id	s-093
requirement	keep s-093 for 7 years

Code.

-- 1) Create the tag with explicit retention
ALTER TABLE db.sales CREATE TAG `eom-2026-05`
  AS OF VERSION 93
  RETAIN 2557 DAYS;            -- ~7 years

-- 2) Configure the table-level retention defaults (apply to non-tagged snapshots)
ALTER TABLE db.sales SET TBLPROPERTIES (
  'history.expire.max-snapshot-age-ms' = '604800000',     -- 7 days
  'history.expire.min-snapshots-to-keep' = '5'
);

-- 3) Run the periodic expire job (typically Airflow daily)
CALL system.expire_snapshots(
    table             => 'db.sales',
    older_than        => TIMESTAMP '2026-06-05 00:00:00',
    retain_last       => 5
);

-- 4) Verify the tagged snapshot survives
SELECT snapshot_id, committed_at
FROM db.sales.snapshots
WHERE snapshot_id = 93;

Step-by-step explanation.

The tag eom-2026-05 pins s-093 and carries its own RETAIN 2557 DAYS retention, overriding the table-level "expire after 7 days" rule for this specific snapshot.
The table-level config is unchanged for other snapshots — appends and daily commits still age out after 7 days.
When expire_snapshots runs, it walks every ref (main, tags, branches) and computes "reachable snapshots." s-093 is reachable from the tag, so it is kept.
Seven years later, when the auditor queries db.sales VERSION AS OF 'eom-2026-05', the snapshot is still there. The query reproduces the month-end revenue numbers exactly.

Output.

snapshot_id	committed_at	retained because
s-093	2026-05-31 23:55	tag `eom-2026-05` (RETAIN 7 years)

Rule of thumb. Compliance pins go in tags with explicit RETAIN clauses. Never rely on table-level retention for "must-survive-for-N-years" snapshots — tags are the contract.

SQL interview question on tag-based audit reproducibility

A senior interviewer often asks: "How would you guarantee that any historical revenue number you filed with the regulator is byte-reproducible for 7 years, even though your team runs OPTIMIZE and expire_snapshots daily?" It probes tag retention, the interaction between tags and expiration, and the audit-trail story.

Solution Using compliance tags with explicit retention

-- 1) Tag the filed snapshot the moment the filing is finalised
ALTER TABLE finance.revenue CREATE TAG `filed-2026-Q1`
  AS OF VERSION (SELECT snapshot_id FROM finance.revenue.refs WHERE name='main')
  RETAIN 2557 DAYS;           -- 7 years SOX retention

-- 2) Record the tag plus the snapshot id in the filings audit ledger
INSERT INTO finance.filings_audit (filing_id, table_name, snapshot_id, tag_name, filed_at)
SELECT
    'Q1-2026-revenue',
    'finance.revenue',
    r.snapshot_id,
    r.name,
    CURRENT_TIMESTAMP
FROM finance.revenue.refs r
WHERE name = 'filed-2026-Q1';

-- 3) Years later: reproduce the filed number
SELECT
    region,
    SUM(amount) AS filed_revenue
FROM finance.revenue VERSION AS OF 'filed-2026-Q1'
WHERE filing_period = 'Q1-2026'
GROUP BY region;

Step-by-step trace.

step	action	catalog state
1	createTag filed-2026-Q1 RETAIN 7y	new ref protecting s-093
2	insert into audit ledger	external row capturing tag
3 (1 year later)	expire_snapshots(older_than=300d, retain_last=5)	s-093 protected; other s ≤ 300d gone
4 (7 years later)	SELECT VERSION AS OF 'filed-2026-Q1'	snapshot s-093 still queryable
5 (7y + 1d later)	tag self-expires per RETAIN clause	snapshot s-093 eligible to expire

The trace shows the contract: the tag's RETAIN 7 years clause makes s-093 immune to expire_snapshots for the full 7 years, after which the tag itself expires and the snapshot becomes a normal candidate for the table's default retention.

Output:

audit query	result	reproducible?
Q1-2026 revenue at filing time	4_205_300	yes
Q1-2026 revenue 1 year later	4_205_300	yes (tag intact)
Q1-2026 revenue 7 years later	4_205_300	yes (tag still intact)

Why this works — concept by concept:

Tag as an immutable pin — the tag binds s-093 to a stable name; the tag's RETAIN clause overrides table-level retention.
Audit ledger as external evidence — recording the tag name + snapshot id in an external ledger means even if the Iceberg metadata is somehow lost, the audit trail names the exact snapshot.
expire_snapshots respects refs — the expiration job walks every ref (branches + tags + main) and only reclaims snapshots not reachable from any ref. Tags guarantee reachability.
VERSION AS OF resolves the tag at read time — the engine reads the refs sub-table, finds the snapshot id, and walks the manifest tree. No filename lookups, no path hacks.
Multi-engine reproducibility — because the data files are immutable Parquet referenced by the manifest tree, Spark, Trino, Snowflake, and Flink all return identical results when reading the same tag.
Cost — one ref row plus the protected snapshot's files. The 7-year retention is a storage decision, not a compute decision; the read cost is identical to any other snapshot read.

SQL
Topic — design
Design problems

Practice →

4. WAP pattern — write to audit branch, validate, fast-forward to main

`iceberg wap` is the production-grade `write audit publish` pattern — branch isolation makes the staging table obsolete

The mental model in one line: WAP writes the new snapshot to an isolated audit branch, runs validation against that branch, and on pass fastForwards main to the validated snapshot — making "no consumer ever sees half-published data" a structural guarantee rather than a hope. Once you say "the audit branch IS the staging table — same files, isolated ref," WAP becomes a 5-line ritual instead of a 100-line copy-then-swap.

The WAP contract in five bullets.

Write. The transform writes a new snapshot to an audit branch (e.g. audit-<runid>), not to main. Main remains at the previous good state.
Audit. Validation checks run against the audit branch — row count vs expected, null rates, freshness, business-rule predicates. The audit reads the exact bytes that will publish.
Publish. If audit passes, fastForward('main', 'audit-<runid>') is the single atomic operation that makes the new snapshot live. Readers transition without seeing a partial state.
Drop on failure. If audit fails, the audit branch is dropped. Main never moves; the bad data files are reclaimed by the next expire_snapshots.
Multi-engine reader isolation. During the audit window, Spark / Trino / Snowflake readers on main see the prior snapshot; only the audit job (which explicitly reads the audit-<runid> branch) sees the new state.

Pre-branches WAP (the legacy pattern).

spark.wap.id — a Spark session property. When set, every commit attaches the WAP id as a staged snapshot rather than landing as the new current snapshot.
Publish — a separate procedure call (publish_changes) atomically promotes the staged snapshot to current.
Drawback. Only Spark wrote staged snapshots; Trino / Flink could not participate. Readers on other engines either saw the staged snapshot (depending on engine version) or were blocked from reading altogether.

Modern branch-based WAP.

Engine-agnostic. Any engine that supports branches (Spark, Trino, Flink, Snowflake, Polars + Iceberg) can write to and read from the audit branch.
Composable. The audit branch is a real branch — you can branch-from-branch, cherry-pick, or run multiple audits in parallel.
No special session property. The branch ref is the entire mechanism; no spark.wap.id needed.
The 2026 rule. All new WAP implementations should use branches. The spark.wap.id property is legacy.

Audit checks that belong in a WAP gate.

Row count vs expected — (new_rows / prior_rows) within [0.8, 1.2] for daily appends; tighter for slowly-changing dimensions.
Null rates per critical column — (null_count / row_count) below a threshold (e.g. < 1% for customer_id, < 5% for email).
Freshness — MAX(event_time) within the SLA window (typically >= now() - 2h for hourly tables).
Business rules — SUM(amount) per partition above a floor (catches "the integration silently zeroed out a column").
Schema — match the contract: every required column present, types unchanged. Use a JSON Schema or dbt contract for portability.
Referential integrity — every customer_id in the new snapshot exists in dim_customer. Detects orphans before they hit BI.

Tools that play well with branch-WAP.

dbt — --target flag plus pre_hook / post_hook to wire branch creation, model build, and fast-forward. dbt-Iceberg adapter handles the branch syntax.
Great Expectations — runs as the audit gate; pointed at the audit branch via VERSION AS OF.
Soda Core — declarative checks; supports Iceberg via Spark/Trino connections.
Airflow / Dagster — orchestrate the create-branch → run-transform → audit → publish-or-drop sequence as discrete tasks. The publish task is a 1-line fast_forward call.

Common interview probes on WAP.

"What is WAP and why does it matter?" — write-audit-publish, the contract that no consumer ever sees a partial or invalid commit. Matters because object-storage tables would otherwise expose half-written state.
"How does branch-based WAP differ from spark.wap.id?" — branches are engine-agnostic, composable, and do not require a session property. spark.wap.id was Spark-only.
"What happens to the bad data on a WAP failure?" — the audit branch is dropped; the files are reclaimed by the next expire_snapshots. Main never moves.
"Can multiple WAP runs proceed in parallel?" — yes, one branch per run id. They publish in their respective fast-forward order. Conflicts between runs are detected by the catalog's last-writer-wins retry.

Worked example — Spark write to the audit branch

Detailed explanation. A daily ETL appends new orders to db.fact_orders. Under branch-WAP, the writer targets the audit-20260612 branch instead of main; the audit gate then reads from that branch.

Question. Show the Spark / SQL code that creates an audit branch from the current main head, appends today's orders to the branch, and registers the branch in a run-id ledger for later audit.

Input.

run_id	source rows	target table	branch ref
20260612	124_000	db.fact_orders	audit-20260612

Code.

-- 1) Create the audit branch from current main
ALTER TABLE db.fact_orders CREATE BRANCH `audit-20260612`
  AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main');

-- 2) Append today's orders to the branch
INSERT INTO db.fact_orders.`branch_audit-20260612`
SELECT
    order_id, customer_id, order_date, amount, region
FROM staging.orders_20260612;

-- 3) Register the audit run in the lineage ledger
INSERT INTO etl.runs (run_id, table_name, branch_name, started_at)
VALUES ('20260612', 'db.fact_orders', 'audit-20260612', CURRENT_TIMESTAMP);

Step-by-step explanation.

CREATE BRANCH audit-20260612 writes one ref row pointing at main's current snapshot. No data is copied.
INSERT INTO ...branch_audit-20260612 appends to the branch. The new files land in S3 under the table's data prefix and are referenced by the branch's manifest tree. Main is untouched.
The ledger row captures the run id and the branch name so downstream auditors and the publish task can find the right branch by run id rather than guessing the name.
At this point, production readers on main see the previous day's data; only readers that explicitly target branch_audit-20260612 see today's appends.

Output.

step	main snapshot	audit-20260612 snapshot	rows visible on main
pre	s-200	(not created)	12_000_000
1	s-200	s-200	12_000_000
2	s-200	s-a1	12_000_000
3	s-200	s-a1	12_000_000 (ledger row added)

Rule of thumb. Every WAP writer targets a branch named after the run id. The branch name is the link between the orchestrator's run id, the audit gate, the publish task, and the post-mortem audit trail. Never let the writer commit directly to main during WAP.

Worked example — dbt-style audit gate (row counts, null rate, freshness)

Detailed explanation. The audit gate reads the audit branch and verifies three invariants: row count is within tolerance, the critical column has a null rate below threshold, and freshness is within SLA. If any check fails, the gate returns FAIL and the publish task is skipped.

Question. Write the audit gate SQL for db.fact_orders on the audit-20260612 branch. Pass condition: rows in [80%, 120%] of yesterday's count, null_customer_id rate < 1%, MAX(order_date) = today.

Input.

metric	expected
yesterday row count	120_000
min today rows	96_000
max today rows	144_000
max null_customer_id rate	0.01
freshness	MAX(order_date) = today

Code.

WITH branch_state AS (
    SELECT snapshot_id FROM db.fact_orders.refs WHERE name = 'audit-20260612'
),
branch_metrics AS (
    SELECT
        COUNT(*)                                              AS row_count,
        SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END)  AS null_customer,
        MAX(order_date)                                       AS max_order_date
    FROM db.fact_orders VERSION AS OF (SELECT snapshot_id FROM branch_state)
    WHERE order_date = CURRENT_DATE
),
yesterday AS (
    SELECT COUNT(*) AS row_count
    FROM db.fact_orders
    WHERE order_date = CURRENT_DATE - INTERVAL '1' DAY
)
SELECT
    bm.row_count,
    bm.null_customer,
    bm.max_order_date,
    y.row_count AS y_row_count,
    CASE
        WHEN bm.row_count < 0.8 * y.row_count THEN 'FAIL — too few rows'
        WHEN bm.row_count > 1.2 * y.row_count THEN 'FAIL — too many rows'
        WHEN bm.null_customer > 0.01 * bm.row_count THEN 'FAIL — null_customer_id rate too high'
        WHEN bm.max_order_date <> CURRENT_DATE THEN 'FAIL — stale data'
        ELSE 'PASS'
    END AS gate_decision
FROM branch_metrics bm
CROSS JOIN yesterday y;

Step-by-step explanation.

branch_state resolves the run-id branch to its current snapshot id by reading the refs sub-table.
branch_metrics reads the audit branch at that snapshot via VERSION AS OF — exactly the bytes that will publish if the gate passes.
yesterday reads main (the unqualified table reference) for the comparison baseline. Note: it reads main, not the audit branch — that is the whole point of the isolation.
The final SELECT computes the three pass conditions; the first failing check wins. The result is a single-row decision that the orchestrator inspects.
On PASS, the orchestrator calls fast_forward. On FAIL, it calls DROP BRANCH and pages the on-call.

Output (pass case).

row_count	null_customer	max_order_date	y_row_count	gate_decision
124_000	80	2026-06-12	120_000	PASS

Output (fail case — too few rows).

row_count	null_customer	max_order_date	y_row_count	gate_decision
60_000	40	2026-06-12	120_000	FAIL — too few rows

Rule of thumb. The gate must read the audit branch (via VERSION AS OF), not main. If you accidentally read main, you are checking the previous day's data and the gate will pass on every run — a silent data outage.

Worked example — fast-forward publish vs rollback on failure

Detailed explanation. The publish step is a single API call. Branch-based WAP turns "did the publish succeed?" into "did the catalog write succeed?" — a binary, retriable signal.

Question. Show the publish-or-rollback orchestrator logic for db.fact_orders after the audit gate returns PASS or FAIL.

Input.

audit result	action
PASS	fast_forward main → audit-20260612
FAIL	drop branch, alert

Code.

# Pseudocode for the orchestrator (Airflow / Dagster task)
def publish_or_rollback(table, branch, audit_result):
    if audit_result == "PASS":
        spark.sql(f"CALL system.fast_forward('{table}', 'main', '{branch}')")
        spark.sql(f"ALTER TABLE {table} DROP BRANCH `{branch}`")
        return "published"
    else:
        spark.sql(f"ALTER TABLE {table} DROP BRANCH `{branch}`")
        page_oncall(
            severity="P2",
            message=f"WAP gate failed on {table} branch={branch}: {audit_result}",
        )
        return "rolled_back"

Step-by-step explanation.

On PASS, fast_forward flips main's pointer to the audit branch's head in one atomic catalog write. Every reader transitions to the new state.
After fast-forward, the audit branch ref still points at the (now-shared) head snapshot. Dropping it cleans up the ref; the data files are already part of main's lineage.
On FAIL, dropping the branch removes the only ref pointing to the bad snapshot. The files become unreachable and will be reclaimed by the next expire_snapshots job.
The page-on-call call is what makes WAP a real production pattern — silent failures still need humans.

Output.

outcome	main pointer	branch ref	data files
PASS	moved to s-a1	dropped	bad files become main files
FAIL	unchanged at s-200	dropped	bad files orphaned, reclaimed later

Rule of thumb. Treat the publish step as an idempotent operation — if the orchestrator crashes mid-step, retry the fast-forward. The catalog's atomic semantics make the retry safe even if the fast-forward partially succeeded.

Worked example — multi-engine reader isolation during the audit window

Detailed explanation. During the audit window, Spark / Trino / Snowflake readers all see the prior good main snapshot. Only the audit gate (which explicitly targets the branch) sees the in-flight data.

Question. Show what each of three readers sees during the audit window for db.fact_orders: a Trino dashboard query, a Snowflake batch job, and the Spark audit gate.

Input.

engine	query target	expects to see
Trino	main (default)	prior good snapshot
Snowflake	main (default)	prior good snapshot
Spark audit	audit-20260612 branch	in-flight snapshot

Code.

-- Trino — production dashboard (main, prior good)
SELECT region, SUM(amount) AS revenue
FROM iceberg.db.fact_orders
WHERE order_date = CURRENT_DATE - INTERVAL '1' DAY
GROUP BY region;

-- Snowflake — overnight finance batch (main, prior good)
SELECT region, SUM(amount) AS revenue
FROM db.fact_orders
WHERE order_date = CURRENT_DATE - INTERVAL '1 DAY'
GROUP BY region;

-- Spark — audit gate reads the audit branch only
SELECT region, SUM(amount) AS revenue
FROM db.fact_orders VERSION AS OF (
    SELECT snapshot_id FROM db.fact_orders.refs WHERE name='audit-20260612'
)
WHERE order_date = CURRENT_DATE
GROUP BY region;

Step-by-step explanation.

Trino's planner resolves iceberg.db.fact_orders against its catalog, finds main's snapshot pointer at s-200, and reads that snapshot. The in-flight audit branch is invisible.
Snowflake's planner does the same against its catalog binding. The audit branch is also invisible to Snowflake.
Spark's audit gate explicitly resolves the audit branch from refs and pins the snapshot for the gate query. It sees the in-flight bytes — exactly what will publish.
The isolation is provided by the catalog's ref model. No engine sees the audit branch unless it explicitly asks for it.

Output.

engine	snapshot read	data seen
Trino dashboard	s-200 (main)	prior good
Snowflake batch	s-200 (main)	prior good
Spark audit gate	s-a1 (audit)	in-flight

Rule of thumb. Branch-based WAP provides multi-engine reader isolation out of the box. You do not need to coordinate engines, freeze readers, or run a maintenance window — the catalog ref model does it for you.

SQL interview question on designing WAP for a 100TB fact table

A senior interviewer might say: "Design a WAP pattern for a 100TB fact table written every 15 minutes by 8 parallel Spark jobs, read by Trino dashboards every second, and audited by a Great Expectations suite. What does the catalog look like, and what is your failure-mode story?" It probes orchestrator design, branch naming, retention, and the multi-engine isolation guarantee.

Solution Using one audit branch per run-id with fast-forward publish

-- The contract:
--   1. Writer task creates branch audit-<run_id> from current main
--   2. Writer commits to the branch (parallel writers retry on last-writer-wins)
--   3. Audit task runs validations against the branch
--   4. Publish task fast-forwards main on PASS, drops branch on FAIL
--   5. Daily maintenance expires orphaned files from FAIL runs

-- 1) Writer (one of N parallel jobs)
ALTER TABLE db.fact_orders CREATE BRANCH IF NOT EXISTS `audit-${run_id}`
  AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main');

INSERT INTO db.fact_orders.`branch_audit-${run_id}`
SELECT * FROM staging.orders_partition_${shard_id};

-- 2) Audit gate (Great Expectations / Soda / dbt-test pointed at the branch)
-- (returns PASS / FAIL)

-- 3) Publish (orchestrator)
CALL system.fast_forward('db.fact_orders', 'main', 'audit-${run_id}');
ALTER TABLE db.fact_orders DROP BRANCH `audit-${run_id}`;

-- 4) Daily maintenance
CALL system.expire_snapshots(
    table       => 'db.fact_orders',
    older_than  => CURRENT_TIMESTAMP - INTERVAL '7' DAY,
    retain_last => 50
);

CALL system.remove_orphan_files(
    table       => 'db.fact_orders',
    older_than  => CURRENT_TIMESTAMP - INTERVAL '3' DAY
);

Step-by-step trace.

time	event	main pointer	audit branch	gate
t0	run starts	s-200	(create at s-200)	pending
t0 + 5m	8 parallel writers commit to branch	s-200	s-a8	pending
t0 + 8m	audit reads branch	s-200	s-a8	running
t0 + 10m	gate PASS	s-200	s-a8	PASS
t0 + 10m + 1ms	fast_forward	s-a8	s-a8	published
t0 + 10m + 2ms	drop branch	s-a8	(dropped)	done
next day 02:00	maintenance	s-a14 (more runs)	--	--

The trace shows the central guarantee: during the entire t0 → t0+10m window, the 1000 Trino dashboard readers per second see s-200 — completely consistent prior state. At t0+10m+1ms, every new reader sees s-a8. There is no transition state.

Output:

metric	value
writes per day	96
audit-branch refs created per day	96
audit-branch refs alive at end of day	0 (all dropped)
storage overhead from WAP	< 1% (only failed-audit files)
readers that saw partial state	0
catalog write per publish	1 (the fast_forward)

Why this works — concept by concept:

Branch per run-id — names the audit window uniquely and prevents collisions between parallel runs. The orchestrator's run id is the universal handle.
Multiple writers on one branch — the catalog's last-writer-wins retry handles parallel writers natively. Each writer commits its shard; the final branch snapshot is the union.
Audit reads the branch — exactly the bytes that will publish. No staging-vs-prod skew.
fast_forward is one catalog write — atomic publish. Every reader transitions at the same instant.
Drop branch on success — keeps the catalog clean; data files survive under main's lineage.
expire_snapshots reclaims orphans — failed-audit branches' files become unreachable and are reclaimed by the daily maintenance.
Cost — proportional to data (writes), not table size. The branch-create / drop / fast-forward are O(1) catalog writes. Maintenance is O(snapshots + files) and runs off-peak.

SQL
Topic — streaming
Streaming ingestion patterns

Practice →

5. Production ops — snapshot expiration, OPTIMIZE, schema evolution

Iceberg health is a maintenance cadence — `expire_snapshots`, `rewrite_data_files`, `rewrite_manifests`, and `remove_orphan_files` are the four sectors of the production wheel

The mental model in one line: a healthy Iceberg table needs four periodic ops — expire snapshots (control history), OPTIMIZE / rewrite_data_files (bin-pack small files), rewrite_manifests (keep planning fast), and remove_orphan_files (reclaim storage from failed writes) — together they keep planning latency low and S3 cost bounded. Once you say "Iceberg is a self-managing format only if you actually run the management," the production cadence becomes a daily, weekly, monthly checklist.

The four maintenance sectors.

expire_snapshots — removes old snapshot manifests beyond the retention window, freeing the data files they uniquely referenced. Tags and branches protect their snapshots. Daily cadence on hot tables, weekly on slow ones.
rewrite_data_files (OPTIMIZE) — compacts small data files into bin-packed target-sized files. Cuts planning time and read latency. Weekly cadence on append-heavy tables; on-demand after large overwrite jobs.
rewrite_manifests — re-buckets the manifest list to keep manifests balanced by partition and size. Critical for tables with many commits per day. Monthly cadence usually suffices.
remove_orphan_files — finds files in the table's S3 prefix not referenced by any snapshot (results of failed writes, crashed jobs, dropped branches) and deletes them. Weekly cadence; run only after expire_snapshots has settled.

Schema evolution — the safe parts and the unsafe parts.

Add column. Always safe — new column appended, existing rows have NULL for the new column. No file rewrite.
Drop column. Safe — column is removed from the schema; underlying files keep the column data but readers stop seeing it. No file rewrite.
Rename column. Safe — Iceberg uses field-ids, not column names, so a rename is a metadata-only operation. No file rewrite. (Compare to Hive, where a rename was a full rewrite.)
Reorder columns. Safe — the schema is field-id-keyed, so column order is a presentation choice.
Widen type. Safe in many cases — INT → BIGINT, FLOAT → DOUBLE, DECIMAL(p, s) → DECIMAL(p', s) with p' > p and same s.
Narrow type. Unsafe — would lose precision or range. Iceberg refuses.
Change type incompatibly (STRING → INT). Unsafe — requires a backfill via a new column + cutover.
Promote nullable → required. Unsafe — would require validating every existing row. Iceberg refuses unless the table is empty.

Partition evolution.

Iceberg supports changing the partition spec without rewriting old data — a textbook feature no other format offers.
Example. A table starts partitioned by month(event_time); after a year, traffic grows and you want daily partitions. ALTER TABLE ... ADD PARTITION FIELD day(event_time) switches the spec. New data is partitioned daily; old data stays in monthly partitions.
Readers handle it transparently — the manifest tree records which partition spec each file was written under; the planner prunes per-spec.
Caveat. Partition evolution is most useful as a future-proofing escape hatch, not a routine knob. Switching specs every quarter is a sign the original spec was wrong; pick a defensive spec at design time.

Catalog choice — the production decision that shapes everything.

REST catalog (e.g. Tabular, Polaris, Lakekeeper) — engine-agnostic, vendor-neutral, audited via HTTP. The right default for a multi-engine shop.
AWS Glue — managed, low-friction in AWS, but locks you to AWS for catalog operations. Cheap and easy at small-to-medium scale.
Nessie — Git-style catalog with branching at the catalog level (not just the table level). Powerful for multi-table transactions; richer than the per-table branches.
Snowflake-managed — Snowflake owns the catalog; other engines read through Snowflake's Iceberg APIs. Great if Snowflake is your primary engine.
Unity Catalog — Databricks' catalog; Iceberg compatibility is improving. Right answer for Databricks-heavy shops moving toward open formats.
Polaris — Snowflake's open-source REST catalog implementation. Multi-engine friendly with first-class tag/branch support.

Multi-engine reader matrix (2026).

Spark — read + write + admin (full).
Trino / Presto — read + write + admin (full).
Flink — read + write (full streaming + batch).
Snowflake — read + write (Iceberg-managed or external).
DuckDB — read (via the iceberg extension); write support landing in 2026.
ClickHouse — read (via the Iceberg engine); write support nascent.
StarRocks — read (full); write support landing.
Athena (AWS) — read + write.
BigQuery — read (via BigLake / Iceberg external tables); write maturing.

Common interview probes on production ops.

"How often should I run expire_snapshots?" — daily on tables with frequent commits; weekly is fine on slow tables. Always after a large OPTIMIZE.
"What is the difference between rewrite_data_files and rewrite_manifests?" — the first compacts data files; the second compacts metadata. Both reduce planning time, but for different reasons.
"How do you know when OPTIMIZE is needed?" — average data-file size below ~128MB on a partition, or db.<table>.files showing > 1000 files per active partition. Both are signs of read-time scan amplification.
"Can I rename a column on a 100TB Iceberg table without rewriting data?" — yes. Rename is a metadata-only operation because Iceberg uses field-ids.

Worked example — daily snapshot expiration job protecting tags

Detailed explanation. A daily Airflow task expires snapshots older than 7 days while protecting every tagged snapshot. Tags marked for SOX (e.g. eom-*, filed-*) survive indefinitely until their tag is explicitly dropped.

Question. Write the expire_snapshots call for db.fact_orders with a 7-day retention, keeping the last 30 snapshots minimum, and verifying that tagged snapshots survive.

Input.

config	value
older_than	7 days ago
retain_last	30
protected tags	eom-2026-05, filed-2026-Q1

Code.

-- 1) Inspect refs to confirm tag protection
SELECT name, snapshot_id, type
FROM db.fact_orders.refs
WHERE type = 'TAG';

-- 2) Run the expire job
CALL system.expire_snapshots(
    table          => 'db.fact_orders',
    older_than     => CURRENT_TIMESTAMP - INTERVAL '7' DAY,
    retain_last    => 30
);

-- 3) Verify tagged snapshots still exist
SELECT
    r.name             AS tag_name,
    r.snapshot_id,
    s.committed_at,
    CASE WHEN s.snapshot_id IS NULL THEN 'MISSING' ELSE 'PRESENT' END AS status
FROM db.fact_orders.refs r
LEFT JOIN db.fact_orders.snapshots s ON r.snapshot_id = s.snapshot_id
WHERE r.type = 'TAG';

Step-by-step explanation.

The refs pre-check captures the tags that must survive — useful as a baseline for the post-check.
expire_snapshots computes the set of reachable snapshots: every snapshot referenced (directly or via the parent chain) by any ref (main, branches, tags). Snapshots not reachable AND older than 7 days AND not in the last 30 are deleted.
The post-check confirms that every tag still resolves to a present snapshot. If any tag shows MISSING, an alert fires (this should never happen in correctly configured tables).
The data files freed by expired snapshots are not deleted immediately — they are simply unreferenced. The next remove_orphan_files call reclaims them.

Output.

tag_name	snapshot_id	committed_at	status
eom-2026-05	93	2026-05-31 23:55	PRESENT
filed-2026-Q1	87	2026-04-15 09:00	PRESENT

Rule of thumb. Run expire_snapshots daily on hot tables. Tags are the only contract that guarantees a snapshot will survive — table-level retention can change, but tags are explicit.

Worked example — OPTIMIZE (rewrite_data_files) with bin-pack target

Detailed explanation. A table with hundreds of thousands of small Parquet files per partition plans queries slowly because the planner must open every file's footer. rewrite_data_files reads small files in a partition and writes them out as fewer, larger bin-packed files (typically 512MB target). Read latency drops by 5-10x on heavy partitions.

Question. Run rewrite_data_files on the EU partition of db.fact_orders, targeting 512MB output files, and verify that the data-file count dropped while the row count stayed identical.

Input.

partition	files before	avg size before	rows before
EU	1240	12 MB	4_200_000

Code.

-- 1) Inspect data-file distribution for the EU partition
SELECT
    file_path,
    file_size_in_bytes,
    record_count
FROM db.fact_orders.files
WHERE partition.region = 'EU'
ORDER BY file_size_in_bytes ASC
LIMIT 5;

-- 2) Run bin-pack rewrite on the EU partition
CALL system.rewrite_data_files(
    table   => 'db.fact_orders',
    options => map(
        'target-file-size-bytes', '536870912',     -- 512 MB
        'min-input-files',        '5',
        'max-concurrent-file-group-rewrites', '8'
    ),
    where   => "region = 'EU'"
);

-- 3) Verify file count dropped and row count is identical
SELECT
    COUNT(*)                          AS file_count,
    SUM(file_size_in_bytes) / 1e9     AS total_gb,
    SUM(record_count)                 AS row_count
FROM db.fact_orders.files
WHERE partition.region = 'EU';

Step-by-step explanation.

The pre-check confirms the symptom: ~1240 files averaging 12MB — far below the 128MB-512MB sweet spot most engines prefer.
rewrite_data_files reads the small files in where-matched partitions and writes new files of target-file-size-bytes. The procedure commits a new snapshot whose added_data_files are the new large files and deleted_data_files are the old small ones. Row counts are preserved by construction.
The post-check shows the new file count (~12 files of ~500MB each), same total bytes (a tiny overhead from format headers), same row count.
Planning time on the next query against the EU partition drops by an order of magnitude because the planner now opens 12 footers instead of 1240.

Output.

metric	before	after
file_count	1240	12
total_gb	14.5	14.5
row_count	4_200_000	4_200_000
planning latency (typical)	6.4 s	0.5 s

Rule of thumb. Run weekly rewrite_data_files on append-heavy tables. Target 256-512MB per file. Skip partitions whose average file size is already in range — the procedure's min-input-files option lets you do this automatically.

Worked example — schema evolution: add column, drop column, rename

Detailed explanation. Iceberg's field-id-keyed schema makes column ops metadata-only. Adding, dropping, or renaming a column does not rewrite any data files — it updates the schema metadata and changes how the readers map files to columns.

Question. On db.fact_orders, add a currency column (default NULL), rename amount to amount_local, drop the obsolete legacy_flag column. Verify zero file rewrites.

Input.

operation	column
add	currency STRING
rename	amount → amount_local
drop	legacy_flag

Code.

-- 1) Inspect current schema
DESCRIBE TABLE db.fact_orders;

-- 2) Schema evolution — all three operations are metadata-only
ALTER TABLE db.fact_orders ADD COLUMN currency STRING;
ALTER TABLE db.fact_orders RENAME COLUMN amount TO amount_local;
ALTER TABLE db.fact_orders DROP COLUMN legacy_flag;

-- 3) Verify no data files were rewritten by the schema change
SELECT
    snapshot_id, operation,
    summary['added-data-files']    AS added,
    summary['deleted-data-files']  AS deleted
FROM db.fact_orders.snapshots
ORDER BY committed_at DESC
LIMIT 4;

Step-by-step explanation.

The schema pre-check captures the current columns. amount, legacy_flag, etc.
Each ALTER TABLE writes a new snapshot whose operation is replace (schema update) and whose summary shows added_data_files = 0, deleted_data_files = 0 — the data is untouched.
The RENAME works because Iceberg stores column field-ids in the manifest tree; the rename changes the schema's name → field-id mapping without touching the files.
The DROP COLUMN removes the column from the schema; existing files still contain the data, but readers no longer expose it. The disk usage is unchanged until a future rewrite_data_files writes new files without the dropped column.
The post-check confirms three new snapshots, each with zero data-file delta. Schema evolution at scale, for free.

Output.

snapshot_id	operation	added
s-201	replace (add currency)	0
s-202	replace (rename amount)	0
s-203	replace (drop legacy_flag)	0
s-200	append (prior)	12

Rule of thumb. Schema evolution on Iceberg is a metadata operation, not a rewrite. The cost is one catalog write per change. Compared to Hive, where a column rename was a full table copy, this is the single most-quoted operational win of the format.

Worked example — multi-engine read of the same Iceberg table

Detailed explanation. A typical 2026 stack has Spark for batch transforms, Trino for ad-hoc analytics, Snowflake for finance reporting, and Flink for streaming. All four read the same Iceberg table through the catalog; none of them needs format-specific glue.

Question. Show the canonical SELECT against db.fact_orders in Spark / Trino / Snowflake / Flink. Confirm that the result is identical.

Input.

engine	catalog binding
Spark	spark_catalog → REST
Trino	iceberg.db.fact_orders → REST
Snowflake	db.fact_orders (Iceberg external)
Flink	iceberg_catalog.db.fact_orders → REST

Code.

-- Spark
SELECT region, SUM(amount_local) AS revenue
FROM db.fact_orders
WHERE order_date = DATE '2026-06-12'
GROUP BY region;

-- Trino
SELECT region, SUM(amount_local) AS revenue
FROM iceberg.db.fact_orders
WHERE order_date = DATE '2026-06-12'
GROUP BY region;

-- Snowflake (Iceberg external table)
SELECT region, SUM(amount_local) AS revenue
FROM db.fact_orders
WHERE order_date = '2026-06-12'
GROUP BY region;

-- Flink SQL
SELECT region, SUM(amount_local) AS revenue
FROM iceberg_catalog.db.fact_orders
WHERE order_date = DATE '2026-06-12'
GROUP BY region;

Step-by-step explanation.

Each engine resolves the table identifier against its catalog binding. All four catalogs point to the same Iceberg metadata location.
Each engine's planner reads the current snapshot's manifest tree from the same S3 prefix and prunes files using column-level stats.
Each engine executes the same SQL semantics over the same data files. The result is byte-identical because the data is the same Parquet.
The only differences are engine-specific syntax (e.g. Trino's three-part identifier vs Spark's two-part) and connector configuration.

Output (identical across engines).

region	revenue
EU	1_310_200
US	1_045_900
APAC	444_100

Rule of thumb. When designing a new lakehouse, list every engine that needs to read it; pick a catalog (REST, Polaris, Nessie, Glue) that all of them support; let Iceberg's format do the rest. The "one format, every engine" promise is only as strong as the catalog you pick.

SQL interview question on a production Iceberg maintenance cadence

A senior interviewer often opens with: "Design the weekly maintenance schedule for a 200TB Iceberg fact table with 96 commits per day and 12 active downstream readers. What runs daily, what runs weekly, what runs monthly, and how do you protect compliance tags?" It probes operational hygiene, retention design, and the interaction between expiration and tags.

Solution Using a tag-protected daily / weekly / monthly cadence

-- DAILY — runs at 02:00 in low-traffic window
CALL system.expire_snapshots(
    table       => 'db.fact_orders',
    older_than  => CURRENT_TIMESTAMP - INTERVAL '7' DAY,
    retain_last => 50
);

-- WEEKLY — runs Sunday 02:30
CALL system.rewrite_data_files(
    table   => 'db.fact_orders',
    options => map(
        'target-file-size-bytes', '536870912',
        'min-input-files',        '5'
    )
);

CALL system.remove_orphan_files(
    table       => 'db.fact_orders',
    older_than  => CURRENT_TIMESTAMP - INTERVAL '3' DAY
);

-- MONTHLY — runs first Sunday at 03:00
CALL system.rewrite_manifests(
    table => 'db.fact_orders'
);

-- ONGOING — tagging step in the publish task (already covered in WAP)
ALTER TABLE db.fact_orders CREATE TAG `eom-${YYYY-MM}`
  AS OF VERSION (SELECT snapshot_id FROM db.fact_orders.refs WHERE name='main')
  RETAIN 2557 DAYS;

Step-by-step trace.

cadence	op	input	output
daily	expire_snapshots(7d, retain 50)	~700 snapshots, 50 tags	snapshots > 7d AND not in last 50 AND not tagged → eligible to expire
weekly	rewrite_data_files	partitions with avg file < 128MB	bin-packed to 512MB
weekly	remove_orphan_files	S3 files not in any snapshot, > 3d old	deleted
monthly	rewrite_manifests	imbalanced manifests	rebalanced
event	createTag eom-YYYY-MM	last good monthly snapshot	tag protects for 7 years

The trace shows how the four operations layer: daily expiration controls snapshot count; weekly OPTIMIZE controls file count; weekly orphan cleanup controls S3 cost; monthly rewrite_manifests controls planning time; per-month tagging controls compliance pinning.

Output:

metric	before	after
avg planning latency	4.2 s	0.6 s
avg files per partition	800	28
S3 storage (table)	240 TB	200 TB
snapshot count	~1200	~50 + tags
compliance tags preserved	yes	yes

Why this works — concept by concept:

expire_snapshots respects refs — tags and branches protect their snapshots. The retention window controls only un-referenced snapshots.
rewrite_data_files bin-packs — compacts small files into 256-512MB targets; reduces planning latency by an order of magnitude.
remove_orphan_files reclaims S3 — files left by failed writes or dropped branches; runs after expire_snapshots to avoid races.
rewrite_manifests rebalances metadata — keeps the manifest tree shallow and partition-aligned for fast planning at large commit counts.
Per-month tagging — the audit trail and compliance pin in one tag; the RETAIN 7 YEARS clause overrides the daily 7-day expiration for tagged snapshots.
Cost — daily expire is O(snapshots); weekly OPTIMIZE is O(small files); orphan cleanup is O(S3 listing); rewrite_manifests is O(manifest size). All four are bounded by table change rate, not table size — so the cadence scales sublinearly.

SQL
Topic — optimization
Optimization problems

Practice →

Cheat sheet — Iceberg branching, tagging, and WAP recipes

Inspect commit history. SELECT * FROM <table>.snapshots ORDER BY committed_at DESC LIMIT N — the audit log every other format had to bolt on.
List every named ref. SELECT name, snapshot_id, type FROM <table>.refs — branches plus tags plus main, in one query.
Time travel to a timestamp. SELECT ... FROM <table> TIMESTAMP AS OF '2026-06-12 06:00:00' — pin the planner to the snapshot that was current then.
Time travel to a snapshot id. SELECT ... FROM <table> VERSION AS OF 103 — unambiguous, audit-friendly.
Time travel to a tag. SELECT ... FROM <table> VERSION AS OF 'gold-v1' — engine resolves the tag from the refs table.
Atomic rollback. CALL system.rollback_to_snapshot('<table>', <prior_id>) — flip the main pointer back; bad snapshot stays on disk until expired.
Create a branch. ALTER TABLE <table> CREATE BRANCH '<name>' AS OF VERSION (SELECT snapshot_id FROM <table>.refs WHERE name='main').
Drop a branch. ALTER TABLE <table> DROP BRANCH '<name>' — ref removed; data files freed only if no other ref references them.
Fast-forward a branch into main. CALL system.fast_forward('<table>', 'main', '<branch>') — single atomic publish.
Cherry-pick a single snapshot. CALL system.cherrypick_snapshot('<table>', <source_snapshot_id>) — bring one fix from a branch into main.
Create a tag (with retention). ALTER TABLE <table> CREATE TAG '<name>' AS OF VERSION <id> RETAIN 2557 DAYS — 7-year SOX-style pin.
Drop a tag. ALTER TABLE <table> DROP TAG '<name>' — pinned snapshot becomes eligible to expire if no other ref protects it.
WAP write (branch-based). ALTER TABLE ... CREATE BRANCH 'audit-<runid>'; INSERT INTO ...branch_audit-<runid> SELECT ...; gate; fast_forward; drop branch.
Bin-pack OPTIMIZE. CALL system.rewrite_data_files(table => '<table>', options => map('target-file-size-bytes','536870912'), where => "<predicate>").
Daily snapshot expiration. CALL system.expire_snapshots(table => '<table>', older_than => CURRENT_TIMESTAMP - INTERVAL '7' DAY, retain_last => 50) — tags and branches protect their snapshots automatically.
Rewrite manifests (monthly). CALL system.rewrite_manifests(table => '<table>') — re-balance metadata to keep planning fast.
Reclaim orphan files. CALL system.remove_orphan_files(table => '<table>', older_than => CURRENT_TIMESTAMP - INTERVAL '3' DAY) — run after expire_snapshots.
Schema evolution (metadata only). ALTER TABLE ... ADD COLUMN ..., RENAME COLUMN ... TO ..., DROP COLUMN ... — all are zero data-file rewrites because Iceberg uses field-ids.
Partition evolution. ALTER TABLE ... ADD PARTITION FIELD day(event_time) — change spec without rewriting old data; readers handle both specs transparently.
Inspect file count per partition. SELECT partition, COUNT(*) FROM <table>.files GROUP BY partition ORDER BY COUNT(*) DESC — identify OPTIMIZE candidates.
Apache Iceberg vs Delta Lake quick rule. Multi-engine multi-vendor lakehouse: Iceberg. Databricks-first shop: Delta with UniForm. Migrating in 2 years: Iceberg.

Frequently asked questions

What is the difference between an Iceberg branch and a tag?

A branch is a movable named ref to a snapshot — writes targeting the branch advance its pointer to new snapshots. A tag is an immutable named ref — once created, its snapshot id does not change (you can only drop and recreate it). Branches are for in-progress isolated work (WAP, dev sandboxes, long-lived experiments); tags are for compliance pins, golden-dataset releases, and "last known good" markers. Both protect their snapshots from expire_snapshots, but tags carry an explicit RETAIN clause and are the right primitive for "must survive for 7 years" audit requirements.

Apache Iceberg vs Delta Lake — which should I pick for a new lakehouse?

For a multi-engine lakehouse with Snowflake, Databricks, Trino, Flink, and ad-hoc DuckDB readers, pick Iceberg — its catalog-pluggable design and first-class branch/tag APIs are universally supported. For a Databricks-first shop with limited multi-engine writers and a strong Spark / Photon investment, Delta with UniForm (which writes Iceberg-compatible metadata alongside Delta) is operationally simpler. For "I might switch engines in two years" or "I want to avoid vendor lock-in," Iceberg is the safer long-term bet. Both formats now ship snapshot atomicity, ACID, time travel, and schema evolution — the differences are catalog model, branching API, and the writer ecosystem outside Spark.

How does WAP differ from a staging table?

A staging table is a physical copy: you write to staging.fact_orders, validate, then INSERT INTO production.fact_orders SELECT * FROM staging.fact_orders — two copies of the data exist briefly and the publish is a multi-statement transaction (or worse, two separate transactions on object storage). Branch-based WAP writes the new snapshot to an isolated branch on the same table; the data files are shared; the publish is a single atomic catalog write that flips the main ref. WAP eliminates the copy, eliminates the two-step publish, and provides the same isolation guarantee with zero storage overhead beyond the new data itself.

Do Iceberg tags prevent snapshot expiration?

Yes. expire_snapshots walks every ref (main, branches, tags) and computes the set of reachable snapshots. Any snapshot reachable from any ref is kept regardless of the older_than clause. Tags can also carry an explicit RETAIN clause that self-expires the tag after a fixed duration (e.g. 7 years for SOX compliance), after which the protected snapshot becomes eligible for normal expiration. The combined contract — "tagged snapshots survive until the tag is dropped or self-expires" — is what makes tags the right primitive for compliance and audit reproducibility.

Can I read an Iceberg branch from Trino or Flink, not just Spark?

Yes. Branch-based WAP works because branches are catalog-level refs, not Spark-specific state. Trino reads a branch via SELECT ... FROM iceberg.db.table FOR VERSION AS OF '<branch>'; Flink via the OPTIONS('snapshot-id'='<resolved-from-branch>') hint; Snowflake via AT(BRANCH => '<branch>'). Every engine that has Iceberg catalog support resolves the branch's snapshot id through the refs metadata sub-table and reads it identically to any other snapshot. This multi-engine isolation is what made the legacy spark.wap.id property obsolete and why branch-based WAP is the 2026 default.

How do I roll back a bad production commit safely?

Two paths. First, if a tag protects the last-good snapshot, CALL system.rollback_to_snapshot('<table>', (SELECT snapshot_id FROM <table>.refs WHERE name='<tag>')) flips the main pointer back atomically. Every reader transitions immediately; the bad snapshot remains on disk for post-mortem until expired. Second, if no tag exists, query <table>.snapshots ORDER BY committed_at DESC LIMIT 5 to identify the prior good snapshot id, then call rollback_to_snapshot('<table>', <good_snapshot_id>). In both cases, the catalog's atomic pointer swap guarantees no reader sees a half-rolled-back state. The lesson: tag every production deploy beforehand — rollback then becomes a one-line lookup instead of a forensic exercise.

Practice on PipeCode

Drill the ETL practice library → for end-to-end pipeline design and the patterns Iceberg encodes (idempotent writes, atomic publish, schema evolution).
Rehearse database design problems → for the snapshot-isolation and ACID-on-object-storage reasoning interviewers want.
Sharpen optimization drills → when the interviewer wants planning-time and bin-pack reasoning.
Layer streaming problems → for CDC-into-Iceberg patterns and the WAP-on-streams design probe.
Stack joins practice → for the "ref-table vs fact-table" decisions every lakehouse forces.
Layer aggregation problems → for the read-side metrics that ride on top of Iceberg fact tables.
For the broader interview surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Sharpen the system-design axis with the ETL system design course →.
For long-form schema craft, work through data modelling for DE interviews →.
For the engine internals that drive Iceberg writers and readers, study the Apache Spark internals course →.

Pipecode.ai is Leetcode for Data Engineering — every Iceberg pattern above ships with hands-on practice rooms where you create the audit branch, run the validation gate, fast-forward main, and protect compliance tags against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your branch-based WAP actually behaves the same on Trino as it does on Spark — you can prove it on graded reps.

Practice ETL design now →
Database design drills →

1. Why open table formats — snapshot atomicity, ACID, time travel

The "files in S3" era broke at scale — the snapshot manifest is the primitive that fixed every invariant a SQL engineer expects

Worked example — list snapshots and inspect commit history

Worked example — time travel to query the table at a past timestamp

Worked example — atomic rollback to a previous snapshot

SQL interview question on Iceberg atomicity and time travel

Solution Using snapshot isolation and the manifest-pointer commit

2. Iceberg branches — createBranch / fastForward / cherry-pick, isolated dev

iceberg branching makes the table look like a Git repo — branches are named snapshot refs, not table copies

Worked example — create a dev branch, mutate, and fast-forward into main

Worked example — cherry-pick a single fix snapshot from a hotfix branch

Worked example — long-lived experiment branch with concurrent main writes

SQL interview question on sandboxing a backfill

Solution Using a backfill branch, validation gate, and fast-forward publish

3. Iceberg tags — immutable labels for compliance and audit

iceberg tagging pins a snapshot under a stable name — the audit-friendly equivalent of "the version regulators asked for"

Worked example — tag the last good snapshot before a known-bad release

Worked example — query "AS OF" a tag from Spark, Trino, and Snowflake

Worked example — protect a tag from snapshot expiration

SQL interview question on tag-based audit reproducibility

Solution Using compliance tags with explicit retention

4. WAP pattern — write to audit branch, validate, fast-forward to main

iceberg wap is the production-grade write audit publish pattern — branch isolation makes the staging table obsolete

Worked example — Spark write to the audit branch

Worked example — dbt-style audit gate (row counts, null rate, freshness)

Worked example — fast-forward publish vs rollback on failure

Worked example — multi-engine reader isolation during the audit window

SQL interview question on designing WAP for a 100TB fact table

Solution Using one audit branch per run-id with fast-forward publish

5. Production ops — snapshot expiration, OPTIMIZE, schema evolution

Iceberg health is a maintenance cadence — expire_snapshots, rewrite_data_files, rewrite_manifests, and remove_orphan_files are the four sectors of the production wheel

Worked example — daily snapshot expiration job protecting tags

Worked example — OPTIMIZE (rewrite_data_files) with bin-pack target

Worked example — schema evolution: add column, drop column, rename

Worked example — multi-engine read of the same Iceberg table

SQL interview question on a production Iceberg maintenance cadence

Solution Using a tag-protected daily / weekly / monthly cadence

Cheat sheet — Iceberg branching, tagging, and WAP recipes

Frequently asked questions

What is the difference between an Iceberg branch and a tag?

Apache Iceberg vs Delta Lake — which should I pick for a new lakehouse?

How does WAP differ from a staging table?

Do Iceberg tags prevent snapshot expiration?

Can I read an Iceberg branch from Trino or Flink, not just Spark?

How do I roll back a bad production commit safely?

Practice on PipeCode

`iceberg branching` makes the table look like a Git repo — branches are named snapshot refs, not table copies

Worked example — create a `dev` branch, mutate, and fast-forward into `main`

Solution Using a `backfill` branch, validation gate, and fast-forward publish

`iceberg tagging` pins a snapshot under a stable name — the audit-friendly equivalent of "the version regulators asked for"

`iceberg wap` is the production-grade `write audit publish` pattern — branch isolation makes the staging table obsolete

Iceberg health is a maintenance cadence — `expire_snapshots`, `rewrite_data_files`, `rewrite_manifests`, and `remove_orphan_files` are the four sectors of the production wheel