The iceberg rest catalog has quietly become the lingua franca of the open lakehouse — the thin OpenAPI surface that lets Spark, Trino, Flink, DuckDB, ClickHouse, Snowflake, and BigQuery all read and write the same physical Iceberg tables without a Hive metastore in sight. Two other names sit alongside it on every 2026 architecture review: Project Nessie, the git-style catalog that adds branches, tags, and atomic multi-table commits to the lakehouse; and Apache Polaris, Snowflake's open-source multi-tenant catalog donated to the ASF and now the default federation surface for cross-engine Iceberg deployments.
This guide is the senior-engineer reference for picking one. It walks through the lakehouse catalog contract (federation, governance, atomic CAS), unpacks the Iceberg REST spec endpoint by endpoint, contrasts project nessie branches/tags with vanilla snapshot-only versioning, lays out apache polaris principals / roles / grants for multi tenant catalog isolation, and closes with a four-axis decision matrix for engine fit, multi-tenancy, data versioning catalog semantics, and governance. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the dimensional modelling practice library →, rehearse on ETL design problems →, and stack the platform muscles with system design drills →.
On this page
- Why a catalog — federation, governance, multi-engine access
- Iceberg REST Catalog spec — open protocol, vended credentials, engine support
- Project Nessie — git-like catalog, branches and tags on data
- Apache Polaris — multi-tenant, Iceberg + RBAC
- Picking a catalog — engine support, multi-tenancy, governance trade-offs
- Catalog cheat sheet — REST, Nessie, Polaris recipes
- Frequently asked questions
- Practice on PipeCode
1. Why a catalog — federation, governance, multi-engine access
A catalog is the contract between the table format and every engine that reads it
The one-sentence invariant: an Iceberg catalog is the only thing that lets multiple writers and multiple engines agree on which metadata.json is the current head of a table — every other property (atomicity, federation, governance) falls out of that one rule. Once you accept that the catalog is the lakehouse control plane, the choice between iceberg catalog implementations stops being about features and starts being about which contract you can defend in a code review.
The four properties a real catalog must provide.
-
Federation. A single namespace tree (
catalog → namespace → table) that Spark, Trino, Flink, DuckDB, ClickHouse, Snowflake, and BigQuery can all list and resolve. Without federation you are back to per-engine metastores and you cannot do a cross-engine join without copying data. - Governance. Centralised authentication, authorisation, audit trail, and storage credential vending — engines authenticate to the catalog, the catalog vends short-lived tokens to the storage layer, and no engine ever sees the root S3 / GCS / ABFS key.
- Discovery. A searchable, hierarchical namespace tree that data consumers can browse without joining an internal Slack channel and asking which database the gold layer lives in.
-
Atomicity. The catalog holds the pointer to the current
metadata.json. Every commit is a compare-and-set on that pointer — first writer wins, second writer 409s and retries — which is the only way to make a lakehouse table safely writable by two processes at the same time.
The three places a missing catalog bites.
- Discovery. Without a catalog, "which table holds the gold-tier orders?" is answered by grep on an S3 bucket. Once you have ten domains, this becomes a full-time job for one engineer.
-
Atomicity. Two writers race a commit. Without a CAS-capable catalog, both append a new
metadata.jsonto the same prefix and rely on S3 last-writer-wins. The result is a silently dropped commit and the senior on-call gets paged at 03:00. - Governance. Without a catalog vending credentials, every engine needs a long-lived IAM key with read/write to the whole bucket. The blast radius of a single leaked key is the entire lake.
The three-engine reality.
Once a platform has more than one query engine, the catalog is the only safe way to share Iceberg tables. Spark writes a fact table from an ETL job, Trino reads it for ad-hoc analytics, Flink streams a CDC sink into a sister table, DuckDB pulls the same table for a notebook. All four engines must agree on the current snapshot or one of them will silently read stale data. That agreement is the catalog's job.
What interviewers listen for.
- Do you say "the catalog holds the pointer to the current
metadata.jsonand commits are an atomic CAS on that pointer"? — required answer. - Do you mention vended credentials when asked how engines authenticate to storage? — senior signal.
- Do you reach for RBAC at the namespace / table level the moment multi-tenant comes up? — required answer.
- Do you recognise that the Hive metastore is now deprecated for lakehouse use because it lacks atomic CAS and short-lived credential vending? — senior signal.
The 2026 reality.
- Hive Metastore is officially deprecated for new lakehouse builds — every major engine ships an Iceberg REST connector and treats HMS as legacy.
- AWS Glue offers a REST-compatible mode alongside its native API, so existing Glue users can move to the open spec without re-importing tables.
- Snowflake and Databricks both publish Iceberg via catalog — Snowflake through Polaris (GA mid-2025), Databricks through Unity Catalog's Iceberg REST endpoint (GA late-2025).
- Apache Polaris is an Apache Software Foundation incubator project donated by Snowflake mid-2024, with community drivers from Spark, Trino, Flink, and Dremio.
- Project Nessie ships in Dremio Sonar and as a stand-alone OSS server, and added Iceberg REST compatibility in version 0.80, so a single catalog can serve both git-style and REST clients.
Worked example — what breaks without a catalog
Detailed explanation. New lakehouse builders sometimes try to skip the catalog and let writers manage metadata.json on object storage themselves — "we'll just list the directory and pick the latest." This works exactly long enough for the second writer to show up. The moment two processes commit concurrently, one of them silently loses its append because S3 has no atomic compare-and-set on object names.
Question. Two Spark jobs commit to the same Iceberg table at the same time. The "catalog" is a directory listing on S3 that picks the alphabetically latest metadata-vN.json. Walk me through how a commit gets dropped and why a real catalog fixes it.
Input.
| Time | Writer A | Writer B | S3 state |
|---|---|---|---|
| t0 | reads v3 as current |
reads v3 as current |
v3.json present |
| t1 | writes v4-A.json (new snapshot) |
— |
v3.json, v4-A.json
|
| t2 | — | writes v4-B.json (new snapshot) |
v3.json, v4-A.json, v4-B.json
|
| t3 | next reader lists prefix | — | sees v4-B.json last |
| t4 | — | next reader lists prefix | sees v4-B.json last |
Code.
# BROKEN — no catalog, two writers race
def commit_table(snapshot, table_prefix):
versions = list_s3(f"{table_prefix}/metadata/")
latest = max(versions) # e.g. metadata-v3.json
next_v = bump(latest) # metadata-v4-{uuid}.json
write_s3(f"{table_prefix}/metadata/{next_v}", snapshot)
return next_v
# REAL — catalog mediates the commit via atomic CAS
def commit_table_real(snapshot, table_id, catalog):
current = catalog.load_table(table_id)
new_meta = build_new_metadata(current, snapshot)
# POST /v1/.../commits with requirements.assertRefSnapshotId
catalog.commit_table(
table_id,
requirements=[{"type": "assert-ref-snapshot-id",
"ref": "main",
"snapshot-id": current.snapshot_id}],
updates=new_meta.diff(),
)
Step-by-step explanation.
- Writer A reads the directory at
t0and seesv3.jsonas current. So does Writer B. - Writer A writes
v4-A.jsonatt1with snapshot-idS_Awhose parent isv3. - Writer B writes
v4-B.jsonatt2with snapshot-idS_Bwhose parent isv3. Both children point at the same parent, so the timeline forks but neither writer knows. - A reader at
t3lists the prefix and picksv4-B.json(alphabetically last). Writer A's commit is now invisible — the snapshot is on disk but is not the current head. - The "real" commit through the catalog includes an
assertRefSnapshotIdrequirement: "I claimmainpoints atS_3— please CAS it toS_A." The catalog sees the requirement, processes Writer A first, advancesmaintoS_A, then sees Writer B's claim, finds thatmainis already atS_AnotS_3, returns a 409 Conflict, and Writer B retries against the new head.
Output.
| Path | Without catalog | With catalog |
|---|---|---|
| Writer A snapshot | silently dropped | committed |
| Writer B snapshot | wins by coincidence | retries → committed second |
| Lineage | forked, one branch invisible | linear |
Rule of thumb. Never let writers pick metadata.json themselves. The catalog's compare-and-set on the current snapshot pointer is the only lakehouse primitive that turns S3 / GCS / ABFS into a safe shared table store.
Worked example — namespace + multi-engine read
Detailed explanation. A namespace tree (catalog → namespace → table) is the searchable surface for every Iceberg consumer. Once a catalog publishes it, Spark, Trino, Flink, and DuckDB all enumerate the same tree using their own SQL dialects — SHOW NAMESPACES, SHOW SCHEMAS, SHOW DATABASES, iceberg_scan('...') — all backed by the same REST endpoint.
Question. A table lives at prod.sales.orders. Show how Spark, Trino, and DuckDB each list the namespace and read the table, and explain why the answers are identical.
Input.
| Engine | List command | Read command |
|---|---|---|
| Spark | SHOW NAMESPACES IN prod |
SELECT * FROM prod.sales.orders |
| Trino | SHOW SCHEMAS FROM prod |
SELECT * FROM prod.sales.orders |
| DuckDB | SELECT * FROM duckdb_iceberg.namespaces |
SELECT * FROM iceberg_scan('s3://.../orders') |
Code.
-- Spark — REST catalog config:
-- spark.sql.catalog.prod = org.apache.iceberg.spark.SparkCatalog
-- spark.sql.catalog.prod.catalog-impl = org.apache.iceberg.rest.RESTCatalog
-- spark.sql.catalog.prod.uri = https://catalog.example.com
SHOW NAMESPACES IN prod;
SELECT order_id, total FROM prod.sales.orders WHERE order_date = current_date;
-- Trino — REST connector:
-- iceberg.catalog.type=rest
-- iceberg.rest-catalog.uri=https://catalog.example.com
SHOW SCHEMAS FROM prod;
SELECT order_id, total FROM prod.sales.orders WHERE order_date = current_date;
-- DuckDB — Iceberg extension v1.x
INSTALL iceberg; LOAD iceberg;
CREATE SECRET (TYPE iceberg, ENDPOINT 'https://catalog.example.com', TOKEN '<bearer>');
SELECT order_id, total FROM iceberg_scan('prod.sales.orders')
WHERE order_date = current_date;
Step-by-step explanation.
- Each engine's config points at the same REST catalog URL. The HTTP client inside each engine speaks the same OpenAPI spec — only the surrounding SQL surface differs.
-
SHOW NAMESPACES IN prodtranslates toGET /v1/{prefix}/namespaces?parent=prodover HTTP. The catalog returns the JSON list — Spark, Trino, and DuckDB all receive the same payload. - Reading a table translates to
GET /v1/{prefix}/namespaces/{ns}/tables/{tbl}(load-table). The response includes the currentmetadata-location, vended storage credentials, and the snapshot manifest list. The engine then reads the parquet files directly from object storage using those credentials. - The "answer" is the same because the catalog is the source of truth. The engines are just three different consumers of the same JSON API.
Output.
| Engine | Namespace list | First three orders read |
|---|---|---|
| Spark | [default, sales, marketing] |
1, 2, 3 |
| Trino | [default, sales, marketing] |
1, 2, 3 |
| DuckDB | [default, sales, marketing] |
1, 2, 3 |
Rule of thumb. Pick a catalog whose protocol is open (REST, gRPC, or both). Vendor-specific protocols (HMS / Glue native / Unity proprietary) lock you to one engine flavour; open protocols let you swap engines without re-importing tables.
Worked example — atomic commit via catalog CAS
Detailed explanation. The catalog stores a single pointer per table: "this table's main reference points at snapshot S_n." Every commit is a CAS on that pointer: the writer claims the current value and proposes the new value. The catalog server validates the claim atomically — exactly one writer can succeed for any given parent.
Question. Show how the Iceberg REST commitTable endpoint prevents the lost-update problem, using a worked CAS sequence for two concurrent writers.
Input — two writers, same parent snapshot.
| Writer | Parent snapshot | Proposed child |
|---|---|---|
| A | S_3 |
S_A (append 10k rows) |
| B | S_3 |
S_B (append 8k rows) |
Code.
POST /v1/prod/namespaces/sales/tables/orders/commits
Content-Type: application/json
Authorization: Bearer <token>
{
"requirements": [
{
"type": "assert-ref-snapshot-id",
"ref": "main",
"snapshot-id": 3
},
{
"type": "assert-table-uuid",
"uuid": "550e8400-e29b-41d4-a716-446655440000"
}
],
"updates": [
{ "action": "add-snapshot", "snapshot": { "snapshot-id": 1001, ... } },
{ "action": "set-current-snapshot", "snapshot-id": 1001 }
]
}
Step-by-step explanation.
- Writer A and Writer B both call
loadTableat the same time. Both receivemain → S_3as the current head. Each writes its parquet files to S3 and stages a new snapshot. - Writer A POSTs
commitTablewithassert-ref-snapshot-id: 3 → 1001. The catalog acquires its row-level lock, verifies thatmainisS_3, advancesmaintoS_A(id 1001), releases the lock, and returns200 OK. - Writer B POSTs
commitTablewithassert-ref-snapshot-id: 3 → 1002. The catalog acquires its row-level lock, findsmainisS_A(id 1001), notS_3, and returns409 Conflictwith the current snapshot id. - Writer B retries: re-runs
loadTable, re-bases its append on top ofS_A, stages a new snapshotS_B'with parentS_A, and POSTscommitTablewithassert-ref-snapshot-id: 1001 → 1002. This time the assertion holds and the catalog advancesmaintoS_B'.
Output.
| Path | Without CAS | With CAS |
|---|---|---|
| Snapshot history | forked (S_3 → S_A, S_3 → S_B) |
linear (S_3 → S_A → S_B') |
| Rows lost | 8k or 10k (whoever lost) | 0 |
| Operator wake-ups | 1 (at 03:00) | 0 |
Rule of thumb. Every catalog you evaluate must offer atomic CAS on the current snapshot pointer. If the answer is "we rely on object store consistency," walk away — that is not a lakehouse catalog, that is a directory.
Senior engineering interview question on safe concurrent writes
A platform interviewer often opens with: "Two Spark jobs are appending to the same Iceberg fact table every minute. Walk me through the catalog primitives that make that safe, and the failure mode you would expect without them." It blends federation, atomic CAS, and retry semantics into a single architecture answer.
Solution Using catalog CAS for safe concurrent writes
# Senior-engineer answer — catalog CAS + bounded retry
import time
from random import uniform
def append_with_retry(client, table_id, files, max_retries=8):
"""Commit a snapshot via REST commitTable with optimistic CAS."""
for attempt in range(max_retries):
current = client.load_table(table_id)
new_snapshot = build_snapshot(current, files)
try:
client.commit_table(
table_id,
requirements=[
{"type": "assert-table-uuid",
"uuid": current.uuid},
{"type": "assert-ref-snapshot-id",
"ref": "main",
"snapshot-id": current.snapshot_id},
],
updates=[
{"action": "add-snapshot",
"snapshot": new_snapshot},
{"action": "set-current-snapshot",
"snapshot-id": new_snapshot["snapshot-id"]},
],
)
return new_snapshot["snapshot-id"]
except CommitFailedException: # HTTP 409 from catalog
# someone else committed; back off and retry
time.sleep(uniform(0.05, 0.5) * (2 ** attempt))
raise RuntimeError("commit failed after max retries")
Step-by-step trace.
| Step | Writer A | Writer B | Catalog main ref |
|---|---|---|---|
| 1 |
loadTable → S_3
|
loadTable → S_3
|
S_3 |
| 2 | stage files for S_A
|
stage files for S_B
|
S_3 |
| 3 |
commitTable assert S_3 → S_A
|
— | CAS succeeds → S_A
|
| 4 | — |
commitTable assert S_3 → S_B
|
409 (current is S_A) |
| 5 | — | re-loadTable → S_A; rebase to S_B' parent S_A
|
S_A |
| 6 | — |
commitTable assert S_A → S_B'
|
CAS succeeds → S_B'
|
After six steps both appends land linearly in the snapshot log. The contention cost is one extra round-trip plus one rebase per loser per conflict.
Output:
| Result | Without CAS | With CAS |
|---|---|---|
| Writer A snapshot | dropped | committed (S_A) |
| Writer B snapshot | wins by accident | committed (S_B') |
Final main ref |
inconsistent | S_B' |
| Lost rows | up to 18k | 0 |
| Cost of contention | 1 page | 1 retry per loser |
Why this works — concept by concept:
-
catalog CAS — every commit asserts "I think
mainisS_3" and the catalog atomically advances it toS_Aonly if the claim holds. This is the same compare-and-set primitive every distributed system uses; the catalog server is the single place that serialises the writes. -
assertRefSnapshotId — the REST requirement encodes which reference (
main,dev, a Nessie branch) and which snapshot id the writer thinks is current. Both fields are checked; a mismatch on either is a 409. -
assertTableUUID — guards against the table being dropped and re-created with the same name between
loadTableandcommitTable. Cheap insurance against a class of subtle bugs. -
retry semantics — losers re-
loadTable, rebase their snapshot on the new parent, and re-submit. Exponential backoff with jitter prevents thundering-herd retries. - no S3 lock needed — because the CAS lives in the catalog, the object store can stay last-writer-wins. This is the architectural insight that unlocks lakehouse on commodity blob storage.
-
Cost —
O(1)catalog round-trip in the happy path,O(R)round-trips underRretries,O(parquet bytes)for the file write. Atomicity is bounded by the catalog server's per-table lock, not by the size of the snapshot.
SQL
Topic — dimensional-modeling
Dimensional modelling problems (SQL)
2. Iceberg REST Catalog spec — open protocol, vended credentials, engine support
REST catalog is the OpenAPI of Iceberg — one spec, every engine
The mental model in one line: the Iceberg REST catalog is an OpenAPI-3 contract anyone can implement (Tabular, Polaris, Nessie, Glue, Snowflake, Databricks Unity) and anyone can call (Spark, Trino, Flink, DuckDB, ClickHouse, BigQuery, Athena), so the protocol — not any one vendor — is the moat. Once the spec exists, "switching catalogs" becomes a configuration change, not a re-import.
The endpoint surface in one table.
| Endpoint | HTTP | Purpose | Idempotent |
|---|---|---|---|
/v1/config |
GET | catalog capability discovery + auth hint | yes |
/v1/{prefix}/namespaces |
POST | createNamespace | no |
/v1/{prefix}/namespaces |
GET | listNamespaces | yes |
/v1/{prefix}/namespaces/{ns}/tables |
POST | createTable | no |
/v1/{prefix}/namespaces/{ns}/tables |
GET | listTables | yes |
/v1/{prefix}/namespaces/{ns}/tables/{tbl} |
GET | loadTable (+ vended creds) | yes |
/v1/{prefix}/namespaces/{ns}/tables/{tbl}/commits |
POST | commitTable (atomic CAS) | no |
/v1/{prefix}/namespaces/{ns}/tables/{tbl} |
DELETE | dropTable | no |
/v1/{prefix}/namespaces/{ns}/views |
POST | createView | no |
/v1/{prefix}/oauth/tokens |
POST | OAuth2 client-credentials | yes |
Vended credentials explained.
When an engine calls loadTable, the response body includes a credentials block containing a short-lived storage token (STS, GCS HMAC, Azure SAS) scoped to the table's storage prefix. The engine never sees the root bucket key — it only sees a token that can read or write one table's files, expires in roughly an hour, and is automatically refreshed by another loadTable call.
This is the single feature that turns the catalog from a metadata server into a governance plane. Without vended credentials, a leaked engine credential exposes the whole lake. With vended credentials, the blast radius is one table for one hour.
Engine support matrix (2026).
| Engine | REST catalog | Notes |
|---|---|---|
| Spark | yes (built-in) | org.apache.iceberg.rest.RESTCatalog |
| Trino | yes (built-in) | iceberg.catalog.type=rest |
| Flink | yes (built-in) | CREATE CATALOG ... 'type'='iceberg' 'catalog-type'='rest' |
| DuckDB | yes (extension) |
iceberg extension v1+ |
| ClickHouse | yes |
Iceberg table engine + REST URL |
| Snowflake | yes | catalog-integration → REST URL |
| BigQuery | yes (preview) | external Iceberg metadata source |
| Athena | yes | REST endpoint + IAM token |
OAuth2 auth in one paragraph.
The catalog publishes an oauth/tokens endpoint that accepts client_credentials grants. Engines exchange a client-id / client-secret pair for a short-lived bearer token, then send Authorization: Bearer <token> on every other call. Scopes are catalog-defined (catalog, data:read, data:write), and the catalog enforces them per principal.
The 2026 reality.
- Snowflake's Polaris Catalog is REST-native — there is no proprietary protocol underneath; Polaris simply implements the spec.
- Databricks Unity announces an Iceberg REST endpoint (GA late-2025) — Unity Catalog can now serve any REST-compatible engine, not just Databricks runtime.
- AWS Glue exposes a REST shim alongside its native API — existing Glue users can flip a switch and gain REST clients.
- Tabular (the original commercial REST catalog) was acquired by Databricks in 2024; its engineers now drive the Unity REST surface.
- The spec lives at github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml and is governed by the Apache Iceberg PMC.
Worked example — list namespaces with curl
Detailed explanation. Because the REST catalog is just HTTP + JSON, anything that can speak HTTP can talk to it — curl, Postman, a Python script, an engine. The same call works the same way regardless of the client. This is the dimension that makes the protocol portable.
Question. Show how Trino, Spark, and a raw curl all hit the same listNamespaces endpoint and receive the same response. Explain why this matters for debugging in production.
Input.
| Client | Command |
|---|---|
| curl | curl -H "Authorization: Bearer $T" .../v1/prod/namespaces |
| Spark | SHOW NAMESPACES IN prod |
| Trino | SHOW SCHEMAS FROM prod |
Code.
# Raw HTTP — works from any laptop on the network
curl -H "Authorization: Bearer ${TOKEN}" \
-H "Accept: application/json" \
https://catalog.example.com/v1/prod/namespaces
# Response (same payload regardless of caller):
# {
# "namespaces": [
# ["sales"],
# ["marketing"],
# ["finance"]
# ]
# }
// Spark — same HTTP call under the hood
spark.conf.set("spark.sql.catalog.prod",
"org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.prod.catalog-impl",
"org.apache.iceberg.rest.RESTCatalog")
spark.conf.set("spark.sql.catalog.prod.uri",
"https://catalog.example.com")
spark.sql("SHOW NAMESPACES IN prod").show()
-- Trino — same HTTP call under the hood
-- etc/catalog/prod.properties:
-- connector.name=iceberg
-- iceberg.catalog.type=rest
-- iceberg.rest-catalog.uri=https://catalog.example.com
SHOW SCHEMAS FROM prod;
Step-by-step explanation.
- Each client builds the same HTTP request:
GET /v1/prod/namespaceswith a bearer token and an Accept header. - The catalog server validates the token, checks that the principal has
catalog:listpermission onprod, and returns the namespace JSON. - Each client parses the response into its native representation (a Spark DataFrame row, a Trino result set row, a JSON object) — but the bytes on the wire are identical.
- When a Trino user complains "I can't see the sales schema," the platform engineer reproduces the failure with
curlin five seconds. No need to spin up a Trino cluster.
Output.
| Client | Output |
|---|---|
| curl | {"namespaces":[["sales"],["marketing"],["finance"]]} |
| Spark | [sales, marketing, finance] |
| Trino | [sales, marketing, finance] |
Rule of thumb. REST means everything is HTTP — and HTTP debuggability is one of the best operational properties a control plane can have. The day you investigate a production catalog issue, you'll be grateful you can curl it.
Worked example — commitTable with optimistic CAS
Detailed explanation. The commitTable endpoint is the heart of the spec. Every other endpoint is glorified discovery; commitTable is the one that turns the catalog into a transactional system. Its anatomy — requirements (CAS assertions) + updates (the change to apply) — is worth memorising for senior interviews.
Question. Given an Iceberg table currently at snapshot S_3, write a commitTable request body that adds a new snapshot S_4 and advances main to it. Explain each requirement field.
Input.
| Field | Value |
|---|---|
current main
|
S_3 (snapshot-id 3) |
| new snapshot id | 1001 |
| new snapshot operation | append |
| new manifest count | 4 |
Code.
POST /v1/prod/namespaces/sales/tables/orders/commits
{
"requirements": [
{
"type": "assert-table-uuid",
"uuid": "550e8400-e29b-41d4-a716-446655440000"
},
{
"type": "assert-ref-snapshot-id",
"ref": "main",
"snapshot-id": 3
},
{
"type": "assert-last-assigned-field-id",
"last-assigned-field-id": 12
}
],
"updates": [
{
"action": "add-snapshot",
"snapshot": {
"snapshot-id": 1001,
"parent-snapshot-id": 3,
"sequence-number": 4,
"timestamp-ms": 1718265600000,
"operation": "append",
"manifest-list": "s3://lake/orders/metadata/snap-1001-...avro",
"summary": {"added-records": "10000"}
}
},
{
"action": "set-current-snapshot",
"snapshot-id": 1001,
"ref": "main"
}
]
}
Step-by-step explanation.
-
assert-table-uuidguards against the table being dropped and re-created under the same name in the milliseconds betweenloadTableandcommitTable. If the UUID changed, the commit aborts with409. -
assert-ref-snapshot-idis the CAS itself: "I believemainis at snapshot 3." If the catalog findsmainis at any other snapshot, the commit aborts with409and returns the actual current snapshot id so the client can rebase. -
assert-last-assigned-field-idguards against schema drift — if the schema gained a column betweenloadTableandcommitTable, the commit aborts and the client must re-derive the new manifest with the new column. -
add-snapshotregisters the new snapshot in the table metadata.set-current-snapshotadvances themainreference. Both updates apply atomically — either all updates land or none do.
Output.
| State | Before | After |
|---|---|---|
main ref |
S_3 |
S_4 |
| snapshots known | [S_1, S_2, S_3] |
[S_1, S_2, S_3, S_4] |
parent of S_4
|
— | S_3 |
| HTTP status | — | 200 OK |
Rule of thumb. Every senior data engineer should know the three-field anatomy of commitTable — UUID, ref snapshot id, last field id. Those three assertions are the whole transactional contract.
Worked example — vended credentials end-to-end
Detailed explanation. loadTable is the endpoint engines call before every scan. When the catalog supports credential vending, the response includes a credentials block with short-lived storage tokens. The engine uses those tokens to read parquet files directly from S3 / GCS / ABFS — the catalog is never in the data path.
Question. Walk me through the entire chain of trust: Spark asks the catalog to load a table, the catalog returns vended credentials, Spark reads parquet directly from S3. What expires when? Who renews?
Input.
| Actor | Identity | Lifetime |
|---|---|---|
| Spark driver | OAuth client_id / secret | long-lived |
| Spark bearer token | from /oauth/tokens
|
1 hour |
| Vended S3 token | from loadTable credentials block |
1 hour |
Code.
# Step 1 — Spark exchanges client-creds for a catalog bearer token
POST /v1/oauth/tokens
grant_type=client_credentials&client_id=spark-prod&client_secret=...
# Response:
# { "access_token": "eyJ...", "expires_in": 3600, "token_type": "Bearer" }
# Step 2 — Spark calls loadTable; catalog returns vended S3 creds
GET /v1/prod/namespaces/sales/tables/orders
Authorization: Bearer eyJ...
# Response:
# {
# "metadata-location": "s3://lake/sales/orders/metadata/v42.json",
# "metadata": { ... full table metadata ... },
# "config": {
# "s3.access-key-id": "ASIA...",
# "s3.secret-access-key": "wJal...",
# "s3.session-token": "FQoG...",
# "s3.session-expires": "1718269200000"
# }
# }
# Step 3 — Spark reads parquet directly from S3 using those vended creds
# (no catalog round-trip per file)
Step-by-step explanation.
- Spark starts up and exchanges its long-lived client-id / client-secret for a 1-hour bearer token. The client-id / client-secret never leaves the driver — only the short-lived bearer does.
- Spark calls
loadTablewith the bearer token in theAuthorizationheader. The catalog server authenticates the principal, authorises read onsales.orders, then asks its storage broker (STS / GCS / Azure) for a 1-hour token scoped only to thes3://lake/sales/orders/prefix. - The catalog returns the table metadata + the vended storage token in one response. Spark caches both for the lifetime of the query.
- Spark reads the manifest list, the manifests, and the data files directly from S3 using the vended token. The catalog is not in the data path — it has touched zero parquet bytes.
- When the vended token expires, Spark calls
loadTableagain to refresh. The catalog re-checks the principal's authorisation each time, so revoking a principal in the catalog takes effect at the next refresh — typically within an hour.
Output.
| Property | Without vending | With vending |
|---|---|---|
| Engine credential | long-lived IAM key | 1-hour token |
| Blast radius of leak | whole bucket forever | one table prefix for one hour |
| Revocation latency | hours to days | <= 1 hour |
| Catalog in data path | no | no |
Rule of thumb. Always insist on vended credentials when picking a catalog. The lakehouse story falls apart if every engine needs a long-lived IAM key — the only way to give Spark, Trino, Flink, DuckDB safe access to the same lake is for the catalog to hand each of them a scoped, expiring token.
Senior engineering interview question on REST catalog wiring across engines
A senior interviewer might frame this as: "Your platform has Spark for ETL, Trino for ad-hoc analytics, and Flink for streaming. Walk me through wiring all three to a single Iceberg REST catalog and explain what configuration each engine needs."
Solution Using a single REST catalog config across three engines
# Spark — set as session-level catalog
spark.sql.catalog.prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.prod.catalog-impl = org.apache.iceberg.rest.RESTCatalog
spark.sql.catalog.prod.uri = https://catalog.example.com
spark.sql.catalog.prod.credential = <oauth-client-id>:<oauth-client-secret>
spark.sql.catalog.prod.warehouse = s3://lake/prod
spark.sql.catalog.prod.io-impl = org.apache.iceberg.aws.s3.S3FileIO
# Trino — etc/catalog/prod.properties
connector.name = iceberg
iceberg.catalog.type = rest
iceberg.rest-catalog.uri = https://catalog.example.com
iceberg.rest-catalog.security = OAUTH2
iceberg.rest-catalog.oauth2.credential = <client-id>:<client-secret>
iceberg.rest-catalog.warehouse = s3://lake/prod
fs.native-s3.enabled = true
-- Flink — CREATE CATALOG via SQL
CREATE CATALOG prod WITH (
'type' = 'iceberg',
'catalog-type' = 'rest',
'uri' = 'https://catalog.example.com',
'credential' = '<client-id>:<client-secret>',
'warehouse' = 's3://lake/prod',
'io-impl' = 'org.apache.iceberg.aws.s3.S3FileIO'
);
USE CATALOG prod;
Step-by-step trace.
| Step | Spark | Trino | Flink |
|---|---|---|---|
| 1 | exchanges client-creds for bearer | exchanges client-creds for bearer | exchanges client-creds for bearer |
| 2 | calls listNamespaces → [sales, marketing, finance]
|
calls listNamespaces → identical |
calls listNamespaces → identical |
| 3 | calls loadTable(sales.orders) → metadata + vended creds |
calls loadTable(sales.orders) → metadata + vended creds |
calls loadTable(sales.orders) → metadata + vended creds |
| 4 | reads parquet from S3 directly | reads parquet from S3 directly | reads parquet from S3 directly |
| 5 | atomic commitTable for ETL writes |
(read-only) | atomic commitTable for sink writes |
After five steps all three engines have the same view of the lake, all three are authenticated via OAuth, and none of them holds a long-lived S3 key.
Output:
| Engine | Auth | Storage credential | Sees same data? |
|---|---|---|---|
| Spark | OAuth bearer | vended STS (1h) | yes |
| Trino | OAuth bearer | vended STS (1h) | yes |
| Flink | OAuth bearer | vended STS (1h) | yes |
Why this works — concept by concept:
- one URI — every engine points at the same REST endpoint, so the catalog is the single source of truth. Changing the warehouse location is a configuration change in one place, not three.
- OAuth token — every engine uses the same client-credentials flow. Revoking a client in the catalog kills its access across all three engines simultaneously.
-
vended credentials — none of the engines holds a long-lived S3 key. Each
loadTablerefreshes the storage token, and the catalog re-checks authorisation each refresh. -
capability discovery — the
GET /v1/configendpoint returns the catalog's capabilities (e.g.view-versions,multi-table-commits). Engines can adapt their behaviour to what the catalog supports without hard-coded vendor checks. -
atomic CAS — Spark and Flink both write via
commitTable. Concurrent commits from both engines serialise through the catalog's per-table lock; neither loses a snapshot. -
Cost —
O(1)catalog round-trip per scan plus one OAuth refresh per hour. Data-path bandwidth is unchanged from non-REST architectures because the catalog is never on the file read path.
SQL
Topic — etl
ETL design problems (SQL)
3. Project Nessie — git-like catalog, branches and tags on data
Nessie is git for the data lakehouse
The mental model in one line: Nessie turns the lakehouse catalog into a version-controlled reference store — every commit is a hash, every table version is reachable by branch or tag, and git merge semantics apply to data tables across the entire catalog. Once you have used a Write-Audit-Publish branch in anger, the snapshot-only history of vanilla REST catalogs feels like working without version control.
The four primitives.
-
Branches. A branch is a named, mutable reference (like git
main) that points at a commit hash. New branches are copy-on-write at the reference level — branching offmainisO(1)and consumes no data bytes. - Tags. A tag is a named, immutable reference. Tags exist for releases, regulatory snapshots, and "the state of the lake at month-end" anchors. A tag cannot be force-moved.
- Commits. A commit is a hash that records a snapshot of every table's state. A single Nessie commit can advance multiple tables at once — the atomic multi-table commit feature that pure REST catalogs lack.
- Merges and cherry-picks. Carry curated commits from one branch to another with full git-style merge semantics, including conflict resolution.
Architecture.
-
Server. Java service, gRPC + REST surface, ships in Dremio and as a stand-alone OSS server (
projectnessie/nessie). - Backend. RocksDB for embedded, JDBC (Postgres / MySQL) or DynamoDB for production. The backend is a key-value store of commit hashes, not the data itself.
- Storage. Data files live on S3 / GCS / ABFS exactly as with any other Iceberg catalog. Nessie is purely the reference store — the parquet bytes are untouched.
The Write-Audit-Publish (WAP) pattern.
WAP is the canonical use of Nessie:
-
Write. Branch off
mainintoetl_2026_06_13. Run the ETL job against that branch. The writes land in normal parquet files but the references advance only on the dev branch. -
Audit. Run dbt tests, Great Expectations, or any quality check against the dev branch. Production consumers are still on
main; they see none of the in-flight changes. -
Publish. If audit passes,
MERGEthe dev branch intomain. If audit fails, drop the dev branch —O(1)rollback with zero data movement.
Every regulated industry on earth invents WAP independently; Nessie ships it as a built-in.
Multi-table atomic commits.
The Iceberg REST spec has a commitMultiTable extension, but it is optional and not all catalogs implement it. Nessie is multi-table atomic by construction — every Nessie commit can advance N tables in one transaction. A MERGE across two tables (e.g. orders + order_items) lands as a single Nessie commit; either both tables advance or neither does.
Engine support.
| Engine | Nessie support | Notes |
|---|---|---|
| Spark | yes (built-in) | org.projectnessie.spark.NessieCatalog |
| Flink | yes | nessie catalog plugin |
| Trino | yes (via Iceberg REST shim) | Nessie 0.80+ |
| Dremio Sonar | yes (native) | Dremio markets Nessie as the OSS Polaris alternative |
| Presto | yes | nessie plugin |
| DuckDB | yes (via Iceberg REST shim) | Nessie 0.80+ |
The 2026 reality.
- Nessie 0.80+ ships Iceberg REST-spec compatibility — a single Nessie server can serve both git-style clients and vanilla REST clients.
- Dremio Sonar markets Nessie as the open-source alternative to Polaris for organisations that need WAP and multi-table atomicity.
- Multi-table commit is unmatched by any other open catalog — Polaris is single-table-atomic, Glue is single-table-atomic, Unity is single-table-atomic.
- Snowflake and BigQuery do not (yet) speak Nessie's git API, only its REST surface; the git semantics are most powerful from Spark, Flink, Trino, and Dremio.
Worked example — branch / commit / merge a WAP flow
Detailed explanation. WAP turns "did I just corrupt prod?" into "did I just corrupt my branch?" — and a corrupt branch costs nothing because dropping it is one CLI command. The reads on main see none of the in-flight writes until the publish step.
Question. Walk me through a daily ETL run that ingests orders into the gold layer using a Nessie WAP pattern. Show the CLI / SQL commands at each step.
Input.
| Step | Branch state |
main state |
|---|---|---|
| 0 | (none) |
gold.orders at v100 |
| 1 (branch) |
etl_2026_06_13 at v100 |
v100 |
| 2 (write) |
etl_2026_06_13 at v101 (10k new rows) |
v100 |
| 3 (audit) | run dbt tests on dev branch | v100 |
| 4 (merge) |
etl_2026_06_13 at v101 (merged) |
v101 |
Code.
-- Step 1 — branch off main
CREATE BRANCH etl_2026_06_13 IN nessie FROM main;
USE REFERENCE etl_2026_06_13 IN nessie;
-- Step 2 — write to the dev branch only
INSERT INTO gold.orders
SELECT * FROM silver.orders_landing
WHERE ingestion_date = DATE '2026-06-13';
-- Step 3 — audit (dbt tests, GE checks) against the dev branch
SELECT COUNT(*) AS new_rows FROM gold.orders@etl_2026_06_13;
SELECT COUNT(*) AS dupe_keys
FROM (
SELECT order_id, COUNT(*) c
FROM gold.orders@etl_2026_06_13
GROUP BY 1
HAVING COUNT(*) > 1
);
-- Step 4 — publish (merge into main) or drop on failure
-- if audit passed:
MERGE BRANCH etl_2026_06_13 INTO main IN nessie;
-- if audit failed:
DROP BRANCH etl_2026_06_13 IN nessie;
Step-by-step explanation.
-
CREATE BRANCHis a metadata-only operation — no data is copied. The new branch starts as a pointer atmain's current commit hash.O(1)time,O(1)space. - The INSERT on the dev branch writes parquet files to S3 (those bytes are physical) but advances only the dev branch reference. Production readers on
mainsee exactly the same data they saw before. - Audit queries use the
@etl_2026_06_13reference syntax to read from the dev branch. The same SQL runs against the dev state — no special tooling required. - If audit passes,
MERGE BRANCHfast-forwardsmainto the dev commit. The merge itself is a metadata-only reference update — no data movement. - If audit fails,
DROP BRANCHdeletes the dev reference. The parquet files become orphaned and are cleaned up by the next Iceberg expire-snapshots run. Zero downtime, zero risk of half-applied state onmain.
Output.
| State | After step 1 | After step 2 | After step 4 (success) | After step 4 (failure) |
|---|---|---|---|---|
main ref |
v100 | v100 | v101 | v100 |
etl_2026_06_13 ref |
v100 | v101 | merged | dropped |
Rows on main
|
1M | 1M | 1.01M | 1M |
| Audit time | — | — | seconds | seconds |
Rule of thumb. Branch every ETL job. The cost is one metadata operation; the payoff is bulletproof rollback semantics for the entire pipeline.
Worked example — multi-table atomic commit
Detailed explanation. A fact-and-dimension append (e.g. orders + order_items) must land atomically — either both tables get the new rows or neither does. Vanilla REST catalogs cannot guarantee this across two tables. Nessie can, because a single Nessie commit can advance many tables at once.
Question. Walk me through inserting into orders and order_items as a single atomic operation across both tables using Nessie. Show the failure mode under a single-table-atomic catalog.
Input.
| Table | Before commit | After commit |
|---|---|---|
orders |
1,000,000 rows | 1,010,000 rows |
order_items |
3,500,000 rows | 3,536,000 rows |
Code.
-- Spark + Nessie — multi-table commit on a single branch
USE REFERENCE etl_orders_2026_06_13 IN nessie;
INSERT INTO orders SELECT * FROM landing.orders_new;
INSERT INTO order_items SELECT * FROM landing.order_items_new;
-- One Nessie commit covers both INSERTs because they share a single branch.
-- When we merge the branch into main:
MERGE BRANCH etl_orders_2026_06_13 INTO main IN nessie;
-- → main advances atomically: both tables update or neither does.
Step-by-step explanation.
- Both INSERTs run on the same branch (
etl_orders_2026_06_13). Each INSERT advances the branch reference; under the hood, the second INSERT's commit has the first as parent. -
MERGE BRANCHfast-forwardsmainto the head of the dev branch. The fast-forward is a single metadata operation on a single reference, but it carries the entire chain of dev commits with it — both INSERTs land as one atomic step frommain's perspective. - Readers on
mainsee either neither INSERT (before merge) or both INSERTs (after merge). There is no window in whichordershas the new rows butorder_itemsdoes not. - Under a single-table-atomic catalog (vanilla REST, Glue, Polaris), the same workflow would land each INSERT as a separate commit. A reader between the two commits would see
ordersadvance withoutorder_items— a foreign-key-inconsistent view.
Output.
| Reader timing | Nessie multi-table | Single-table REST |
|---|---|---|
| Before either INSERT | old, old | old, old |
| Between INSERTs | (not possible) | old + new, old |
| After both INSERTs | new, new | new, new |
Rule of thumb. If your tables have cross-table invariants (FK, parent-child counts, sum-equal-sum), you need a multi-table-atomic catalog. Nessie is the only mainstream open catalog that ships this; the alternative is to model both tables as a single wide table and lose the dimensional model.
Worked example — tag a daily release
Detailed explanation. Tags are immutable references — the snapshot equivalent of a git release tag. They are how you say "the lake as of midnight" for regulatory purposes, end-of-month reconciliation, or downstream consumers that want a frozen view.
Question. Show how to create an immutable daily release tag from main and how to query the lake at that tag from Spark.
Input.
| Reference | Type | Mutability |
|---|---|---|
main |
branch | mutable |
release_2026_06_12 |
tag | immutable |
Code.
-- Create the tag (typically scheduled at 00:00 UTC)
CREATE TAG release_2026_06_12 IN nessie FROM main;
-- Optional protection — make the tag undroppable except by admin
ALTER TAG release_2026_06_12 SET access_rules = ('drop' -> 'admin-only');
-- Query the lake at the tag
USE REFERENCE release_2026_06_12 IN nessie;
SELECT COUNT(*) FROM gold.orders;
-- → returns the count as of midnight 2026-06-12
-- Or inline:
SELECT COUNT(*) FROM gold.orders@release_2026_06_12;
Step-by-step explanation.
-
CREATE TAGrecords the current commit hash ofmainunder the new immutable reference name. The operation is metadata-only andO(1). -
USE REFERENCEswitches the session's read context to the tag. Every subsequent query reads tables as of the tagged commit. - The
@release_2026_06_12inline syntax lets a single query read from the tag without switching session context — useful for ad-hoc reconciliation. - Tags are immutable by default — attempting to move or overwrite a tag returns an error. Access rules can further restrict who may drop the tag at all.
Output.
| Query | Result |
|---|---|
SELECT COUNT(*) FROM gold.orders on main
|
varies (live) |
SELECT COUNT(*) FROM gold.orders@release_2026_06_12 |
fixed (frozen at tag time) |
Rule of thumb. Tag every meaningful snapshot — end-of-day, end-of-month, regulatory cut-offs. The cost is one reference write; the payoff is an immutable, reproducible view of the lake that survives any subsequent ETL run.
Senior engineering interview question on rollback via Nessie branches
A senior interviewer might frame this as: "A bad ETL run corrupted the silver layer this morning. The corrupt data has already been read by three downstream jobs. Walk me through rolling back to the previous day's state with zero data movement and zero downtime."
Solution Using Nessie ASSIGN BRANCH for instant rollback
-- Senior-engineer answer — atomic reference reassignment
-- 1) Identify the last good tag (created at midnight yesterday)
SHOW REFERENCES IN nessie;
-- → main, release_2026_06_12, etl_2026_06_13 (corrupt)
-- 2) (Optional) tag the corrupt state for forensics
CREATE TAG forensics_2026_06_13_bad IN nessie FROM main;
-- 3) Reassign main to the previous day's release tag
ASSIGN BRANCH main IN nessie TO release_2026_06_12;
-- 4) Verify
USE REFERENCE main IN nessie;
SELECT MAX(ingestion_date) FROM silver.orders;
-- → 2026-06-12 (rollback confirmed)
Step-by-step trace.
| Step | Timeline |
main ref |
Reader view |
|---|---|---|---|
| 1 | 09:00 ETL writes bad data | S_etl_bad |
corrupt |
| 2 | 09:30 alert fires | S_etl_bad |
corrupt |
| 3 | 09:35 tag forensics | S_etl_bad |
corrupt |
| 4 | 09:36 ASSIGN BRANCH main TO release_2026_06_12
|
S_release_06_12 |
clean |
| 5 | 09:36 downstream reads | S_release_06_12 |
clean |
After five steps, the entire lake is back to yesterday's state. The bad commit S_etl_bad still exists on disk (and under the forensics tag) — nothing has been deleted. The rollback was a single metadata reference update.
Output:
| Property | Before rollback | After rollback |
|---|---|---|
main ref |
S_etl_bad |
S_release_06_12 |
| Reader view of silver | corrupt | yesterday's clean state |
| Data files on S3 | unchanged | unchanged |
| Downtime | — | 0 |
| Forensics tag | none | forensics_2026_06_13_bad |
Why this works — concept by concept:
-
reference, not snapshot — Nessie rolls back by re-pointing the
mainbranch reference at an earlier commit hash. No snapshots are deleted, no manifests are rewritten, no parquet files are touched. -
atomic reassign —
ASSIGN BRANCHis a single transaction at the Nessie level. Either the reference moves or it does not; there is no intermediate state where half the catalog points old and half points new. -
no data movement — the rollback is
O(1)in bytes. Compare with a snapshot-based catalog where rolling back across multiple tables requires running anexpire_snapshotsper table and possibly rewriting manifests. -
multi-table consistency — because Nessie commits are catalog-wide, reassigning
mainrolls back every table in the catalog to the tag's state. There is no risk of one table reverting while a sibling stays at the bad commit. - forensics preserved — tagging the corrupt state before rollback keeps the bad commit reachable for post-mortem. The parquet files survive until the next expire-snapshots run.
-
Cost —
O(1)metadata writes. The whole rollback is two reference updates (forensics tag + ASSIGN BRANCH) and completes in single-digit milliseconds. The slowest part is the on-call engineer typing the command.
SQL
Topic — time-series
Time-series analysis problems (SQL)
4. Apache Polaris — multi-tenant, Iceberg + RBAC
Polaris is Snowflake's open Iceberg catalog — multi-tenant from day one
The mental model in one line: Apache Polaris is an Iceberg REST catalog designed around a SaaS multi-tenant control plane — every entity is a CATALOG / PRINCIPAL / ROLE / GRANT, RBAC is enforced at the namespace and table level, and short-lived storage credentials are vended for every load. If your platform serves multiple product teams, lines of business, or customer tenants on one lake, Polaris is the catalog you can defend to your security team.
The Polaris object model.
-
Catalog. The top-level container. A catalog has a name, a default storage location, and a set of grants. Catalogs are usually one-per-tenant or one-per-environment (
prod,staging,dev). - Namespace. A nested folder inside a catalog. Maps cleanly to schemas in Spark / Trino.
- Table / View. The standard Iceberg objects. Grants apply at table and view level.
- Principal. A user or service-account identity. Principals authenticate via OAuth2 client-credentials and own zero permissions until granted.
- Role. A bundle of grants. Roles are assigned to principals. Multiple roles per principal allowed; effective permissions are the union.
-
Grant. A
(privilege, securable)pair attached to a role. Privileges are SQL-flavoured (USAGE,SELECT,INSERT,UPDATE,MERGE,CREATE_NAMESPACE, etc.).
The federation knob.
A single Polaris server can host many catalogs. One Polaris instance can simultaneously serve:
- A Snowflake account (Snowflake reads Polaris as its native Iceberg catalog).
- A Spark cluster doing ETL.
- A Trino cluster doing analytics.
- A Flink streaming sink.
- A Dremio reader.
The principals + roles + grants model isolates each consumer without splitting the catalog.
Storage credential vending.
Polaris supports STS (AWS), Workload Identity Federation (GCP), and Azure SAS for storage. On every loadTable, the catalog generates a short-lived storage token scoped to the table's prefix and returns it in the response. Engines never see the root storage credential.
The 2026 reality.
- Apache Polaris is an Apache Software Foundation incubator project donated by Snowflake mid-2024. The Apache name lends governance neutrality versus a Snowflake-owned project.
- GA on Snowflake mid-2025. Snowflake customers can spin up a Polaris catalog without managing it themselves.
- Self-hosted Polaris runs as a Java service backed by Postgres for metadata. Single-node and HA topologies are documented.
- Spark, Trino, Flink, and Dremio drivers all speak Polaris through the standard Iceberg REST spec — no Polaris-specific code paths.
- Databricks Unity Catalog now also publishes Iceberg via REST, competing directly. Polaris distinguishes itself with the multi-tenant catalog model and a fully open governance story.
Worked example — create a tenant + role + grant in Polaris
Detailed explanation. Onboarding a new team to a multi-tenant lakehouse is a four-step ritual: create the catalog, create the role, create the principal, grant the role's privileges. Polaris models this with familiar SQL DDL (and equivalent REST calls).
Question. Walk me through onboarding a new "ml-platform" team to a Polaris-managed lakehouse. They need full control of their own catalog plus read-only access to the shared "dim" catalog.
Input.
| Object | Name | Type |
|---|---|---|
| catalog | ml_platform |
new |
| catalog | dim |
existing (shared) |
| role | ml_platform_owner |
new |
| principal | svc-ml-platform-prod |
new |
Code.
-- 1) Create the tenant catalog
CREATE CATALOG ml_platform
TYPE = ICEBERG
STORAGE = 's3://lake/ml-platform/'
STORAGE_CREDENTIAL_KEY = 'aws-role-ml-platform';
-- 2) Create the role for that catalog
CREATE ROLE ml_platform_owner;
-- 3) Grant full control of the new catalog to the role
GRANT USAGE ON CATALOG ml_platform TO ROLE ml_platform_owner;
GRANT CREATE_NAMESPACE
ON CATALOG ml_platform TO ROLE ml_platform_owner;
GRANT MANAGE ON CATALOG ml_platform TO ROLE ml_platform_owner;
-- 4) Grant read-only access to the shared dim catalog
GRANT USAGE ON CATALOG dim TO ROLE ml_platform_owner;
GRANT SELECT ON ALL TABLES IN CATALOG dim TO ROLE ml_platform_owner;
-- 5) Create a service-account principal and bind the role
CREATE PRINCIPAL svc_ml_platform_prod
CLIENT_ID = 'ml-platform-prod'
CLIENT_SECRET = '<generated>';
GRANT ROLE ml_platform_owner TO PRINCIPAL svc_ml_platform_prod;
Step-by-step explanation.
- The new catalog gets its own storage prefix and IAM role. Polaris remembers the IAM role and will use it to mint vended STS tokens for any reader of this catalog.
-
ml_platform_owneris the role bundle — every member of the team gets this role on their principal. Adding a new team member is a one-line role grant. - The four
GRANT ... ON CATALOG ml_platformstatements give the role full control of its own catalog: usage, namespace creation, and full manage. - The two
GRANT ... ON CATALOG dimstatements add read-only access to the shared dim catalog — the team can read shared dimensions but cannot modify them. -
CREATE PRINCIPALmints OAuth client credentials.GRANT ROLE ... TO PRINCIPALbinds the role to the identity. The team can now point their Spark / Trino / Flink clients at Polaris with these credentials.
Output.
| Resource | Access by svc_ml_platform_prod
|
|---|---|
ml_platform catalog |
full control |
ml_platform.*.* tables |
full CRUD |
dim catalog |
usage + read |
dim.*.* tables |
SELECT only |
| any other catalog | no access |
Rule of thumb. Model one catalog per tenant. Cross-tenant sharing happens through explicit GRANT USAGE + GRANT SELECT. This keeps the blast radius of any misconfiguration to one tenant and makes per-tenant billing trivial (catalog == cost-allocation unit).
Worked example — Spark reads via Polaris with OAuth client-creds
Detailed explanation. Once a principal exists, every engine logs in with the same OAuth2 client-credentials flow. Spark sees Polaris as a vanilla REST catalog — the multi-tenant model is invisible at the wire level.
Question. Configure a Spark session to read from the ml_platform.features.user_features table using the svc_ml_platform_prod principal. Walk through the request sequence.
Input.
| Item | Value |
|---|---|
| Polaris URL | https://polaris.example.com |
| Catalog | ml_platform |
| Client id | ml-platform-prod |
| Client secret | <vault-managed> |
Code.
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("ml-features-read")
.config(
"spark.sql.catalog.ml",
"org.apache.iceberg.spark.SparkCatalog",
)
.config(
"spark.sql.catalog.ml.catalog-impl",
"org.apache.iceberg.rest.RESTCatalog",
)
.config("spark.sql.catalog.ml.uri",
"https://polaris.example.com")
.config("spark.sql.catalog.ml.warehouse", "ml_platform")
.config("spark.sql.catalog.ml.credential",
"ml-platform-prod:${env:ML_PLATFORM_SECRET}")
.config("spark.sql.catalog.ml.scope", "PRINCIPAL_ROLE:ALL")
.config("spark.sql.catalog.ml.io-impl",
"org.apache.iceberg.aws.s3.S3FileIO")
.getOrCreate()
)
df = spark.sql("""
SELECT user_id, feature_v1, feature_v2
FROM ml.features.user_features
WHERE region = 'EU'
""")
df.show()
Step-by-step explanation.
- Spark starts up. The driver exchanges the OAuth client-id / secret for a bearer token via
POST /v1/oauth/tokens. - Spark calls
GET /v1/configfor capability discovery andGET /v1/ml_platform/namespacesto list the catalog. Polaris authorises the principal: it hasUSAGEonml_platform, so the list succeeds. - Spark calls
GET /v1/ml_platform/namespaces/features/tables/user_features. Polaris authorisesSELECTon the table (granted via the role), mints a vended STS token scoped to the table prefix, and returns the metadata + creds. - Spark reads parquet from S3 using the vended STS token. The catalog is not in the data path.
- The bearer token + vended STS both expire in 1 hour. The next scan triggers refresh on both.
Output.
| Stage | Polaris response |
|---|---|
| OAuth | 200 + bearer (1h) |
| listNamespaces | 200 + [features, models, serving]
|
| loadTable | 200 + metadata + vended STS |
| S3 read | 200 + parquet bytes |
| Aggregate query | rows from EU users |
Rule of thumb. Polaris from an engine's perspective looks like any other REST catalog. The multi-tenant model is invisible at the wire — it shows up only in authorisation decisions on the server side.
Worked example — storage credential vending on read
Detailed explanation. The single biggest operational reason to pick Polaris is that no engine ever holds a long-lived S3 key. Every read mints a fresh, scoped, expiring STS token. Revoking a principal in Polaris kills its lake access within an hour, regardless of how many engines were running.
Question. Walk through what an attacker would see if they stole the bearer token from a Spark driver running against Polaris. Compare with a non-vending catalog.
Input.
| Asset | Stolen | Useful for |
|---|---|---|
| Spark bearer (OAuth) | 1h | calling Polaris APIs as the principal |
| Vended S3 STS | 1h | reading one table prefix |
| Long-lived S3 key (counter-factual) | forever | reading the whole bucket |
Code.
# Attack surface with Polaris vended creds
attacker has bearer → can call loadTable for ANY table the principal is
allowed to access, for the next ~1h
attacker has STS → can read s3://lake/ml-platform/features/user_features/
for the next ~1h, then 403
# Attack surface with a non-vending catalog
attacker has S3 key → can read s3://lake/ recursively
forever (until key is manually rotated, days to weeks)
Step-by-step explanation.
- The Polaris bearer token is scoped to the principal's role bundle. An attacker can read tables the principal can read, but cannot escalate.
- The vended S3 STS is scoped to a single table prefix. An attacker who somehow lifts the STS token cannot read sibling tables, only the one that loaded.
- Both tokens expire in roughly an hour. Revoking the principal in Polaris kills future minting; in-flight tokens still live to their natural expiry. The whole lake is safe again within an hour.
- Under a long-lived-key model, the equivalent recovery is: rotate the S3 key, redeploy every engine, audit every read since the leak. Hours to days, not minutes.
Output.
| Threat | With Polaris vended creds | With long-lived S3 key |
|---|---|---|
| Blast radius | one principal, one table | the whole bucket |
| Time to revoke | <= 1 hour automatic | hours to days manual |
| Lateral movement | none | cross-table, cross-tenant |
| Audit trail | per-load entry in catalog | none |
Rule of thumb. If your security team has ever asked "what happens if a service account leaks?" — Polaris's answer is "an hour of one table." That answer alone is often enough to win the catalog decision.
Senior engineering interview question on multi-tenant catalog design
A senior interviewer might frame this as: "You're picking a single Iceberg catalog to serve 12 product teams, each with their own data and one shared dim layer. Walk me through how Polaris's catalogs / principals / roles / grants solve isolation, billing, and discoverability — and the trade-off versus per-team catalog instances."
Solution Using Polaris catalogs-as-tenants pattern
-- One catalog per team, plus one shared dim catalog
CREATE CATALOG dim STORAGE = 's3://lake/dim/';
CREATE CATALOG sales STORAGE = 's3://lake/sales/';
CREATE CATALOG marketing STORAGE = 's3://lake/marketing/';
CREATE CATALOG product STORAGE = 's3://lake/product/';
-- ... etc for 12 teams
-- Per-team owner role
CREATE ROLE sales_owner;
GRANT MANAGE ON CATALOG sales TO ROLE sales_owner;
GRANT USAGE ON CATALOG dim TO ROLE sales_owner;
GRANT SELECT ON ALL TABLES IN CATALOG dim TO ROLE sales_owner;
-- A global "data-platform" role for the platform team
CREATE ROLE data_platform;
GRANT MANAGE ON CATALOG dim TO ROLE data_platform;
-- (the platform team can manage shared dims, not tenant catalogs)
-- Bind principals
CREATE PRINCIPAL svc_sales_prod CLIENT_ID = 'sales-prod';
CREATE PRINCIPAL svc_platform_prod CLIENT_ID = 'platform-prod';
GRANT ROLE sales_owner TO PRINCIPAL svc_sales_prod;
GRANT ROLE data_platform TO PRINCIPAL svc_platform_prod;
Step-by-step trace.
| Step | What | Effect |
|---|---|---|
| 1 | one catalog per team + one shared dim
|
isolation by name + by storage prefix |
| 2 | per-team owner role with MANAGE on team catalog |
full self-service inside the team's catalog |
| 3 | per-team owner also has SELECT on shared dim
|
discoverable cross-tenant joins without copies |
| 4 | platform team has MANAGE only on dim, not tenant catalogs |
governance separation: platform manages shared, tenants own their own |
| 5 | each principal sees only its catalogs in listCatalogs
|
discoverability + isolation in one model |
After five steps, 12 teams each have their own catalog, can self-serve namespaces and tables, can read the shared dim layer without copying, cannot read each other's data, and the platform team controls the shared dim without intruding on tenant catalogs.
Output:
| Principal | Catalogs visible |
MANAGE on |
SELECT on shared dim |
|---|---|---|---|
svc_sales_prod |
sales, dim
|
sales |
yes |
svc_marketing_prod |
marketing, dim
|
marketing |
yes |
svc_platform_prod |
every catalog |
dim only |
yes (own) |
| any other | only own catalog + dim
|
own only | yes |
Why this works — concept by concept:
-
catalog-as-tenant — one catalog per team gives a clean cost-allocation unit (storage prefix == billing key), a clean blast-radius unit (a misconfiguration affects one catalog), and a clean discovery unit (the team's catalog is what they see when they
listCatalogs). -
role inheritance — roles bundle grants. Adding a new team member is one
GRANT ROLEstatement; offboarding is oneREVOKE. No per-table fiddling. -
principal granularity — separate service-account principals per environment (
svc_sales_prod,svc_sales_dev) keep secrets and audit trails distinct. The principal is the unit of secret rotation. -
cross-catalog grants —
GRANT SELECT ON CATALOG dimlets every team join against shared dimensions without copying them into per-tenant catalogs. This is the killer pattern for the "single shared dim, many fact catalogs" architecture. - governance separation — platform team has manage rights only on the shared dim catalog. Tenant teams have manage rights only on their own catalog. Neither can step on the other's toes.
-
Cost —
O(catalogs)storage prefixes,O(catalogs)IAM roles,O(roles + grants)rows in the Polaris metadata DB. No per-table overhead — even 200 tables per team is comfortable for a single Polaris instance.
SQL
Topic — design
System design problems (SQL)
5. Picking a catalog — engine support, multi-tenancy, governance trade-offs
Three catalogs, four axes — pick by the one that bites first
The mental model in one line: the three open Iceberg catalogs (vanilla REST, Nessie, Polaris) score differently on four axes — engine fit, multi-tenancy, versioning, governance — and the right answer is the one that nails whichever axis is most painful in your platform today. Once the axes are explicit, the choice stops being about marketing fit and starts being about which trade-off you can live with.
The feature matrix.
| Catalog | Engine fit | Multi-tenancy | Versioning | Governance |
|---|---|---|---|---|
| REST (vanilla) — Tabular / Glue / DIY | strong (built into every engine) | partial (single-catalog, requires app-layer logic) | snapshot only | partial (depends on impl) |
| Nessie | strong (Iceberg + REST shim) | basic (single-server, branch as isolation) | killer (branches / tags / merge / multi-table atomic) | basic RBAC |
| Polaris | strong (REST-native) | killer (catalogs + principals + roles + grants) | snapshot only (no branches) | strong (ANSI-flavour GRANT / REVOKE) |
When to pick which.
- Pick vanilla REST when you have one team, one tenant, no version-control needs, and you want the lightest possible catalog. Glue's REST mode, Tabular, or a small Polaris instance all qualify.
- Pick Nessie when WAP is non-negotiable, when multi-table atomicity matters (FK invariants, parent-child counts), or when you want git semantics on top of the lake.
- Pick Polaris when multi-tenancy is non-negotiable, when you need ANSI-style RBAC, when you serve multiple product teams or external customers, or when you want a Snowflake-native Iceberg catalog.
The hybrid pattern.
Most large enterprises in 2026 end up running Polaris in front of Nessie-backed branches via a federation layer (Gravitino, custom proxy). The line between the two is blurring as REST becomes the lingua franca — Nessie 0.80+ already speaks the REST spec, and the Polaris community is openly discussing branch / tag semantics for future versions.
The 2026 reality.
- REST is the lingua franca. Every serious catalog speaks it; the choice between catalogs is now mostly about features, not protocol.
- Polaris vs Unity is the new vendor decision. Snowflake-aligned shops pick Polaris; Databricks-aligned shops pick Unity. Both publish via REST, so the engine layer is identical.
- Nessie owns the WAP + multi-table niche. No other open catalog ships true multi-table atomic commits.
- Glue's REST mode is the easy migration. AWS-native shops with existing Glue tables flip a config flag and gain REST clients without re-importing.
Worked example — three-tenant analytics platform
Detailed explanation. A data platform team supports three product teams — sales analytics, marketing analytics, finance — on one lake. Each team has 30-80 tables. The shared dim layer holds customer / product / calendar. Tenant isolation and per-team billing are explicit requirements.
Question. Pick a catalog and justify the choice on each of the four axes.
Input.
| Requirement | Priority |
|---|---|
| 3 teams + shared dim | high |
| isolation between teams | high |
| ANSI-style RBAC | high |
| Spark + Trino + Snowflake reads | high |
| WAP or multi-table atomic | low |
| branches / tags | low |
Code.
Decision: Polaris
Justification by axis:
engine fit — strong; Spark + Trino + Snowflake all speak Polaris's REST surface
multi-tenancy — killer; one catalog per team + a shared dim catalog
versioning — sufficient; snapshot-level rollback is acceptable, no WAP needed
governance — strong; ANSI-style GRANT / REVOKE, per-catalog billing, vended creds
Step-by-step explanation.
- Multi-tenancy is the highest priority and Polaris's catalogs / principals / roles / grants model is the cleanest fit. Per-tenant catalogs give per-tenant billing and per-tenant blast-radius isolation for free.
- ANSI-style RBAC is required because the security team wants
GRANT SELECT ON ...audit trails. Polaris ships exactly that surface; Nessie's RBAC is more basic. - WAP and multi-table atomicity are low-priority — the ETL team is happy with snapshot-level rollback and single-table commits. Nessie's killer features are wasted on this team.
- Engine fit is uniform across all three options for the engines listed.
Output.
| Axis | Polaris | Nessie | REST (plain) |
|---|---|---|---|
| Multi-tenancy | win | partial | partial |
| Governance | win | basic | partial |
| Versioning | OK | overkill | OK |
| Engine fit | tied | tied | tied |
Rule of thumb. Multi-tenancy + RBAC → Polaris. If the requirement list opens with "isolate N teams on one lake," Polaris is the default.
Worked example — heavy ETL team needing WAP
Detailed explanation. A data team runs 200 ETL jobs per day. The CFO has asked for bulletproof rollback semantics after a previous quarter's reporting blew up due to a half-applied ETL. The data team owns one catalog; there is no multi-tenancy requirement.
Question. Pick a catalog and justify the choice on each of the four axes.
Input.
| Requirement | Priority |
|---|---|
| WAP per ETL job | high |
| multi-table atomic commits | high |
| instant rollback (no data movement) | high |
| single-team usage | high |
| ANSI-style RBAC | low |
| 12+ tenants | low |
Code.
Decision: Nessie
Justification by axis:
engine fit — strong; Spark + Flink + Trino all speak Nessie + REST shim
multi-tenancy — basic; not needed for one team
versioning — killer; branches + tags + multi-table atomic commits
governance — basic RBAC, sufficient for one team
Step-by-step explanation.
- Versioning is the highest priority. Nessie's branch / tag / merge / multi-table atomic commit model is the only open catalog that ships this fully.
- Multi-tenancy is irrelevant — one team, no per-tenant billing required.
- RBAC requirements are basic — Nessie's branch-level access rules are sufficient for one team.
- Engine fit is uniform; Spark / Flink / Trino all work fine.
Output.
| Axis | Polaris | Nessie | REST (plain) |
|---|---|---|---|
| Versioning | snapshot only | win | snapshot only |
| Multi-table atomic | no | win | no |
| Multi-tenancy | overkill | OK | OK |
| Governance | overkill | OK | OK |
Rule of thumb. WAP or multi-table atomic → Nessie. If the requirement list opens with "we can never have a half-applied ETL," Nessie is the default.
Worked example — small team standardising on Spark + Trino
Detailed explanation. A small data team standardises on Spark for ETL and Trino for analytics. There is one tenant, no WAP requirement, and they want the lowest-operational-cost catalog they can get. AWS-native shop.
Question. Pick a catalog and justify the choice on each of the four axes.
Input.
| Requirement | Priority |
|---|---|
| lowest operational cost | high |
| Spark + Trino read/write | high |
| AWS-native | high |
| 1 tenant | high |
| WAP | low |
| multi-tenancy | low |
Code.
Decision: AWS Glue REST mode (vanilla REST)
Justification by axis:
engine fit — strong; Spark + Trino out of the box
multi-tenancy — N/A; single tenant
versioning — snapshot only, acceptable
governance — IAM-backed; vended creds via Glue + STS
Step-by-step explanation.
- Lowest operational cost is the highest priority. Glue is fully managed by AWS; there is nothing to run.
- Single tenant means Polaris's multi-tenant model is overkill.
- No WAP requirement means Nessie's git features are unused.
- AWS-native shop already trusts IAM for everything else; Glue's IAM-backed authorisation aligns with that.
Output.
| Axis | Polaris | Nessie | Glue REST |
|---|---|---|---|
| Ops cost | medium | medium | win (managed) |
| Engine fit | tied | tied | tied |
| Multi-tenancy | overkill | OK | OK |
| Versioning | OK | overkill | OK |
Rule of thumb. If the team is one tenant, AWS-native, and snapshot-level rollback is enough, Glue REST mode is the cheapest path. Polaris is overkill; Nessie's features are wasted.
Senior engineering interview question on catalog decision
A senior interviewer might frame this as: "You inherit a Hive-metastore-based lakehouse with 4 teams and 200 tables. The CTO wants you to pick a target catalog this quarter. Walk me through the decision matrix and your final recommendation."
Solution Using a tiered axis-first decision
Decision rubric — score each axis against the platform's needs
1. ENGINE FIT
- Spark, Trino, Snowflake (read-only) → all speak REST → tied
- Pick: any REST-spec catalog
2. MULTI-TENANCY
- 4 teams + 1 shared dim → per-tenant catalogs + cross-tenant SELECT
- Pick: Polaris (catalogs as tenants)
3. VERSIONING
- WAP per ETL job desired but not yet mandated
- Nice-to-have: branches / tags
- Pick: Polaris (snapshot rollback) — escalate to Nessie federation if
mandate arrives
4. GOVERNANCE
- Security team wants ANSI-style GRANT / REVOKE
- Pick: Polaris
FINAL DECISION: Polaris on the control plane, with the option to layer
Nessie behind a subset of catalogs if WAP becomes a hard requirement.
Hive metastore migrates table-by-table via Iceberg's snapshot import.
Step-by-step trace.
| Axis | Need | REST plain | Nessie | Polaris |
|---|---|---|---|---|
| engine fit | Spark / Trino / Snowflake | tied | tied | tied |
| multi-tenancy | 4 teams + shared dim | partial | basic | win |
| versioning | WAP nice-to-have | snapshot | overkill | snapshot |
| governance | ANSI RBAC | partial | basic | win |
| ops cost | run one service | low | medium | medium |
| final | — | OK | mismatched priority | best fit |
Output:
| Axis | Winner | Rationale |
|---|---|---|
| Multi-tenancy | Polaris | catalogs-as-tenants gives clean isolation + billing |
| Governance | Polaris | ANSI-style GRANT / REVOKE matches security team's mental model |
| Versioning | Polaris | snapshot rollback is enough today; WAP can be added later |
| Engine fit | tied | REST spec means every engine works |
| Final | Polaris | wins on the axes that bite |
Why this works — concept by concept:
- engine-first axis — start by enumerating every engine that will touch the lake. If they all speak REST, every modern catalog is on the table; if one only speaks HMS, you have a migration problem regardless of catalog choice.
- tenant-first axis — count the teams + the cost-allocation requirement. More than 2 tenants + per-tenant billing → Polaris. One tenant → Glue / Tabular is enough.
- versioning-first axis — if WAP is contractual (regulatory, audit, end-of-month reconciliation), Nessie wins. If "snapshot rollback within 24h" is enough, the snapshot-only catalogs are fine.
- governance-first axis — does the security team want ANSI GRANT / REVOKE audit trails? Polaris ships them. Does the security team only care about IAM? Glue is enough.
- hybrid escape hatch — even after picking a primary, you can layer Nessie behind a subset of catalogs via federation if WAP becomes mandatory later. The decision is reversible because REST is the lingua franca.
-
Cost —
O(4)axes to score,O(3)catalogs to score, plus one operational-cost tie-breaker. The whole decision fits on one whiteboard and on one slide.
SQL
Topic — dimensional-modeling
Lakehouse modelling problems (SQL)
Catalog cheat sheet — REST, Nessie, Polaris recipes
-
List namespaces over HTTP.
GET /v1/{prefix}/namespaceswithAuthorization: Bearer <token>. Same call from curl, Spark, Trino, DuckDB. -
Load a table + get vended creds.
GET /v1/{prefix}/namespaces/{ns}/tables/{tbl}. Response includesmetadata,metadata-location, and acredentialsblock. -
Atomic commit.
POST /v1/{prefix}/namespaces/{ns}/tables/{tbl}/commitswithrequirements: [assertTableUUID, assertRefSnapshotId]andupdates: [add-snapshot, set-current-snapshot]. CAS failure returns409. -
OAuth2 token exchange.
POST /v1/oauth/tokenswithgrant_type=client_credentials. Bearer expires in 1h; refresh by re-posting. -
Nessie WAP.
CREATE BRANCH dev FROM main→ write → audit ondev→MERGE BRANCH dev INTO main. Drop on failure forO(1)rollback. -
Nessie tag.
CREATE TAG release_YYYY_MM_DD IN nessie FROM main. Immutable; protect with access rules. -
Nessie rollback.
ASSIGN BRANCH main IN nessie TO TAG release_YYYY_MM_DD.O(1)metadata write. - Nessie multi-table commit. Multiple INSERTs on one branch + one MERGE. Either all tables advance or none do.
-
Polaris tenant onboarding.
CREATE CATALOG t_team+CREATE ROLE team_owner+GRANT MANAGE ON CATALOG t_team TO ROLE team_owner+CREATE PRINCIPAL svc_team_prod+GRANT ROLE team_owner TO PRINCIPAL svc_team_prod. -
Polaris cross-catalog grant.
GRANT SELECT ON ALL TABLES IN CATALOG dim TO ROLE team_owner. Lets the team read shared dims without copies. -
Glue REST mode. Set
iceberg.rest-catalog.uri=https://glue.{region}.amazonaws.com/icebergand use IAM SigV4 auth. No re-import required. - Pick the catalog by the axis that bites. Multi-tenancy → Polaris. WAP / multi-table → Nessie. Single team + AWS-native → Glue REST.
Frequently asked questions
What is the Iceberg REST catalog?
The Iceberg REST catalog is an OpenAPI-3 specification published alongside the Apache Iceberg project that defines the HTTP surface for an Iceberg-aware metadata store. It standardises endpoints for createNamespace, listTables, loadTable, commitTable (with optimistic compare-and-set), vendCredentials, and OAuth2 token exchange. Every modern catalog — vanilla Tabular, AWS Glue REST mode, Snowflake's Polaris, Project Nessie 0.80+, Databricks Unity — implements this spec, and every modern engine (Spark, Trino, Flink, DuckDB, ClickHouse, Snowflake, BigQuery) speaks it as a client. The point of the spec is that picking a catalog becomes a configuration change rather than a re-import.
How is Project Nessie different from a vanilla REST catalog?
Nessie adds git semantics on top of the Iceberg catalog surface. The four primitives — branches, tags, atomic multi-table commits, and merge / cherry-pick — turn the catalog into a version-controlled reference store rather than a snapshot log. A Nessie commit can advance many tables at once (multi-table atomic), which a vanilla REST catalog cannot, and any reference can be reassigned in O(1) for instant rollback. Nessie 0.80+ also speaks the Iceberg REST spec, so the same server can serve git-style clients (Spark + Nessie SQL, Dremio Sonar) and vanilla REST clients (Trino, DuckDB, Snowflake) simultaneously.
Is Apache Polaris open source?
Yes. Apache Polaris is an Apache Software Foundation incubator project, donated by Snowflake in mid-2024 and now governed by the ASF rather than by a single vendor. The server is Java, the metadata store is Postgres, and the wire protocol is the standard Iceberg REST spec. Snowflake offers a fully managed Polaris service for customers who do not want to run it themselves, but self-hosted Polaris is identical to the managed version. Community drivers from Spark, Trino, Flink, and Dremio all use the open REST spec — there is no Polaris-specific protocol underneath.
Can one engine talk to multiple catalogs?
Yes — every modern engine supports multiple catalog registrations and can join across them. Spark uses session-level catalog configs (spark.sql.catalog.prod, spark.sql.catalog.dim, etc.). Trino registers one connector instance per catalog. Flink supports CREATE CATALOG for each. The engine then resolves prod.sales.orders and dim.customer against their respective catalogs and joins them in a single query. This is how the hybrid pattern works in practice: register Polaris and Nessie side-by-side, route tenant tables through Polaris and WAP-critical tables through Nessie, and let the engine join across them transparently.
Do I still need a Hive metastore?
In 2026, no — for new lakehouse builds. Every major engine treats HMS as a legacy connector and ships an Iceberg REST connector as the default. HMS lacks atomic CAS on the current snapshot pointer and cannot vend short-lived credentials, so it fails the two foundational tests of a real lakehouse catalog. Existing HMS deployments can migrate table-by-table via Iceberg's snapshot import: register the HMS table in the REST catalog (Polaris / Nessie / Glue REST), point engines at the new catalog name, and decommission the HMS row once readers have moved.
Which catalog should I pick for a multi-tenant analytics platform?
Apache Polaris — its catalogs / principals / roles / grants model is purpose-built for multi-tenant Iceberg. Onboarding a new team is CREATE CATALOG t_team + CREATE ROLE team_owner + GRANT MANAGE ON CATALOG t_team TO ROLE team_owner + CREATE PRINCIPAL svc_team_prod + GRANT ROLE team_owner TO PRINCIPAL svc_team_prod. Cross-tenant sharing is one GRANT SELECT ON ALL TABLES IN CATALOG dim per tenant. Per-tenant billing is trivially clean because each catalog has its own S3 prefix. Vended credentials kill the long-lived-IAM-key blast-radius problem. If you also need WAP or multi-table atomic commits, layer Nessie behind a subset of catalogs via federation — the REST spec makes this hybrid feasible.
Practice on PipeCode
- Drill the dimensional modelling practice library → for the fact / dimension shapes that catalogs serve.
- Rehearse on ETL design problems → for WAP, atomic commits, and the cross-engine ingest patterns.
- Sharpen system design drills → for multi-tenant catalog architecture.
- Layer the event modelling library → for the streaming sinks that Flink writes into Iceberg.
- Stack the slowly changing data practice → for the SCD-2 patterns that benefit most from multi-table atomic commits.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the design axis with the ETL system design for data engineering interviews course →.
- For long-form schema craft, work through data modelling for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every catalog pattern above ships with hands-on practice rooms where you write the multi-table SCD-2, the WAP branch, and the per-tenant Polaris GRANT against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your Iceberg REST catalog answer actually holds up under a senior interviewer's "what happens if two writers race?" probe.





Top comments (0)