databricks unity catalog looks, to a junior reading the marketing page, like "the new metastore." That undersells it by roughly an order of magnitude. UC is the first governance layer in the Databricks ecosystem that treats identity, metadata, lineage, audit, and external sharing as one problem solved at the account layer rather than five problems solved per-workspace, per-cluster, per-notebook. The result is the difference between writing a single ABAC policy that protects every PII column in every catalog and writing the same table-ACL three times — once for each cluster's local Hive metastore — and hoping the team that spins up workspace number four remembers to copy it.
This guide is the one you wish existed the day your security team handed you a "who can read which column" matrix and a deadline. It walks through the unity catalog metastore mental model, the three-level namespace (catalog.schema.table) that replaces the legacy database.table form, automatic data lineage at both table and column grain, the four security primitives (GRANT / REVOKE, row-level security filters, column masks), and the delta sharing topology that lets a producer expose a vetted "data products" catalog without copying bytes. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps the moment you finish reading, drill the SQL practice library for data engineers →, rehearse on joins problems →, and stack the metadata-aware muscles with database design problems →.
On this page
- Why Unity Catalog exists — the pre-UC governance gap
- The three-level namespace + metastore architecture
- Automatic data lineage — table-level and column-level
- Row filters, column masks + GRANT/REVOKE
- Delta Sharing topology + production rollout
- Cheat sheet — Unity Catalog recipes
- Frequently asked questions
- Practice on PipeCode
1. Why Unity Catalog exists — the pre-UC governance gap
The legacy Hive metastore was workspace-scoped — that single design decision created every "who can read what" headache UC was built to delete
The one-sentence invariant: before Unity Catalog, every Databricks workspace shipped with its own Hive metastore, its own table ACLs, and its own copy of every identity the security team had to maintain — and there was no way to ask "what does the Finance group see in all workspaces?" without a manual cross-tabulation. Once you internalise that the legacy world's governance unit was the workspace, the entire UC architecture becomes obvious: pull every identity-and-metadata concern up to the account, leave the compute concerns down at the workspace, and write each policy exactly once.
The five things UC unifies.
-
Identity. SCIM users and groups live at the account level, not per workspace. One source of truth for "who is a
pii_reader." - Metastore. One regional metastore replaces N per-workspace Hive metastores. Tables register once, not once per workspace.
- Lineage. UC captures table-level and column-level lineage automatically across notebooks, jobs, dashboards, and DLT — no instrumentation.
-
Audit. Every
GRANT,REVOKE, query, share creation, and policy evaluation lands in thesystem.access.auditsystem table. - Sharing. Delta Sharing — an open REST + parquet protocol — lets you expose tables and volumes to internal teams or external recipients without copying bytes.
Hive metastore vs Unity Catalog in five contrasts.
| Concern | Hive metastore (legacy) | Unity Catalog |
|---|---|---|
| Scope | One per workspace | One per region, many workspaces |
| Identity | Workspace-local groups | Account-level SCIM |
| Naming |
database.table (2 parts) |
catalog.schema.table (3 parts) |
| ACLs | Per-cluster, table-grain | Account-wide, table + row + column |
| External storage | Notebook-mounted with secrets |
EXTERNAL LOCATION + STORAGE CREDENTIAL
|
Why "account-level" matters.
-
One ACL surface. A
GRANT SELECT ON CATALOG mainto theanalystsgroup applies in every workspace attached to the metastore. No copy-paste between admin consoles. - Identity federation. SCIM-sync from Okta / Entra / Workday once at the account, propagate everywhere. Removing a leaver becomes a one-step revoke.
- Cross-workspace queries. A BI workspace can read prod tables registered in a different workspace, because the metastore — not the workspace — owns the table.
-
Audit consolidation.
system.access.auditcollects events from every workspace into one log, queryable like any Delta table.
Cost / latency story.
- Metadata only. UC does not sit on the query hot path. Photon and DBSQL still scan parquet from S3/ADLS/GCS directly; UC's role is to authorize the open, not to broker the bytes.
- No extra hop. The credential-vending call happens at job start, not per row. After authorization, the cluster has a temporary credential and reads at parquet speed.
- The free upgrade. Migration to UC is one of the few "more capability, same cost" moves in modern data platforms. The price is engineering time, not new infrastructure.
What interviewers listen for.
- Do you say "account-level identity, regional metastore, workspace-local compute" when asked to draw UC? — senior signal.
- Do you mention
system.access.auditandsystem.access.table_lineagewhen asked how to audit a metric? — required answer. - Do you call the legacy thing "Hive metastore" and not just "the old metastore"? — required answer.
- Do you mention that UC needs an external location + storage credential for managed-table writes? — senior signal.
Worked example — the four-workspace governance headache UC deletes
Detailed explanation. A financial services platform runs four Databricks workspaces — prod-east, prod-west, analytics, and sandbox. In the legacy world every workspace had its own Hive metastore, so the pii_readers group had to be created four times, the customer_pii table ACL granted four times, and rotated four times whenever the audit cycle demanded it. After UC, every concern shifts up to the account.
Question. Given the four-workspace topology, how many places does the security team need to maintain the pii_readers group, the customer_pii SELECT grant, and the rotation of access tokens — before UC and after UC?
Input.
| Concern | Pre-UC (per workspace) | UC (per metastore) |
|---|---|---|
Group pii_readers
|
4 copies | 1 copy |
GRANT SELECT ON customer_pii |
4 grants | 1 grant |
| Token rotation for ETL service | 4 rotations | 1 rotation |
Audit log for SELECT customer_pii
|
4 separate logs | 1 system.access.audit table |
Code.
-- Pre-UC — repeated in every workspace's local Hive metastore
GRANT SELECT ON TABLE finance.customer_pii TO `pii_readers`;
-- ... copy to prod-west, analytics, sandbox
-- UC — written once at the account-attached metastore
USE CATALOG main;
GRANT USAGE ON CATALOG main TO `pii_readers`;
GRANT USAGE ON SCHEMA main.finance TO `pii_readers`;
GRANT SELECT ON TABLE main.finance.customer_pii TO `pii_readers`;
Step-by-step explanation.
- In the legacy world the metastore lives inside each workspace, so any grant only authorises that workspace's clusters. Four workspaces means four GRANTs and four future REVOKEs.
- UC promotes the metastore to the regional account layer. A single
GRANTreaches every workspace attached to the metastore — including new workspaces created tomorrow. - UC also requires the
USAGEchain: the principal needsUSAGEon the catalog and the schema before theSELECTon the table will resolve. This chain replaces the legacy "workspace ACL + table ACL" model. - Removing the
pii_readersgroup is one SCIM action at the account; in the legacy world it was four workspace edits plus a check that nobody left a stale cluster-level ACL behind.
Output.
| Phase | Group copies | Grant copies | Audit logs |
|---|---|---|---|
| Pre-UC (4 workspaces) | 4 | 4 | 4 |
| UC (1 metastore) | 1 | 1 | 1 |
Rule of thumb. If you are still defending workspace-scoped grants in 2026, draw the four-workspace cost table to whoever owns the migration backlog. The savings are literally O(workspaces) per policy edit.
Worked example — discovering what is workspace-scoped before migration
Detailed explanation. Before turning UC on, every team needs an inventory of what is still living in the legacy Hive metastore. The standard query reads from hive_metastore (the alias UC exposes for the legacy metastore) and counts tables by database. That output drives the migration backlog — which databases get a SYNC upgrade, which need a manual rebuild, which can be retired.
Question. Inventory every legacy database, count its tables, and flag any that have managed tables (which need data movement vs reference-only migration).
Input. A typical legacy workspace has 3–10 Hive databases, each with mixed managed and external tables.
| database | table_type | row_count |
|---|---|---|
| finance | MANAGED | 1.2M |
| finance | EXTERNAL | 8.0M |
| sandbox | MANAGED | 200 |
| sandbox | MANAGED | 50 |
| legacy_etl | EXTERNAL | 12.5M |
Code.
SELECT
database_name,
COUNT(*) AS total_tables,
COUNT(*) FILTER (WHERE table_type = 'MANAGED') AS managed_tables,
COUNT(*) FILTER (WHERE table_type = 'EXTERNAL') AS external_tables,
CASE
WHEN COUNT(*) FILTER (WHERE table_type = 'MANAGED') > 0
THEN 'needs data movement'
ELSE 'reference only — sync metadata'
END AS migration_path
FROM hive_metastore.information_schema.tables
WHERE database_name NOT IN ('default', 'information_schema')
GROUP BY database_name
ORDER BY total_tables DESC;
Step-by-step explanation.
- The query reads from
hive_metastore.information_schema.tables— UC surfaces the legacy metastore as if it were a normal catalog. That means every legacy table is queryable through 3LN ashive_metastore.db.table. -
COUNT(*) FILTER (WHERE ...)is ANSI-standard conditional counting; it lets us tally managed and external tables in a single pass. - The
CASEflags the migration path. Managed tables in the Hive metastore live in DBFS — UC requires them to be re-homed to a UC external location. External tables only need a metadata sync because the bytes already live in your bucket. - Anything in
defaultorinformation_schemais metadata noise — filter it out so the migration backlog reflects real engineering work.
Output.
| database_name | total_tables | managed_tables | external_tables | migration_path |
|---|---|---|---|---|
| legacy_etl | 12 | 0 | 12 | reference only — sync metadata |
| finance | 7 | 2 | 5 | needs data movement |
| sandbox | 4 | 4 | 0 | needs data movement |
Rule of thumb. Run the inventory before drawing the catalog tree. The output tells you which databases become UC schemas in main, which become their own dev/sandbox catalogs, and which get retired entirely. Skipping this step is the #1 reason migrations stall.
SQL interview question on legacy-vs-UC governance modelling
A senior interviewer often opens with: "Your team owns three Databricks workspaces. The CISO asks 'who can read the customer_pii table in every workspace today, and how would you change that answer to one query under Unity Catalog?' Walk me through the SQL and the architectural shift."
Solution Using account-level GRANT + system.access.audit cross-check
-- Pre-UC — three queries, one per workspace's local metastore
-- (executed sequentially in each workspace)
SHOW GRANTS ON TABLE finance.customer_pii;
-- UC — one query reaches every workspace because the metastore
-- and the audit table are account-scoped.
SELECT
principal,
privilege_type,
object_type,
object_full_name,
granted_at
FROM system.information_schema.table_privileges
WHERE object_full_name = 'main.finance.customer_pii'
AND privilege_type = 'SELECT'
ORDER BY granted_at DESC;
-- Cross-check against actual reads from system.access.audit
SELECT
user_identity.email AS user_email,
COUNT(*) AS read_count,
MAX(event_time) AS last_read_at
FROM system.access.audit
WHERE service_name = 'unityCatalog'
AND action_name = 'getTable'
AND request_params.full_name_arg = 'main.finance.customer_pii'
AND event_time >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY user_identity.email
ORDER BY read_count DESC;
Step-by-step trace.
| Step | Pre-UC | UC |
|---|---|---|
| 1. List grants | run SHOW GRANTS once per workspace | one information_schema query |
| 2. Collect into spreadsheet | manual cross-tab in Excel | already cross-tab |
| 3. Cross-check vs actual reads | scrape per-workspace audit logs | join with system.access.audit
|
| 4. Produce report | days | minutes |
The two UC queries form the canonical "policy intent vs policy reality" check. The first lists who is allowed to read; the second lists who actually read. The gap between the two is the principle-of-least-privilege backlog.
Output:
| Metric | Pre-UC | UC |
|---|---|---|
| Queries required | 3 (one per workspace) | 1 (account-wide) |
| Time to report | hours | minutes |
| Stale ACL detection | manual |
LEFT ANTI JOIN between privileges and audit |
| Audit window | per-workspace, variable | account, 365 days |
Why this works — concept by concept:
- One metastore per region — every UC catalog is reachable from every workspace attached to the metastore, so a single grant covers all of them. The unit of isolation moves from workspace to catalog.
-
Account-level SCIM —
pii_readersis a single group object that SCIM-syncs from your IdP. Adding or removing a member is one action that propagates to every catalog the group is granted on. -
system.access.auditis itself a Delta table — you can JOIN it toinformation_schema.table_privilegesto spot "granted but never used" permissions and revoke them. That is the audit pattern your CISO is actually asking for. -
USAGE chain is non-negotiable — even an account-admin needs
USAGEon the catalog and schema beforeSELECTon the table resolves. New engineers hit this on day one; document it in the onboarding runbook. - Cost — one extra metadata fetch at job start (O(1)); zero ongoing cost. The audit query is a normal Delta scan — bring your own DBSQL warehouse and pay for one query.
SQL
Topic — SQL fundamentals
SQL practice for data engineers
SQL
Topic — design
Database design problems (governance, ACLs)
2. The three-level namespace + metastore architecture
catalog.schema.table is not just longer naming — it is the unit of isolation, delegation, and policy attachment
The mental model in one line: the legacy database.table form has one boundary; UC's catalog.schema.table form has two — and that extra boundary is where every modern governance pattern (per-LOB isolation, per-environment promotion, certified data products) lives. Once you say "the catalog is the isolation unit; the schema is the domain unit," every UC design question becomes a tree-drawing exercise.
The five tiers, top down.
- Account. One per organisation. Owns SCIM identities, audit logs, billing, and the metastores. The CFO of your data estate.
- Metastore. One per cloud region. Attached to N workspaces in that region. Owns catalogs and the external-storage credentials.
-
Catalog. The unit of isolation. Typical names:
main,dev,sandbox,finance,growth. A catalog can hold a single LOB or a whole environment. - Schema. Logical grouping inside a catalog. Maps to a domain (sales, hr, marketing) or a product area (loyalty, fraud, attribution).
- Object. What lives in a schema: tables, views, materialized views, volumes (governed files — a UC-managed wrapper for unstructured bytes), functions (SQL UDFs), models.
External locations + storage credentials.
-
STORAGE CREDENTIAL— a UC object that wraps an IAM role / managed identity / service account. The cloud-side principal that the cluster assumes to read your bucket. -
EXTERNAL LOCATION— a UC object that maps a URL prefix (e.g.s3://acme-prod-bronze/) to a storage credential and a set of access grants. -
The pattern. Notebooks never carry secrets. They reference
EXTERNAL LOCATION raw_zoneand UC vends a short-lived credential to the cluster at job start. -
READ FILES,WRITE FILES,CREATE EXTERNAL TABLE— the three privileges on an external location that determine what the principal can do under that prefix.
Foreign catalogs — Lakehouse Federation.
- What it is. A UC catalog whose backing store is not Delta — it is Snowflake, BigQuery, Postgres, MySQL, Redshift, or SQL Server.
-
What you get. Three-part naming and GRANT semantics over the foreign engine —
SELECT * FROM snowflake_prod.public.ordersfrom a Databricks notebook resolves to a pushdown query in Snowflake. - What you do not get. Write-back to the foreign system (currently read-only for most connectors), and lineage at column grain depends on the connector.
- The interview probe. "How would you join a Delta table to a Snowflake table without copying either?" — Lakehouse Federation foreign catalog + a normal SQL JOIN.
Two CATALOG flavours.
- Standard catalog. Backed by managed Delta tables in a UC-managed storage location. The default for new catalogs.
-
Foreign catalog (federation). Backed by an external engine. Defined with
CREATE FOREIGN CATALOG ... USING CONNECTION ....
Naming tradeoffs.
-
mainas the prod catalog — short, ANSI-friendly, easy to type. Default for greenfield deployments. -
prod/dev/sandboxper environment — clear environment isolation; promotion is "rename or copy across catalogs." -
Per-LOB catalogs (
finance,growth,infra) — clear ownership boundary; group grants land at the catalog root. The pattern most large enterprises end up with after a year.
Common interview probes on the namespace.
- "Why does Postgres / dbt always need a schema and a database?" — because the ANSI standard 3LN was always
catalog.schema.table; legacy engines collapsed the first two. - "Can two catalogs in the same metastore share a schema name?" — yes.
main.salesanddev.salesare distinct schemas. - "What is a volume?" — a UC-governed wrapper for unstructured data (PDFs, images, logs). You GRANT
READ VOLUMEthe same way you GRANTSELECT. - "What is the difference between a managed and an external Delta table?" — managed tables live in the catalog's managed storage location and UC owns their lifecycle; external tables live in a path you control and UC only owns the metadata.
Worked example — the GRANT chain you must walk to read a table
Detailed explanation. A new analyst joins the growth team. They try SELECT * FROM main.marketing.touches and get PERMISSION_DENIED on TABLE. The bug is not the SELECT grant — that was already in place. The missing piece is the USAGE chain: UC requires the principal to have USAGE on the catalog and USAGE on the schema before any SELECT resolves. This is the most common day-one ticket in any UC rollout.
Question. Given the growth_analysts group, write the three GRANT statements that allow them to SELECT * FROM main.marketing.touches. Explain why a single SELECT grant is insufficient.
Input.
| Principal | Object | Current grants |
|---|---|---|
growth_analysts |
main (catalog) |
— |
growth_analysts |
main.marketing (schema) |
— |
growth_analysts |
main.marketing.touches (table) |
SELECT (granted yesterday) |
Code.
-- The three-step USAGE chain
GRANT USAGE ON CATALOG main TO `growth_analysts`;
GRANT USAGE ON SCHEMA main.marketing TO `growth_analysts`;
GRANT SELECT ON TABLE main.marketing.touches TO `growth_analysts`;
-- Verify in one query
SELECT grantee, privilege_type, table_catalog, table_schema, table_name
FROM system.information_schema.table_privileges
WHERE grantee = 'growth_analysts';
Step-by-step explanation.
- UC's authorization check walks the 3LN tree top-down: catalog → schema → table. At each level the principal must have at least
USAGE(the "I can see this exists" privilege). - Without
USAGE ON CATALOG main, the planner cannot even resolve the namemain— the error is "catalog not found" or "permission denied at catalog level." - Without
USAGE ON SCHEMA main.marketing, the planner cannot list tables in the schema — same error at the schema level. - Only after both USAGEs are in place does the
SELECTgrant become operative. The table grant on its own is necessary but not sufficient. - The verification query reads from
information_schema.table_privileges, which is the canonical UC view for "show me every privilege this principal holds."
Output.
| grantee | privilege_type | table_catalog | table_schema | table_name |
|---|---|---|---|---|
| growth_analysts | SELECT | main | marketing | touches |
Rule of thumb. Bake the USAGE chain into your provisioning automation. Whenever a group gets a new table grant, the script must also issue (or confirm) the parent catalog and schema USAGE grants. Eight out of ten "UC is broken" tickets dissolve once this is automated.
Worked example — registering an external location for a multi-team raw zone
Detailed explanation. A platform team owns an S3 prefix s3://acme-prod-bronze/ that holds raw landings from twelve ingest pipelines. Pre-UC, every pipeline mounted the bucket and embedded an IAM access key in cluster-init. Post-UC, the platform team registers the prefix once as an EXTERNAL LOCATION bound to a STORAGE CREDENTIAL, then grants WRITE FILES to the ingest groups. Notebooks never see a secret again.
Question. Register the credential, the external location, and the WRITE FILES grant for the ingest_pipelines group. Show the verification query.
Input.
| Object | Cloud-side | UC-side |
|---|---|---|
| IAM role | arn:aws:iam::123456789012:role/databricks-bronze-writer |
STORAGE CREDENTIAL cred_bronze_writer |
| S3 prefix | s3://acme-prod-bronze/ |
EXTERNAL LOCATION raw_zone |
| Principal | n/a |
ingest_pipelines group |
Code.
-- 1) Bind the cloud role to a UC storage credential
CREATE STORAGE CREDENTIAL cred_bronze_writer
WITH AWS_IAM_ROLE 'arn:aws:iam::123456789012:role/databricks-bronze-writer'
COMMENT 'Bronze-zone writer role for ingest pipelines';
-- 2) Bind a URL prefix to that credential
CREATE EXTERNAL LOCATION raw_zone
URL 's3://acme-prod-bronze/'
WITH (STORAGE CREDENTIAL cred_bronze_writer)
COMMENT 'Multi-team bronze landing zone';
-- 3) Grant write to the ingest group
GRANT READ FILES, WRITE FILES, CREATE EXTERNAL TABLE
ON EXTERNAL LOCATION raw_zone
TO `ingest_pipelines`;
-- 4) Verify
SELECT name, url, credential_name, owner, created_at
FROM system.information_schema.external_locations
WHERE name = 'raw_zone';
Step-by-step explanation.
- The
STORAGE CREDENTIALis the UC wrapper for the IAM role. It is a single object; many external locations can reference the same credential. - The
EXTERNAL LOCATIONis the URL-prefix → credential binding. UC checks that no two external locations overlap —s3://acme-prod-bronze/ands3://acme-prod-bronze/team-a/cannot both exist. - The grant chain on an external location is
READ FILES/WRITE FILES/CREATE EXTERNAL TABLE— these are URL-prefix privileges, not table privileges. - After the grant, an ingest notebook can run
COPY INTO main.raw.events FROM 's3://acme-prod-bronze/events/'and UC will vend a short-lived credential bound tocred_bronze_writer. No secret hits the notebook. - The verification query confirms the location is registered, owned by the right group, and bound to the expected credential.
Output.
| name | url | credential_name | owner | created_at |
|---|---|---|---|---|
| raw_zone | s3://acme-prod-bronze/ | cred_bronze_writer | data_platform | 2026-06-06 09:14:00 |
Rule of thumb. One credential per cloud principal (role / managed identity / service account); one external location per logical zone (bronze, silver, gold, archive). Resist the urge to create one external location per bucket prefix per team — it creates a thicket UC has to keep coherent.
Worked example — federating a Snowflake catalog as a foreign catalog
Detailed explanation. A team has a Snowflake warehouse holding production CRM data. They do not want to copy it into Delta, but they need to JOIN it to event tables in main.events. Lakehouse Federation lets them register the Snowflake account as a CONNECTION, then create a FOREIGN CATALOG that surfaces every Snowflake schema and table as if it were native Delta — three-part naming and all.
Question. Register the Snowflake connection, create the foreign catalog, and write a JOIN that combines snowflake_crm.public.contacts with main.events.signups.
Input.
| Component | Value |
|---|---|
| Snowflake host | acme.snowflakecomputing.com |
| Snowflake user | db_federation_user |
| Snowflake warehouse | COMPUTE_WH |
| Foreign catalog name | snowflake_crm |
| Delta catalog | main |
Code.
-- 1) Create a UC connection to Snowflake
CREATE CONNECTION snowflake_prod TYPE SNOWFLAKE
OPTIONS (
host 'acme.snowflakecomputing.com',
port '443',
sfWarehouse 'COMPUTE_WH',
user secret('snowflake', 'user'),
password secret('snowflake', 'password')
);
-- 2) Create the foreign catalog
CREATE FOREIGN CATALOG snowflake_crm
USING CONNECTION snowflake_prod
OPTIONS (database 'CRM_PROD');
-- 3) Grant USAGE for analysts
GRANT USAGE ON CATALOG snowflake_crm TO `growth_analysts`;
GRANT USAGE ON SCHEMA snowflake_crm.public TO `growth_analysts`;
-- 4) The query — joins across two engines, no copy
SELECT
c.contact_id,
c.email,
s.signup_at
FROM snowflake_crm.public.contacts c
JOIN main.events.signups s ON s.email = c.email
WHERE s.signup_at >= CURRENT_DATE - INTERVAL 7 DAYS;
Step-by-step explanation.
- The
CONNECTIONobject captures the how to reach Snowflake — host, warehouse, credential. Secrets live in the Databricks secret scope, not inline. - The
FOREIGN CATALOGis the what to expose — one Snowflake database becomes one UC catalog. Schemas and tables auto-discover. - The
USAGEchain is identical to native catalogs. The analyst experience is "another catalog in the explorer" — no special Snowflake awareness required. - The JOIN plan pushes down the contact-side predicates to Snowflake, fetches only the necessary columns into Databricks, and finishes the JOIN locally. The optimiser owns the split.
- Lineage captures the foreign catalog as a source — you can see "this gold table reads from Snowflake" in the Catalog Explorer graph.
Output.
| contact_id | signup_at | |
|---|---|---|
| 8821 | alice@example.com | 2026-06-04 14:22:00 |
| 8822 | bob@example.com | 2026-06-05 09:01:00 |
Rule of thumb. Federate before you copy. If the data lives somewhere already, a foreign catalog is one DDL away and gives you correctness, lineage, and grants for free. Only physically copy when latency, cost, or transformation complexity makes federation unworkable.
SQL interview question on namespace design
A senior interviewer might frame this as: "Your platform team is launching UC on a region with three lines of business (finance, growth, infra), four environments (prod, stage, dev, sandbox), and a 'data products' notion. Sketch the catalog tree and the GRANT model."
Solution Using a {env}_{lob} catalog matrix + a certified dataproducts catalog
-- The catalog matrix — twelve catalogs for the {env, lob} cross-product
CREATE CATALOG prod_finance;
CREATE CATALOG prod_growth;
CREATE CATALOG prod_infra;
CREATE CATALOG stage_finance;
CREATE CATALOG stage_growth;
-- ... (and so on)
CREATE CATALOG sandbox_finance;
-- The certified shareable surface — one cross-LOB catalog
CREATE CATALOG dataproducts
COMMENT 'Certified data products. Read-only to consumers. Owned by data_platform.';
-- The grant model — three levels of audience
GRANT USAGE ON CATALOG prod_finance TO `finance_engineers`;
GRANT USAGE ON CATALOG dataproducts TO `account_users`; -- everyone reads certified
GRANT CREATE SCHEMA ON CATALOG sandbox_finance TO `finance_engineers`;
Step-by-step trace.
| Catalog | Audience | Privileges granted |
|---|---|---|
| prod_finance | finance_engineers | USAGE; SELECT per schema |
| stage_finance | finance_engineers | USAGE; CREATE SCHEMA |
| sandbox_finance | finance_engineers | USAGE; CREATE SCHEMA; OWNER |
| dataproducts | account_users (everyone) | USAGE; SELECT (read-only) |
| dataproducts | data_platform | OWNER; CREATE SCHEMA; CREATE TABLE |
The matrix gives each LOB a private playground (sandbox), a staging area, and a prod zone — and one cross-LOB catalog (dataproducts) where certified, governed datasets are shared without giving every consumer USAGE on the producer's prod catalog.
Output:
| Question | Answer |
|---|---|
| How many catalogs total? | 13 (12 matrix + 1 dataproducts) |
| Who can write to prod? | LOB engineers via CI/CD service principal |
| How does finance share with growth? | Promote table into dataproducts
|
| Where does sandbox data live? |
sandbox_{lob} — short retention, no production grants |
Why this works — concept by concept:
-
Catalog is the isolation unit — by making catalogs per
{env, lob}, every group grant lands at the catalog root and inherits down. New schemas inside the catalog automatically get the same audience. -
dataproductsas a curated surface — separating "raw / staged / prod tables" (inside each LOB catalog) from "shareable, contracted datasets" (insidedataproducts) lets you ship certified views without exposing the internal data model. -
USAGE chain enforces the tree — a finance engineer never needs USAGE on
prod_growth; the catalogs are physically separate and the grant model says so. -
Sandbox catalogs — every engineer gets
CREATE SCHEMAin their LOB's sandbox catalog. Experimentation never pollutes prod, and there is one obvious place to clean up. - Cost — catalog creation is metadata-only. There is no per-catalog license or compute cost. The audit and lineage system tables remain a single account-scoped store.
SQL
Topic — design
Schema and catalog design problems
SQL
Topic — dimensional modeling
Dimensional modelling drills
3. Automatic data lineage — table-level and column-level
Lineage that "just appears" is the single most under-appreciated UC feature — and the basis of every modern impact-analysis answer
The mental model in one line: UC infers lineage from the query plan — every notebook, job, DLT pipeline, and dashboard that reads or writes via Photon or DBSQL emits a node into the lineage graph automatically, at both table grain (which table writes which table) and column grain (which source columns feed which target column). The engineer's job is to read the graph, not build it.
What UC captures automatically.
-
Producer → table edges. Every notebook, scheduled job, dashboard, DLT pipeline, and SQL warehouse query writes an edge into
system.access.table_lineage. -
Table → table edges. When a notebook reads
main.raw.ordersand writesmain.silver.orders_clean, UC records the read-write dependency. -
Column → column edges. When a transformation maps
orders.amount+customers.regiontodaily_revenue.gmv, UC stores the column-level mapping insystem.access.column_lineage. - Cross-workload. SQL warehouses, Photon clusters, DLT pipelines, scheduled jobs (notebook and script tasks), and AI agent runs all contribute.
Two grains, one graph.
-
Table grain is the right currency for impact analysis ("if I drop
main.raw.orders, what breaks downstream?"). Faster to query, fewer edges, smaller graph. -
Column grain is the right currency for PII tracing and data-contract validation ("does any downstream column depend on
customers.ssn?"). More edges, more cost, slower query. - Both grains are populated automatically — there is no opt-in flag for column lineage on supported runtimes.
Where to read the lineage.
- Catalog Explorer UI. Click a table → "Lineage" tab → see the upstream/downstream DAG. The default audience for analysts and stewards.
-
REST API.
/api/2.0/lineage-tracking/table-lineageand/api/2.0/lineage-tracking/column-lineage. Programmatic access for impact-analysis automation. -
System tables.
system.access.table_lineageandsystem.access.column_lineage. SQL queryable, joinable to anything else.
The 90-day retention window.
- Default. UC retains lineage events for 90 days. Older events drop off.
- Long-term storage. Stream the system tables into your own gold tables for retention beyond 90 days — useful for regulated industries.
-
Audit pairing. Lineage +
system.access.audit= "who read which table at which time and where the result flowed."
What lineage does not capture (limits).
-
Structured Streaming
foreachBatchsinks. The arbitrary code inside the lambda is opaque to UC. - Non-Databricks consumers. A Power BI dashboard that uses Delta Sharing as an open-protocol consumer is captured as a share event but not as a downstream Power BI graph node.
- External writes from non-UC clusters. If a cluster bypasses UC entirely, its writes are invisible to the lineage graph.
Common interview probes on lineage.
- "How would you find every downstream table impacted by deleting a column?" — query
system.access.column_lineagefor every edge whosesource_column_name = 'ssn'. - "What is the retention of the lineage graph?" — 90 days; stream to a gold table for longer.
- "Is lineage captured for DLT?" — yes, at both grains. DLT is one of the best lineage producers because the framework knows the transformation declaratively.
- "Does Lakehouse Federation contribute lineage?" — yes, at table grain. The federated query is captured as a read from the foreign catalog.
Worked example — impact analysis "what breaks if I drop this column?"
Detailed explanation. A data engineer needs to drop customers.ssn (the company is moving SSN to a tokenized vault). Before doing it, they need to know every downstream table whose columns depend on customers.ssn. The answer is one query against system.access.column_lineage.
Question. Find every downstream column that has main.crm.customers.ssn as a source.
Input. The column lineage table has one row per (source_column, target_column) pair, with the producer notebook/job and the event time.
| source_table_full_name | source_column_name | target_table_full_name | target_column_name |
|---|---|---|---|
| main.crm.customers | ssn | main.compliance.ssn_lookup | ssn_hash |
| main.crm.customers | ssn | main.compliance.ssn_lookup | masked_ssn |
| main.crm.customers | ssn | main.gold.kyc_dashboard | ssn_last4 |
| main.crm.customers | main.gold.kyc_dashboard |
Code.
SELECT
target_table_full_name,
target_column_name,
source_column_name,
MAX(event_time) AS last_seen
FROM system.access.column_lineage
WHERE source_table_full_name = 'main.crm.customers'
AND source_column_name = 'ssn'
AND event_time >= CURRENT_DATE - INTERVAL 90 DAYS
GROUP BY
target_table_full_name,
target_column_name,
source_column_name
ORDER BY target_table_full_name, target_column_name;
Step-by-step explanation.
- The query filters
column_lineageto a single source column. Every row in the result is a column that depends — directly or transitively in the recorded edge — oncustomers.ssn. -
event_time >= CURRENT_DATE - INTERVAL 90 DAYSmatches the UC retention window; anything older was already aged out of the lineage table. - The
GROUP BY ... MAX(event_time)collapses multiple edge events into one row per target column with the most recent observation. Useful for distinguishing "actively read" vs "historic edge." - The result is the impact list: every downstream target that needs a rewrite (or a deprecation plan) before the source column can be dropped.
- To go one more hop downstream (transitive impact), feed the target tables back into the same query as source candidates.
Output.
| target_table_full_name | target_column_name | source_column_name | last_seen |
|---|---|---|---|
| main.compliance.ssn_lookup | masked_ssn | ssn | 2026-06-05 |
| main.compliance.ssn_lookup | ssn_hash | ssn | 2026-06-05 |
| main.gold.kyc_dashboard | ssn_last4 | ssn | 2026-06-04 |
Rule of thumb. Make this query a saved view (vw_column_impact). Every "can I change this column?" ticket becomes a one-parameter call. Same query, different source_column_name — instant impact list.
Worked example — finding orphan tables nobody reads
Detailed explanation. Over time, every lakehouse accumulates tables no consumer reads. They cost storage, pollute Catalog Explorer, and dilute trust. The lineage table tells you exactly which tables have zero read edges in the last N days — i.e. orphans.
Question. Find every table in main.gold that has no read activity in the last 60 days.
Input. A registry of all tables in main.gold plus the table_lineage event log.
Code.
WITH gold_tables AS (
SELECT
CONCAT_WS('.', table_catalog, table_schema, table_name) AS full_name
FROM system.information_schema.tables
WHERE table_catalog = 'main'
AND table_schema = 'gold'
AND table_type = 'MANAGED'
),
recent_reads AS (
SELECT DISTINCT source_table_full_name AS full_name
FROM system.access.table_lineage
WHERE event_time >= CURRENT_DATE - INTERVAL 60 DAYS
AND source_table_full_name LIKE 'main.gold.%'
)
SELECT g.full_name AS orphan_table
FROM gold_tables g
LEFT JOIN recent_reads r ON r.full_name = g.full_name
WHERE r.full_name IS NULL
ORDER BY g.full_name;
Step-by-step explanation.
-
gold_tableslists every managed table in themain.goldschema — the candidate population. -
recent_readslists everymain.gold.*table that appears as a source (i.e. was read by something) in the last 60 days.DISTINCTcollapses the per-event rows into per-table rows. - The
LEFT JOIN ... WHERE r.full_name IS NULLis the canonical anti-join pattern — keeps every gold table that does not appear in the recent-reads list. - The result is the orphan list. Each row is a candidate for deletion, archival, or — if it is a quarterly report — better discovery metadata so consumers find it.
- Add a
created_at >= CURRENT_DATE - INTERVAL 14 DAYSexclusion if you want to skip brand-new tables that have not had time to acquire readers yet.
Output.
| orphan_table |
|---|
| main.gold.legacy_revenue_q1 |
| main.gold.test_aggregate |
| main.gold.unused_rollup |
Rule of thumb. Run the orphan scan monthly. Tag each orphan with a last_read annotation, give owners 30 days to claim, then archive. The lineage table makes this a 20-line query — pre-UC it was a manual archaeology project.
Worked example — auditing PII reads with lineage + audit join
Detailed explanation. Compliance asks: "show me every user who read any column derived from customers.ssn in the last 30 days." The answer joins column_lineage (find downstream columns) → table_lineage (find tables) → system.access.audit (find readers).
Question. List every user_email that ran a query against a table downstream of main.crm.customers.ssn in the last 30 days.
Input. Three system tables joined.
Code.
WITH ssn_targets AS (
SELECT DISTINCT target_table_full_name AS table_name
FROM system.access.column_lineage
WHERE source_table_full_name = 'main.crm.customers'
AND source_column_name = 'ssn'
AND event_time >= CURRENT_DATE - INTERVAL 90 DAYS
)
SELECT
a.user_identity.email AS user_email,
a.request_params.full_name_arg AS table_read,
COUNT(*) AS read_count,
MAX(a.event_time) AS last_read_at
FROM system.access.audit a
JOIN ssn_targets t
ON a.request_params.full_name_arg = t.table_name
WHERE a.service_name = 'unityCatalog'
AND a.action_name = 'getTable'
AND a.event_time >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY
a.user_identity.email,
a.request_params.full_name_arg
ORDER BY read_count DESC;
Step-by-step explanation.
-
ssn_targetscollapses the column lineage into a table list — every table that has at least one column derived fromcustomers.ssn. - The audit table records one row per UC
getTableevent, which corresponds to a user opening / scanning a table. Join the two on the table name. - The aggregate produces "user → table → read_count" — the compliance answer in one page.
- Filter on
service_name = 'unityCatalog'andaction_name = 'getTable'to ignore unrelated audit events (cluster start, secret access, etc). - The 30-day window for the audit join is narrower than the 90-day lineage retention — the two are independent retention horizons.
Output.
| user_email | table_read | read_count | last_read_at |
|---|---|---|---|
| alice@acme.com | main.compliance.ssn_lookup | 12 | 2026-06-05 14:00 |
| bob@acme.com | main.gold.kyc_dashboard | 4 | 2026-06-03 09:22 |
| carol@acme.com | main.compliance.ssn_lookup | 1 | 2026-05-28 11:10 |
Rule of thumb. Bake this query into a quarterly compliance report. The lineage hop expands PII auditing from "what was read directly" to "what was read transitively" — which is the question regulators actually ask.
SQL interview question on lineage-based impact analysis
A senior interviewer often frames this as: "You are dropping main.raw.events.user_ip to comply with a regional privacy law. Walk me through the queries you would run to (a) find every downstream table impacted, (b) find every notebook/job that needs an edit, and (c) the rollout order."
Solution Using a two-step lineage join (column → table → producer)
-- 1) Downstream tables impacted
WITH impacted_tables AS (
SELECT DISTINCT target_table_full_name AS table_full_name
FROM system.access.column_lineage
WHERE source_table_full_name = 'main.raw.events'
AND source_column_name = 'user_ip'
AND event_time >= CURRENT_DATE - INTERVAL 90 DAYS
),
-- 2) Producers (notebooks / jobs / DLT) that write each impacted table
producers AS (
SELECT
tl.target_table_full_name,
tl.entity_type,
tl.entity_id,
tl.entity_name,
MAX(tl.event_time) AS last_write
FROM system.access.table_lineage tl
JOIN impacted_tables it
ON it.table_full_name = tl.target_table_full_name
WHERE tl.event_time >= CURRENT_DATE - INTERVAL 90 DAYS
GROUP BY
tl.target_table_full_name,
tl.entity_type,
tl.entity_id,
tl.entity_name
)
SELECT * FROM producers
ORDER BY target_table_full_name, last_write DESC;
Step-by-step trace.
| Step | Reads from | Output cardinality | Purpose |
|---|---|---|---|
| 1. impacted_tables | column_lineage | small (5–50) | "what depends on user_ip" |
| 2. producers | table_lineage | medium (10–200) | "what writes each impacted table" |
| 3. final select | producers | medium | the rollout backlog |
The two-step pattern — column lineage to find impacted tables, then table lineage to find the producers — is the canonical "what do I need to edit" answer. The output rows are the engineering backlog, ordered by table.
Output:
| target_table_full_name | entity_type | entity_name | last_write |
|---|---|---|---|
| main.silver.events_geo | NOTEBOOK | nb_geo_enrichment | 2026-06-05 |
| main.gold.daily_traffic | JOB | daily_traffic_job | 2026-06-05 |
| main.gold.daily_traffic | DLT_PIPELINE | gold_traffic_dlt | 2026-06-05 |
Why this works — concept by concept:
- column_lineage drives table_lineage — start at the column you want to change, walk up the column graph to find target columns, then up again via the table graph to find producers. The hop count is finite (usually 1–3).
-
entity_type tells the editor what to edit —
NOTEBOOK,JOB,DLT_PIPELINE,SQL_QUERY. Each demands a different change-management workflow. - 90-day retention is enough for impact analysis — anything that has not written in 90 days is either dormant or deprecated. Treat it as a candidate for archival, not a blocker.
- One query, no instrumentation — pre-UC, this answer required reading every notebook in source control. UC reduces it to a join.
- Cost — two LEFT JOIN scans over the lineage system tables. Optimised by Databricks for sub-second response on typical metastore sizes. Bring your own DBSQL warehouse.
SQL
Topic — joins
JOIN problems for lineage analysis
SQL
Topic — ctes
CTE-based multi-step queries
4. Row filters, column masks + GRANT/REVOKE
row-level security filters and column masks are functions you attach to a table — and they beat dynamic views for maintenance, cross-tool consistency, and audit
The mental model in one line: a row filter is a SQL UDF that returns BOOLEAN and decides whether each row is visible; a column mask is a SQL UDF that returns the (possibly rewritten) column value — and once attached to a table, both fire automatically for every reader, every tool, every workspace. The win versus the older "build a dynamic view per audience" pattern is that the table stays one object — there is no view tree to keep in sync.
The four primitives.
-
GRANT— give a privilege (SELECT, MODIFY, USAGE, EXECUTE, ...) on an object (table, schema, catalog, function, volume) to a principal (user, group, service principal). -
REVOKE— the symmetric opposite. Identical syntax, opposite effect. -
Row filter function. A SQL UDF that returns
BOOLEANand is attached to a table viaALTER TABLE ... SET ROW FILTER. The function fires per row; rows where it returnsFALSEorNULLare dropped from the result. -
Column mask function. A SQL UDF that takes the column value as input and returns the (possibly rewritten) value. Attached via
ALTER TABLE ... ALTER COLUMN ... SET MASK. Fires per row, per column.
Two helper functions every policy uses.
-
is_account_group_member('group_name')— returnsTRUEif the current principal is a member of the named account-level group. The cleanest way to write "if the reader is a PII reader." -
current_user()— returns the email of the current principal. Useful for per-user filters ("show only rows owned by the current user").
Row filters in detail.
-
Signature. A SQL UDF whose return type is
BOOLEAN. Arguments correspond to one or more columns of the protected table. -
Application.
ALTER TABLE main.sales.orders SET ROW FILTER us_only_filter ON (region)— UC passes theregioncolumn intous_only_filterfor every row. -
Combinable with group membership. Use
is_account_group_member()inside the filter so administrators see all rows, regional analysts see their region only. -
Detach.
ALTER TABLE main.sales.orders DROP ROW FILTER— removes the filter without altering the function.
Column masks in detail.
- Signature. A SQL UDF whose argument is the column value and whose return type matches. Optional additional arguments for context (e.g. the row's region).
-
Application.
ALTER TABLE main.hr.employees ALTER COLUMN ssn SET MASK ssn_mask— UC routes the originalssnvalue throughssn_maskfor every reader. -
Conditional unmasking. Put
CASE WHEN is_account_group_member('pii_readers') THEN ssn ELSE ... ENDinside the mask body — same physical column, different visible value per audience. -
Detach.
ALTER TABLE main.hr.employees ALTER COLUMN ssn DROP MASK.
Dynamic views vs filters/masks.
-
Dynamic view. A view defined with
CREATE VIEW ... AS SELECT ... CASE WHEN is_account_group_member('pii') THEN ssn ELSE '***' END .... The audience is encoded in the view body. - Why filters/masks win. One table, one policy. No view tree to maintain. Same policy applies whether the reader hits DBSQL, a notebook, a federated tool, or a Delta Sharing recipient.
- When dynamic views still help. When the policy logic depends on a JOIN to another table (e.g. an entitlements table); row filters and masks are scalar UDFs by design.
Attribute-Based Access Control (ABAC).
-
What. Tag-driven policies — apply a
pii=truetag to a column once, and a tenant-wide rule "mask everypii=truecolumn for non-pii_readers" fires automatically. - Status (2026). Preview / GA depending on cloud and region. The direction of travel for the next two years.
-
Why care. It replaces per-table policy admin with per-tag policy admin. The maintenance graph collapses from
O(tables × audiences)toO(tags × audiences).
The GRANT chain (always required).
-
GRANT USAGE ON CATALOG main TO group— see the catalog. -
GRANT USAGE ON SCHEMA main.sales TO group— see the schema. -
GRANT SELECT ON TABLE main.sales.orders TO group— read the table (subject to filter + mask).
Audit — where every check lands.
-
system.access.audit. One row per privileged action. Includes the action name (getTable,grantPrivilege, ...), the principal, the object, the request params, the source IP. -
Cross-tabulation with privileges. Join
auditwithinformation_schema.table_privilegesto surface "granted but never used" privileges — the principle-of-least-privilege backlog.
Worked example — a row filter that limits orders to the reader's region
Detailed explanation. A sales.orders table holds orders from every region. EU analysts should see only EU rows; APAC only APAC; global admins all rows. A single row filter function — looking at the row's region column and the reader's group membership — encodes the policy.
Question. Define a row filter region_filter(region) that returns TRUE when the reader is in the corresponding regional group OR the global readers group. Attach it to main.sales.orders.
Input.
| order_id | region | amount |
|---|---|---|
| 1 | EU | 100 |
| 2 | US | 80 |
| 3 | APAC | 50 |
| 4 | EU | 200 |
Code.
-- 1) The row filter function
CREATE OR REPLACE FUNCTION main.sec.region_filter(region STRING)
RETURN
is_account_group_member('global_readers')
OR (region = 'EU' AND is_account_group_member('eu_analysts'))
OR (region = 'US' AND is_account_group_member('us_analysts'))
OR (region = 'APAC' AND is_account_group_member('apac_analysts'));
-- 2) Attach to the table
ALTER TABLE main.sales.orders
SET ROW FILTER main.sec.region_filter ON (region);
-- 3) Grant EXECUTE on the function to readers
GRANT EXECUTE ON FUNCTION main.sec.region_filter TO `account_users`;
Step-by-step explanation.
- The function returns
BOOLEAN. Its first branch —is_account_group_member('global_readers')— short-circuits TRUE for admins, so the filter never excludes them. - Each regional branch tests both the row's region and the reader's group membership. The pattern scales to N regions by adding N branches.
-
ALTER TABLE ... SET ROW FILTER ... ON (region)registers the filter and tells UC which column to pass in. UC injects the function into every read plan automatically. -
GRANT EXECUTE ON FUNCTIONis not the same asSELECTon the table — the principal needs both. Without EXECUTE on the filter function, the read fails with "function not accessible." - The filter fires for every read: DBSQL, notebooks, federated tools, Delta Sharing recipients. There is one policy, applied everywhere.
Output (what an EU analyst sees).
| order_id | region | amount |
|---|---|---|
| 1 | EU | 100 |
| 4 | EU | 200 |
Rule of thumb. Put one boolean "is_admin" branch at the very top of every row filter. It makes admin debugging trivial and prevents the filter from accidentally hiding rows during a policy change.
Worked example — a column mask for SSN with conditional unmasking
Detailed explanation. The hr.employees table stores a raw ssn column. Only the pii_readers group is allowed to see the full value; everyone else sees the last-4 form. A column mask attached to the table enforces this without rewriting any reader's queries.
Question. Define a mask ssn_mask(ssn) that returns the full SSN for pii_readers and XXX-XX-#### (last 4 digits) for everyone else. Attach it to main.hr.employees.ssn.
Input.
| name | ssn |
|---|---|
| alice | 123-45-6789 |
| bob | 987-65-4321 |
| carol | 555-12-3456 |
Code.
-- 1) The mask function — input ssn, output (possibly rewritten) ssn
CREATE OR REPLACE FUNCTION main.sec.ssn_mask(ssn STRING)
RETURN
CASE
WHEN is_account_group_member('pii_readers')
THEN ssn
ELSE
'XXX-XX-' || RIGHT(ssn, 4)
END;
-- 2) Attach to the column
ALTER TABLE main.hr.employees
ALTER COLUMN ssn SET MASK main.sec.ssn_mask;
-- 3) Grant EXECUTE on the function
GRANT EXECUTE ON FUNCTION main.sec.ssn_mask TO `account_users`;
Step-by-step explanation.
- The mask takes the column value as input and returns the visible value. The function is called once per row, per read.
- For a member of
pii_readers, the mask returns the input unchanged — the full SSN is visible. - For any other reader, the mask returns
'XXX-XX-' || RIGHT(ssn, 4)— the masked form. The same physical row produces different visible values depending on the audience. - Like row filters, column masks fire across every read path: notebooks, DBSQL, Delta Sharing recipients (if shared). Replacing a per-tool ACL with a single mask.
-
RIGHT(ssn, 4)is the Databricks SQL idiom for the last four characters. The mask intentionally leaks no more than four digits — the standard PII-tolerable disclosure.
Output — what a non-PII reader sees.
| name | ssn |
|---|---|
| alice | XXX-XX-6789 |
| bob | XXX-XX-4321 |
| carol | XXX-XX-3456 |
Rule of thumb. Define mask functions in a dedicated sec schema (main.sec.ssn_mask, main.sec.email_mask, ...). One place to audit, one place to revoke EXECUTE if a function becomes deprecated.
Worked example — combining filter + mask + USAGE on one table
Detailed explanation. A single main.sales.orders table can carry both a row filter (region) and a column mask (customer_email). The two compose: the filter runs first and drops rows; the mask runs on surviving rows and rewrites values. The reader needs the full USAGE chain plus EXECUTE on both functions.
Question. Show the complete policy bundle — row filter, column mask, USAGE chain, function EXECUTE — for an EU analyst reading main.sales.orders.
Input.
| order_id | region | customer_email | amount |
|---|---|---|---|
| 1 | EU | alice@example.com | 100 |
| 2 | US | bob@example.com | 80 |
| 3 | EU | carol@example.com | 50 |
Code.
-- Functions
CREATE OR REPLACE FUNCTION main.sec.region_filter(region STRING)
RETURN is_account_group_member('global_readers')
OR (region = 'EU' AND is_account_group_member('eu_analysts'));
CREATE OR REPLACE FUNCTION main.sec.email_mask(email STRING)
RETURN CASE
WHEN is_account_group_member('pii_readers') THEN email
ELSE CONCAT(LEFT(email, 2), '***@', SPLIT_PART(email, '@', 2))
END;
-- Attach
ALTER TABLE main.sales.orders SET ROW FILTER main.sec.region_filter ON (region);
ALTER TABLE main.sales.orders ALTER COLUMN customer_email SET MASK main.sec.email_mask;
-- Grants — full bundle for eu_analysts
GRANT USAGE ON CATALOG main TO `eu_analysts`;
GRANT USAGE ON SCHEMA main.sales TO `eu_analysts`;
GRANT SELECT ON TABLE main.sales.orders TO `eu_analysts`;
GRANT EXECUTE ON FUNCTION main.sec.region_filter TO `eu_analysts`;
GRANT EXECUTE ON FUNCTION main.sec.email_mask TO `eu_analysts`;
Step-by-step explanation.
- The filter runs first. For an EU analyst, only
region = 'EU'rows survive — orders 1 and 3. - The mask runs second. The EU analyst is not in
pii_readers, socustomer_emailis rewritten asal***@example.com,ca***@example.com. - Both functions must be granted EXECUTE to the reader. The error if either is missing is "function not accessible" — a frequent first-day mistake.
- The full USAGE chain (catalog + schema + table SELECT) is additional — function EXECUTE alone is not sufficient to read the table.
- Bundle the seven grants into one script and check it into source control. Re-running the script is idempotent; UC silently no-ops if the privilege already exists.
Output — what the EU analyst sees.
| order_id | region | customer_email | amount |
|---|---|---|---|
| 1 | EU | al***@example.com | 100 |
| 3 | EU | ca***@example.com | 50 |
Rule of thumb. Treat row filter + column mask as one policy bundle. Bundle them into a single SQL script per table, grant function EXECUTE in the same script, and tag the script with the table's full name. When the policy changes, you edit one file.
SQL interview question on row-level + column-level security composition
A senior interviewer often opens with: "Design the row filter + column mask + GRANT model so that EU analysts see EU rows with masked emails, US analysts see US rows with masked emails, and compliance officers see all rows with full emails. Show the SQL."
Solution Using one row filter + one column mask + three groups
-- One filter for region-based row visibility
CREATE OR REPLACE FUNCTION main.sec.region_filter(region STRING)
RETURN is_account_group_member('compliance_officers')
OR (region = 'EU' AND is_account_group_member('eu_analysts'))
OR (region = 'US' AND is_account_group_member('us_analysts'));
-- One mask for email visibility
CREATE OR REPLACE FUNCTION main.sec.email_mask(email STRING)
RETURN CASE
WHEN is_account_group_member('compliance_officers') THEN email
ELSE CONCAT(LEFT(email, 2), '***@', SPLIT_PART(email, '@', 2))
END;
-- Attach both
ALTER TABLE main.sales.orders SET ROW FILTER main.sec.region_filter ON (region);
ALTER TABLE main.sales.orders ALTER COLUMN customer_email SET MASK main.sec.email_mask;
-- Three audiences, one policy
GRANT USAGE ON CATALOG main TO `eu_analysts`, `us_analysts`, `compliance_officers`;
GRANT USAGE ON SCHEMA main.sales TO `eu_analysts`, `us_analysts`, `compliance_officers`;
GRANT SELECT ON TABLE main.sales.orders TO `eu_analysts`, `us_analysts`, `compliance_officers`;
GRANT EXECUTE ON FUNCTION main.sec.region_filter TO `eu_analysts`, `us_analysts`, `compliance_officers`;
GRANT EXECUTE ON FUNCTION main.sec.email_mask TO `eu_analysts`, `us_analysts`, `compliance_officers`;
Step-by-step trace.
| Reader group | Rows visible | Email visibility |
|---|---|---|
| eu_analysts | EU only | masked (al***@example.com) |
| us_analysts | US only | masked |
| compliance_officers | all regions | full (alice@example.com) |
| (no group above) | none | n/a — SELECT denied at table level |
The filter+mask combination keeps the same physical table while presenting three audience-specific views. There is no view tree to maintain and the audit trail is clean: every read shows the same main.sales.orders target with the principal's group on the row.
Output:
| audience | orders rows | email column visible |
|---|---|---|
| eu_analysts | 2 (EU only) | masked |
| us_analysts | 1 (US only) | masked |
| compliance_officers | 3 (all) | full |
Why this works — concept by concept:
- One filter function, three branches — the function encodes every audience's row visibility in one place. Adding APAC is a one-line edit, not a per-table change.
-
is_account_group_member()is the policy language — it reads the principal's current effective groups. No application-side identity propagation required. - Mask + filter compose — filter runs first (drops rows), mask runs second (rewrites columns). The optimiser inlines both into the read plan.
- Function EXECUTE is independent of table SELECT — both grants are required. Missing EXECUTE produces a "function not accessible" error that looks like a table permission bug but is not. Catch it in onboarding.
- Cost — both functions are scalar UDFs. Their cost is per-row CPU only; no I/O, no extra scan. The optimiser usually pushes the filter predicate into the file-scan stage so unread rows never leave storage.
SQL
Topic — case expression
CASE expression problems (mask logic)
SQL
Topic — filtering
Filtering problems (row filter analogues)
5. Delta Sharing topology + production rollout
delta sharing is the open REST + parquet protocol that turns "send me a CSV every Monday" into "subscribe to this share and read it from Power BI"
The mental model in one line: Delta Sharing is one protocol (REST + parquet) with two activation flavours (open-protocol token-based for non-Databricks recipients; metastore-based for Databricks-to-Databricks) — and the unit of sharing is a SHARE object that bundles tables, schemas, partitions, volumes, and AI models on the producer side. Once you say "one share, many recipients, zero copy," every cross-team data-product conversation becomes a configuration exercise rather than a pipeline-building exercise.
Three producer-side objects.
-
PROVIDER(recipient-facing) — the producer side automatically becomes a provider when it creates its first share. The recipient sees a "provider" handle that bundles all the producer's shares to that recipient. -
SHARE— a bundle of objects.CREATE SHARE analytics_sharecreates an empty share; subsequentALTER SHARE ... ADD TABLEadds objects. -
RECIPIENT— the consumer registration. Two activation modes: open-protocol (recipient downloads a credential file) or metastore-based (Databricks-to-Databricks identity federation, no token leak).
Two recipient activation modes.
-
Token-based (open protocol). UC mints a
bearer_tokenand a profile file. The recipient downloads the profile and uses any open-protocol client (Pandas, Power BI, Tableau, Spark). The token is the secret; rotate periodically. - Metastore-based (Databricks-to-Databricks). Both producer and recipient are Databricks metastores; UC handles identity federation. No token file changes hands. The "do this when both sides are Databricks" preferred mode.
Five object kinds you can share.
- Tables. The canonical case. Delta tables (managed or external) share at parquet-file granularity.
-
Partitions. Share only a subset of a partitioned table — e.g. only the
region = 'EU'partitions. - Schemas. Share an entire schema; new tables in the schema auto-flow into the share.
- Volumes. Share governed file data (PDFs, logs, ML artefacts).
-
AI models. Share registered models in UC's model registry; the recipient can
model.load_predictdirectly.
Cross-cloud, cross-region.
- The protocol is cloud-agnostic. A producer on AWS can share to a recipient on Azure or GCP — UC vends short-lived cloud credentials and the recipient reads the parquet directly.
- Egress. The producer pays cross-cloud / cross-region egress on the parquet read. Factor that into pricing models.
- Latency. Cross-region adds the obvious propagation latency on the first parquet open; subsequent reads benefit from any client-side cache.
Open-protocol consumers — first-party libraries.
- Power BI. Native Delta Sharing connector since 2024. Point at the profile file, pick a share, get a model.
- Tableau. Native connector via the Delta Sharing client.
-
Pandas.
pip install delta-sharing→delta_sharing.load_as_pandas('profile.share#share.schema.table'). -
Spark.
df = spark.read.format('deltaSharing').load('share.schema.table')after the open-protocol JAR is on the cluster.
Production rollout — the discovery-first sequence.
- Step 1 — Discovery. Identify candidate tables via Catalog Explorer + lineage + business-glossary tags. Producing teams propose shares; consuming teams request them.
-
Step 2 — Tagging. Apply governance tags (
pii=true,tier=gold,sla=1h) to every shareable column. ABAC policies later attach to tags rather than per-table. - Step 3 — Policies. Attach row filters + column masks. The share inherits the policies — recipients see filtered, masked data without any per-share rewrite.
-
Step 4 — Lineage SLA. Confirm the producer pipeline emits lineage so consumers can answer "is this current?" via
system.access.table_lineage. - Step 5 — Sharing. Create the share, add the objects, register the recipient, send the activation link. Smoke-test with a known-good query.
The "data products" catalog pattern.
-
One certified catalog (
dataproducts) holds shareable, contracted tables — not the producer's raw or staging tables. -
Each share points only into
dataproducts.*. Producers promote their tables intodataproductsafter the data contract, governance tags, and SLAs are in place. -
Discovery is centralised — every consumer browses
dataproductsin Catalog Explorer and requests a share rather than spelunking through prod catalogs.
Common interview probes on Delta Sharing.
- "Difference between token-based and metastore-based activation?" — token file is mailed out-of-band; metastore-based uses UC identity federation, no token.
- "Can you share a table that has a row filter?" — yes. The recipient sees filtered rows; the filter runs on the producer side.
- "Can the recipient write back to the share?" — no, sharing is read-only by design. For two-way, both sides set up shares in opposite directions.
- "What does the producer pay?" — egress cost on parquet reads (cross-cloud / cross-region only) and the metadata-call cost. Both small relative to ETL.
Worked example — share three tables to a Databricks-to-Databricks recipient
Detailed explanation. A producer on AWS wants to share main.sales.orders, main.sales.customers, and main.analytics.daily_revenue to an analytics team running its own Databricks metastore on Azure. Both sides have UC, so the right mode is metastore-based — no token file, no out-of-band secret.
Question. Create the share, add the three tables, register the Azure metastore as the recipient, and grant the share to it.
Input.
| Object | Producer-side full name |
|---|---|
| Orders fact | main.sales.orders |
| Customer dim | main.sales.customers |
| Gold revenue | main.analytics.daily_revenue |
| Recipient metastore id | azure-metastore-east-id |
Code.
-- 1) Create an empty share
CREATE SHARE IF NOT EXISTS analytics_share
COMMENT 'Certified analytics tables for Azure consumer team';
-- 2) Add objects
ALTER SHARE analytics_share ADD TABLE main.sales.orders;
ALTER SHARE analytics_share ADD TABLE main.sales.customers;
ALTER SHARE analytics_share ADD TABLE main.analytics.daily_revenue
PARTITION (region = 'EU'); -- share only the EU partitions
-- 3) Register the Azure metastore as a recipient
CREATE RECIPIENT azure_analytics_team
USING ID 'aws:us-east-1:azure-metastore-east-id'
COMMENT 'Azure analytics team — D2D mode, no token file';
-- 4) Grant the share to the recipient
GRANT SELECT ON SHARE analytics_share TO RECIPIENT azure_analytics_team;
Step-by-step explanation.
-
CREATE SHAREis metadata-only. No data moves; the share is a logical bundle. -
ALTER SHARE ... ADD TABLEregisters each object. ThePARTITION (region = 'EU')clause shows partition-level sharing — only EU rows are exposed. - The
CREATE RECIPIENT ... USING IDform is the metastore-based activation. The ID encodes the cloud, the region, and the recipient's UC metastore identifier. No secret file is generated. - The
GRANT SELECT ON SHAREstep is what makes the share visible to the recipient. On the recipient side, the share appears as a "provider" in Catalog Explorer. - The recipient mounts the share by creating a catalog from it:
CREATE CATALOG analytics_from_aws USING SHARE 'aws-provider.analytics_share'. Three-part naming then resolves on the consumer side.
Output (recipient view).
| catalog | schema | table |
|---|---|---|
| analytics_from_aws | sales | orders |
| analytics_from_aws | sales | customers |
| analytics_from_aws | analytics | daily_revenue |
Rule of thumb. Default to metastore-based when both sides are Databricks. Tokens are a 2024 compatibility tool; the 2026 best practice is D2D when available. Reserve open-protocol tokens for Power BI / Pandas / Tableau consumers.
Worked example — share to an open-protocol Power BI consumer
Detailed explanation. An external partner team uses Power BI on a non-Databricks stack. They want to read main.gold.daily_revenue from inside Power BI's Delta Sharing connector. The right activation mode is token-based: produce a credential file, send it via secure channel, partner imports it into Power BI Desktop.
Question. Create the share, register a token-based recipient, and produce the credential file.
Input.
| Component | Value |
|---|---|
| Share name | partner_revenue_share |
| Table | main.gold.daily_revenue |
| Recipient | partner_powerbi |
Code.
-- 1) Share
CREATE SHARE partner_revenue_share;
ALTER SHARE partner_revenue_share ADD TABLE main.gold.daily_revenue;
-- 2) Token-based recipient — UC generates a profile file
CREATE RECIPIENT partner_powerbi
USING SHARING_CODE
COMMENT 'External partner — open-protocol, Power BI Desktop';
-- 3) Issue the activation URL (one-time)
SELECT activation_link
FROM system.information_schema.recipient_activations
WHERE recipient_name = 'partner_powerbi'
ORDER BY created_at DESC
LIMIT 1;
-- 4) Grant
GRANT SELECT ON SHARE partner_revenue_share TO RECIPIENT partner_powerbi;
Step-by-step explanation.
- The share is created the same way as a D2D share. The difference is recipient-side, not producer-side.
-
CREATE RECIPIENT ... USING SHARING_CODEtells UC to mint abearer_tokenand an activation URL. The partner clicks the URL once to download a.shareprofile file. - The activation URL is single-use and time-bounded. The recipient must click it before it expires; you can rotate it from
system.information_schema.recipient_activations. - The partner opens Power BI Desktop → Get Data → Delta Sharing → load the
.sharefile → pickpartner_revenue_share→ import the table. - Token rotation is a
ROTATEcommand on the recipient — issue a new activation URL, send to the partner, revoke the old one after grace period.
Output (partner's Power BI view).
| share | schema | table | columns |
|---|---|---|---|
| partner_revenue_share | gold | daily_revenue | date, region, gmv |
Rule of thumb. Send the activation URL via your secure-comms channel (Slack DM, encrypted email). The .share file itself can be sent through normal channels — it points back to UC's REST endpoint. The activation URL is the secret; the .share file is the address.
Worked example — production rollout: discovery → tagging → policies → share
Detailed explanation. A platform team is rolling Delta Sharing to ten internal consumer teams. The "just create shares" anti-pattern leaks raw bronze tables and gets policy exceptions filed within a week. The discovery-first sequence catches every governance concern before the share goes live.
Question. Walk through the five-step rollout for a new "marketing attribution" data product.
Input. A candidate gold table main.gold.attribution_daily with PII columns customer_email and customer_ip.
Code.
-- Step 1 — Discovery: confirm consumer demand, certify the table
SELECT
target_table_full_name,
COUNT(DISTINCT entity_name) AS distinct_consumers,
MAX(event_time) AS last_read
FROM system.access.table_lineage
WHERE target_table_full_name = 'main.gold.attribution_daily'
GROUP BY target_table_full_name;
-- Step 2 — Tagging: mark PII and tier
ALTER TABLE main.gold.attribution_daily
SET TAGS ('tier' = 'gold', 'product' = 'marketing-attribution', 'sla' = '6h');
ALTER TABLE main.gold.attribution_daily ALTER COLUMN customer_email
SET TAGS ('pii' = 'true', 'classification' = 'restricted');
ALTER TABLE main.gold.attribution_daily ALTER COLUMN customer_ip
SET TAGS ('pii' = 'true', 'classification' = 'restricted');
-- Step 3 — Policies: attach a mask to PII columns
ALTER TABLE main.gold.attribution_daily
ALTER COLUMN customer_email SET MASK main.sec.email_mask;
ALTER TABLE main.gold.attribution_daily
ALTER COLUMN customer_ip SET MASK main.sec.ip_mask;
-- Step 4 — Promote to the dataproducts catalog
CREATE OR REPLACE TABLE dataproducts.marketing.attribution_daily
DEEP CLONE main.gold.attribution_daily;
-- Step 5 — Share
CREATE SHARE attribution_share;
ALTER SHARE attribution_share ADD TABLE dataproducts.marketing.attribution_daily;
CREATE RECIPIENT marketing_team_share USING SHARING_CODE;
GRANT SELECT ON SHARE attribution_share TO RECIPIENT marketing_team_share;
Step-by-step explanation.
- Discovery confirms that there are real consumers — the lineage query shows ten distinct entities reading the table over the last 30 days. Without consumer demand, do not share.
-
Tagging registers
pii=trueon the sensitive columns. The platform-wide ABAC policy ("maskpii=truefor non-pii_readers") then attaches automatically — no per-share rule needed. - Policies apply masks. Even if a downstream consumer accidentally promotes the share to a wider audience, the mask travels with the table.
-
Promotion to
dataproductsdecouples the share from the producer's prod catalog. Producers can iterate onmain.gold.attribution_daily(e.g. add columns) without breaking the share contract — only theDEEP CLONErefresh promotes the change. - Sharing is the last step, not the first. By the time the recipient receives the activation URL, every PII concern has already been addressed.
Output — the rollout artefacts.
| Step | Artefact | Owner |
|---|---|---|
| 1 | Consumer demand confirmed | platform |
| 2 | Tags applied to columns | platform + steward |
| 3 | Masks attached | platform |
| 4 | dataproducts table created | producer |
| 5 | Share + recipient + grant | platform |
Rule of thumb. The five-step sequence is non-negotiable for any externally-facing share. Skipping step 2 (tagging) is the leading cause of PII leakage in Delta Sharing rollouts — the table looks fine to the producer, the consumer sees raw PII, and the audit lands in the wrong inbox.
SQL interview question on Delta Sharing topology and rollout
A senior interviewer often frames this as: "Design the share + policy + catalog promotion for a sensitive table that needs to go to one Databricks-to-Databricks recipient (analytics team) and one open-protocol recipient (Power BI consumer). Walk me through every object you would create."
Solution Using a dataproducts catalog + two recipients on one share
-- 0) Promote into the certified catalog (already governance-tagged + masked upstream)
CREATE OR REPLACE TABLE dataproducts.marketing.attribution_daily
DEEP CLONE main.gold.attribution_daily;
-- 1) Share
CREATE SHARE attribution_share
COMMENT 'Certified attribution table; PII masked';
ALTER SHARE attribution_share
ADD TABLE dataproducts.marketing.attribution_daily;
-- 2) Two recipients on the same share
-- D2D for analytics team
CREATE RECIPIENT analytics_team_d2d
USING ID 'aws:us-east-1:analytics-metastore-id';
-- Open-protocol for Power BI partner
CREATE RECIPIENT partner_powerbi
USING SHARING_CODE;
-- 3) Grant the same share to both
GRANT SELECT ON SHARE attribution_share TO RECIPIENT analytics_team_d2d;
GRANT SELECT ON SHARE attribution_share TO RECIPIENT partner_powerbi;
-- 4) Inspect — who is sharing what to whom
SELECT
s.share_name,
sr.recipient_name,
r.authentication_type,
s.created_at
FROM system.information_schema.shares s
JOIN system.information_schema.share_recipients sr
ON sr.share_name = s.share_name
JOIN system.information_schema.recipients r
ON r.recipient_name = sr.recipient_name
WHERE s.share_name = 'attribution_share';
Step-by-step trace.
| Step | Object created | Purpose |
|---|---|---|
| 0 | dataproducts.marketing.attribution_daily |
Certified, contract-bound table |
| 1 |
attribution_share + ADD TABLE |
The shareable bundle |
| 2a |
analytics_team_d2d recipient |
Metastore-based, no token |
| 2b |
partner_powerbi recipient |
Token-based via SHARING_CODE |
| 3 | Two GRANT SELECT ON SHARE calls |
One share, two audiences |
| 4 | Verification query | Cross-tab of share x recipient x mode |
The pattern fans two recipient flavours off a single share. The producer manages one share object; UC handles the two activation modes underneath. PII masks travel with the table — the open-protocol consumer never sees raw values.
Output:
| share_name | recipient_name | authentication_type | created_at |
|---|---|---|---|
| attribution_share | analytics_team_d2d | DATABRICKS | 2026-06-06 10:00 |
| attribution_share | partner_powerbi | TOKEN | 2026-06-06 10:01 |
Why this works — concept by concept:
- dataproducts catalog as the share surface — promotes the producer's volatile gold table into a contracted, tagged, masked snapshot. Recipients depend on the snapshot, not the producer's day-to-day edits.
-
One share, two recipients, two modes — UC decouples "what is shared" (the share object) from "how it is consumed" (the recipient mode). Adding a third recipient is a single
CREATE RECIPIENT+GRANTpair. - Masks inherit — because the masks are attached on the source side, the consumer sees masked values regardless of whether they connect via D2D or open-protocol. No per-share rewrite.
- system.information_schema as the audit surface — every share/recipient relationship is queryable as Delta. The same JOIN powers your data-products dashboard and your compliance report.
- Cost — share metadata is free. The producer pays cross-cloud / cross-region egress on parquet reads only. D2D within the same cloud and region is effectively free.
SQL
Topic — joins
JOIN problems (share / recipient cross-tabs)
SQL
Topic — etl
ETL problems (data product promotion)
Cheat sheet — Unity Catalog recipes
-
Fully qualified name.
SELECT * FROM main.sales.orders— three-part name resolves catalog → schema → table. The legacydatabase.tableform is silently rewritten tohive_metastore.database.table. -
GRANT chain — copy this for every new group.
GRANT USAGE ON CATALOG main TO group; GRANT USAGE ON SCHEMA main.sales TO group; GRANT SELECT ON TABLE main.sales.orders TO group. Skipping the USAGE rungs is the #1 day-one ticket. -
Row filter.
ALTER TABLE main.sales.orders SET ROW FILTER region_filter ON (region). Function must returnBOOLEAN; grantEXECUTE ON FUNCTIONto readers. -
Column mask.
ALTER TABLE main.hr.employees ALTER COLUMN ssn SET MASK ssn_mask. Function takes the column and returns the visible value; grantEXECUTE ON FUNCTIONto readers. -
External location.
CREATE EXTERNAL LOCATION raw_zone URL 's3://bucket/raw' WITH (STORAGE CREDENTIAL cred_iam). Notebooks reference the location, never the secret. GrantREAD FILES/WRITE FILESto the ingest groups. -
Delta Share — D2D.
CREATE SHARE analytics; ALTER SHARE analytics ADD TABLE main.sales.orders; CREATE RECIPIENT azure_team USING ID '...'; GRANT SELECT ON SHARE analytics TO RECIPIENT azure_team. -
Delta Share — open-protocol. Replace
USING IDwithUSING SHARING_CODE; UC mints a one-time activation URL. Send via secure channel, never inline. -
Lineage — column impact.
SELECT * FROM system.access.column_lineage WHERE source_table_full_name = 'main.crm.customers' AND source_column_name = 'ssn'— the "what depends on this column" answer in one query. -
Lineage — producer audit.
SELECT entity_type, entity_name, MAX(event_time) FROM system.access.table_lineage WHERE target_table_full_name = 'main.gold.daily_revenue' GROUP BY 1, 2 ORDER BY 3 DESC. -
Audit — who read PII.
SELECT user_identity.email, COUNT(*) FROM system.access.audit WHERE service_name = 'unityCatalog' AND action_name = 'getTable' AND request_params.full_name_arg = 'main.crm.customers' GROUP BY 1. -
Foreign catalog (Snowflake).
CREATE CONNECTION snowflake_prod TYPE SNOWFLAKE OPTIONS (...); CREATE FOREIGN CATALOG snowflake_crm USING CONNECTION snowflake_prod OPTIONS (database 'CRM_PROD'). Three-part naming works across engines. -
Tag a column.
ALTER TABLE main.gold.attribution ALTER COLUMN customer_email SET TAGS ('pii' = 'true', 'classification' = 'restricted')— feeds future ABAC policies. -
Promote into
dataproducts.CREATE OR REPLACE TABLE dataproducts.marketing.attribution_daily DEEP CLONE main.gold.attribution_daily. The promoted snapshot becomes the share target. -
Inspect grants for a principal.
SELECT * FROM system.information_schema.table_privileges WHERE grantee = 'finance_engineers'. -
Detach a filter or mask.
ALTER TABLE ... DROP ROW FILTERandALTER TABLE ... ALTER COLUMN ... DROP MASK. Both are metadata-only and instant.
Frequently asked questions
What is the difference between Unity Catalog and the Hive metastore?
The Hive metastore is workspace-scoped, lives inside each Databricks workspace, and uses two-part database.table naming. Unity Catalog is account-scoped, lives once per region, and uses three-part catalog.schema.table naming. Beyond the namespace, UC adds account-level SCIM identity, automatic lineage at table and column grain, the system.access.audit log, row filter and column mask primitives, and Delta Sharing — all of which the legacy Hive metastore lacks. The migration path keeps Hive metastore tables reachable as hive_metastore.db.table so consumers can adopt UC at their own pace.
Can one Unity Catalog metastore span multiple Databricks workspaces?
Yes — and that is the entire point. One UC metastore is attached to N workspaces in the same cloud region, so a SCIM group like pii_readers and a grant like GRANT SELECT ON TABLE main.finance.customer_pii TO pii_readers is written once and effective everywhere. Pre-UC, the same policy had to be copied per workspace and frequently drifted out of sync. The metastore is a per-region object, not a per-workspace one; if you span two cloud regions you have two metastores and operate them independently (Delta Sharing handles cross-region data flow).
How does Unity Catalog capture column-level lineage automatically?
UC inspects the query plan of every notebook, scheduled job, DLT pipeline, and SQL warehouse statement that runs on a UC-enabled cluster. From the plan, UC infers which source columns feed each target column and emits a row into system.access.column_lineage. There is no instrumentation flag, no decorator, no opt-in — column lineage is on by default for SQL warehouses, Photon clusters, and DLT pipelines. The 90-day retention window is the default; teams that need longer history stream the system table into a long-term gold table.
Row filter functions vs dynamic views — which should I use?
Prefer row filters and column masks whenever the policy is scalar — that is, the visibility of a row or a column value depends only on the row itself and the reader's group membership. Row filters and masks attach to the table, so every reader (notebook, DBSQL, federated tool, Delta Sharing recipient) sees the same policy and the audit trail is clean. Dynamic views are still the right tool when the policy logic requires a JOIN to another table — e.g. an entitlements table that lists which customer_id values each user can see. As ABAC matures over 2026, tag-driven policies will further reduce the number of cases where dynamic views are needed.
Does Delta Sharing work with non-Databricks consumers like Power BI or Pandas?
Yes — Delta Sharing is an open REST + parquet protocol with first-party client libraries for Power BI, Tableau, Pandas, and Spark. The producer creates a recipient with USING SHARING_CODE, UC issues a one-time activation URL, the consumer downloads a .share profile file, and reads parquet directly via cloud credentials that UC vends on the fly. Producers can apply row filters and column masks on the producer side so non-Databricks consumers see filtered, masked data without any consumer-side rewrite. For two Databricks metastores, prefer metastore-based activation (USING ID '...') — it skips the token file entirely.
How do I migrate existing Hive metastore tables into Unity Catalog?
Run an inventory query against hive_metastore.information_schema.tables to flag managed vs external tables per database. External tables migrate in metadata only — point UC at the existing cloud path with CREATE EXTERNAL LOCATION plus CREATE TABLE ... LOCATION '...'. Managed tables hold data in DBFS and need to be re-homed: either CTAS (CREATE TABLE AS SELECT) into a UC managed table in a new external location, or use the SYNC SCHEMA / SYNC TABLE commands for in-place upgrade where supported. After migration, dual-publish for a transition period — writers update both hive_metastore.x and main.y — then retire the Hive side once all consumers cut over. Run the impact-analysis query against system.access.table_lineage to confirm zero traffic on the Hive copy before deleting it.
Practice on PipeCode
- Drill the SQL practice library for data engineers → to keep your dialect-portable muscles warm.
- Rehearse on joins problems → for the lineage and audit cross-tab patterns.
- Stack database design problems → to internalise the catalog → schema → table tree.
- Sharpen with CTE-based problems → for the multi-step lineage and orphan queries.
- Layer the filtering library → for the row-filter-style predicates.
- Practise the conditional muscles with CASE expression problems → — the same shape used inside column masks.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the SQL axis with the SQL for data engineering interviews course →.
- For long-form schema craft, work through data modelling for DE interviews →.
- Build the company-context muscles with Databricks-tagged practice problems →.
Pipecode.ai is Leetcode for Data Engineering — every Unity Catalog pattern above ships with hands-on practice rooms where you write the GRANT chain, the row filter UDF, and the lineage join against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your row filter actually behaves the same on a Photon notebook as on a DBSQL warehouse or a Delta Sharing recipient.





Top comments (0)