Gowtham Potureddi

Posted on Jun 6

Databricks Unity Catalog: Governance, Lineage, Row/Column Security & Delta Sharing

#python #sql #dataengineering #interview

databricks unity catalog looks, to a junior reading the marketing page, like "the new metastore." That undersells it by roughly an order of magnitude. UC is the first governance layer in the Databricks ecosystem that treats identity, metadata, lineage, audit, and external sharing as one problem solved at the account layer rather than five problems solved per-workspace, per-cluster, per-notebook. The result is the difference between writing a single ABAC policy that protects every PII column in every catalog and writing the same table-ACL three times — once for each cluster's local Hive metastore — and hoping the team that spins up workspace number four remembers to copy it.

This guide is the one you wish existed the day your security team handed you a "who can read which column" matrix and a deadline. It walks through the unity catalog metastore mental model, the three-level namespace (catalog.schema.table) that replaces the legacy database.table form, automatic data lineage at both table and column grain, the four security primitives (GRANT / REVOKE, row-level security filters, column masks), and the delta sharing topology that lets a producer expose a vetted "data products" catalog without copying bytes. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps the moment you finish reading, drill the SQL practice library for data engineers →, rehearse on joins problems →, and stack the metadata-aware muscles with database design problems →.

On this page

Why Unity Catalog exists — the pre-UC governance gap
The three-level namespace + metastore architecture
Automatic data lineage — table-level and column-level
Row filters, column masks + GRANT/REVOKE
Delta Sharing topology + production rollout
Cheat sheet — Unity Catalog recipes
Frequently asked questions
Practice on PipeCode

1. Why Unity Catalog exists — the pre-UC governance gap

The legacy Hive metastore was workspace-scoped — that single design decision created every "who can read what" headache UC was built to delete

The one-sentence invariant: before Unity Catalog, every Databricks workspace shipped with its own Hive metastore, its own table ACLs, and its own copy of every identity the security team had to maintain — and there was no way to ask "what does the Finance group see in all workspaces?" without a manual cross-tabulation. Once you internalise that the legacy world's governance unit was the workspace, the entire UC architecture becomes obvious: pull every identity-and-metadata concern up to the account, leave the compute concerns down at the workspace, and write each policy exactly once.

The five things UC unifies.

Identity. SCIM users and groups live at the account level, not per workspace. One source of truth for "who is a pii_reader."
Metastore. One regional metastore replaces N per-workspace Hive metastores. Tables register once, not once per workspace.
Lineage. UC captures table-level and column-level lineage automatically across notebooks, jobs, dashboards, and DLT — no instrumentation.
Audit. Every GRANT, REVOKE, query, share creation, and policy evaluation lands in the system.access.audit system table.
Sharing. Delta Sharing — an open REST + parquet protocol — lets you expose tables and volumes to internal teams or external recipients without copying bytes.

Hive metastore vs Unity Catalog in five contrasts.

Concern	Hive metastore (legacy)	Unity Catalog
Scope	One per workspace	One per region, many workspaces
Identity	Workspace-local groups	Account-level SCIM
Naming	`database.table` (2 parts)	`catalog.schema.table` (3 parts)
ACLs	Per-cluster, table-grain	Account-wide, table + row + column
External storage	Notebook-mounted with secrets	`EXTERNAL LOCATION` + `STORAGE CREDENTIAL`

Why "account-level" matters.

One ACL surface. A GRANT SELECT ON CATALOG main to the analysts group applies in every workspace attached to the metastore. No copy-paste between admin consoles.
Identity federation. SCIM-sync from Okta / Entra / Workday once at the account, propagate everywhere. Removing a leaver becomes a one-step revoke.
Cross-workspace queries. A BI workspace can read prod tables registered in a different workspace, because the metastore — not the workspace — owns the table.
Audit consolidation. system.access.audit collects events from every workspace into one log, queryable like any Delta table.

Cost / latency story.

Metadata only. UC does not sit on the query hot path. Photon and DBSQL still scan parquet from S3/ADLS/GCS directly; UC's role is to authorize the open, not to broker the bytes.
No extra hop. The credential-vending call happens at job start, not per row. After authorization, the cluster has a temporary credential and reads at parquet speed.
The free upgrade. Migration to UC is one of the few "more capability, same cost" moves in modern data platforms. The price is engineering time, not new infrastructure.

What interviewers listen for.

Do you say "account-level identity, regional metastore, workspace-local compute" when asked to draw UC? — senior signal.
Do you mention system.access.audit and system.access.table_lineage when asked how to audit a metric? — required answer.
Do you call the legacy thing "Hive metastore" and not just "the old metastore"? — required answer.
Do you mention that UC needs an external location + storage credential for managed-table writes? — senior signal.

Worked example — the four-workspace governance headache UC deletes

Detailed explanation. A financial services platform runs four Databricks workspaces — prod-east, prod-west, analytics, and sandbox. In the legacy world every workspace had its own Hive metastore, so the pii_readers group had to be created four times, the customer_pii table ACL granted four times, and rotated four times whenever the audit cycle demanded it. After UC, every concern shifts up to the account.

Question. Given the four-workspace topology, how many places does the security team need to maintain the pii_readers group, the customer_pii SELECT grant, and the rotation of access tokens — before UC and after UC?

Input.

Concern	Pre-UC (per workspace)	UC (per metastore)
Group `pii_readers`	4 copies	1 copy
`GRANT SELECT ON customer_pii`	4 grants	1 grant
Token rotation for ETL service	4 rotations	1 rotation
Audit log for `SELECT customer_pii`	4 separate logs	1 `system.access.audit` table

Code.

-- Pre-UC — repeated in every workspace's local Hive metastore
GRANT SELECT ON TABLE finance.customer_pii TO `pii_readers`;
-- ... copy to prod-west, analytics, sandbox

-- UC — written once at the account-attached metastore
USE CATALOG main;
GRANT USAGE  ON CATALOG main             TO `pii_readers`;
GRANT USAGE  ON SCHEMA  main.finance     TO `pii_readers`;
GRANT SELECT ON TABLE   main.finance.customer_pii TO `pii_readers`;

Step-by-step explanation.

In the legacy world the metastore lives inside each workspace, so any grant only authorises that workspace's clusters. Four workspaces means four GRANTs and four future REVOKEs.
UC promotes the metastore to the regional account layer. A single GRANT reaches every workspace attached to the metastore — including new workspaces created tomorrow.
UC also requires the USAGE chain: the principal needs USAGE on the catalog and the schema before the SELECT on the table will resolve. This chain replaces the legacy "workspace ACL + table ACL" model.
Removing the pii_readers group is one SCIM action at the account; in the legacy world it was four workspace edits plus a check that nobody left a stale cluster-level ACL behind.

Output.

Phase	Group copies	Grant copies	Audit logs
Pre-UC (4 workspaces)	4	4	4
UC (1 metastore)	1	1	1

Rule of thumb. If you are still defending workspace-scoped grants in 2026, draw the four-workspace cost table to whoever owns the migration backlog. The savings are literally O(workspaces) per policy edit.

Worked example — discovering what is workspace-scoped before migration

Detailed explanation. Before turning UC on, every team needs an inventory of what is still living in the legacy Hive metastore. The standard query reads from hive_metastore (the alias UC exposes for the legacy metastore) and counts tables by database. That output drives the migration backlog — which databases get a SYNC upgrade, which need a manual rebuild, which can be retired.

Question. Inventory every legacy database, count its tables, and flag any that have managed tables (which need data movement vs reference-only migration).

Input. A typical legacy workspace has 3–10 Hive databases, each with mixed managed and external tables.

database	table_type	row_count
finance	MANAGED	1.2M
finance	EXTERNAL	8.0M
sandbox	MANAGED	200
sandbox	MANAGED	50
legacy_etl	EXTERNAL	12.5M

Code.

SELECT
    database_name,
    COUNT(*)                                         AS total_tables,
    COUNT(*) FILTER (WHERE table_type = 'MANAGED')   AS managed_tables,
    COUNT(*) FILTER (WHERE table_type = 'EXTERNAL')  AS external_tables,
    CASE
        WHEN COUNT(*) FILTER (WHERE table_type = 'MANAGED') > 0
             THEN 'needs data movement'
        ELSE 'reference only — sync metadata'
    END AS migration_path
FROM hive_metastore.information_schema.tables
WHERE database_name NOT IN ('default', 'information_schema')
GROUP BY database_name
ORDER BY total_tables DESC;

Step-by-step explanation.

The query reads from hive_metastore.information_schema.tables — UC surfaces the legacy metastore as if it were a normal catalog. That means every legacy table is queryable through 3LN as hive_metastore.db.table.
COUNT(*) FILTER (WHERE ...) is ANSI-standard conditional counting; it lets us tally managed and external tables in a single pass.
The CASE flags the migration path. Managed tables in the Hive metastore live in DBFS — UC requires them to be re-homed to a UC external location. External tables only need a metadata sync because the bytes already live in your bucket.
Anything in default or information_schema is metadata noise — filter it out so the migration backlog reflects real engineering work.

Output.

database_name	total_tables	managed_tables	external_tables	migration_path
legacy_etl	12	0	12	reference only — sync metadata
finance	7	2	5	needs data movement
sandbox	4	4	0	needs data movement

Rule of thumb. Run the inventory before drawing the catalog tree. The output tells you which databases become UC schemas in main, which become their own dev/sandbox catalogs, and which get retired entirely. Skipping this step is the #1 reason migrations stall.

SQL interview question on legacy-vs-UC governance modelling

A senior interviewer often opens with: "Your team owns three Databricks workspaces. The CISO asks 'who can read the customer_pii table in every workspace today, and how would you change that answer to one query under Unity Catalog?' Walk me through the SQL and the architectural shift."

Solution Using account-level GRANT + `system.access.audit` cross-check

-- Pre-UC — three queries, one per workspace's local metastore
-- (executed sequentially in each workspace)
SHOW GRANTS ON TABLE finance.customer_pii;

-- UC — one query reaches every workspace because the metastore
-- and the audit table are account-scoped.
SELECT
    principal,
    privilege_type,
    object_type,
    object_full_name,
    granted_at
FROM system.information_schema.table_privileges
WHERE object_full_name = 'main.finance.customer_pii'
  AND privilege_type   = 'SELECT'
ORDER BY granted_at DESC;

-- Cross-check against actual reads from system.access.audit
SELECT
    user_identity.email                          AS user_email,
    COUNT(*)                                     AS read_count,
    MAX(event_time)                              AS last_read_at
FROM system.access.audit
WHERE service_name = 'unityCatalog'
  AND action_name  = 'getTable'
  AND request_params.full_name_arg = 'main.finance.customer_pii'
  AND event_time   >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY user_identity.email
ORDER BY read_count DESC;

Step-by-step trace.

Step	Pre-UC	UC
1. List grants	run SHOW GRANTS once per workspace	one `information_schema` query
2. Collect into spreadsheet	manual cross-tab in Excel	already cross-tab
3. Cross-check vs actual reads	scrape per-workspace audit logs	join with `system.access.audit`
4. Produce report	days	minutes

The two UC queries form the canonical "policy intent vs policy reality" check. The first lists who is allowed to read; the second lists who actually read. The gap between the two is the principle-of-least-privilege backlog.

Output:

Metric	Pre-UC	UC
Queries required	3 (one per workspace)	1 (account-wide)
Time to report	hours	minutes
Stale ACL detection	manual	`LEFT ANTI JOIN` between privileges and audit
Audit window	per-workspace, variable	account, 365 days

Why this works — concept by concept:

One metastore per region — every UC catalog is reachable from every workspace attached to the metastore, so a single grant covers all of them. The unit of isolation moves from workspace to catalog.
Account-level SCIM — pii_readers is a single group object that SCIM-syncs from your IdP. Adding or removing a member is one action that propagates to every catalog the group is granted on.
system.access.audit is itself a Delta table — you can JOIN it to information_schema.table_privileges to spot "granted but never used" permissions and revoke them. That is the audit pattern your CISO is actually asking for.
USAGE chain is non-negotiable — even an account-admin needs USAGE on the catalog and schema before SELECT on the table resolves. New engineers hit this on day one; document it in the onboarding runbook.
Cost — one extra metadata fetch at job start (O(1)); zero ongoing cost. The audit query is a normal Delta scan — bring your own DBSQL warehouse and pay for one query.

SQL
Topic — SQL fundamentals
SQL practice for data engineers

Practice →

SQL
Topic — design
Database design problems (governance, ACLs)

Practice →

2. The three-level namespace + metastore architecture

`catalog.schema.table` is not just longer naming — it is the unit of isolation, delegation, and policy attachment

The mental model in one line: the legacy database.table form has one boundary; UC's catalog.schema.table form has two — and that extra boundary is where every modern governance pattern (per-LOB isolation, per-environment promotion, certified data products) lives. Once you say "the catalog is the isolation unit; the schema is the domain unit," every UC design question becomes a tree-drawing exercise.

The five tiers, top down.

Account. One per organisation. Owns SCIM identities, audit logs, billing, and the metastores. The CFO of your data estate.
Metastore. One per cloud region. Attached to N workspaces in that region. Owns catalogs and the external-storage credentials.
Catalog. The unit of isolation. Typical names: main, dev, sandbox, finance, growth. A catalog can hold a single LOB or a whole environment.
Schema. Logical grouping inside a catalog. Maps to a domain (sales, hr, marketing) or a product area (loyalty, fraud, attribution).
Object. What lives in a schema: tables, views, materialized views, volumes (governed files — a UC-managed wrapper for unstructured bytes), functions (SQL UDFs), models.

External locations + storage credentials.

STORAGE CREDENTIAL — a UC object that wraps an IAM role / managed identity / service account. The cloud-side principal that the cluster assumes to read your bucket.
EXTERNAL LOCATION — a UC object that maps a URL prefix (e.g. s3://acme-prod-bronze/) to a storage credential and a set of access grants.
The pattern. Notebooks never carry secrets. They reference EXTERNAL LOCATION raw_zone and UC vends a short-lived credential to the cluster at job start.
READ FILES, WRITE FILES, CREATE EXTERNAL TABLE — the three privileges on an external location that determine what the principal can do under that prefix.

Foreign catalogs — Lakehouse Federation.

What it is. A UC catalog whose backing store is not Delta — it is Snowflake, BigQuery, Postgres, MySQL, Redshift, or SQL Server.
What you get. Three-part naming and GRANT semantics over the foreign engine — SELECT * FROM snowflake_prod.public.orders from a Databricks notebook resolves to a pushdown query in Snowflake.
What you do not get. Write-back to the foreign system (currently read-only for most connectors), and lineage at column grain depends on the connector.
The interview probe. "How would you join a Delta table to a Snowflake table without copying either?" — Lakehouse Federation foreign catalog + a normal SQL JOIN.

Two CATALOG flavours.

Standard catalog. Backed by managed Delta tables in a UC-managed storage location. The default for new catalogs.
Foreign catalog (federation). Backed by an external engine. Defined with CREATE FOREIGN CATALOG ... USING CONNECTION ....

Naming tradeoffs.

main as the prod catalog — short, ANSI-friendly, easy to type. Default for greenfield deployments.
prod / dev / sandbox per environment — clear environment isolation; promotion is "rename or copy across catalogs."
Per-LOB catalogs (finance, growth, infra) — clear ownership boundary; group grants land at the catalog root. The pattern most large enterprises end up with after a year.

Common interview probes on the namespace.

"Why does Postgres / dbt always need a schema and a database?" — because the ANSI standard 3LN was always catalog.schema.table; legacy engines collapsed the first two.
"Can two catalogs in the same metastore share a schema name?" — yes. main.sales and dev.sales are distinct schemas.
"What is a volume?" — a UC-governed wrapper for unstructured data (PDFs, images, logs). You GRANT READ VOLUME the same way you GRANT SELECT.
"What is the difference between a managed and an external Delta table?" — managed tables live in the catalog's managed storage location and UC owns their lifecycle; external tables live in a path you control and UC only owns the metadata.

Worked example — the GRANT chain you must walk to read a table

Detailed explanation. A new analyst joins the growth team. They try SELECT * FROM main.marketing.touches and get PERMISSION_DENIED on TABLE. The bug is not the SELECT grant — that was already in place. The missing piece is the USAGE chain: UC requires the principal to have USAGE on the catalog and USAGE on the schema before any SELECT resolves. This is the most common day-one ticket in any UC rollout.

Question. Given the growth_analysts group, write the three GRANT statements that allow them to SELECT * FROM main.marketing.touches. Explain why a single SELECT grant is insufficient.

Input.

Principal	Object	Current grants
`growth_analysts`	`main` (catalog)	—
`growth_analysts`	`main.marketing` (schema)	—
`growth_analysts`	`main.marketing.touches` (table)	`SELECT` (granted yesterday)

Code.

-- The three-step USAGE chain
GRANT USAGE  ON CATALOG main                        TO `growth_analysts`;
GRANT USAGE  ON SCHEMA  main.marketing              TO `growth_analysts`;
GRANT SELECT ON TABLE   main.marketing.touches      TO `growth_analysts`;

-- Verify in one query
SELECT grantee, privilege_type, table_catalog, table_schema, table_name
FROM system.information_schema.table_privileges
WHERE grantee = 'growth_analysts';

Step-by-step explanation.

UC's authorization check walks the 3LN tree top-down: catalog → schema → table. At each level the principal must have at least USAGE (the "I can see this exists" privilege).
Without USAGE ON CATALOG main, the planner cannot even resolve the name main — the error is "catalog not found" or "permission denied at catalog level."
Without USAGE ON SCHEMA main.marketing, the planner cannot list tables in the schema — same error at the schema level.
Only after both USAGEs are in place does the SELECT grant become operative. The table grant on its own is necessary but not sufficient.
The verification query reads from information_schema.table_privileges, which is the canonical UC view for "show me every privilege this principal holds."

Output.

grantee	privilege_type	table_catalog	table_schema	table_name
growth_analysts	SELECT	main	marketing	touches

Rule of thumb. Bake the USAGE chain into your provisioning automation. Whenever a group gets a new table grant, the script must also issue (or confirm) the parent catalog and schema USAGE grants. Eight out of ten "UC is broken" tickets dissolve once this is automated.

Worked example — registering an external location for a multi-team raw zone

Detailed explanation. A platform team owns an S3 prefix s3://acme-prod-bronze/ that holds raw landings from twelve ingest pipelines. Pre-UC, every pipeline mounted the bucket and embedded an IAM access key in cluster-init. Post-UC, the platform team registers the prefix once as an EXTERNAL LOCATION bound to a STORAGE CREDENTIAL, then grants WRITE FILES to the ingest groups. Notebooks never see a secret again.

Question. Register the credential, the external location, and the WRITE FILES grant for the ingest_pipelines group. Show the verification query.

Input.

Object	Cloud-side	UC-side
IAM role	`arn:aws:iam::123456789012:role/databricks-bronze-writer`	`STORAGE CREDENTIAL cred_bronze_writer`
S3 prefix	`s3://acme-prod-bronze/`	`EXTERNAL LOCATION raw_zone`
Principal	n/a	`ingest_pipelines` group

Code.

-- 1) Bind the cloud role to a UC storage credential
CREATE STORAGE CREDENTIAL cred_bronze_writer
WITH AWS_IAM_ROLE 'arn:aws:iam::123456789012:role/databricks-bronze-writer'
COMMENT 'Bronze-zone writer role for ingest pipelines';

-- 2) Bind a URL prefix to that credential
CREATE EXTERNAL LOCATION raw_zone
    URL 's3://acme-prod-bronze/'
    WITH (STORAGE CREDENTIAL cred_bronze_writer)
    COMMENT 'Multi-team bronze landing zone';

-- 3) Grant write to the ingest group
GRANT READ FILES, WRITE FILES, CREATE EXTERNAL TABLE
    ON EXTERNAL LOCATION raw_zone
    TO `ingest_pipelines`;

-- 4) Verify
SELECT name, url, credential_name, owner, created_at
FROM system.information_schema.external_locations
WHERE name = 'raw_zone';

Step-by-step explanation.

The STORAGE CREDENTIAL is the UC wrapper for the IAM role. It is a single object; many external locations can reference the same credential.
The EXTERNAL LOCATION is the URL-prefix → credential binding. UC checks that no two external locations overlap — s3://acme-prod-bronze/ and s3://acme-prod-bronze/team-a/ cannot both exist.
The grant chain on an external location is READ FILES / WRITE FILES / CREATE EXTERNAL TABLE — these are URL-prefix privileges, not table privileges.
After the grant, an ingest notebook can run COPY INTO main.raw.events FROM 's3://acme-prod-bronze/events/' and UC will vend a short-lived credential bound to cred_bronze_writer. No secret hits the notebook.
The verification query confirms the location is registered, owned by the right group, and bound to the expected credential.

Output.

name	url	credential_name	owner	created_at
raw_zone	s3://acme-prod-bronze/	cred_bronze_writer	data_platform	2026-06-06 09:14:00

Rule of thumb. One credential per cloud principal (role / managed identity / service account); one external location per logical zone (bronze, silver, gold, archive). Resist the urge to create one external location per bucket prefix per team — it creates a thicket UC has to keep coherent.

Worked example — federating a Snowflake catalog as a foreign catalog

Detailed explanation. A team has a Snowflake warehouse holding production CRM data. They do not want to copy it into Delta, but they need to JOIN it to event tables in main.events. Lakehouse Federation lets them register the Snowflake account as a CONNECTION, then create a FOREIGN CATALOG that surfaces every Snowflake schema and table as if it were native Delta — three-part naming and all.

Question. Register the Snowflake connection, create the foreign catalog, and write a JOIN that combines snowflake_crm.public.contacts with main.events.signups.

Input.

Component	Value
Snowflake host	`acme.snowflakecomputing.com`
Snowflake user	`db_federation_user`
Snowflake warehouse	`COMPUTE_WH`
Foreign catalog name	`snowflake_crm`
Delta catalog	`main`

Code.

-- 1) Create a UC connection to Snowflake
CREATE CONNECTION snowflake_prod TYPE SNOWFLAKE
OPTIONS (
    host        'acme.snowflakecomputing.com',
    port        '443',
    sfWarehouse 'COMPUTE_WH',
    user        secret('snowflake', 'user'),
    password    secret('snowflake', 'password')
);

-- 2) Create the foreign catalog
CREATE FOREIGN CATALOG snowflake_crm
    USING CONNECTION snowflake_prod
    OPTIONS (database 'CRM_PROD');

-- 3) Grant USAGE for analysts
GRANT USAGE ON CATALOG snowflake_crm TO `growth_analysts`;
GRANT USAGE ON SCHEMA  snowflake_crm.public TO `growth_analysts`;

-- 4) The query — joins across two engines, no copy
SELECT
    c.contact_id,
    c.email,
    s.signup_at
FROM snowflake_crm.public.contacts c
JOIN main.events.signups            s ON s.email = c.email
WHERE s.signup_at >= CURRENT_DATE - INTERVAL 7 DAYS;

Step-by-step explanation.

The CONNECTION object captures the how to reach Snowflake — host, warehouse, credential. Secrets live in the Databricks secret scope, not inline.
The FOREIGN CATALOG is the what to expose — one Snowflake database becomes one UC catalog. Schemas and tables auto-discover.
The USAGE chain is identical to native catalogs. The analyst experience is "another catalog in the explorer" — no special Snowflake awareness required.
The JOIN plan pushes down the contact-side predicates to Snowflake, fetches only the necessary columns into Databricks, and finishes the JOIN locally. The optimiser owns the split.
Lineage captures the foreign catalog as a source — you can see "this gold table reads from Snowflake" in the Catalog Explorer graph.

Output.

contact_id	email	signup_at
8821	alice@example.com	2026-06-04 14:22:00
8822	bob@example.com	2026-06-05 09:01:00

Rule of thumb. Federate before you copy. If the data lives somewhere already, a foreign catalog is one DDL away and gives you correctness, lineage, and grants for free. Only physically copy when latency, cost, or transformation complexity makes federation unworkable.

SQL interview question on namespace design

A senior interviewer might frame this as: "Your platform team is launching UC on a region with three lines of business (finance, growth, infra), four environments (prod, stage, dev, sandbox), and a 'data products' notion. Sketch the catalog tree and the GRANT model."

Solution Using a `{env}_{lob}` catalog matrix + a certified `dataproducts` catalog

-- The catalog matrix — twelve catalogs for the {env, lob} cross-product
CREATE CATALOG prod_finance;
CREATE CATALOG prod_growth;
CREATE CATALOG prod_infra;
CREATE CATALOG stage_finance;
CREATE CATALOG stage_growth;
-- ... (and so on)
CREATE CATALOG sandbox_finance;

-- The certified shareable surface — one cross-LOB catalog
CREATE CATALOG dataproducts
COMMENT 'Certified data products. Read-only to consumers. Owned by data_platform.';

-- The grant model — three levels of audience
GRANT USAGE  ON CATALOG prod_finance    TO `finance_engineers`;
GRANT USAGE  ON CATALOG dataproducts    TO `account_users`;     -- everyone reads certified
GRANT CREATE SCHEMA ON CATALOG sandbox_finance TO `finance_engineers`;

Step-by-step trace.

Catalog	Audience	Privileges granted
prod_finance	finance_engineers	USAGE; SELECT per schema
stage_finance	finance_engineers	USAGE; CREATE SCHEMA
sandbox_finance	finance_engineers	USAGE; CREATE SCHEMA; OWNER
dataproducts	account_users (everyone)	USAGE; SELECT (read-only)
dataproducts	data_platform	OWNER; CREATE SCHEMA; CREATE TABLE

The matrix gives each LOB a private playground (sandbox), a staging area, and a prod zone — and one cross-LOB catalog (dataproducts) where certified, governed datasets are shared without giving every consumer USAGE on the producer's prod catalog.

Output:

Question	Answer
How many catalogs total?	13 (12 matrix + 1 dataproducts)
Who can write to prod?	LOB engineers via CI/CD service principal
How does finance share with growth?	Promote table into `dataproducts`
Where does sandbox data live?	`sandbox_{lob}` — short retention, no production grants

Why this works — concept by concept:

Catalog is the isolation unit — by making catalogs per {env, lob}, every group grant lands at the catalog root and inherits down. New schemas inside the catalog automatically get the same audience.
dataproducts as a curated surface — separating "raw / staged / prod tables" (inside each LOB catalog) from "shareable, contracted datasets" (inside dataproducts) lets you ship certified views without exposing the internal data model.
USAGE chain enforces the tree — a finance engineer never needs USAGE on prod_growth; the catalogs are physically separate and the grant model says so.
Sandbox catalogs — every engineer gets CREATE SCHEMA in their LOB's sandbox catalog. Experimentation never pollutes prod, and there is one obvious place to clean up.
Cost — catalog creation is metadata-only. There is no per-catalog license or compute cost. The audit and lineage system tables remain a single account-scoped store.

SQL
Topic — design
Schema and catalog design problems

Practice →

SQL
Topic — dimensional modeling
Dimensional modelling drills

Practice →

3. Automatic data lineage — table-level and column-level

Lineage that "just appears" is the single most under-appreciated UC feature — and the basis of every modern impact-analysis answer

The mental model in one line: UC infers lineage from the query plan — every notebook, job, DLT pipeline, and dashboard that reads or writes via Photon or DBSQL emits a node into the lineage graph automatically, at both table grain (which table writes which table) and column grain (which source columns feed which target column). The engineer's job is to read the graph, not build it.

What UC captures automatically.

Producer → table edges. Every notebook, scheduled job, dashboard, DLT pipeline, and SQL warehouse query writes an edge into system.access.table_lineage.
Table → table edges. When a notebook reads main.raw.orders and writes main.silver.orders_clean, UC records the read-write dependency.
Column → column edges. When a transformation maps orders.amount + customers.region to daily_revenue.gmv, UC stores the column-level mapping in system.access.column_lineage.
Cross-workload. SQL warehouses, Photon clusters, DLT pipelines, scheduled jobs (notebook and script tasks), and AI agent runs all contribute.

Two grains, one graph.

Table grain is the right currency for impact analysis ("if I drop main.raw.orders, what breaks downstream?"). Faster to query, fewer edges, smaller graph.
Column grain is the right currency for PII tracing and data-contract validation ("does any downstream column depend on customers.ssn?"). More edges, more cost, slower query.
Both grains are populated automatically — there is no opt-in flag for column lineage on supported runtimes.

Where to read the lineage.

Catalog Explorer UI. Click a table → "Lineage" tab → see the upstream/downstream DAG. The default audience for analysts and stewards.
REST API. /api/2.0/lineage-tracking/table-lineage and /api/2.0/lineage-tracking/column-lineage. Programmatic access for impact-analysis automation.
System tables. system.access.table_lineage and system.access.column_lineage. SQL queryable, joinable to anything else.

The 90-day retention window.

Default. UC retains lineage events for 90 days. Older events drop off.
Long-term storage. Stream the system tables into your own gold tables for retention beyond 90 days — useful for regulated industries.
Audit pairing. Lineage + system.access.audit = "who read which table at which time and where the result flowed."

What lineage does not capture (limits).

Structured Streaming foreachBatch sinks. The arbitrary code inside the lambda is opaque to UC.
Non-Databricks consumers. A Power BI dashboard that uses Delta Sharing as an open-protocol consumer is captured as a share event but not as a downstream Power BI graph node.
External writes from non-UC clusters. If a cluster bypasses UC entirely, its writes are invisible to the lineage graph.

Common interview probes on lineage.

"How would you find every downstream table impacted by deleting a column?" — query system.access.column_lineage for every edge whose source_column_name = 'ssn'.
"What is the retention of the lineage graph?" — 90 days; stream to a gold table for longer.
"Is lineage captured for DLT?" — yes, at both grains. DLT is one of the best lineage producers because the framework knows the transformation declaratively.
"Does Lakehouse Federation contribute lineage?" — yes, at table grain. The federated query is captured as a read from the foreign catalog.

Worked example — impact analysis "what breaks if I drop this column?"

Detailed explanation. A data engineer needs to drop customers.ssn (the company is moving SSN to a tokenized vault). Before doing it, they need to know every downstream table whose columns depend on customers.ssn. The answer is one query against system.access.column_lineage.

Question. Find every downstream column that has main.crm.customers.ssn as a source.

Input. The column lineage table has one row per (source_column, target_column) pair, with the producer notebook/job and the event time.

source_table_full_name	source_column_name	target_table_full_name	target_column_name
main.crm.customers	ssn	main.compliance.ssn_lookup	ssn_hash
main.crm.customers	ssn	main.compliance.ssn_lookup	masked_ssn
main.crm.customers	ssn	main.gold.kyc_dashboard	ssn_last4
main.crm.customers	email	main.gold.kyc_dashboard	email

Code.

SELECT
    target_table_full_name,
    target_column_name,
    source_column_name,
    MAX(event_time) AS last_seen
FROM system.access.column_lineage
WHERE source_table_full_name = 'main.crm.customers'
  AND source_column_name     = 'ssn'
  AND event_time >= CURRENT_DATE - INTERVAL 90 DAYS
GROUP BY
    target_table_full_name,
    target_column_name,
    source_column_name
ORDER BY target_table_full_name, target_column_name;

Step-by-step explanation.

The query filters column_lineage to a single source column. Every row in the result is a column that depends — directly or transitively in the recorded edge — on customers.ssn.
event_time >= CURRENT_DATE - INTERVAL 90 DAYS matches the UC retention window; anything older was already aged out of the lineage table.
The GROUP BY ... MAX(event_time) collapses multiple edge events into one row per target column with the most recent observation. Useful for distinguishing "actively read" vs "historic edge."
The result is the impact list: every downstream target that needs a rewrite (or a deprecation plan) before the source column can be dropped.
To go one more hop downstream (transitive impact), feed the target tables back into the same query as source candidates.

Output.

target_table_full_name	target_column_name	source_column_name	last_seen
main.compliance.ssn_lookup	masked_ssn	ssn	2026-06-05
main.compliance.ssn_lookup	ssn_hash	ssn	2026-06-05
main.gold.kyc_dashboard	ssn_last4	ssn	2026-06-04

Rule of thumb. Make this query a saved view (vw_column_impact). Every "can I change this column?" ticket becomes a one-parameter call. Same query, different source_column_name — instant impact list.

Worked example — finding orphan tables nobody reads

Detailed explanation. Over time, every lakehouse accumulates tables no consumer reads. They cost storage, pollute Catalog Explorer, and dilute trust. The lineage table tells you exactly which tables have zero read edges in the last N days — i.e. orphans.

Question. Find every table in main.gold that has no read activity in the last 60 days.

Input. A registry of all tables in main.gold plus the table_lineage event log.

Code.

WITH gold_tables AS (
    SELECT
        CONCAT_WS('.', table_catalog, table_schema, table_name) AS full_name
    FROM system.information_schema.tables
    WHERE table_catalog = 'main'
      AND table_schema  = 'gold'
      AND table_type    = 'MANAGED'
),
recent_reads AS (
    SELECT DISTINCT source_table_full_name AS full_name
    FROM system.access.table_lineage
    WHERE event_time >= CURRENT_DATE - INTERVAL 60 DAYS
      AND source_table_full_name LIKE 'main.gold.%'
)
SELECT g.full_name AS orphan_table
FROM   gold_tables g
LEFT   JOIN recent_reads r ON r.full_name = g.full_name
WHERE  r.full_name IS NULL
ORDER  BY g.full_name;

Step-by-step explanation.

gold_tables lists every managed table in the main.gold schema — the candidate population.
recent_reads lists every main.gold.* table that appears as a source (i.e. was read by something) in the last 60 days. DISTINCT collapses the per-event rows into per-table rows.
The LEFT JOIN ... WHERE r.full_name IS NULL is the canonical anti-join pattern — keeps every gold table that does not appear in the recent-reads list.
The result is the orphan list. Each row is a candidate for deletion, archival, or — if it is a quarterly report — better discovery metadata so consumers find it.
Add a created_at >= CURRENT_DATE - INTERVAL 14 DAYS exclusion if you want to skip brand-new tables that have not had time to acquire readers yet.

Output.

orphan_table
main.gold.legacy_revenue_q1
main.gold.test_aggregate
main.gold.unused_rollup

Rule of thumb. Run the orphan scan monthly. Tag each orphan with a last_read annotation, give owners 30 days to claim, then archive. The lineage table makes this a 20-line query — pre-UC it was a manual archaeology project.

Worked example — auditing PII reads with lineage + audit join

Detailed explanation. Compliance asks: "show me every user who read any column derived from customers.ssn in the last 30 days." The answer joins column_lineage (find downstream columns) → table_lineage (find tables) → system.access.audit (find readers).

Question. List every user_email that ran a query against a table downstream of main.crm.customers.ssn in the last 30 days.

Input. Three system tables joined.

Code.

WITH ssn_targets AS (
    SELECT DISTINCT target_table_full_name AS table_name
    FROM system.access.column_lineage
    WHERE source_table_full_name = 'main.crm.customers'
      AND source_column_name     = 'ssn'
      AND event_time >= CURRENT_DATE - INTERVAL 90 DAYS
)
SELECT
    a.user_identity.email                  AS user_email,
    a.request_params.full_name_arg         AS table_read,
    COUNT(*)                               AS read_count,
    MAX(a.event_time)                      AS last_read_at
FROM system.access.audit a
JOIN ssn_targets         t
  ON a.request_params.full_name_arg = t.table_name
WHERE a.service_name = 'unityCatalog'
  AND a.action_name  = 'getTable'
  AND a.event_time  >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY
    a.user_identity.email,
    a.request_params.full_name_arg
ORDER BY read_count DESC;

Step-by-step explanation.

ssn_targets collapses the column lineage into a table list — every table that has at least one column derived from customers.ssn.
The audit table records one row per UC getTable event, which corresponds to a user opening / scanning a table. Join the two on the table name.
The aggregate produces "user → table → read_count" — the compliance answer in one page.
Filter on service_name = 'unityCatalog' and action_name = 'getTable' to ignore unrelated audit events (cluster start, secret access, etc).
The 30-day window for the audit join is narrower than the 90-day lineage retention — the two are independent retention horizons.

Output.

user_email	table_read	read_count	last_read_at
alice@acme.com	main.compliance.ssn_lookup	12	2026-06-05 14:00
bob@acme.com	main.gold.kyc_dashboard	4	2026-06-03 09:22
carol@acme.com	main.compliance.ssn_lookup	1	2026-05-28 11:10

Rule of thumb. Bake this query into a quarterly compliance report. The lineage hop expands PII auditing from "what was read directly" to "what was read transitively" — which is the question regulators actually ask.

SQL interview question on lineage-based impact analysis

A senior interviewer often frames this as: "You are dropping main.raw.events.user_ip to comply with a regional privacy law. Walk me through the queries you would run to (a) find every downstream table impacted, (b) find every notebook/job that needs an edit, and (c) the rollout order."

Solution Using a two-step lineage join (column → table → producer)

-- 1) Downstream tables impacted
WITH impacted_tables AS (
    SELECT DISTINCT target_table_full_name AS table_full_name
    FROM   system.access.column_lineage
    WHERE  source_table_full_name = 'main.raw.events'
      AND  source_column_name     = 'user_ip'
      AND  event_time >= CURRENT_DATE - INTERVAL 90 DAYS
),
-- 2) Producers (notebooks / jobs / DLT) that write each impacted table
producers AS (
    SELECT
        tl.target_table_full_name,
        tl.entity_type,
        tl.entity_id,
        tl.entity_name,
        MAX(tl.event_time) AS last_write
    FROM   system.access.table_lineage tl
    JOIN   impacted_tables it
      ON   it.table_full_name = tl.target_table_full_name
    WHERE  tl.event_time >= CURRENT_DATE - INTERVAL 90 DAYS
    GROUP  BY
        tl.target_table_full_name,
        tl.entity_type,
        tl.entity_id,
        tl.entity_name
)
SELECT * FROM producers
ORDER BY target_table_full_name, last_write DESC;

Step-by-step trace.

Step	Reads from	Output cardinality	Purpose
1. impacted_tables	column_lineage	small (5–50)	"what depends on user_ip"
2. producers	table_lineage	medium (10–200)	"what writes each impacted table"
3. final select	producers	medium	the rollout backlog

The two-step pattern — column lineage to find impacted tables, then table lineage to find the producers — is the canonical "what do I need to edit" answer. The output rows are the engineering backlog, ordered by table.

Output:

target_table_full_name	entity_type	entity_name	last_write
main.silver.events_geo	NOTEBOOK	nb_geo_enrichment	2026-06-05
main.gold.daily_traffic	JOB	daily_traffic_job	2026-06-05
main.gold.daily_traffic	DLT_PIPELINE	gold_traffic_dlt	2026-06-05

Why this works — concept by concept:

column_lineage drives table_lineage — start at the column you want to change, walk up the column graph to find target columns, then up again via the table graph to find producers. The hop count is finite (usually 1–3).
entity_type tells the editor what to edit — NOTEBOOK, JOB, DLT_PIPELINE, SQL_QUERY. Each demands a different change-management workflow.
90-day retention is enough for impact analysis — anything that has not written in 90 days is either dormant or deprecated. Treat it as a candidate for archival, not a blocker.
One query, no instrumentation — pre-UC, this answer required reading every notebook in source control. UC reduces it to a join.
Cost — two LEFT JOIN scans over the lineage system tables. Optimised by Databricks for sub-second response on typical metastore sizes. Bring your own DBSQL warehouse.

SQL
Topic — joins
JOIN problems for lineage analysis

Practice →

SQL
Topic — ctes
CTE-based multi-step queries

Practice →

4. Row filters, column masks + GRANT/REVOKE

`row-level security` filters and `column masks` are functions you attach to a table — and they beat dynamic views for maintenance, cross-tool consistency, and audit

The mental model in one line: a row filter is a SQL UDF that returns BOOLEAN and decides whether each row is visible; a column mask is a SQL UDF that returns the (possibly rewritten) column value — and once attached to a table, both fire automatically for every reader, every tool, every workspace. The win versus the older "build a dynamic view per audience" pattern is that the table stays one object — there is no view tree to keep in sync.

The four primitives.

GRANT — give a privilege (SELECT, MODIFY, USAGE, EXECUTE, ...) on an object (table, schema, catalog, function, volume) to a principal (user, group, service principal).
REVOKE — the symmetric opposite. Identical syntax, opposite effect.
Row filter function. A SQL UDF that returns BOOLEAN and is attached to a table via ALTER TABLE ... SET ROW FILTER. The function fires per row; rows where it returns FALSE or NULL are dropped from the result.
Column mask function. A SQL UDF that takes the column value as input and returns the (possibly rewritten) value. Attached via ALTER TABLE ... ALTER COLUMN ... SET MASK. Fires per row, per column.

Two helper functions every policy uses.

is_account_group_member('group_name') — returns TRUE if the current principal is a member of the named account-level group. The cleanest way to write "if the reader is a PII reader."
current_user() — returns the email of the current principal. Useful for per-user filters ("show only rows owned by the current user").

Row filters in detail.

Signature. A SQL UDF whose return type is BOOLEAN. Arguments correspond to one or more columns of the protected table.
Application. ALTER TABLE main.sales.orders SET ROW FILTER us_only_filter ON (region) — UC passes the region column into us_only_filter for every row.
Combinable with group membership. Use is_account_group_member() inside the filter so administrators see all rows, regional analysts see their region only.
Detach. ALTER TABLE main.sales.orders DROP ROW FILTER — removes the filter without altering the function.

Column masks in detail.

Signature. A SQL UDF whose argument is the column value and whose return type matches. Optional additional arguments for context (e.g. the row's region).
Application. ALTER TABLE main.hr.employees ALTER COLUMN ssn SET MASK ssn_mask — UC routes the original ssn value through ssn_mask for every reader.
Conditional unmasking. Put CASE WHEN is_account_group_member('pii_readers') THEN ssn ELSE ... END inside the mask body — same physical column, different visible value per audience.
Detach. ALTER TABLE main.hr.employees ALTER COLUMN ssn DROP MASK.

Dynamic views vs filters/masks.

Dynamic view. A view defined with CREATE VIEW ... AS SELECT ... CASE WHEN is_account_group_member('pii') THEN ssn ELSE '***' END .... The audience is encoded in the view body.
Why filters/masks win. One table, one policy. No view tree to maintain. Same policy applies whether the reader hits DBSQL, a notebook, a federated tool, or a Delta Sharing recipient.
When dynamic views still help. When the policy logic depends on a JOIN to another table (e.g. an entitlements table); row filters and masks are scalar UDFs by design.

Attribute-Based Access Control (ABAC).

What. Tag-driven policies — apply a pii=true tag to a column once, and a tenant-wide rule "mask every pii=true column for non-pii_readers" fires automatically.
Status (2026). Preview / GA depending on cloud and region. The direction of travel for the next two years.
Why care. It replaces per-table policy admin with per-tag policy admin. The maintenance graph collapses from O(tables × audiences) to O(tags × audiences).

The GRANT chain (always required).

GRANT USAGE ON CATALOG main TO group — see the catalog.
GRANT USAGE ON SCHEMA main.sales TO group — see the schema.
GRANT SELECT ON TABLE main.sales.orders TO group — read the table (subject to filter + mask).

Audit — where every check lands.

system.access.audit. One row per privileged action. Includes the action name (getTable, grantPrivilege, ...), the principal, the object, the request params, the source IP.
Cross-tabulation with privileges. Join audit with information_schema.table_privileges to surface "granted but never used" privileges — the principle-of-least-privilege backlog.

Worked example — a row filter that limits orders to the reader's region

Detailed explanation. A sales.orders table holds orders from every region. EU analysts should see only EU rows; APAC only APAC; global admins all rows. A single row filter function — looking at the row's region column and the reader's group membership — encodes the policy.

Question. Define a row filter region_filter(region) that returns TRUE when the reader is in the corresponding regional group OR the global readers group. Attach it to main.sales.orders.

Input.

order_id	region	amount
1	EU	100
2	US	80
3	APAC	50
4	EU	200

Code.

-- 1) The row filter function
CREATE OR REPLACE FUNCTION main.sec.region_filter(region STRING)
RETURN
    is_account_group_member('global_readers')
    OR (region = 'EU'   AND is_account_group_member('eu_analysts'))
    OR (region = 'US'   AND is_account_group_member('us_analysts'))
    OR (region = 'APAC' AND is_account_group_member('apac_analysts'));

-- 2) Attach to the table
ALTER TABLE main.sales.orders
    SET ROW FILTER main.sec.region_filter ON (region);

-- 3) Grant EXECUTE on the function to readers
GRANT EXECUTE ON FUNCTION main.sec.region_filter TO `account_users`;

Step-by-step explanation.

The function returns BOOLEAN. Its first branch — is_account_group_member('global_readers') — short-circuits TRUE for admins, so the filter never excludes them.
Each regional branch tests both the row's region and the reader's group membership. The pattern scales to N regions by adding N branches.
ALTER TABLE ... SET ROW FILTER ... ON (region) registers the filter and tells UC which column to pass in. UC injects the function into every read plan automatically.
GRANT EXECUTE ON FUNCTION is not the same as SELECT on the table — the principal needs both. Without EXECUTE on the filter function, the read fails with "function not accessible."
The filter fires for every read: DBSQL, notebooks, federated tools, Delta Sharing recipients. There is one policy, applied everywhere.

Output (what an EU analyst sees).

order_id	region	amount
1	EU	100
4	EU	200

Rule of thumb. Put one boolean "is_admin" branch at the very top of every row filter. It makes admin debugging trivial and prevents the filter from accidentally hiding rows during a policy change.

Worked example — a column mask for SSN with conditional unmasking

Detailed explanation. The hr.employees table stores a raw ssn column. Only the pii_readers group is allowed to see the full value; everyone else sees the last-4 form. A column mask attached to the table enforces this without rewriting any reader's queries.

Question. Define a mask ssn_mask(ssn) that returns the full SSN for pii_readers and XXX-XX-#### (last 4 digits) for everyone else. Attach it to main.hr.employees.ssn.

Input.

name	ssn
alice	123-45-6789
bob	987-65-4321
carol	555-12-3456

Code.

-- 1) The mask function — input ssn, output (possibly rewritten) ssn
CREATE OR REPLACE FUNCTION main.sec.ssn_mask(ssn STRING)
RETURN
    CASE
        WHEN is_account_group_member('pii_readers')
            THEN ssn
        ELSE
            'XXX-XX-' || RIGHT(ssn, 4)
    END;

-- 2) Attach to the column
ALTER TABLE main.hr.employees
    ALTER COLUMN ssn SET MASK main.sec.ssn_mask;

-- 3) Grant EXECUTE on the function
GRANT EXECUTE ON FUNCTION main.sec.ssn_mask TO `account_users`;

Step-by-step explanation.

The mask takes the column value as input and returns the visible value. The function is called once per row, per read.
For a member of pii_readers, the mask returns the input unchanged — the full SSN is visible.
For any other reader, the mask returns 'XXX-XX-' || RIGHT(ssn, 4) — the masked form. The same physical row produces different visible values depending on the audience.
Like row filters, column masks fire across every read path: notebooks, DBSQL, Delta Sharing recipients (if shared). Replacing a per-tool ACL with a single mask.
RIGHT(ssn, 4) is the Databricks SQL idiom for the last four characters. The mask intentionally leaks no more than four digits — the standard PII-tolerable disclosure.

Output — what a non-PII reader sees.

name	ssn
alice	XXX-XX-6789
bob	XXX-XX-4321
carol	XXX-XX-3456

Rule of thumb. Define mask functions in a dedicated sec schema (main.sec.ssn_mask, main.sec.email_mask, ...). One place to audit, one place to revoke EXECUTE if a function becomes deprecated.

Worked example — combining filter + mask + USAGE on one table

Detailed explanation. A single main.sales.orders table can carry both a row filter (region) and a column mask (customer_email). The two compose: the filter runs first and drops rows; the mask runs on surviving rows and rewrites values. The reader needs the full USAGE chain plus EXECUTE on both functions.

Question. Show the complete policy bundle — row filter, column mask, USAGE chain, function EXECUTE — for an EU analyst reading main.sales.orders.

Input.

order_id	region	customer_email	amount
1	EU	alice@example.com	100
2	US	bob@example.com	80
3	EU	carol@example.com	50

Code.

-- Functions
CREATE OR REPLACE FUNCTION main.sec.region_filter(region STRING)
RETURN is_account_group_member('global_readers')
    OR (region = 'EU' AND is_account_group_member('eu_analysts'));

CREATE OR REPLACE FUNCTION main.sec.email_mask(email STRING)
RETURN CASE
    WHEN is_account_group_member('pii_readers') THEN email
    ELSE CONCAT(LEFT(email, 2), '***@', SPLIT_PART(email, '@', 2))
END;

-- Attach
ALTER TABLE main.sales.orders SET ROW FILTER main.sec.region_filter ON (region);
ALTER TABLE main.sales.orders ALTER COLUMN customer_email SET MASK main.sec.email_mask;

-- Grants — full bundle for eu_analysts
GRANT USAGE   ON CATALOG  main                              TO `eu_analysts`;
GRANT USAGE   ON SCHEMA   main.sales                        TO `eu_analysts`;
GRANT SELECT  ON TABLE    main.sales.orders                 TO `eu_analysts`;
GRANT EXECUTE ON FUNCTION main.sec.region_filter            TO `eu_analysts`;
GRANT EXECUTE ON FUNCTION main.sec.email_mask               TO `eu_analysts`;

Step-by-step explanation.

The filter runs first. For an EU analyst, only region = 'EU' rows survive — orders 1 and 3.
The mask runs second. The EU analyst is not in pii_readers, so customer_email is rewritten as al***@example.com, ca***@example.com.
Both functions must be granted EXECUTE to the reader. The error if either is missing is "function not accessible" — a frequent first-day mistake.
The full USAGE chain (catalog + schema + table SELECT) is additional — function EXECUTE alone is not sufficient to read the table.
Bundle the seven grants into one script and check it into source control. Re-running the script is idempotent; UC silently no-ops if the privilege already exists.

Output — what the EU analyst sees.

order_id	region	customer_email	amount
1	EU	al***@example.com	100
3	EU	ca***@example.com	50

Rule of thumb. Treat row filter + column mask as one policy bundle. Bundle them into a single SQL script per table, grant function EXECUTE in the same script, and tag the script with the table's full name. When the policy changes, you edit one file.

SQL interview question on row-level + column-level security composition

A senior interviewer often opens with: "Design the row filter + column mask + GRANT model so that EU analysts see EU rows with masked emails, US analysts see US rows with masked emails, and compliance officers see all rows with full emails. Show the SQL."

Solution Using one row filter + one column mask + three groups

-- One filter for region-based row visibility
CREATE OR REPLACE FUNCTION main.sec.region_filter(region STRING)
RETURN is_account_group_member('compliance_officers')
    OR (region = 'EU' AND is_account_group_member('eu_analysts'))
    OR (region = 'US' AND is_account_group_member('us_analysts'));

-- One mask for email visibility
CREATE OR REPLACE FUNCTION main.sec.email_mask(email STRING)
RETURN CASE
    WHEN is_account_group_member('compliance_officers') THEN email
    ELSE CONCAT(LEFT(email, 2), '***@', SPLIT_PART(email, '@', 2))
END;

-- Attach both
ALTER TABLE main.sales.orders SET ROW FILTER main.sec.region_filter ON (region);
ALTER TABLE main.sales.orders ALTER COLUMN customer_email SET MASK main.sec.email_mask;

-- Three audiences, one policy
GRANT USAGE   ON CATALOG  main                       TO `eu_analysts`, `us_analysts`, `compliance_officers`;
GRANT USAGE   ON SCHEMA   main.sales                 TO `eu_analysts`, `us_analysts`, `compliance_officers`;
GRANT SELECT  ON TABLE    main.sales.orders          TO `eu_analysts`, `us_analysts`, `compliance_officers`;
GRANT EXECUTE ON FUNCTION main.sec.region_filter     TO `eu_analysts`, `us_analysts`, `compliance_officers`;
GRANT EXECUTE ON FUNCTION main.sec.email_mask        TO `eu_analysts`, `us_analysts`, `compliance_officers`;

Step-by-step trace.

Reader group	Rows visible	Email visibility
eu_analysts	EU only	masked (`al***@example.com`)
us_analysts	US only	masked
compliance_officers	all regions	full (`alice@example.com`)
(no group above)	none	n/a — SELECT denied at table level

The filter+mask combination keeps the same physical table while presenting three audience-specific views. There is no view tree to maintain and the audit trail is clean: every read shows the same main.sales.orders target with the principal's group on the row.

Output:

audience	orders rows	email column visible
eu_analysts	2 (EU only)	masked
us_analysts	1 (US only)	masked
compliance_officers	3 (all)	full

Why this works — concept by concept:

One filter function, three branches — the function encodes every audience's row visibility in one place. Adding APAC is a one-line edit, not a per-table change.
is_account_group_member() is the policy language — it reads the principal's current effective groups. No application-side identity propagation required.
Mask + filter compose — filter runs first (drops rows), mask runs second (rewrites columns). The optimiser inlines both into the read plan.
Function EXECUTE is independent of table SELECT — both grants are required. Missing EXECUTE produces a "function not accessible" error that looks like a table permission bug but is not. Catch it in onboarding.
Cost — both functions are scalar UDFs. Their cost is per-row CPU only; no I/O, no extra scan. The optimiser usually pushes the filter predicate into the file-scan stage so unread rows never leave storage.

SQL
Topic — case expression
CASE expression problems (mask logic)

Practice →

SQL
Topic — filtering
Filtering problems (row filter analogues)

Practice →

5. Delta Sharing topology + production rollout

`delta sharing` is the open REST + parquet protocol that turns "send me a CSV every Monday" into "subscribe to this share and read it from Power BI"

The mental model in one line: Delta Sharing is one protocol (REST + parquet) with two activation flavours (open-protocol token-based for non-Databricks recipients; metastore-based for Databricks-to-Databricks) — and the unit of sharing is a SHARE object that bundles tables, schemas, partitions, volumes, and AI models on the producer side. Once you say "one share, many recipients, zero copy," every cross-team data-product conversation becomes a configuration exercise rather than a pipeline-building exercise.

Three producer-side objects.

PROVIDER (recipient-facing) — the producer side automatically becomes a provider when it creates its first share. The recipient sees a "provider" handle that bundles all the producer's shares to that recipient.
SHARE — a bundle of objects. CREATE SHARE analytics_share creates an empty share; subsequent ALTER SHARE ... ADD TABLE adds objects.
RECIPIENT — the consumer registration. Two activation modes: open-protocol (recipient downloads a credential file) or metastore-based (Databricks-to-Databricks identity federation, no token leak).

Two recipient activation modes.

Token-based (open protocol). UC mints a bearer_token and a profile file. The recipient downloads the profile and uses any open-protocol client (Pandas, Power BI, Tableau, Spark). The token is the secret; rotate periodically.
Metastore-based (Databricks-to-Databricks). Both producer and recipient are Databricks metastores; UC handles identity federation. No token file changes hands. The "do this when both sides are Databricks" preferred mode.

Five object kinds you can share.

Tables. The canonical case. Delta tables (managed or external) share at parquet-file granularity.
Partitions. Share only a subset of a partitioned table — e.g. only the region = 'EU' partitions.
Schemas. Share an entire schema; new tables in the schema auto-flow into the share.
Volumes. Share governed file data (PDFs, logs, ML artefacts).
AI models. Share registered models in UC's model registry; the recipient can model.load_predict directly.

Cross-cloud, cross-region.

The protocol is cloud-agnostic. A producer on AWS can share to a recipient on Azure or GCP — UC vends short-lived cloud credentials and the recipient reads the parquet directly.
Egress. The producer pays cross-cloud / cross-region egress on the parquet read. Factor that into pricing models.
Latency. Cross-region adds the obvious propagation latency on the first parquet open; subsequent reads benefit from any client-side cache.

Open-protocol consumers — first-party libraries.

Power BI. Native Delta Sharing connector since 2024. Point at the profile file, pick a share, get a model.
Tableau. Native connector via the Delta Sharing client.
Pandas. pip install delta-sharing → delta_sharing.load_as_pandas('profile.share#share.schema.table').
Spark. df = spark.read.format('deltaSharing').load('share.schema.table') after the open-protocol JAR is on the cluster.

Production rollout — the discovery-first sequence.

Step 1 — Discovery. Identify candidate tables via Catalog Explorer + lineage + business-glossary tags. Producing teams propose shares; consuming teams request them.
Step 2 — Tagging. Apply governance tags (pii=true, tier=gold, sla=1h) to every shareable column. ABAC policies later attach to tags rather than per-table.
Step 3 — Policies. Attach row filters + column masks. The share inherits the policies — recipients see filtered, masked data without any per-share rewrite.
Step 4 — Lineage SLA. Confirm the producer pipeline emits lineage so consumers can answer "is this current?" via system.access.table_lineage.
Step 5 — Sharing. Create the share, add the objects, register the recipient, send the activation link. Smoke-test with a known-good query.

The "data products" catalog pattern.

One certified catalog (dataproducts) holds shareable, contracted tables — not the producer's raw or staging tables.
Each share points only into dataproducts.*. Producers promote their tables into dataproducts after the data contract, governance tags, and SLAs are in place.
Discovery is centralised — every consumer browses dataproducts in Catalog Explorer and requests a share rather than spelunking through prod catalogs.

Common interview probes on Delta Sharing.

"Difference between token-based and metastore-based activation?" — token file is mailed out-of-band; metastore-based uses UC identity federation, no token.
"Can you share a table that has a row filter?" — yes. The recipient sees filtered rows; the filter runs on the producer side.
"Can the recipient write back to the share?" — no, sharing is read-only by design. For two-way, both sides set up shares in opposite directions.
"What does the producer pay?" — egress cost on parquet reads (cross-cloud / cross-region only) and the metadata-call cost. Both small relative to ETL.

Worked example — share three tables to a Databricks-to-Databricks recipient

Detailed explanation. A producer on AWS wants to share main.sales.orders, main.sales.customers, and main.analytics.daily_revenue to an analytics team running its own Databricks metastore on Azure. Both sides have UC, so the right mode is metastore-based — no token file, no out-of-band secret.

Question. Create the share, add the three tables, register the Azure metastore as the recipient, and grant the share to it.

Input.

Object	Producer-side full name
Orders fact	main.sales.orders
Customer dim	main.sales.customers
Gold revenue	main.analytics.daily_revenue
Recipient metastore id	`azure-metastore-east-id`

Code.

-- 1) Create an empty share
CREATE SHARE IF NOT EXISTS analytics_share
COMMENT 'Certified analytics tables for Azure consumer team';

-- 2) Add objects
ALTER SHARE analytics_share ADD TABLE main.sales.orders;
ALTER SHARE analytics_share ADD TABLE main.sales.customers;
ALTER SHARE analytics_share ADD TABLE main.analytics.daily_revenue
    PARTITION (region = 'EU');   -- share only the EU partitions

-- 3) Register the Azure metastore as a recipient
CREATE RECIPIENT azure_analytics_team
USING ID 'aws:us-east-1:azure-metastore-east-id'
COMMENT 'Azure analytics team — D2D mode, no token file';

-- 4) Grant the share to the recipient
GRANT SELECT ON SHARE analytics_share TO RECIPIENT azure_analytics_team;

Step-by-step explanation.

CREATE SHARE is metadata-only. No data moves; the share is a logical bundle.
ALTER SHARE ... ADD TABLE registers each object. The PARTITION (region = 'EU') clause shows partition-level sharing — only EU rows are exposed.
The CREATE RECIPIENT ... USING ID form is the metastore-based activation. The ID encodes the cloud, the region, and the recipient's UC metastore identifier. No secret file is generated.
The GRANT SELECT ON SHARE step is what makes the share visible to the recipient. On the recipient side, the share appears as a "provider" in Catalog Explorer.
The recipient mounts the share by creating a catalog from it: CREATE CATALOG analytics_from_aws USING SHARE 'aws-provider.analytics_share'. Three-part naming then resolves on the consumer side.

Output (recipient view).

catalog	schema	table
analytics_from_aws	sales	orders
analytics_from_aws	sales	customers
analytics_from_aws	analytics	daily_revenue

Rule of thumb. Default to metastore-based when both sides are Databricks. Tokens are a 2024 compatibility tool; the 2026 best practice is D2D when available. Reserve open-protocol tokens for Power BI / Pandas / Tableau consumers.

Worked example — share to an open-protocol Power BI consumer

Detailed explanation. An external partner team uses Power BI on a non-Databricks stack. They want to read main.gold.daily_revenue from inside Power BI's Delta Sharing connector. The right activation mode is token-based: produce a credential file, send it via secure channel, partner imports it into Power BI Desktop.

Question. Create the share, register a token-based recipient, and produce the credential file.

Input.

Component	Value
Share name	`partner_revenue_share`
Table	`main.gold.daily_revenue`
Recipient	`partner_powerbi`

Code.

-- 1) Share
CREATE SHARE partner_revenue_share;
ALTER SHARE partner_revenue_share ADD TABLE main.gold.daily_revenue;

-- 2) Token-based recipient — UC generates a profile file
CREATE RECIPIENT partner_powerbi
USING SHARING_CODE
COMMENT 'External partner — open-protocol, Power BI Desktop';

-- 3) Issue the activation URL (one-time)
SELECT activation_link
FROM   system.information_schema.recipient_activations
WHERE  recipient_name = 'partner_powerbi'
ORDER  BY created_at DESC
LIMIT  1;

-- 4) Grant
GRANT SELECT ON SHARE partner_revenue_share TO RECIPIENT partner_powerbi;

Step-by-step explanation.

The share is created the same way as a D2D share. The difference is recipient-side, not producer-side.
CREATE RECIPIENT ... USING SHARING_CODE tells UC to mint a bearer_token and an activation URL. The partner clicks the URL once to download a .share profile file.
The activation URL is single-use and time-bounded. The recipient must click it before it expires; you can rotate it from system.information_schema.recipient_activations.
The partner opens Power BI Desktop → Get Data → Delta Sharing → load the .share file → pick partner_revenue_share → import the table.
Token rotation is a ROTATE command on the recipient — issue a new activation URL, send to the partner, revoke the old one after grace period.

Output (partner's Power BI view).

share	schema	table	columns
partner_revenue_share	gold	daily_revenue	date, region, gmv

Rule of thumb. Send the activation URL via your secure-comms channel (Slack DM, encrypted email). The .share file itself can be sent through normal channels — it points back to UC's REST endpoint. The activation URL is the secret; the .share file is the address.

Worked example — production rollout: discovery → tagging → policies → share

Detailed explanation. A platform team is rolling Delta Sharing to ten internal consumer teams. The "just create shares" anti-pattern leaks raw bronze tables and gets policy exceptions filed within a week. The discovery-first sequence catches every governance concern before the share goes live.

Question. Walk through the five-step rollout for a new "marketing attribution" data product.

Input. A candidate gold table main.gold.attribution_daily with PII columns customer_email and customer_ip.

Code.

-- Step 1 — Discovery: confirm consumer demand, certify the table
SELECT
    target_table_full_name,
    COUNT(DISTINCT entity_name) AS distinct_consumers,
    MAX(event_time) AS last_read
FROM   system.access.table_lineage
WHERE  target_table_full_name = 'main.gold.attribution_daily'
GROUP  BY target_table_full_name;

-- Step 2 — Tagging: mark PII and tier
ALTER TABLE main.gold.attribution_daily
    SET TAGS ('tier' = 'gold', 'product' = 'marketing-attribution', 'sla' = '6h');
ALTER TABLE main.gold.attribution_daily ALTER COLUMN customer_email
    SET TAGS ('pii' = 'true', 'classification' = 'restricted');
ALTER TABLE main.gold.attribution_daily ALTER COLUMN customer_ip
    SET TAGS ('pii' = 'true', 'classification' = 'restricted');

-- Step 3 — Policies: attach a mask to PII columns
ALTER TABLE main.gold.attribution_daily
    ALTER COLUMN customer_email SET MASK main.sec.email_mask;
ALTER TABLE main.gold.attribution_daily
    ALTER COLUMN customer_ip SET MASK main.sec.ip_mask;

-- Step 4 — Promote to the dataproducts catalog
CREATE OR REPLACE TABLE dataproducts.marketing.attribution_daily
    DEEP CLONE main.gold.attribution_daily;

-- Step 5 — Share
CREATE SHARE attribution_share;
ALTER SHARE attribution_share ADD TABLE dataproducts.marketing.attribution_daily;
CREATE RECIPIENT marketing_team_share USING SHARING_CODE;
GRANT SELECT ON SHARE attribution_share TO RECIPIENT marketing_team_share;

Step-by-step explanation.

Discovery confirms that there are real consumers — the lineage query shows ten distinct entities reading the table over the last 30 days. Without consumer demand, do not share.
Tagging registers pii=true on the sensitive columns. The platform-wide ABAC policy ("mask pii=true for non-pii_readers") then attaches automatically — no per-share rule needed.
Policies apply masks. Even if a downstream consumer accidentally promotes the share to a wider audience, the mask travels with the table.
Promotion to dataproducts decouples the share from the producer's prod catalog. Producers can iterate on main.gold.attribution_daily (e.g. add columns) without breaking the share contract — only the DEEP CLONE refresh promotes the change.
Sharing is the last step, not the first. By the time the recipient receives the activation URL, every PII concern has already been addressed.

Output — the rollout artefacts.

Step	Artefact	Owner
1	Consumer demand confirmed	platform
2	Tags applied to columns	platform + steward
3	Masks attached	platform
4	dataproducts table created	producer
5	Share + recipient + grant	platform

Rule of thumb. The five-step sequence is non-negotiable for any externally-facing share. Skipping step 2 (tagging) is the leading cause of PII leakage in Delta Sharing rollouts — the table looks fine to the producer, the consumer sees raw PII, and the audit lands in the wrong inbox.

SQL interview question on Delta Sharing topology and rollout

A senior interviewer often frames this as: "Design the share + policy + catalog promotion for a sensitive table that needs to go to one Databricks-to-Databricks recipient (analytics team) and one open-protocol recipient (Power BI consumer). Walk me through every object you would create."

Solution Using a `dataproducts` catalog + two recipients on one share

-- 0) Promote into the certified catalog (already governance-tagged + masked upstream)
CREATE OR REPLACE TABLE dataproducts.marketing.attribution_daily
    DEEP CLONE main.gold.attribution_daily;

-- 1) Share
CREATE SHARE attribution_share
COMMENT 'Certified attribution table; PII masked';
ALTER SHARE attribution_share
    ADD TABLE dataproducts.marketing.attribution_daily;

-- 2) Two recipients on the same share
-- D2D for analytics team
CREATE RECIPIENT analytics_team_d2d
USING ID 'aws:us-east-1:analytics-metastore-id';

-- Open-protocol for Power BI partner
CREATE RECIPIENT partner_powerbi
USING SHARING_CODE;

-- 3) Grant the same share to both
GRANT SELECT ON SHARE attribution_share TO RECIPIENT analytics_team_d2d;
GRANT SELECT ON SHARE attribution_share TO RECIPIENT partner_powerbi;

-- 4) Inspect — who is sharing what to whom
SELECT
    s.share_name,
    sr.recipient_name,
    r.authentication_type,
    s.created_at
FROM   system.information_schema.shares          s
JOIN   system.information_schema.share_recipients sr
  ON   sr.share_name = s.share_name
JOIN   system.information_schema.recipients      r
  ON   r.recipient_name = sr.recipient_name
WHERE  s.share_name = 'attribution_share';

Step-by-step trace.

Step	Object created	Purpose
0	`dataproducts.marketing.attribution_daily`	Certified, contract-bound table
1	`attribution_share` + ADD TABLE	The shareable bundle
2a	`analytics_team_d2d` recipient	Metastore-based, no token
2b	`partner_powerbi` recipient	Token-based via SHARING_CODE
3	Two `GRANT SELECT ON SHARE` calls	One share, two audiences
4	Verification query	Cross-tab of share x recipient x mode

The pattern fans two recipient flavours off a single share. The producer manages one share object; UC handles the two activation modes underneath. PII masks travel with the table — the open-protocol consumer never sees raw values.

Output:

share_name	recipient_name	authentication_type	created_at
attribution_share	analytics_team_d2d	DATABRICKS	2026-06-06 10:00
attribution_share	partner_powerbi	TOKEN	2026-06-06 10:01

Why this works — concept by concept:

dataproducts catalog as the share surface — promotes the producer's volatile gold table into a contracted, tagged, masked snapshot. Recipients depend on the snapshot, not the producer's day-to-day edits.
One share, two recipients, two modes — UC decouples "what is shared" (the share object) from "how it is consumed" (the recipient mode). Adding a third recipient is a single CREATE RECIPIENT + GRANT pair.
Masks inherit — because the masks are attached on the source side, the consumer sees masked values regardless of whether they connect via D2D or open-protocol. No per-share rewrite.
system.information_schema as the audit surface — every share/recipient relationship is queryable as Delta. The same JOIN powers your data-products dashboard and your compliance report.
Cost — share metadata is free. The producer pays cross-cloud / cross-region egress on parquet reads only. D2D within the same cloud and region is effectively free.

SQL
Topic — joins
JOIN problems (share / recipient cross-tabs)

Practice →

SQL
Topic — etl
ETL problems (data product promotion)

Practice →

Cheat sheet — Unity Catalog recipes

Fully qualified name. SELECT * FROM main.sales.orders — three-part name resolves catalog → schema → table. The legacy database.table form is silently rewritten to hive_metastore.database.table.
GRANT chain — copy this for every new group. GRANT USAGE ON CATALOG main TO group; GRANT USAGE ON SCHEMA main.sales TO group; GRANT SELECT ON TABLE main.sales.orders TO group. Skipping the USAGE rungs is the #1 day-one ticket.
Row filter. ALTER TABLE main.sales.orders SET ROW FILTER region_filter ON (region). Function must return BOOLEAN; grant EXECUTE ON FUNCTION to readers.
Column mask. ALTER TABLE main.hr.employees ALTER COLUMN ssn SET MASK ssn_mask. Function takes the column and returns the visible value; grant EXECUTE ON FUNCTION to readers.
External location. CREATE EXTERNAL LOCATION raw_zone URL 's3://bucket/raw' WITH (STORAGE CREDENTIAL cred_iam). Notebooks reference the location, never the secret. Grant READ FILES / WRITE FILES to the ingest groups.
Delta Share — D2D. CREATE SHARE analytics; ALTER SHARE analytics ADD TABLE main.sales.orders; CREATE RECIPIENT azure_team USING ID '...'; GRANT SELECT ON SHARE analytics TO RECIPIENT azure_team.
Delta Share — open-protocol. Replace USING ID with USING SHARING_CODE; UC mints a one-time activation URL. Send via secure channel, never inline.
Lineage — column impact. SELECT * FROM system.access.column_lineage WHERE source_table_full_name = 'main.crm.customers' AND source_column_name = 'ssn' — the "what depends on this column" answer in one query.
Lineage — producer audit. SELECT entity_type, entity_name, MAX(event_time) FROM system.access.table_lineage WHERE target_table_full_name = 'main.gold.daily_revenue' GROUP BY 1, 2 ORDER BY 3 DESC.
Audit — who read PII. SELECT user_identity.email, COUNT(*) FROM system.access.audit WHERE service_name = 'unityCatalog' AND action_name = 'getTable' AND request_params.full_name_arg = 'main.crm.customers' GROUP BY 1.
Foreign catalog (Snowflake). CREATE CONNECTION snowflake_prod TYPE SNOWFLAKE OPTIONS (...); CREATE FOREIGN CATALOG snowflake_crm USING CONNECTION snowflake_prod OPTIONS (database 'CRM_PROD'). Three-part naming works across engines.
Tag a column. ALTER TABLE main.gold.attribution ALTER COLUMN customer_email SET TAGS ('pii' = 'true', 'classification' = 'restricted') — feeds future ABAC policies.
Promote into dataproducts. CREATE OR REPLACE TABLE dataproducts.marketing.attribution_daily DEEP CLONE main.gold.attribution_daily. The promoted snapshot becomes the share target.
Inspect grants for a principal. SELECT * FROM system.information_schema.table_privileges WHERE grantee = 'finance_engineers'.
Detach a filter or mask. ALTER TABLE ... DROP ROW FILTER and ALTER TABLE ... ALTER COLUMN ... DROP MASK. Both are metadata-only and instant.

Frequently asked questions

What is the difference between Unity Catalog and the Hive metastore?

The Hive metastore is workspace-scoped, lives inside each Databricks workspace, and uses two-part database.table naming. Unity Catalog is account-scoped, lives once per region, and uses three-part catalog.schema.table naming. Beyond the namespace, UC adds account-level SCIM identity, automatic lineage at table and column grain, the system.access.audit log, row filter and column mask primitives, and Delta Sharing — all of which the legacy Hive metastore lacks. The migration path keeps Hive metastore tables reachable as hive_metastore.db.table so consumers can adopt UC at their own pace.

Can one Unity Catalog metastore span multiple Databricks workspaces?

Yes — and that is the entire point. One UC metastore is attached to N workspaces in the same cloud region, so a SCIM group like pii_readers and a grant like GRANT SELECT ON TABLE main.finance.customer_pii TO pii_readers is written once and effective everywhere. Pre-UC, the same policy had to be copied per workspace and frequently drifted out of sync. The metastore is a per-region object, not a per-workspace one; if you span two cloud regions you have two metastores and operate them independently (Delta Sharing handles cross-region data flow).

How does Unity Catalog capture column-level lineage automatically?

UC inspects the query plan of every notebook, scheduled job, DLT pipeline, and SQL warehouse statement that runs on a UC-enabled cluster. From the plan, UC infers which source columns feed each target column and emits a row into system.access.column_lineage. There is no instrumentation flag, no decorator, no opt-in — column lineage is on by default for SQL warehouses, Photon clusters, and DLT pipelines. The 90-day retention window is the default; teams that need longer history stream the system table into a long-term gold table.

Row filter functions vs dynamic views — which should I use?

Prefer row filters and column masks whenever the policy is scalar — that is, the visibility of a row or a column value depends only on the row itself and the reader's group membership. Row filters and masks attach to the table, so every reader (notebook, DBSQL, federated tool, Delta Sharing recipient) sees the same policy and the audit trail is clean. Dynamic views are still the right tool when the policy logic requires a JOIN to another table — e.g. an entitlements table that lists which customer_id values each user can see. As ABAC matures over 2026, tag-driven policies will further reduce the number of cases where dynamic views are needed.

Does Delta Sharing work with non-Databricks consumers like Power BI or Pandas?

Yes — Delta Sharing is an open REST + parquet protocol with first-party client libraries for Power BI, Tableau, Pandas, and Spark. The producer creates a recipient with USING SHARING_CODE, UC issues a one-time activation URL, the consumer downloads a .share profile file, and reads parquet directly via cloud credentials that UC vends on the fly. Producers can apply row filters and column masks on the producer side so non-Databricks consumers see filtered, masked data without any consumer-side rewrite. For two Databricks metastores, prefer metastore-based activation (USING ID '...') — it skips the token file entirely.

How do I migrate existing Hive metastore tables into Unity Catalog?

Run an inventory query against hive_metastore.information_schema.tables to flag managed vs external tables per database. External tables migrate in metadata only — point UC at the existing cloud path with CREATE EXTERNAL LOCATION plus CREATE TABLE ... LOCATION '...'. Managed tables hold data in DBFS and need to be re-homed: either CTAS (CREATE TABLE AS SELECT) into a UC managed table in a new external location, or use the SYNC SCHEMA / SYNC TABLE commands for in-place upgrade where supported. After migration, dual-publish for a transition period — writers update both hive_metastore.x and main.y — then retire the Hive side once all consumers cut over. Run the impact-analysis query against system.access.table_lineage to confirm zero traffic on the Hive copy before deleting it.

Practice on PipeCode

Drill the SQL practice library for data engineers → to keep your dialect-portable muscles warm.
Rehearse on joins problems → for the lineage and audit cross-tab patterns.
Stack database design problems → to internalise the catalog → schema → table tree.
Sharpen with CTE-based problems → for the multi-step lineage and orphan queries.
Layer the filtering library → for the row-filter-style predicates.
Practise the conditional muscles with CASE expression problems → — the same shape used inside column masks.
For the broader interview surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Sharpen the SQL axis with the SQL for data engineering interviews course →.
For long-form schema craft, work through data modelling for DE interviews →.
Build the company-context muscles with Databricks-tagged practice problems →.

Pipecode.ai is Leetcode for Data Engineering — every Unity Catalog pattern above ships with hands-on practice rooms where you write the GRANT chain, the row filter UDF, and the lineage join against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your row filter actually behaves the same on a Photon notebook as on a DBSQL warehouse or a Delta Sharing recipient.

Practice SQL now →
Databricks-tagged drills →

1. Why Unity Catalog exists — the pre-UC governance gap

The legacy Hive metastore was workspace-scoped — that single design decision created every "who can read what" headache UC was built to delete

Worked example — the four-workspace governance headache UC deletes

Worked example — discovering what is workspace-scoped before migration

SQL interview question on legacy-vs-UC governance modelling

Solution Using account-level GRANT + system.access.audit cross-check

2. The three-level namespace + metastore architecture

catalog.schema.table is not just longer naming — it is the unit of isolation, delegation, and policy attachment

Worked example — the GRANT chain you must walk to read a table

Worked example — registering an external location for a multi-team raw zone

Worked example — federating a Snowflake catalog as a foreign catalog

SQL interview question on namespace design

Solution Using a {env}_{lob} catalog matrix + a certified dataproducts catalog

3. Automatic data lineage — table-level and column-level

Lineage that "just appears" is the single most under-appreciated UC feature — and the basis of every modern impact-analysis answer

Worked example — impact analysis "what breaks if I drop this column?"

Worked example — finding orphan tables nobody reads

Worked example — auditing PII reads with lineage + audit join

SQL interview question on lineage-based impact analysis

Solution Using a two-step lineage join (column → table → producer)

4. Row filters, column masks + GRANT/REVOKE

row-level security filters and column masks are functions you attach to a table — and they beat dynamic views for maintenance, cross-tool consistency, and audit

Worked example — a row filter that limits orders to the reader's region

Worked example — a column mask for SSN with conditional unmasking

Worked example — combining filter + mask + USAGE on one table

SQL interview question on row-level + column-level security composition

Solution Using one row filter + one column mask + three groups

5. Delta Sharing topology + production rollout

delta sharing is the open REST + parquet protocol that turns "send me a CSV every Monday" into "subscribe to this share and read it from Power BI"

Worked example — share three tables to a Databricks-to-Databricks recipient

Worked example — share to an open-protocol Power BI consumer

Worked example — production rollout: discovery → tagging → policies → share

SQL interview question on Delta Sharing topology and rollout

Solution Using a dataproducts catalog + two recipients on one share

Cheat sheet — Unity Catalog recipes

Frequently asked questions

What is the difference between Unity Catalog and the Hive metastore?

Can one Unity Catalog metastore span multiple Databricks workspaces?

How does Unity Catalog capture column-level lineage automatically?

Row filter functions vs dynamic views — which should I use?

Does Delta Sharing work with non-Databricks consumers like Power BI or Pandas?

How do I migrate existing Hive metastore tables into Unity Catalog?

Practice on PipeCode

Solution Using account-level GRANT + `system.access.audit` cross-check

`catalog.schema.table` is not just longer naming — it is the unit of isolation, delegation, and policy attachment

Solution Using a `{env}_{lob}` catalog matrix + a certified `dataproducts` catalog

`row-level security` filters and `column masks` are functions you attach to a table — and they beat dynamic views for maintenance, cross-tool consistency, and audit

`delta sharing` is the open REST + parquet protocol that turns "send me a CSV every Monday" into "subscribe to this share and read it from Power BI"

Solution Using a `dataproducts` catalog + two recipients on one share