[Databricks on AWS #0] The Target Architecture: Isolating Prod, Dev, and Sandbox with Unity Catalog

#databricks #lakehouse #dataengineering #governance

📚 Series: Databricks on AWS (Part 0, prologue)

The Target Architecture ← you are here

Building a Databricks AI Platform on AWS

RBAC with Function-Role Groups

Compute Governance: Pools, Policies, Clusters

The BOOTSTRAP_TIMEOUT Mystery

Fixing It with AWS PrivateLink

How We Structure the Terraform

Before the build story, here's the destination. This is the target-state data architecture we designed the whole platform toward — the three principles that shaped every later decision, and the Unity Catalog governance model that keeps production data safe from human hands.

The rest of this series is a build log: workspaces, RBAC, compute, the networking rabbit hole, the Terraform layout. But every one of those decisions was made in service of a target picture we drew first. This post is that picture — the "to-be" architecture, not the scaffolding we happened to have up on any given week.

It's built on three things Databricks basically hands you if you lean into them: the Lakehouse (one store, ACID tables, no separate warehouse to sync), the Medallion architecture (raw → cleaned → integrated → business, each layer a promotion), and Unity Catalog as the single governance plane across all of it. The interesting part isn't reciting those three buzzwords — it's the specific way we wire them so that prod, dev, and analyst sandboxes never step on each other.

Three principles, and everything follows

Almost every concrete rule later in this series is a consequence of one of these three.

1. Nobody touches production by hand. Create, update, delete in prod data happens only through an automated, code-reviewed pipeline running as a service principal. Human accounts don't get write on prod — not analysts, not engineers, not admins. The blast radius of a bad afternoon is capped at whatever a person can do with read-only. This one principle is why the whole "promote" flow later exists.

2. Never copy production to look at it. If an analyst wants to explore the gold layer, they read it in place. Within one metastore that's just Unity Catalog namespace permissions; across metastores or orgs it's Delta Sharing. Either way the bytes don't move. No nightly "analytics copy" job, no storage bill for the same data three times, no stale replica that quietly drifts from the source of truth.

3. Give analysts a room they can trash. Read-only-on-prod sounds clean until an analyst runs the same 200-line WITH query forty times in an afternoon, full-scanning a fact table each run. So we give them a sandbox — a physically isolated catalog where they can write — and encourage them to materialize heavy intermediate results there once. Freedom to write, but walled off from anything that matters.

Hold those three in your head. The zone model below is just their logical consequence.

The zone model

Inside one Unity Catalog metastore we carve out three zones with sharply different permission profiles.

              ┌──────────────── Unity Catalog metastore ────────────────┐
              │                                                          │
 pipeline-sp ─┼─▶ ┌──── PROD (Medallion) ────┐      ┌─── SANDBOX ───┐    │
 (write only) │   │ landing→cleaned→         │ zero │  per-user     │    │
              │   │ integrated→business      │─copy▶│  schemas      │    │
              │   │ read-only for humans     │ read │  free write   │    │
              │   └──────────────────────────┘      └───────────────┘    │
              │                 ▲                          │             │
 humans ──────┼─── read (SELECT)│              free write (ALL)│         │
 ai_analyst   │                 │                                        │
 ai_engineer  │   ┌──── DEV (Medallion) ─────┐                           │
              │   │ same layers, engineers   │  build & validate         │
              │   │ can write                 │  pipelines here          │
              │   └──────────────────────────┘                           │
              └──────────────────────────────────────────────────────────┘

Zone	Catalogs	Who	Privileges	Role
Prod	`landing`, `cleaned`, `integrated`, `business`	pipeline service principal (write), human groups (read)	SP: `USE` `SELECT` `CREATE` `MODIFY` (all); humans: `USE` `SELECT` only	Single source of truth. Writes only via automation.
Dev	same four Medallion catalogs, dev env	`ai_engineer`, `ai_admin`	`USE` `SELECT` `CREATE` `MODIFY`	Where pipeline logic is built and validated before it's promoted.
Sandbox	`sandbox` (per-user schemas)	`ai_analyst`, `ai_engineer`	`ALL` on your own schema; read-only on prod/dev	Free-write experiment space + cost-saving materialization.

The four Medallion catalogs — landing (bronze/raw), cleaned (silver, incremental + cleansing), integrated (silver, dimensions + facts), business (gold, marts) — exist in both prod and dev with an environment prefix. Same shape, different blast radius: in dev engineers write freely; in prod only the pipeline principal does.

The three human groups (ai_admin, ai_engineer, ai_analyst) come straight from the RBAC post — account-level function-role groups. Nothing here invents a new permission system; it just points Unity Catalog's built-in privileges at those groups, per zone.

Pattern 1: zero-copy reads

Production data exists exactly once, and everyone reads that one copy without cloning it.

Inside a single metastore, an analyst in the sandbox querying prod_business.gold.sales_fact is just a permission grant — SELECT, no replication, no extra storage. Cross a metastore or org boundary (say a separate prod org sharing down to an analytics org) and it becomes Delta Sharing: still read-only, still metadata-only, the bytes stay put.

And because human accounts only ever get SELECT on prod, principle #1 holds automatically — there's no code path where a person's typo mutates the gold layer.

Pattern 2: materialize once, in the sandbox

Read-only prod plus a curious analyst equals a specific failure mode: the same expensive query, run over and over, full-scanning a fact table every single time. The fix isn't to lock the analyst down — it's to give them somewhere to land the expensive part once.

-- Materialize the heavy part once, in your own sandbox schema
CREATE TABLE dev_sandbox.sandbox_user_a.monthly_summary AS
SELECT ...
FROM   prod_business.gold.sales_fact   -- zero-copy read (UC grant / Delta Sharing)
WHERE  base_date >= '2026-01-01';

Now every follow-up query hits monthly_summary — a small, physical Delta table the analyst owns — instead of re-scanning the production fact table. The repeated full scans collapse to a single one. Per-user schemas keep everyone's scratch tables from colliding.

Pattern 3: the promote path

So an analyst builds something great in their sandbox. How does it become a real production asset? Not by anyone copying it into prod — principle #1 forbids that. It goes through a promote flow that structurally keeps humans out of the production write:

[Sandbox]  analyst experiments & validates  (sandbox.<user>)
    │
    ▼  hand off — analysis code goes to the data-engineering team
    │
    ▼  code review — query optimization (Z-Order, partition pruning), naming/schema standards
    │
    ▼  pipeline intake — validated in dev, then registered in the prod pipeline
    │
    ▼  productionize — the pipeline service principal writes it into the prod Medallion layers

The last writer is always the pipeline principal. The sandbox is where ideas are born; the promote path is the airlock they pass through to become production data.

The infra standard underneath

Two platform-wide defaults make the cost story actually hold up (more in the compute post):

Serverless SQL Warehouse as the default query engine. Analyst load is spiky and unpredictable, so we don't leave fixed interactive clusters running. Serverless bills per-second while a query runs and auto-terminates within a minute or two of idle — the "someone forgot to shut it down" bill is designed out at the architecture level, not left to discipline.
Cost tracking from Unity Catalog system tables. system.billing.usage and system.query.history feed a live dashboard, so we can see exactly who's running the expensive full scans in the sandbox and which queries burn the most. That same data backs the query-optimization review in the promote path — the decision to materialize or Z-order isn't a guess, it's a number.

The takeaway

Three zones, three permission profiles. Prod is read-only for humans and written only by automation; dev is where engineers build; sandbox is a walled room analysts can write in freely.
Never copy prod. Zero-copy reads (UC grants or Delta Sharing) mean one source of truth, one storage bill, zero drift.
Materialize in the sandbox, promote through review. Heavy intermediates land once in a per-user schema; anything worth keeping goes back to prod only through a code-reviewed pipeline running as a service principal.
Serverless + system tables make the cost model self-enforcing instead of aspirational.

That's the destination. The next five posts are how we actually built toward it — starting with the workspaces and the metastore.