[Databricks on AWS #1] Building a Databricks AI Platform on AWS: Two Workspaces, One Unity Catalog Metastore

#databricks #aws #terraform #dataengineering

📚 Series: Databricks on AWS (Part 1)

Building a Databricks AI Platform on AWS ← you are here

RBAC with Function-Role Groups

Compute Governance: Pools, Policies, Clusters

The BOOTSTRAP_TIMEOUT Mystery

Fixing It with AWS PrivateLink

How We Structure the Terraform

We stood up two workspaces, one shared Unity Catalog metastore, a customer-managed VPC, and wired the whole thing through Terraform + Atlantis so nobody clicks in a console. This post is the map for everything that follows.

Every "just spin up Databricks" tutorial stops at the point where the workspace turns green. Real platforms don't stop there. You have to decide how data flows, where it lives, who governs it, and how a change gets from a pull request to running infrastructure without someone SSH-ing into a jump box at 2am.

This series is the honest version of that build — a Databricks AI platform on AWS, provisioned entirely as code, with all the sharp edges we caught our shins on. Part 1 is the architecture and the ground rules. Later parts go deep on RBAC, compute governance, and the networking rabbit hole that ate a week of my life (spoiler: a firewall silently dropping SYN packets).

Let me lay out the whole thing.

The shape of the platform

Two workspaces. One metastore. One VPC. That's the headline.

                    ┌─────────────────────────────────────┐
                    │      Unity Catalog metastore         │
                    │        (one per region)              │
                    └───────────────┬─────────────────────┘
                        assigned to both
             ┌──────────────────────┴──────────────────────┐
             │                                              │
   ┌─────────▼─────────┐                        ┌───────────▼───────────┐
   │  landing workspace │                        │  pipeline workspace   │
   │  ingest + interactive│                      │  transform + batch/job │
   │  analytics          │                       │  model train/serve    │
   └─────────┬──────────┘                        └───────────┬───────────┘
             │                                              │
             └───────────────┬──────────────────────────────┘
                             │
                   customer-managed VPC (shared)
                   shared root S3 + cross-account IAM

Why two workspaces instead of one? Because the two halves of the platform have genuinely different personalities.

The landing workspace is where raw data arrives and where analysts poke at it interactively. It's bursty, human-driven, notebook-heavy. The pipeline workspace is where the scheduled jobs live — transformation, feature engineering, model training and serving. It's automated, batch-shaped, and you do not want an analyst's runaway all-purpose cluster competing with a production training job for pool capacity.

Splitting by data flow (ingest/interactive vs. transform/batch) instead of by team gives you a clean blast radius. A misbehaving interactive cluster can't starve the batch plane. Compute policies, instance pools, and cluster budgets get tuned per personality instead of averaged into mush.

But — and this is the part people get wrong — splitting the workspaces does not mean splitting the data governance.

One metastore to govern them all

Unity Catalog's metastore is regional: one metastore per region, per account. That's not a suggestion, it's a hard limit, and it turns out to be exactly what you want here.

Both workspaces attach to the same metastore. That means a table registered by an ingest job in the landing workspace is immediately governable, grantable, and queryable (subject to permissions) from the pipeline workspace — same three-level namespace (catalog.schema.table), same lineage graph, same audit trail. No copying, no federation, no "which workspace has the real version" archaeology.

metastore (region-wide)
 ├── catalog: landing_*     ← raw / bronze
 ├── catalog: pipeline_*    ← silver / gold
 └── grants + lineage span BOTH workspaces

The workspaces are the compute boundary. The metastore is the governance boundary. Keeping those two concepts separate is the single most important design decision in the whole platform.

Customer-managed VPC, shared storage

The data plane runs in a customer-managed VPC — our network, our subnets, our security groups — not the Databricks-managed default. On a real platform behind a security team, that's non-negotiable: you need the compute to live where your egress controls, flow logs, and inspection already are.

In our case the VPC and subnets were created by the infra team and reused, not provisioned by this repo. Terraform just registers them with Databricks via databricks_mws_networks. That distinction bites later (Part 4 is a whole post about a cluster that wouldn't boot because a firewall in that borrowed network was eating packets), so it's worth stating up front: owning the workspace config is not the same as owning the network.

Storage is shared too. Both workspaces sit on one root S3 bucket, reached through a single cross-account IAM role and one Databricks credential config. The metastore gets its own storage root plus a dedicated data-access role. Standard stuff — S3 gateway endpoint, STS and Kinesis interface endpoints, a security group allowing 443 — but "standard" only if someone actually built it, which is again the infra team, not this repo.

Everything through Terraform + Atlantis

No console clicking. The entire platform — workspaces, networks, storage registration, the metastore, catalogs, schemas, groups, grants — is Terraform, wrapped in Terragrunt, applied through Atlantis off GitLab merge requests.

The flow is boring in the best way:

open MR  →  atlantis plan (comment on the MR)  →  review the plan
        →  merge  →  atlantis apply  →  infra changes

Two things make this actually work at platform scale:

A single automation service principal owns every apply. Humans don't hold the keys; the SP does. (More on the permissions that SP needs below — it's a gotcha.)
Ordering is explicit. Unity Catalog resources have real dependencies, so we apply in a fixed sequence: metastore → external locations → catalogs → schemas → groups → grants. Skip ahead and apply grants before the groups exist and you get a Group not found error. Ask me how I know.

The payoff is that "what's deployed" always equals "what's in main". Drift is a review comment, not a mystery.

Gotchas we hit

Here's the stuff no tutorial mentioned. Every one of these cost real time.

1. The deployment-name prefix has to be registered by Databricks — you can't do it yourself.
Your workspace URL prefix (the something.cloud.databricks.com part) must be pre-registered on your Databricks account by Databricks. It's not a console setting and it's not an API call — you file a request with your Databricks contact. Try to create the workspace before it's registered and you get a flat Deployment name cannot be used... error with no hint that the fix lives in someone else's ticket queue.

2. The automation service principal needs Account admin — not just workspace admin.
All the databricks_mws_* resources (workspaces, networks, storage, Unity Catalog) are account-level. A service principal with only workspace-level rights can't create any of them. The SP that runs your Atlantis applies has to be an Account admin, or every mws resource fails at plan-to-apply.

3. The region auto-creates a default metastore that collides with yours.
Create your first workspace in a fresh region and Databricks helpfully auto-provisions a default metastore for that region and attaches it. Since there's only one metastore per region, that default now is the region's metastore — and it's not the managed one you want to define in code. The fix: detach the auto-created default, delete it, then create your own *-metastore via Terraform and assign it. Bring your own metastore; evict the squatter first.

4. The deploy box needs egress to the account console.
Atlantis runs somewhere, and that somewhere has to reach accounts.cloud.databricks.com on 443 outbound. Behind a locked-down egress firewall, that's a rule someone has to add — and until they do, your applies just hang. Worth flagging clearly: this is the deploy server's egress. The cluster nodes' egress in the data plane is an entirely separate problem (and a much nastier one — that's Part 4).

Where this leaves us

At the end of Part 1 we have the skeleton: two workspaces split by data flow, one shared regional Unity Catalog metastore governing both, a customer-managed VPC and shared S3 underneath, and Atlantis turning merge requests into infrastructure. Green workspaces, a clean namespace, and a deploy pipeline nobody has to babysit.

What we don't have yet is people. A platform with no access model is just an expensive sandbox. Next up: how we built a three-tier RBAC system — users mapped to function roles, function roles to access roles, access roles to actual Unity Catalog grants — so that "give the ML team read on the gold catalog" is a one-line change instead of a permissions spelunking expedition.

Next: Part 2 — RBAC done right: function-role groups, access-role groups, and why we have two layers instead of one.