DEV Community: duke

[Databricks on AWS #6] How We Structure the Terraform: Terragrunt, YAML-Driven Modules, and Atlantis GitOps

duke — Thu, 09 Jul 2026 01:08:38 +0000

📚 Series: Databricks on AWS (Part 6)

Building a Databricks AI Platform on AWS

RBAC with Function-Role Groups

Compute Governance: Pools, Policies, Clusters

The BOOTSTRAP_TIMEOUT Mystery

Fixing It with AWS PrivateLink

How We Structure the Terraform ← you are here

Five parts of what we built — the workspaces, the RBAC, the compute, the network mystery, the PrivateLink fix. This one is the how: the repo layout and Terragrunt patterns that hold it all together, plus the two footguns that will bite you on day one.

Over five parts we stood up an AI platform on Databricks + AWS. Along the way I kept saying "we apply this," "the module for that," "the provider override" — and mostly waved my hands at the plumbing. This finale is the plumbing: how the Terraform is actually organized, why humans never touch a .tf file, and the two ordering gotchas that turn a clean plan into a red one.

If you're standing up a multi-workspace Databricks estate and want a layout that scales past the first workspace without copy-paste sprawl, this is the shape that worked for us.

The repo, from the top

Two ideas do all the work: reusable modules and per-environment trees.

infra/
├── _modules/               # reusable Terraform modules (the "what")
├── <project>/
│   └── databricks/
│       ├── dev/            # per-env tree (the "where")
│       └── prd/
├── common_vars.yaml        # shared config: hosts, account IDs, CIDRs
└── provider.tmpl           # provider template

_modules/ is a flat library of single-purpose Terraform modules — one per resource concept:

workspace, unity-catalog, catalog, schema, grants, group, group-ar,
instance-pool, cluster-policy, compute, sql-warehouse, entitlements,
external-location, workspace-assignment, acls, iamrole, iampolicy,
s3, kms, vpc, service-principal, audit-log-delivery, workspace-conf

None of these know about dev or prd. They take inputs and create resources. The environment tree decides which modules run, with what values, in what order. A new workspace is a new folder under dev/, not a new module.

Humans write YAML, not HCL

Here's the rule we enforce: a person editing this repo edits YAML. They don't write Terraform, and mostly don't write Terragrunt either.

The flow is three hops:

human-authored *.yaml
  → terragrunt.hcl: yamldecode() + env-prefix
  → Terraform module inputs

Each leaf directory has a small *.yaml (the desired state, in human terms) and a terragrunt.hcl that reads it. The terragrunt.hcl does yamldecode() on the YAML, then rewrites the keys with an environment prefix before handing them to the module. Concretely, in the dev tree an instance-pool key like ip_cpu_small becomes dev_ip_cpu_small on the way in.

Why the prefix? Because the same YAML shape lives in dev/ and prd/, but the resource names, pool IDs, and group names must be globally distinct. The env prefix is injected once, in one place, by Terragrunt — so no human ever hand-types dev_ or prd_ and nobody fat-fingers a cross-environment collision. You change a number in YAML; the naming discipline is automatic.

This is the single highest-leverage decision in the whole layout. The reviewer of a merge request reads a YAML diff — "pool max went 8 → 16" — not a wall of HCL.

One provider override per workspace

Databricks has a wrinkle that AWS doesn't: every workspace is its own API host. A databricks_cluster in workspace A and one in workspace B are created against different host URLs, even though it's the same account, same provider, same code.

We solve this with generated provider overrides. Each workspace layer knows its workspace_key (from a small workspace.hcl), and common_vars.yaml maps that key to the workspace host. Terragrunt looks up the host and generates a provider_override.tf on the fly:

locals {
  ws_host = local.common_vars.databricks
    .projects.<project>.workspace_hosts[local.env][local.ws_vars.locals.workspace_key]
}

generate "workspace_provider_override" {
  path      = "provider_override.tf"
  contents  = <<-EOF
    provider "databricks" {
      host = "${local.ws_host}"
    }
  EOF
}

The module code is workspace-agnostic. The host is injected at plan time based on which folder you're in. Add a workspace, add its host to common_vars.yaml, drop a workspace.hcl with the key — every layer underneath now targets the right host with zero code changes.

The two Databricks providers, and why

You'll notice the code uses the Databricks provider two different ways, and it's not an accident:

provider	scope	used for
`databricks.mws` (account-level)	the Databricks account	creating workspaces, account-level groups, metastore assignment
`databricks` (workspace-level)	a single workspace host	clusters, pools, policies, catalogs, grants, ACLs

Account-level resources (the workspace itself, account groups, wiring a metastore) don't belong to any one workspace — they're configured against the account API, which is why they use the mws alias. Everything inside a workspace uses the workspace-scoped provider with the host we just injected. Mixing these up is a classic early mistake: try to create a cluster with the mws provider and you'll get authentication errors that make no sense until you realize you're pointed at the account, not the workspace.

dependency + mock_outputs (and the "init" gotcha)

Layers depend on each other's outputs. cluster-policy needs the pool IDs that instance-pool produced; compute needs the policy IDs. Terragrunt expresses this with a dependency block:

dependency "instance_pool" {
  config_path = "../instance-pool"
  mock_outputs = {
    instance_pool_ids = { "dev_ip_cpu_small" = "mock-pool-id-1" }
  }
  mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}

The mock_outputs let you plan a layer before its dependency has ever been applied — Terragrunt feeds fake IDs so the plan can render. Handy. But look closely at that allow-list: ["validate", "plan"]. No "init".

That omission is the gotcha, and it ties straight back to the timeout saga from Parts 3 and 4. On a brand-new tree, the very first plan runs an init first — and because init isn't in the mock allow-list, Terragrunt refuses to mock the dependency and instead tries to read real outputs from a state that doesn't exist yet. Result:

Error: detected no outputs from dependency ".../instance-pool"

Your clean-looking layer fails to plan not because it's wrong, but because the thing upstream of it was never applied. Some layers in the repo (workspace-assignment, unity-catalog) deliberately include "init" in the allow-list to dodge this; the compute layers don't. The practical consequence is the same either way, and it's the whole reason for the next section.

Apply order is not a suggestion

Because of the dependency chain above, you apply in order. Not "usually." Always, the first time through:

Workspace line:  workspace → instance-pool → cluster-policy → compute
                 → workspace-assignment → acls

Unity Catalog:   unity-catalog → external-location → catalogs
                 → schemas → groups → grants

The rule underneath both lines is the same: apply a layer only after everything it depends on is already applied. Mocks let you plan out of order, but the first real apply has to walk the chain, because a mock pool ID doesn't create a cluster and a mock catalog doesn't hold a grant.

Two failure modes we hit repeatedly, both just ordering:

cluster-policy / compute plan fails with "detected no outputs" → the pool/policy above it wasn't applied. Apply upstream first.
workspace-assignment plans to assignments = {} → the groups layer wasn't applied, so there's nothing to assign. Apply groups first.

Neither is a bug. Both are the dependency graph telling you that you skipped a step.

Atlantis ties the bow

We don't run terragrunt apply from laptops. Every change is a merge request; Atlantis runs atlantis plan on the MR and atlantis apply once it's approved. The plan output lands in the MR as a comment, so the diff and its consequences are reviewable in one place. Account credentials live server-side on the Atlantis host, not in anyone's shell — which also means local plan is optional (and honestly, usually skipped) since you get the real plan from the MR anyway.

Put together, the daily loop is small: edit a YAML value, open an MR, read the Atlantis plan, apply. The modules, the env-prefixing, the provider injection, the dependency graph — all of that is machinery you set up once and then mostly forget.

The series, in one breath

That's the whole platform:

Building the platform — workspaces, account, Unity Catalog metastore.
RBAC — function-role groups and workspace assignment.
Compute governance — instance pools → cluster policies → clusters, and the guardrails around them.
The BOOTSTRAP_TIMEOUT mystery — a healthy EC2 node that couldn't phone home, traced across three accounts to a firewall.
PrivateLink — taking the control-plane traffic off the internet without touching the VPC.
This — the Terraform structure holding all of it together.

If there's one throughline, it's that the boring parts — naming discipline, apply order, who-writes-what — are exactly the parts that decide whether a platform stays sane at the tenth workspace. Get the structure right and the interesting problems (like a packet dying in a firewall) are the only ones left to solve.

Thanks for reading all six. If you're just arriving, Part 1 is where the estate gets built from nothing.

[Databricks on AWS #5] Fixing Databricks BOOTSTRAP_TIMEOUT with AWS PrivateLink: Control Plane Over the Backbone, Zero New Subnets

duke — Thu, 02 Jul 2026 01:50:20 +0000

📚 Series: Databricks on AWS (Part 5)

Building a Databricks AI Platform on AWS

RBAC with Function-Role Groups

Compute Governance: Pools, Policies, Clusters

The BOOTSTRAP_TIMEOUT Mystery

Fixing It with AWS PrivateLink ← you are here

How We Structure the Terraform

In Part 4 we traced a BOOTSTRAP_TIMEOUT all the way to a centralized egress firewall that silently dropped our new workspace CIDR. Here's the clean fix — take the control-plane traffic off the internet entirely, without touching the existing VPC.

In Part 4 we found the culprit: cluster nodes with no public IP were trying to phone home to the Databricks control plane and secure cluster connectivity (SCC) relay, and a centralized inspection firewall was dropping the packets because our new workspace CIDR wasn't in its allow-list. We could have just added a firewall rule and moved on. Instead we did the thing our security team actually wanted from the start: get that traffic off the public internet completely.

The tool for that is AWS PrivateLink. And the interesting part isn't PrivateLink itself — it's that we landed it with zero new subnets, zero new CIDR, and zero routing changes, in a VPC that was already full.

Why VPC endpoints alone don't fix it (the trap from Part 4)

The obvious instinct is: "no internet — just put everything on VPC endpoints." That instinct is correct, but only for AWS services. S3, STS, and Kinesis all publish com.amazonaws... endpoint services, so you can ride the AWS backbone to them with gateway or interface endpoints.

The thing that was actually blocked in Part 4 is different: it's the Databricks control plane and the SCC relay. That's Databricks-owned infrastructure, not an AWS service. There is no com.amazonaws... endpoint for it that you get for free. Databricks does, however, publish its own PrivateLink endpoint services — you just have to explicitly wire them up.

So the real fix splits in two:

Traffic	Solution
S3 / STS / Kinesis	AWS VPC endpoints (optional, separate)
Control plane REST API + SCC relay	Databricks PrivateLink — this is the one that unblocks bootstrap

This post is about that second row.

Back-end vs front-end PrivateLink

Databricks PrivateLink comes in two flavors, and you need to know which one you're solving for:

Type	Connection	Us
Back-end (compute plane)	cluster → workspace core services: REST API + SCC relay	✅ required
Front-end (inbound)	users → workspace over a private path	optional (we skipped it)

The BOOTSTRAP_TIMEOUT is purely a back-end problem: the compute plane can't reach core services. Front-end PrivateLink is about how users reach the workspace UI — a different concern, and one we deliberately left alone so that people can still log in normally over the internet.

Back-end PrivateLink means two interface VPC endpoints:

one to the workspace / REST API endpoint service
one to the SCC relay endpoint service

Both are required. Skip one and you break either the REST channel or the relay channel — the cluster still won't come up.

The elegant part: no new subnets

Here's the constraint we walked into. The VPC was a /24 and it was full — no room to carve out a fresh /28 for endpoints. On most PrivateLink walkthroughs step one is "create a subnet for your endpoints." We couldn't.

Then the realization: an interface VPC endpoint is just an ENI with a private IP. It doesn't need its own subnet. You can drop its ENI straight into an existing subnet — including the very cluster subnets your compute already runs in. Each endpoint consumes a handful of private IPs (roughly 4, one per AZ ENI plus overhead) and that's it.

So the plan became: put both endpoints' ENIs in the existing cluster subnets. No new subnet, no new CIDR. Just ~4 IPs out of the dozens the subnets had free.

Why there's no routing change either

This is the part that made the security review easy, so it's worth being precise about why it's true.

An interface endpoint is reachable at a private IP inside your VPC. When a cluster node talks to that IP, it's talking to something in its own VPC — which every route table already handles via the implicit local route. There is nothing to add. No 0.0.0.0/0 change, no new prefix, no target rewrite.

Contrast this with an S3 gateway endpoint, which does modify routing — a gateway endpoint installs a prefix-list route (pl-xxxx → vpce-xxxx) into your route tables. That's a real routing change you have to review. Interface endpoints don't work that way. They work via DNS plus the local route, not route-table entries.

Which brings us to the mechanism that actually redirects the traffic:

Private DNS is what redirects the traffic

When you enable private DNS on the interface endpoints, AWS overrides DNS resolution inside your VPC so that the Databricks domains resolve to the endpoints' private ENI IPs instead of public addresses.

So tunnel.<region>.cloud.databricks.com — the SCC relay host that was getting dropped at the firewall in Part 4 — now resolves to a private IP in your own cluster subnet. The node connects to that private IP, hits the endpoint ENI, and the traffic rides the AWS backbone straight to Databricks' endpoint service. It never leaves AWS's network. It never touches the firewall.

And critically: private DNS only rewrites the Databricks domains. Every other service in the VPC keeps resolving exactly as before, because they don't use those hostnames. That's why the blast radius here is essentially zero — the existing (non-Databricks) workloads are completely untouched.

To summarize the "why nothing else breaks" story:

Thing	Change?	Why
Route tables	none	interface endpoints use DNS + `local`, not routes
Existing non-Databricks services	none	private DNS only rewrites Databricks domains
Subnets / CIDR	none	ENIs go into existing subnets
Existing SGs / NACLs	none	a new SG is created just for the endpoints
The only real change	~4 private IPs consumed + workspaces flipped to PrivateLink	non-disruptive

The AWS side

Concretely, the infra team does three small things — and notably, no new subnet, no new CIDR, no route change:

One security group for the endpoints, allowing inbound 443 (and the relay port) from the cluster security group.
Two interface VPC endpoints, both placed in the existing cluster subnets (one per AZ), with private DNS enabled, pointed at the two Databricks endpoint services. In <region> those services are:

| Endpoint | Service name |
|----------|--------------|
| Workspace (REST API) | com.amazonaws.vpce.<region>.vpce-svc-... |
| SCC relay | com.amazonaws.vpce.<region>.vpce-svc-... |

(The exact vpce-svc-... values come from the Databricks region docs — always re-verify them right before applying, and use the console's "Verify service" to confirm they resolve.)

Hand back the two VPC endpoint IDs (vpce-...) for the Databricks-side registration.

That's the whole AWS footprint. A security group and two ENIs.

The Databricks side (Terraform)

Now Databricks needs to know these endpoints exist and be told to route the workspace through them. Three resource types:

1. Register each endpoint with the Databricks account, twice — once for the relay, once for the REST API:

resource "databricks_mws_vpc_endpoint" "relay" {
  provider            = databricks.mws
  account_id          = "<databricks-account-id>"
  aws_vpc_endpoint_id = "vpce-...relay"
  vpc_endpoint_name   = "our-dev-relay-vpce"
  region              = "<region>"
}

resource "databricks_mws_vpc_endpoint" "rest" {
  provider            = databricks.mws
  account_id          = "<databricks-account-id>"
  aws_vpc_endpoint_id = "vpce-...workspace"
  vpc_endpoint_name   = "our-dev-workspace-vpce"
  region              = "<region>"
}

2. Create private access settings — and this is the one setting people get wrong:

resource "databricks_mws_private_access_settings" "pas" {
  provider                     = databricks.mws
  private_access_settings_name = "our-dev-pas"
  region                       = "<region>"
  public_access_enabled        = true
}

public_access_enabled = true is deliberate. We're doing back-end PrivateLink only — the compute plane goes private, but we want users to keep logging in over the internet. Set this to false and you've quietly switched on front-end lockdown too, and your users get locked out of the UI. Keep it true unless you're also doing front-end PrivateLink on purpose.

3. Attach the endpoints to the network and the workspaces. The network config gets the two endpoints wired into vpc_endpoints { dataplane_relay = [...], rest_api = [...] }, and the workspaces get private_access_settings_id and the network. Note: subnet_ids stays exactly as it was — we are not changing the cluster subnets, only adding the endpoint references.

Gotchas (the ones that cost real time)

You need both endpoints. Workspace/REST and relay. One alone silently half-works and the cluster still fails.
Private DNS is mandatory. Forget to enable it and the Databricks domains keep resolving to public IPs — the endpoints exist but nothing uses them. This is the single most common "I set it up but it still times out" cause.
Keep public_access_enabled = true for back-end-only PrivateLink, or you'll lock users out of the console.
Wait ~20 minutes after the workspace flips to PrivateLink before you launch a test cluster. The registration and DNS propagation aren't instant, and an early test will look like a failure that isn't one.
Do not SSL-decrypt the relay. Same certificate-pinning gotcha from Part 4 — if any inspection sits in front of it and forward-proxy decrypts, the relay breaks. With PrivateLink the traffic bypasses the firewall entirely, but if you still have belt-and-suspenders inspection anywhere in the path, exclude the Databricks domains.

Verification

The proof is three checks, in order:

Workspace shows RUNNING, then wait ~20 minutes.
From inside the VPC, nslookup tunnel.<region>.cloud.databricks.com → it should resolve to a private endpoint IP in your cluster subnet, not a public address.
Launch a cluster. It reaches RUNNING. BOOTSTRAP_TIMEOUT is gone.

That's it. The same cluster that spent 11 minutes stuck in INSTANCE_INITIALIZING in Part 4 now comes up cleanly — over the backbone, off the internet, with the rest of the VPC none the wiser.

Takeaways

VPC endpoints solve S3/STS/Kinesis. The control plane + relay are Databricks-owned — for those you need PrivateLink, and that's specifically what fixes BOOTSTRAP_TIMEOUT.
Back-end PrivateLink = two interface endpoints (workspace/REST + SCC relay). Both required.
An interface endpoint is just an ENI with a private IP — it fits into an existing subnet, so no new subnet, no new CIDR, and no routing change. (Contrast S3 gateway endpoints, which do add a route.)
Private DNS is the actual redirect mechanism, and it only touches the Databricks domains — so existing services are untouched.
Keep public_access_enabled = true for back-end-only, wait 20 minutes, and never decrypt the relay.

Next up: how all of this is actually structured in Terraform — the module layout, the dual Databricks providers (mws vs workspace), and the appendix that ties the whole series together.

Next: Part 6 — The Terraform structure behind it all (and a series appendix).

[Databricks on AWS #4] The BOOTSTRAP_TIMEOUT Mystery: Tracing a Databricks Cluster from Data Plane to Control Plane (Transit Gateway + Firewall)

duke — Thu, 02 Jul 2026 01:30:12 +0000

📚 Series: Databricks on AWS (Part 4)

Building a Databricks AI Platform on AWS

RBAC with Function-Role Groups

Compute Governance: Pools, Policies, Clusters

The BOOTSTRAP_TIMEOUT Mystery ← you are here

Fixing It with AWS PrivateLink

How We Structure the Terraform

The EC2 nodes were healthy — 3/3 status checks. The cluster still never started. Here's the 11-minute timeout that sent us tracing packets across three AWS accounts.

Most "my Databricks cluster won't start" posts end with "open port 443." This one didn't — because the firewall did allow 443. The traffic was dying somewhere else, and the only way to find it was to follow a single packet from the cluster node all the way to the Databricks control plane.

If you run Databricks classic compute inside a customer-managed VPC behind a centralized egress (Transit Gateway → inspection firewall), this is the failure mode nobody warns you about.

The setup

A Databricks workspace deployment on AWS, classic compute, customer-managed VPC, secure cluster connectivity (no public IPs on cluster nodes). The new workspace reused an existing "spoke" VPC that egresses through a shared network hub:

cluster node (no public IP)
  → spoke VPC route table: 0.0.0.0/0 → Transit Gateway
  → TGW (shared network-hub account)
  → DMZ VPC → GWLB inspection firewall → NAT → IGW → internet → Databricks control plane

Instance pools applied. Cluster policies applied. Then the cluster itself:

$ atlantis apply -d .../ws-landing/compute
databricks_cluster.this["shared_small"]: Still creating... [10m20s elapsed]
...
Error: cannot create cluster: failed to reach RUNNING, got TERMINATED:
  Self-bootstrap timed out during launch ... BOOTSTRAP_TIMEOUT

~25 minutes of "Still creating", then TERMINATED.

The symptom that rules out half the internet

I pulled the cluster event log. The line that mattered:

BOOTSTRAP_TIMEOUT: [id: InstanceId(i-xxxx), status: INSTANCE_INITIALIZING, ...]
timed out after 704524 milliseconds. AWS bootstrap diagnostic output could not be fetched.
Please check network connectivity from the data plane to the control plane.

Two things narrowed it down fast:

The EC2 nodes were up and healthy — 3/3 status checks in the AWS console, no public IP (as expected for secure cluster connectivity).
INSTANCE_INITIALIZING + "diagnostic output could not be fetched" — the control plane couldn't even reach the node to pull logs.

Healthy EC2 + cluster never reaches RUNNING = the node booted but couldn't phone home. Not capacity. Not IAM. Not Terraform. Pure egress: the node never opened its outbound tunnel to the control plane's secure cluster connectivity relay.

Databricks' own error literally says "check network connectivity from the data plane to the control plane." Believe it.

Why "just use VPC endpoints" doesn't fix this

The first instinct (and our security team's) was: route everything through VPC endpoints, no internet. That's correct for AWS services — S3, STS, Kinesis can all ride the AWS backbone via gateway/interface endpoints.

But the thing that's actually blocked is the Databricks control plane and the secure cluster connectivity relay. Those are Databricks-owned infrastructure, not an AWS service — there is no com.amazonaws... endpoint for them. Your options are exactly two:

allow outbound to the control plane (egress firewall), or
AWS PrivateLink to Databricks (backbone, but you still explicitly wire it).

There is no configuration where the node simply doesn't talk to the control plane. With secure cluster connectivity, the node has no public IP and no open inbound ports — so the control plane cannot initiate inward. The node has to go outbound — block that path and the cluster simply can't come up.

Tracing the path

Routing first. I walked every hop and confirmed each route table — including the new workspace CIDR — was correct end to end:

hop	what to check	result
spoke subnet RT	`0.0.0.0/0 → TGW`	✅
TGW route table	default → DMZ VPC; return → spoke (propagated)	✅
DMZ ingress subnet	`0.0.0.0/0 → GWLB endpoint` (inspection)	✅
DMZ GWLB subnet	`0.0.0.0/0 → NAT`; spoke CIDR → TGW (return)	✅
DMZ public subnet	`0.0.0.0/0 → IGW`; spoke CIDR → GWLB (return)	✅

Every route — outbound and return — was present, and the new CIDR was wired identically to the existing workspaces that worked. So routing wasn't it.

That left exactly one thing in the path that isn't a route: the inspection firewall behind the Gateway Load Balancer. The traffic reached the firewall and got dropped there.

What actually broke

The DMZ runs a centralized egress firewall (think Palo Alto / appliance behind a GWLB). It allow-lists by source CIDR and destination. The existing workspaces lived in older CIDR ranges that were already in the allow policy. Our new workspace CIDR was not — so its packets to the Databricks control plane hit the firewall and were silently dropped.

Healthy EC2, perfect routing, and a firewall that quietly eats the SYN. That's why the node sat in INSTANCE_INITIALIZING for 11 minutes and the control plane never heard from it.

The confirmation is in the firewall logs: filter by the new source CIDR and you see DENY entries to the Databricks relay on 443.

The Databricks outbound you actually need

For the data plane to bootstrap (region <region> here; check the docs for yours), the node needs to reach:

destination	port	why
`tunnel.<region>.cloud.databricks.com` (control plane CIDR, e.g. `3.x.x.x/28`)	443	secure cluster connectivity relay (the one that was blocked)
control plane / web app	443	registration
regional metastore (RDS)	3306	cluster launch
S3 / STS / Kinesis (regional)	443	runtime, credentials, logs — put these on VPC endpoints

One Palo Alto gotcha worth its own line: if you do SSL forward-proxy decryption, exclude the Databricks domains. The relay uses certificate pinning; decrypt it and it breaks even with the allow rule in place.

Takeaways

BOOTSTRAP_TIMEOUT + healthy EC2 + "diagnostic output could not be fetched" = data-plane → control-plane egress is blocked, full stop. Don't go hunting IAM or Terraform.
Secure cluster connectivity means the node has no inbound path — it must egress to the relay. That egress is not optional.
VPC endpoints solve S3/STS/Kinesis, not the control plane/relay. Those are Databricks-owned; allow them or use PrivateLink.
When you add a new workspace CIDR behind a centralized firewall, the firewall policy is the thing everyone forgets. New CIDR ≠ automatically allowed.

The clean long-term fix is to take the control-plane traffic off the internet entirely with AWS PrivateLink — without touching the existing VPC. That's Part 5.

Next: Fixing it with AWS PrivateLink — control-plane over the backbone, zero new subnets, zero routing changes.

[Databricks on AWS #3] Compute Governance on Databricks: Instance Pools, Cluster Policies, and Shared Clusters

duke — Thu, 02 Jul 2026 01:29:50 +0000

📚 Series: Databricks on AWS (Part 3)

Building a Databricks AI Platform on AWS

RBAC with Function-Role Groups

Compute Governance: Pools, Policies, Clusters ← you are here

The BOOTSTRAP_TIMEOUT Mystery

Fixing It with AWS PrivateLink

How We Structure the Terraform

RBAC decides who a user is. Compute governance decides what hardware they're allowed to spin up. Here's how instance pools, cluster policies, and an entitlement gate turn "anyone can launch a 128-core GPU box" into "you get exactly what your role permits."

In Part 2 we built the RBAC model: users map to function roles, function roles map to access roles, access roles get grants. That controls which data someone can touch.

But there's a second axis nobody talks about until the cloud bill arrives: compute. Left ungoverned, a single curious analyst can launch a cluster of r6i.4xlarge nodes (16 cores, 128 GB, each) at 2 a.m., forget to turn it off, and greet you with a five-figure surprise on Monday. Governance is the answer, and on Databricks it comes in three layers.

The three layers, top to bottom

Think of it as a funnel. Each layer narrows what a user can actually do:

Layer	What it does	Who feels it
Instance pool	Pre-warmed VMs waiting to be claimed — faster cluster starts	Everyone (transparently)
Cluster policy	The rules: which instance types, sizes, autotermination, runtime	Engineers creating clusters
Entitlement gate	`allow_cluster_create` on/off per group	Non-admins (blocked)

And on top of all that: shared, pre-made clusters for the people who shouldn't be creating anything at all.

Let's go layer by layer.

Layer 1: Instance Pools — pre-warmed VMs

A cold cluster start on AWS means: request EC2 capacity → wait for the instances → install the Databricks runtime → join the cluster. That's several minutes of a user staring at a spinner.

An instance pool keeps a set of VMs pre-acquired (or ready to be acquired fast) so clusters attach to them instead of provisioning from scratch. It's purely about speed and cost predictability — a pool doesn't restrict anything on its own.

For our workspace we defined six pools, split by CPU/GPU and by size:

Pool	Instance type	Capacity	Tag
`ip_cpu_small`	m6g.large (2 vCPU / 8 GB)	10	cpu/small
`ip_cpu_medium`	m6g.xlarge (4 vCPU / 16 GB)	15	cpu/medium
`ip_cpu_large`	r6i.2xlarge (8 vCPU / 64 GB)	15	cpu/large
`ip_cpu_xlarge`	r6i.4xlarge (16 vCPU / 128 GB)	20	cpu/xlarge
`ip_gpu_small`	g5.xlarge (1× A10G)	10	gpu/small
`ip_gpu_large`	g5.2xlarge (1× A10G)	20	gpu/large

The important knobs are shared across all of them:

min_idle_instances = 0 — we don't pay to keep VMs warm 24/7; the pool spins up on demand and the first start after idle pays the cold-start tax. A tradeoff we accepted for cost.
idle_instance_autotermination_minutes = 10 — idle VMs release themselves after 10 minutes.
availability = ON_DEMAND — no spot reclaim surprises for interactive work.

The custom_tags matter more than they look: they flow into AWS cost allocation, so you can answer "how much did the GPU pools cost this month?" without guessing.

Layer 2: Cluster Policies — the actual governance

Instance pools make clusters fast. Cluster policies make clusters legal.

A policy is a template that constrains what a cluster can be: which pool it draws from, which Spark runtime, min/max workers, autotermination, and how many clusters one user can create. A user with a policy can only create clusters that fit inside it — they can't override the instance type, can't disable autotermination, can't ask for 200 workers.

Our analytics workspace has eight policies. A representative slice:

Policy	Pool (worker / driver)	Runtime	Workers	Autoterm (min)	Max clusters/user
`cp_cpu_small`	ip_cpu_small / ip_cpu_small	14.3.x-scala2.12	1–4	10	2
`cp_cpu_medium`	ip_cpu_medium / ip_cpu_medium	14.3.x-scala2.12	0–8	10	2
`cp_cpu_large`	ip_cpu_large / ip_cpu_medium	14.3.x-scala2.12	0–12	10	1
`cp_gpu_small`	ip_gpu_small / ip_cpu_small	14.3.x-gpu-ml-scala2.12	0–8	10	1
`cp_job_standard`	ip_cpu_medium / ip_cpu_medium	14.3.x-scala2.12	0–16	30	—

A few design choices worth calling out:

Autotermination is not optional. The policy sets it (10 min for interactive, 30 for jobs) and the user can't remove it. This single rule kills the "forgot to shut it down over the weekend" bill.
data_security_mode = USER_ISOLATION across the board — Unity Catalog enforcement stays on. Governance from Part 2 doesn't get bypassed by a cleverly configured cluster.
Driver often runs on a smaller pool than workers (e.g. cp_cpu_large uses ip_cpu_large workers but an ip_cpu_medium driver). The driver rarely needs to match worker muscle, and this trims cost.
max_clusters_per_user caps sprawl. The big policies are limited to 1 cluster per person.
A shared cost_center tag on every policy feeds the same cost-allocation story as the pools.

Job policies (higher worker ceilings, longer autotermination) live alongside the interactive ones, so batch pipelines get their own lane without borrowing the interactive budget.

Layer 3: The Entitlement Gate — `allow_cluster_create`

Policies govern how someone creates a cluster. But some people shouldn't create clusters at all.

Databricks has a workspace-level entitlement for exactly this: allow_cluster_create. Turn it off for a group, and members of that group physically cannot create clusters — the button is gone, the API call is rejected. It doesn't matter what policies exist; the door is locked before they reach it.

This is the gate that makes the whole role model coherent:

Role	`allow_cluster_create`	Cluster policies	What they can do
Admin	on	all / unrestricted	Create anything. Break glass.
Engineer	on	assigned policies only	Create clusters — but only inside their policy's box
Analyst	off	none	No creation at all — only attach to pre-made shared clusters

An engineer gets freedom within guardrails. An analyst gets no guardrails because they get no steering wheel — they use what's already there. And that "what's already there" is the last piece.

Shared clusters — for the people who can't create

If analysts can't create clusters, they need clusters waiting for them. So the analytics workspace runs a couple of always-available shared, all-purpose clusters:

Cluster	Policy	Runtime	Shape
`cp_shared_small`	cp_cpu_small	14.3.x-scala2.12	m6g.large, 1 driver + 1 worker
`cp_shared_medium`	cp_cpu_medium	14.3.x-scala2.12	m6g.xlarge, 1 driver + 0 min workers

These are built from the same policies engineers use, so they inherit the same autotermination and isolation rules. Access is granted per group via ACLs (CAN_USE, CAN_ATTACH_TO) — an analyst attaches, runs their notebook, and never touches a provisioning decision.

The pipeline workspace, by contrast, has no shared clusters — it's job-driven, so clusters are ephemeral and spun up per run from job policies. Different workspace, different compute personality, same governance primitives.

Apply order: pool → policy → compute (and the Terragrunt trap)

Here's where Infrastructure-as-Code bites. The three layers have a hard dependency chain:

instance-pool  →  cluster-policy  →  compute (clusters)

Policies reference pool IDs. Clusters reference policy IDs. So you must apply in order:

# 1. pools first
atlantis apply -d .../ws-landing/instance-pool
# 2. re-plan, then policies (they now see real pool IDs)
atlantis apply -d .../ws-landing/cluster-policy
# 3. re-plan, then clusters
atlantis apply -d .../ws-landing/compute

And now the gotcha that cost us a confused afternoon. On the very first plan — before pools are applied — the policy and compute plans fail with:

Error: ... instance-pool ... is a dependency of ... cluster-policy ...
       but detected no outputs. ...

This looks like a broken config. It isn't. Terragrunt lets a dependency return mock outputs so downstream modules can plan before the upstream is applied — but only for the Terraform commands you allowlist:

mock_outputs_allowed_terraform_commands = ["validate", "plan"]

The failure happens during terragrunt init, and init isn't in that list. So on a cold bootstrap, init tries to fetch the real output of an unapplied pool, finds nothing, and dies. On our older workspaces this never surfaced — those pools were already applied, so real outputs existed.

Two ways out:

Just apply in order (pool → re-plan → policy → re-plan → compute). Once the upstream is applied, real outputs exist and the mock is never needed. This is what we did.
Add "init" to mock_outputs_allowed_terraform_commands on the policy/compute modules. The mock now covers the init phase too — but you still apply in order, because a cluster genuinely can't be created against a pool that doesn't exist yet.

The lesson: on a greenfield deployment, "detected no outputs" almost always means "you haven't applied the thing upstream yet," not "your dependency block is wrong."

Takeaways

Three layers, one funnel. Pools = speed. Policies = the real governance (instance type, size, autotermination — non-negotiable). Entitlement gate = who's even allowed to create.
Roles map cleanly onto the layers. Admin is unrestricted, engineer creates within a policy, analyst creates nothing and uses shared clusters. allow_cluster_create = off is what makes the analyst tier real.
Autotermination baked into policy is the single highest-ROI rule you can ship. It ends surprise weekend bills.
Apply order is load-bearing. pool → policy → compute, and the first cold plan failing with "detected no outputs" is a Terragrunt mock/init quirk, not a bug in your code.

So we applied it. Pools came up — real IDs issued. Policies came up — real IDs issued. Then we ran the compute apply to create those two shared clusters, and...

databricks_cluster.this["shared_small"]: Still creating... [10m20s elapsed]
...
Error: cannot create cluster: failed to reach RUNNING, got TERMINATED:
  Self-bootstrap timed out during launch ... BOOTSTRAP_TIMEOUT

The EC2 nodes booted. Status checks passed, 3/3. And the clusters refused to start anyway — 25 minutes of "Still creating," then TERMINATED. Not a policy problem. Not an IAM problem. Something in the network was quietly eating our packets.

Next: The BOOTSTRAP_TIMEOUT Mystery — tracing a Databricks cluster from data plane to control plane, across three AWS accounts and one very quiet firewall.

[Databricks on AWS #2] RBAC on Databricks: Function-Role Groups, Workspace Assignment, and Why USER/ADMIN Isn't the Whole Story

duke — Thu, 02 Jul 2026 01:25:08 +0000

📚 Series: Databricks on AWS (Part 2)

Building a Databricks AI Platform on AWS

RBAC with Function-Role Groups ← you are here

Compute Governance: Pools, Policies, Clusters

The BOOTSTRAP_TIMEOUT Mystery

Fixing It with AWS PrivateLink

How We Structure the Terraform

Part 1 built the environment. Now we hand out the keys — three account-level groups, two workspaces, and a permission model that's mostly not something you invent.

Here's the trap most Databricks RBAC posts fall into: they treat access control like a thing you design from scratch. You don't. Databricks already hands you USER and ADMIN at the workspace level, entitlements, object ACLs, and Unity Catalog grants — all built in. The only piece you actually create is the groups. Get that mental split right and RBAC stops feeling like a maze.

If you're standing up a fresh Databricks account and wondering where to draw the first lines, this is the layer to get right before you touch a single catalog.

The model in one line

Everything flows through groups:

User ──▶ Function-Role group ──▶ workspace (USER / ADMIN) + (later) data grants

A user gets nothing directly. They land in a function-role group, and the group carries the permissions. That indirection is the whole point — it means "who is this person" and "what can this role do" are two separate problems you can solve independently.

We started with the smallest set that still maps to real jobs:

Group	Intent	Workspace level
`ai_admin`	Platform admins — run the place	ADMIN
`ai_engineer`	ML / data engineers — build things	USER
`ai_analyst`	Analysts — read and query	USER

Three groups. Not thirty. You can add ai_scientist, ai_guest, whatever later — each is one line of YAML plus an assignment. Resist the urge to pre-build a role for every hypothetical persona; churn kills that plan fast.

These are account-level groups, not workspace-local ones. That matters: one group definition can be assigned to many workspaces, which is exactly what you want when you have more than one.

Groups are the only thing you create

This is the part worth internalizing. Line up the permission layers and mark who owns each:

Layer	Values	Who defines it
Workspace assignment	`USER` / `ADMIN`	Databricks built-in
Entitlements	`workspace_access`, `databricks_sql_access`, `allow_cluster_create`, ...	Databricks built-in
Object ACLs	`CAN_MANAGE` / `CAN_USE` / `CAN_ATTACH_TO` / ...	Databricks built-in
Unity Catalog grants	`USE CATALOG` / `SELECT` / `MODIFY` / `ALL`	Databricks built-in
Function-role groups	`ai_admin`, `ai_engineer`, ...	You

Four of the five rows are Databricks primitives. You don't design SELECT or CAN_MANAGE — they exist. What you design is the subjects: the groups those permissions attach to. Everything else in this series (entitlements in Part 3, Unity Catalog grants later) is you handing built-in permissions to the groups you made here.

So your IaC surface for RBAC is genuinely small. You define the groups in Terraform, and you wire them to workspaces. Membership — the actual humans — lives somewhere else. More on that in a second.

USER vs ADMIN: built-in, and you just pick

Once the groups exist, you assign each one to a workspace at a permission level. Databricks gives you exactly two at this layer:

ADMIN — full workspace admin. Manage users, clusters, settings, everything.
USER — can log in and work, no admin surface.

That's it. You're not inventing a permission scheme; you're choosing which of two built-in levels each group gets, per workspace. In Terraform this is databricks_mws_permission_assignment — an account-level resource (multi-workspace mode) that maps a group to a workspace at a level.

With two workspaces — call them landing and pipeline — the matrix is small and readable:

Group	landing	pipeline
`ai_admin`	ADMIN	ADMIN
`ai_engineer`	USER	USER
`ai_analyst`	USER	USER

Admins are admins everywhere; engineers and analysts are users everywhere. The point isn't the specific grid — it's that a whole workspace access policy is this compact when the subjects are groups instead of individuals.

Where membership actually lives (hint: not IaC)

Here's the decision that surprises people: user-to-group membership does not go in Terraform.

It's tempting. You've got groups in code, you've got assignments in code — why not list the members in code too? Because joiners and leavers churn constantly, and every one of them would be a pull request, a plan, an apply. You'd be running infrastructure deploys to onboard an intern. That's the wrong tool.

So membership lives in the Account Console / SCIM, managed by ops (or synced from your IdP):

Account Console → User management → Users → Add user (company email = login ID).
Open the target group → Members tab → Add members.

One gotcha worth calling out: add people on the Members tab, not the Roles tab. The Roles tab is account-level roles (account admin, etc.) — a completely different thing, and easy to click by accident. And if you have SSO/SCIM provisioning, let the IdP own membership; manual adds will fight the sync.

The clean split, then:

In IaC (rarely changes): group definitions, workspace assignments, and later the grants.
Out of IaC (changes daily): who's in each group.

Structure in code, people in the console. That line is the single most useful RBAC decision in this whole post.

The apply-order gotcha that eats an afternoon

Terraform (via Terragrunt) resolves this in two stacks: one that creates the groups, one that creates the workspace assignments. The assignment stack depends on the group stack's output — it needs the group IDs to bind them to workspaces.

Here's the trap. If you apply the assignment stack before the groups exist, it doesn't error. It quietly resolves the dependency to a mock (empty) output and produces:

assignments = {}

An empty result. No groups, no bindings, no complaint. You think you shipped RBAC; you shipped nothing. Then you spend an hour wondering why nobody can log in.

The order is non-negotiable:

# 1. groups FIRST
atlantis apply -d .../groups

# 2. THEN the assignment (references group IDs)
atlantis apply -d .../workspace/workspace-assignment

If you ever see assignments = {} on a plan you expected to be full, this is why: the group output wasn't there yet, the dependency fell back to its mock, and the plan built against thin air. Apply groups, then re-plan. It's the RBAC cousin of "create the table before you grant on it."

Takeaways

Groups are the only thing you invent. USER/ADMIN, entitlements, ACLs, and Unity Catalog grants are all Databricks built-ins — you attach them, you don't design them.
Function-role groups are account-level, so one definition assigns cleanly to many workspaces. Start with three (admin/engineer/analyst); add more as one-liners when you actually need them.
USER vs ADMIN is a built-in binary at the workspace layer — pick per group, per workspace, and let the group indirection keep the matrix tiny.
Membership belongs in the console/SCIM, not IaC. Joiner/leaver churn would turn onboarding into infrastructure deploys. Structure in code, people out of code.
Create groups before workspace assignment. Do it backwards and the dependency resolves to a mock, assignments = {} ships silently, and nobody can log in.

With the subjects in place and the workspaces handed out, the next question is what those users are actually allowed to do with compute — who can spin up a cluster, which entitlements gate SQL, and how you keep costs from getting out of hand. That's Part 3.

Next: Compute governance — entitlements, cluster policies, and keeping a self-serve platform from becoming a self-serve bill.

[Databricks on AWS #1] Building a Databricks AI Platform on AWS: Two Workspaces, One Unity Catalog Metastore

duke — Thu, 02 Jul 2026 01:24:21 +0000

📚 Series: Databricks on AWS (Part 1)

Building a Databricks AI Platform on AWS ← you are here

RBAC with Function-Role Groups

Compute Governance: Pools, Policies, Clusters

The BOOTSTRAP_TIMEOUT Mystery

Fixing It with AWS PrivateLink

How We Structure the Terraform

We stood up two workspaces, one shared Unity Catalog metastore, a customer-managed VPC, and wired the whole thing through Terraform + Atlantis so nobody clicks in a console. This post is the map for everything that follows.

Every "just spin up Databricks" tutorial stops at the point where the workspace turns green. Real platforms don't stop there. You have to decide how data flows, where it lives, who governs it, and how a change gets from a pull request to running infrastructure without someone SSH-ing into a jump box at 2am.

This series is the honest version of that build — a Databricks AI platform on AWS, provisioned entirely as code, with all the sharp edges we caught our shins on. Part 1 is the architecture and the ground rules. Later parts go deep on RBAC, compute governance, and the networking rabbit hole that ate a week of my life (spoiler: a firewall silently dropping SYN packets).

Let me lay out the whole thing.

The shape of the platform

Two workspaces. One metastore. One VPC. That's the headline.

                    ┌─────────────────────────────────────┐
                    │      Unity Catalog metastore         │
                    │        (one per region)              │
                    └───────────────┬─────────────────────┘
                        assigned to both
             ┌──────────────────────┴──────────────────────┐
             │                                              │
   ┌─────────▼─────────┐                        ┌───────────▼───────────┐
   │  landing workspace │                        │  pipeline workspace   │
   │  ingest + interactive│                      │  transform + batch/job │
   │  analytics          │                       │  model train/serve    │
   └─────────┬──────────┘                        └───────────┬───────────┘
             │                                              │
             └───────────────┬──────────────────────────────┘
                             │
                   customer-managed VPC (shared)
                   shared root S3 + cross-account IAM

Why two workspaces instead of one? Because the two halves of the platform have genuinely different personalities.

The landing workspace is where raw data arrives and where analysts poke at it interactively. It's bursty, human-driven, notebook-heavy. The pipeline workspace is where the scheduled jobs live — transformation, feature engineering, model training and serving. It's automated, batch-shaped, and you do not want an analyst's runaway all-purpose cluster competing with a production training job for pool capacity.

Splitting by data flow (ingest/interactive vs. transform/batch) instead of by team gives you a clean blast radius. A misbehaving interactive cluster can't starve the batch plane. Compute policies, instance pools, and cluster budgets get tuned per personality instead of averaged into mush.

But — and this is the part people get wrong — splitting the workspaces does not mean splitting the data governance.

One metastore to govern them all

Unity Catalog's metastore is regional: one metastore per region, per account. That's not a suggestion, it's a hard limit, and it turns out to be exactly what you want here.

Both workspaces attach to the same metastore. That means a table registered by an ingest job in the landing workspace is immediately governable, grantable, and queryable (subject to permissions) from the pipeline workspace — same three-level namespace (catalog.schema.table), same lineage graph, same audit trail. No copying, no federation, no "which workspace has the real version" archaeology.

metastore (region-wide)
 ├── catalog: landing_*     ← raw / bronze
 ├── catalog: pipeline_*    ← silver / gold
 └── grants + lineage span BOTH workspaces

The workspaces are the compute boundary. The metastore is the governance boundary. Keeping those two concepts separate is the single most important design decision in the whole platform.

Customer-managed VPC, shared storage

The data plane runs in a customer-managed VPC — our network, our subnets, our security groups — not the Databricks-managed default. On a real platform behind a security team, that's non-negotiable: you need the compute to live where your egress controls, flow logs, and inspection already are.

In our case the VPC and subnets were created by the infra team and reused, not provisioned by this repo. Terraform just registers them with Databricks via databricks_mws_networks. That distinction bites later (Part 4 is a whole post about a cluster that wouldn't boot because a firewall in that borrowed network was eating packets), so it's worth stating up front: owning the workspace config is not the same as owning the network.

Storage is shared too. Both workspaces sit on one root S3 bucket, reached through a single cross-account IAM role and one Databricks credential config. The metastore gets its own storage root plus a dedicated data-access role. Standard stuff — S3 gateway endpoint, STS and Kinesis interface endpoints, a security group allowing 443 — but "standard" only if someone actually built it, which is again the infra team, not this repo.

Everything through Terraform + Atlantis

No console clicking. The entire platform — workspaces, networks, storage registration, the metastore, catalogs, schemas, groups, grants — is Terraform, wrapped in Terragrunt, applied through Atlantis off GitLab merge requests.

The flow is boring in the best way:

open MR  →  atlantis plan (comment on the MR)  →  review the plan
        →  merge  →  atlantis apply  →  infra changes

Two things make this actually work at platform scale:

A single automation service principal owns every apply. Humans don't hold the keys; the SP does. (More on the permissions that SP needs below — it's a gotcha.)
Ordering is explicit. Unity Catalog resources have real dependencies, so we apply in a fixed sequence: metastore → external locations → catalogs → schemas → groups → grants. Skip ahead and apply grants before the groups exist and you get a Group not found error. Ask me how I know.

The payoff is that "what's deployed" always equals "what's in main". Drift is a review comment, not a mystery.

Gotchas we hit

Here's the stuff no tutorial mentioned. Every one of these cost real time.

1. The deployment-name prefix has to be registered by Databricks — you can't do it yourself.
Your workspace URL prefix (the something.cloud.databricks.com part) must be pre-registered on your Databricks account by Databricks. It's not a console setting and it's not an API call — you file a request with your Databricks contact. Try to create the workspace before it's registered and you get a flat Deployment name cannot be used... error with no hint that the fix lives in someone else's ticket queue.

2. The automation service principal needs Account admin — not just workspace admin.
All the databricks_mws_* resources (workspaces, networks, storage, Unity Catalog) are account-level. A service principal with only workspace-level rights can't create any of them. The SP that runs your Atlantis applies has to be an Account admin, or every mws resource fails at plan-to-apply.

3. The region auto-creates a default metastore that collides with yours.
Create your first workspace in a fresh region and Databricks helpfully auto-provisions a default metastore for that region and attaches it. Since there's only one metastore per region, that default now is the region's metastore — and it's not the managed one you want to define in code. The fix: detach the auto-created default, delete it, then create your own *-metastore via Terraform and assign it. Bring your own metastore; evict the squatter first.

4. The deploy box needs egress to the account console.
Atlantis runs somewhere, and that somewhere has to reach accounts.cloud.databricks.com on 443 outbound. Behind a locked-down egress firewall, that's a rule someone has to add — and until they do, your applies just hang. Worth flagging clearly: this is the deploy server's egress. The cluster nodes' egress in the data plane is an entirely separate problem (and a much nastier one — that's Part 4).

Where this leaves us

At the end of Part 1 we have the skeleton: two workspaces split by data flow, one shared regional Unity Catalog metastore governing both, a customer-managed VPC and shared S3 underneath, and Atlantis turning merge requests into infrastructure. Green workspaces, a clean namespace, and a deploy pipeline nobody has to babysit.

What we don't have yet is people. A platform with no access model is just an expensive sandbox. Next up: how we built a three-tier RBAC system — users mapped to function roles, function roles to access roles, access roles to actual Unity Catalog grants — so that "give the ML team read on the gold catalog" is a one-line change instead of a permissions spelunking expedition.

Next: Part 2 — RBAC done right: function-role groups, access-role groups, and why we have two layers instead of one.

[Databricks on AWS #0] The Target Architecture: Isolating Prod, Dev, and Sandbox with Unity Catalog

duke — Thu, 02 Jul 2026 00:34:09 +0000

📚 Series: Databricks on AWS (Part 0, prologue)

The Target Architecture ← you are here

Building a Databricks AI Platform on AWS

RBAC with Function-Role Groups

Compute Governance: Pools, Policies, Clusters

The BOOTSTRAP_TIMEOUT Mystery

Fixing It with AWS PrivateLink

How We Structure the Terraform

Before the build story, here's the destination. This is the target-state data architecture we designed the whole platform toward — the three principles that shaped every later decision, and the Unity Catalog governance model that keeps production data safe from human hands.

The rest of this series is a build log: workspaces, RBAC, compute, the networking rabbit hole, the Terraform layout. But every one of those decisions was made in service of a target picture we drew first. This post is that picture — the "to-be" architecture, not the scaffolding we happened to have up on any given week.

It's built on three things Databricks basically hands you if you lean into them: the Lakehouse (one store, ACID tables, no separate warehouse to sync), the Medallion architecture (raw → cleaned → integrated → business, each layer a promotion), and Unity Catalog as the single governance plane across all of it. The interesting part isn't reciting those three buzzwords — it's the specific way we wire them so that prod, dev, and analyst sandboxes never step on each other.

Three principles, and everything follows

Almost every concrete rule later in this series is a consequence of one of these three.

1. Nobody touches production by hand. Create, update, delete in prod data happens only through an automated, code-reviewed pipeline running as a service principal. Human accounts don't get write on prod — not analysts, not engineers, not admins. The blast radius of a bad afternoon is capped at whatever a person can do with read-only. This one principle is why the whole "promote" flow later exists.

2. Never copy production to look at it. If an analyst wants to explore the gold layer, they read it in place. Within one metastore that's just Unity Catalog namespace permissions; across metastores or orgs it's Delta Sharing. Either way the bytes don't move. No nightly "analytics copy" job, no storage bill for the same data three times, no stale replica that quietly drifts from the source of truth.

3. Give analysts a room they can trash. Read-only-on-prod sounds clean until an analyst runs the same 200-line WITH query forty times in an afternoon, full-scanning a fact table each run. So we give them a sandbox — a physically isolated catalog where they can write — and encourage them to materialize heavy intermediate results there once. Freedom to write, but walled off from anything that matters.

Hold those three in your head. The zone model below is just their logical consequence.

The zone model

Inside one Unity Catalog metastore we carve out three zones with sharply different permission profiles.

              ┌──────────────── Unity Catalog metastore ────────────────┐
              │                                                          │
 pipeline-sp ─┼─▶ ┌──── PROD (Medallion) ────┐      ┌─── SANDBOX ───┐    │
 (write only) │   │ landing→cleaned→         │ zero │  per-user     │    │
              │   │ integrated→business      │─copy▶│  schemas      │    │
              │   │ read-only for humans     │ read │  free write   │    │
              │   └──────────────────────────┘      └───────────────┘    │
              │                 ▲                          │             │
 humans ──────┼─── read (SELECT)│              free write (ALL)│         │
 ai_analyst   │                 │                                        │
 ai_engineer  │   ┌──── DEV (Medallion) ─────┐                           │
              │   │ same layers, engineers   │  build & validate         │
              │   │ can write                 │  pipelines here          │
              │   └──────────────────────────┘                           │
              └──────────────────────────────────────────────────────────┘

Zone	Catalogs	Who	Privileges	Role
Prod	`landing`, `cleaned`, `integrated`, `business`	pipeline service principal (write), human groups (read)	SP: `USE` `SELECT` `CREATE` `MODIFY` (all); humans: `USE` `SELECT` only	Single source of truth. Writes only via automation.
Dev	same four Medallion catalogs, dev env	`ai_engineer`, `ai_admin`	`USE` `SELECT` `CREATE` `MODIFY`	Where pipeline logic is built and validated before it's promoted.
Sandbox	`sandbox` (per-user schemas)	`ai_analyst`, `ai_engineer`	`ALL` on your own schema; read-only on prod/dev	Free-write experiment space + cost-saving materialization.

The four Medallion catalogs — landing (bronze/raw), cleaned (silver, incremental + cleansing), integrated (silver, dimensions + facts), business (gold, marts) — exist in both prod and dev with an environment prefix. Same shape, different blast radius: in dev engineers write freely; in prod only the pipeline principal does.

The three human groups (ai_admin, ai_engineer, ai_analyst) come straight from the RBAC post — account-level function-role groups. Nothing here invents a new permission system; it just points Unity Catalog's built-in privileges at those groups, per zone.

Pattern 1: zero-copy reads

Production data exists exactly once, and everyone reads that one copy without cloning it.

Inside a single metastore, an analyst in the sandbox querying prod_business.gold.sales_fact is just a permission grant — SELECT, no replication, no extra storage. Cross a metastore or org boundary (say a separate prod org sharing down to an analytics org) and it becomes Delta Sharing: still read-only, still metadata-only, the bytes stay put.

And because human accounts only ever get SELECT on prod, principle #1 holds automatically — there's no code path where a person's typo mutates the gold layer.

Pattern 2: materialize once, in the sandbox

Read-only prod plus a curious analyst equals a specific failure mode: the same expensive query, run over and over, full-scanning a fact table every single time. The fix isn't to lock the analyst down — it's to give them somewhere to land the expensive part once.

-- Materialize the heavy part once, in your own sandbox schema
CREATE TABLE dev_sandbox.sandbox_user_a.monthly_summary AS
SELECT ...
FROM   prod_business.gold.sales_fact   -- zero-copy read (UC grant / Delta Sharing)
WHERE  base_date >= '2026-01-01';

Now every follow-up query hits monthly_summary — a small, physical Delta table the analyst owns — instead of re-scanning the production fact table. The repeated full scans collapse to a single one. Per-user schemas keep everyone's scratch tables from colliding.

Pattern 3: the promote path

So an analyst builds something great in their sandbox. How does it become a real production asset? Not by anyone copying it into prod — principle #1 forbids that. It goes through a promote flow that structurally keeps humans out of the production write:

[Sandbox]  analyst experiments & validates  (sandbox.<user>)
    │
    ▼  hand off — analysis code goes to the data-engineering team
    │
    ▼  code review — query optimization (Z-Order, partition pruning), naming/schema standards
    │
    ▼  pipeline intake — validated in dev, then registered in the prod pipeline
    │
    ▼  productionize — the pipeline service principal writes it into the prod Medallion layers

The last writer is always the pipeline principal. The sandbox is where ideas are born; the promote path is the airlock they pass through to become production data.

The infra standard underneath

Two platform-wide defaults make the cost story actually hold up (more in the compute post):

Serverless SQL Warehouse as the default query engine. Analyst load is spiky and unpredictable, so we don't leave fixed interactive clusters running. Serverless bills per-second while a query runs and auto-terminates within a minute or two of idle — the "someone forgot to shut it down" bill is designed out at the architecture level, not left to discipline.
Cost tracking from Unity Catalog system tables. system.billing.usage and system.query.history feed a live dashboard, so we can see exactly who's running the expensive full scans in the sandbox and which queries burn the most. That same data backs the query-optimization review in the promote path — the decision to materialize or Z-order isn't a guess, it's a number.

The takeaway

Three zones, three permission profiles. Prod is read-only for humans and written only by automation; dev is where engineers build; sandbox is a walled room analysts can write in freely.
Never copy prod. Zero-copy reads (UC grants or Delta Sharing) mean one source of truth, one storage bill, zero drift.
Materialize in the sandbox, promote through review. Heavy intermediates land once in a per-user schema; anything worth keeping goes back to prod only through a code-reviewed pipeline running as a service principal.
Serverless + system tables make the cost model self-enforcing instead of aspirational.

That's the destination. The next five posts are how we actually built toward it — starting with the workspaces and the metastore.

[Open-Source LLM Agent #3] Running a Whole RAG Agent Offline: LangGraph + Ollama + Embedded Qdrant (Zero API Keys)

duke — Mon, 29 Jun 2026 01:22:31 +0000

Most RAG tutorials open with "set your OPENAI_API_KEY." This one doesn't need it. In Part 1 I claimed the LLM and embeddings are behind a swappable boundary — "switch providers via config, not code." Part 3 is me cashing that claim: running the entire RAG agent — ingestion, retrieval, the ReAct loop, source citations — on a laptop with zero API keys and no Docker, just Ollama and an embedded Qdrant.

Everything below is real output from an actual run. Including the one thing that broke.

What "offline" actually requires

Three pieces, all local:

Ollama running two models — one for chat, one for embeddings:

  ollama pull qwen3.5:9b   # chat / reasoning
  ollama pull bge-m3       # embeddings (1024-dim, multilingual)

Embedded Qdrant — no server, no container. The vector store writes to a local directory.
A one-line config flip so chat goes to Ollama instead of the gateway:

  CHAT_PROVIDER=ollama

That's it. No OPENAI_API_KEY, no docker compose up. The reason this is a flip and not a rewrite is the provider-swap design from Part 1 — let's look at the three factories that make it work.

The embeddings factory — swap by config

# app/llm/embeddings.py
@lru_cache
def get_embeddings() -> Embeddings:
    s = get_settings()
    provider = s.embedding_provider.lower()

    if provider == "ollama":
        from langchain_ollama import OllamaEmbeddings
        return OllamaEmbeddings(model=s.embedding_model, base_url=s.ollama_url)

    if provider == "openai":
        from langchain_openai import OpenAIEmbeddings
        return OpenAIEmbeddings(base_url=f"{s.litellm_url}/v1",
                                api_key=s.litellm_key, model=s.embedding_model)

    raise ValueError(f"unknown embedding_provider: {s.embedding_provider!r}")

Both branches return the same LangChain Embeddings interface, so the ingestion and retrieval code never knows which one it got. Local dev → Ollama (offline). Production → OpenAI via the gateway. One caveat that matters later: the two providers produce different vector dimensions, so you can't mix vectors ingested with one and queried with the other. More on that in the gotchas.

The vector store — embedded vs. remote, also by config

# app/rag/store.py
@lru_cache
def get_client() -> QdrantClient:
    s = get_settings()
    if s.qdrant_url:
        return QdrantClient(url=s.qdrant_url, api_key=s.qdrant_api_key)  # remote (prod)
    return QdrantClient(path=s.qdrant_path)                             # embedded (local)

No QDRANT_URL? You get an embedded client that persists to s.qdrant_path — a plain directory. Set QDRANT_URL in prod and the same code talks to a real Qdrant service. The trade-off of embedded mode: it locks the directory to a single process, which becomes gotcha #2.

Ingestion: docs → chunks → vectors

The ingest script is the whole pipeline in ~30 lines: load files, split them, probe the embedding dimension, create the collection, upsert.

# scripts/ingest.py (trimmed)
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
chunks = splitter.split_documents(documents)

# probe the embedding dimension so the collection matches the provider
dim = len(get_embeddings().embed_query("probe"))
ensure_collection(dim)
get_vector_store().add_documents(chunks)

The embed_query("probe") trick is worth pausing on: instead of hard-coding 1024 for bge-m3 (or 1536 for OpenAI), it asks the active embedder for one vector and measures it. Swap the provider and the collection is created with the right size automatically.

Running it for real:

$ python scripts/ingest.py --reset
[ingest] source=docs  collection=docs  embed=ollama:bge-m3
[ingest] 5 documents → 53 chunks
[ingest] embedding dim = 1024
[ingest] done — 53 points in collection

Five markdown files, 53 chunks, 1024-dim vectors from bge-m3, written to the local Qdrant directory. No network calls left the machine.

Running the agent — no server needed

You can hit the FastAPI endpoint, but to see the graph think you can also invoke it directly. Here's a real run, asking about something that lives in the docs:

res = await graph.ainvoke({"messages": [HumanMessage(content=
    "How is short-term vs long-term memory implemented in this project?")]})

print([type(m).__name__ for m in res["messages"]])
# ['HumanMessage', 'AIMessage', 'ToolMessage', 'AIMessage']

That message sequence is the ReAct loop, visible in the state:

HumanMessage — the question
AIMessage with tool_calls=[search_docs(...)] — the model decides to retrieve
ToolMessage — the retrieved chunks come back
AIMessage — the final synthesized answer

And the answer itself, generated entirely by a 9B model on the laptop:

Short-term memory: PostgreSQL (PostgresSaver) stores per-thread
  conversation state; swappable to Redis (RedisSaver) if needed.
Long-term memory: Zep manages the user's persistent knowledge,
  recalled by the app on later turns.

Sources: <doc-a>.md, <doc-b>.md

Grounded in the actual docs, with source attribution, zero API keys. That's the win. Now the part the tutorials skip.

Gotchas (the part that's actually worth reading)

1. The empty synthesis turn — the local model, not the pipeline

On one run, the exact same question produced this:

[1] AIMessage   content=''   tool_calls=[search_docs(...)]   finish_reason='tool_calls'
[2] ToolMessage content='[1] (source: ...) ## memory layers ...'   ← retrieval worked
[3] AIMessage   content=''   tool_calls=[]   finish_reason='stop'  ← empty answer

Retrieval succeeded. The chunks were right there in step 2. But step 3 — the model's job to read the chunks and answer — came back empty. finish_reason='stop', no tokens, no error. Re-running the same question gave a perfectly good 280-character answer with citations. So it's intermittent: a small local model occasionally produces an empty turn after a tool call.

Two things to take away:

It's the model, not your graph. The pipeline (routing → retrieval → state) was flawless; the synthesis step just whiffed.
The saw_token fallback from Part 2 won't save you here — that fallback calls ainvoke when no tokens stream, but here ainvoke is the empty result. The real mitigations are a larger/better tool-tuned local model, or accepting some flakiness as the price of fully offline. Worth knowing before you demo it live.

2. Embedded Qdrant locks the directory

Embedded mode keeps the store in one process. Run the ingest script while the server is up and you'll get a lock error. Order matters: ingest first → let it exit → then start the server. The ingest script even closes the client explicitly to avoid a noisy shutdown traceback.

3. Embedding dimensions must match end to end

bge-m3 is 1024-dim; OpenAI's text-embedding-3-small is 1536. If you ingest with one provider and query with another, the dimensions don't line up and search breaks. Switching embedding_provider means re-ingesting (--reset). The embed_query("probe") dimension check is exactly what keeps the collection honest per provider.

4. The first call is slow

Ollama loads the model into memory on first use. The first request eats that cost; subsequent ones are fast. Don't benchmark the cold start.

Why this matters

You can build, debug, and demo the entire RAG agent — graph, retrieval, citations — on a plane with no wifi. Then, for production, you flip two config values (CHAT_PROVIDER, QDRANT_URL) and the same code talks to a hosted model and a real Qdrant cluster. Part 1 claimed the provider boundary; Part 3 ran on both sides of it.

The flip side is honesty about local models: retrieval is rock-solid, but a 9B model's synthesis step is the weak link, and it'll occasionally hand you an empty answer. Know that going in.

Next: persisting conversation threads with a checkpointer — so the agent remembers across requests — and what that adds to the message log you just saw.

Part 3 of a series on running LangGraph in production. Part 1 · Part 2.

[Open-Source LLM Agent #2] Streaming a LangGraph Agent as OpenAI-Compatible SSE (with a Thinking Panel)

duke — Wed, 24 Jun 2026 01:00:27 +0000

In Part 1 I built a LangGraph ReAct agent behind an OpenAI-compatible API and waved at one line:

return StreamingResponse(graph_to_openai_sse(graph, inputs, model_name, config=config),
                         media_type="text/event-stream")

That graph_to_openai_sse is where the real work hides. An OpenAI client like Open WebUI doesn't want "a LangGraph run" — it wants a very specific stream of chat.completion.chunk JSON objects over Server-Sent Events, terminated by a [DONE] sentinel. LangGraph, meanwhile, emits its own rich event stream. This post is the adapter between the two — about 90 lines that also give you a free "thinking" panel showing the agent's tool calls as they happen.

The two formats

What the client expects — each token arrives as an SSE line: data: {json}\n\n, where the JSON is an OpenAI chunk:

# app/api/openai_compat.py
def make_chunk(delta, model_name, completion_id, finish_reason=None):
    return {
        "id": completion_id,                       # "chatcmpl-..."
        "object": "chat.completion.chunk",
        "created": int(time.time()),
        "model": model_name,
        "choices": [{"index": 0, "delta": delta, "finish_reason": finish_reason}],
    }

The stream has a strict shape:

a first chunk with delta = {"role": "assistant"},
many chunks with delta = {"content": "..."} — one per token,
a final chunk with empty delta and finish_reason = "stop",
the literal line data: [DONE]\n\n.

Miss the [DONE] and the client spins forever. Skip the role chunk and some clients drop the first token. The contract is small but unforgiving.

What LangGraph emits — astream_events is a single async stream of typed events for everything happening inside the graph: model tokens, tool calls, node transitions. We subscribe once and translate each event we care about into chunks.

The core loop

# app/api/streaming.py
async def graph_to_openai_sse(graph, inputs, model_name, config=None):
    completion_id = new_completion_id()
    yield _sse(make_chunk({"role": "assistant"}, model_name, completion_id))  # (1) role

    def emit(text):
        return _sse(make_chunk({"content": text}, model_name, completion_id))

    async for event in graph.astream_events(inputs, config=config, version="v2"):
        kind = event.get("event")

        if kind == "on_chat_model_stream":
            chunk = event["data"]["chunk"]
            if isinstance(chunk, AIMessageChunk) and isinstance(chunk.content, str):
                yield emit(chunk.content)                                     # (2) tokens

    yield _sse(make_chunk({}, model_name, completion_id, finish_reason="stop"))  # (3) stop
    yield b"data: [DONE]\n\n"                                                     # (4) done

Three things to notice:

version="v2" pins the event schema. The event stream format has changed across LangChain releases; pinning it means your metadata.langgraph_node and data.chunk keys don't silently move under you.
on_chat_model_stream is the token event. Its data.chunk is an AIMessageChunk — but only when the LLM is actually streaming. Guarding with isinstance(...) avoids crashing on the non-streaming events that also flow through.
One completion_id for the whole response. Every chunk in a single completion shares it; that's how the client stitches tokens into one message.

_sse is just the wire framing — and note ensure_ascii=False, which matters the moment your tokens are Korean, Japanese, or emoji:

def _sse(payload):
    return f"data: {json.dumps(payload, ensure_ascii=False)}\n\n".encode("utf-8")

Surfacing the agent's thinking

Streaming the final answer is table stakes. The interesting part of a ReAct agent is what it did before answering — which document it searched, what came back. Open WebUI renders any text wrapped in <think>...</think> as a collapsible reasoning panel. So we narrate tool activity into that panel.

First, label the nodes worth announcing:

NODE_LABELS = {
    "tools": "🔍 Searching the docs…",
}

Then open a <think> block, and on the relevant events, emit human-readable progress instead of raw tokens:

    show_thinking = bool(NODE_LABELS)
    think_open = False
    prev_node = None

    if show_thinking:
        yield emit("<think>\n")
        think_open = True

    async for event in graph.astream_events(inputs, config=config, version="v2"):
        kind = event.get("event")
        node = (event.get("metadata") or {}).get("langgraph_node", "")

        # node entry → status line
        if node and node != prev_node and node in NODE_LABELS:
            yield emit(f"\n{NODE_LABELS[node]}\n")
            prev_node = node

        if kind == "on_tool_start":
            yield emit(f"  • `{event.get('name', 'tool')}` running…")
            continue

        if kind == "on_tool_end":
            output = event.get("data", {}).get("output")
            text = output.content if hasattr(output, "content") else str(output)
            snippet = " ".join(str(text).split())[:90]          # collapse whitespace, clip
            yield emit(f" ✓ `{snippet}…`\n" if snippet else " ✓\n")
            continue
        # ... on_chat_model_stream handled as before

The on_tool_end output is a ToolMessage, so its text lives on .content — hence the hasattr(output, "content") check before falling back to str(). Collapsing whitespace and clipping to ~90 chars keeps the panel readable instead of dumping a wall of retrieved text.

Closing the panel has to happen no matter how the stream ends — success, exception, or early return — so it goes in a finally:

    finally:
        if think_open:
            yield _sse(make_chunk({"content": "\n</think>\n"}, model_name, completion_id))

The result in the UI: a collapsible "🔍 Searching the docs… ✓" panel, then the streamed answer below it. The user sees the agent reach for RAG in real time.

Two production details that bite

1. Errors belong in the stream, not in a 500. Once you've started streaming, the HTTP status is already 200 and headers are flushed — you can't switch to an error response. So catch inside the generator and emit the error as content:

    except Exception as exc:
        log.exception("stream failed")
        yield _sse(make_chunk({"content": f"\n[error] {exc}"}, model_name, completion_id))

The user sees [error] ... in the chat instead of a frozen, half-rendered message.

2. Not every model streams. Some gateways/models return a single batched response with no on_chat_model_stream events at all. If you only ever forwarded tokens, those models would yield an empty answer. Track whether any token was seen, and if not, fall back to a plain ainvoke:

    if not saw_token:
        result = await graph.ainvoke(inputs, config=config)
        final = extract_final_text(result.get("messages", []))
        yield emit(final)

extract_final_text walks the message log backwards for the last non-empty AIMessage — handling both plain-string content and the list-of-blocks shape some providers return. This one guard is the difference between "streaming works on my dev model" and "works on every model behind the gateway."

The shape of the whole thing

graph.astream_events(version="v2")
        │
        ├─ on_chat_model_stream → emit({"content": token})
        ├─ node entry           → emit("🔍 status line")   ┐
        ├─ on_tool_start        → emit("• tool running…")  ├─ inside <think>…</think>
        ├─ on_tool_end          → emit("✓ snippet…")       ┘
        └─ (exception)          → emit("[error] …")
        ▼
 first chunk {role}  →  …content chunks…  →  {finish_reason: stop}  →  data: [DONE]

The payoff from Part 1 compounds here: because the boundary is just OpenAI SSE, this thinking-panel UX shows up in any OpenAI-compatible client with zero client code. You wrote a translator, and every frontend in that ecosystem speaks it for free.

Next up: persisting conversation threads with a checkpointer so the agent remembers across requests — and what that does to the streaming loop.

Built with LangGraph, LangChain, and FastAPI. Part 2 of a series on running LangGraph in production — Part 1 here.

[Open-Source LLM Agent #1] Running a LangGraph ReAct Agent in Production: OpenAI-Compatible API + Multi-Model Gateway + One-Line Tracing

duke — Tue, 23 Jun 2026 22:57:54 +0000

Most LangGraph content stops at the notebook. You build a cute ReAct loop, it answers one question, and the article ends before the hard part: how do you actually serve this thing, swap models without a rewrite, and see what it's doing when it misbehaves?

This post walks through a small but production-shaped LangGraph deployment: a RAG ReAct agent that

exposes an OpenAI-compatible HTTP API, so any OpenAI client (Open WebUI, the openai SDK, LibreChat) can talk to it unchanged,
routes every model call through a gateway so switching from a hosted API to self-hosted vLLM is a config change, not a code change, and
gets full tracing — node transitions, tool calls, and LLM calls in one trace — by adding a single callback.

Every snippet below is real code from a working service. Roughly 150 lines of Python is all it takes.

The shape of the thing

OpenAI client (Open WebUI, openai SDK)
        │  POST /v1/chat/completions
        ▼
FastAPI router ──► LangGraph StateGraph ──► LLM Gateway ──► model (hosted API today, vLLM tomorrow)
        │                   │
        │                   └──► ToolNode ──► Qdrant (RAG)
        │
        └──► Langfuse callback (one trace per request)

The contract with the outside world is just the OpenAI API. Everything interesting — the graph, RAG, tracing — lives behind that boundary. That single decision is what lets an off-the-shelf chat UI drive a custom agent with zero adapter code.

1. The ReAct graph

The graph is deliberately tiny: one agent node that reasons, one tools node that retrieves, and a conditional edge that loops between them until the model stops asking for tools.

# app/graph/builder.py
from langgraph.graph import END, StateGraph
from langgraph.prebuilt import ToolNode, tools_condition

def build_graph():
    g = StateGraph(AgentState)
    g.add_node("agent", agent_node)
    g.set_entry_point("agent")

    # ReAct: if the model emits tool_calls, go to `tools`; otherwise END.
    g.add_node("tools", ToolNode(TOOLS))
    g.add_conditional_edges("agent", tools_condition)
    g.add_edge("tools", "agent")
    return g.compile()

tools_condition and ToolNode are LangGraph prebuilts that do the unglamorous work: inspect the last message for tool_calls, route accordingly, execute the tools, and append ToolMessages back into state. You wire the loop; they run it.

State is a single shared message log with a reducer that appends rather than replaces:

# app/graph/state.py
from typing import Annotated, TypedDict
from langchain_core.messages import BaseMessage
from langgraph.graph.message import add_messages

class AgentState(TypedDict, total=False):
    messages: Annotated[list[BaseMessage], add_messages]

add_messages is the reducer. Every node returns {"messages": [...]} and LangGraph merges it into the running log — no manual list-shuffling, and it's what makes the agent⇄tools loop accumulate context correctly.

The agent node binds the tools and calls the model. Note bind_tools is conditional — flip RAG off and the exact same node degrades to a plain single-shot chat call:

# app/graph/nodes/agent.py
async def agent_node(state: AgentState) -> dict:
    llm = get_llm()
    if get_settings().rag_enabled:
        llm = llm.bind_tools(TOOLS)
    messages = [SystemMessage(content=SYSTEM_PROMPT), *state["messages"]]
    response = await llm.ainvoke(messages)
    return {"messages": [response]}

And the tool itself is an ordinary @tool-decorated function. The docstring is not documentation — it's the prompt the model reads to decide when to call it:

# app/graph/tools.py
@tool
def search_docs(query: str) -> str:
    """Search internal docs for content relevant to the question.
    When the user asks about the project/system/docs, call this first."""
    hits = get_vector_store().similarity_search(query, k=get_settings().rag_top_k)
    blocks = [
        f"[{i}] (source: {doc.metadata.get('source', 'unknown')})\n{doc.page_content.strip()}"
        for i, doc in enumerate(hits, 1)
    ]
    return "\n\n".join(blocks) or "No relevant documents found."

Returning a [1] (source: ...) structure isn't cosmetic — it's how the model can cite sources in its final answer, which is the difference between a demo and something people trust.

2. The OpenAI-compatible surface

Here's the lever that makes everything else cheap: the agent speaks OpenAI's wire format. The router turns an incoming /v1/chat/completions request into graph input and the graph's output back into an OpenAI response.

# app/api/router.py
@router.post("/v1/chat/completions")
async def chat_completions(req: ChatCompletionRequest):
    graph = get_graph()
    inputs = {"messages": to_langchain_messages(req.messages)}
    config: dict = {}

    if not req.stream:
        result = await graph.ainvoke(inputs, config=config)
        text = extract_final_text(result.get("messages", []))
        return make_completion(text, settings.served_model_name)

    return StreamingResponse(
        graph_to_openai_sse(graph, inputs, settings.served_model_name, config=config),
        media_type="text/event-stream",
    )

Because the response matches OpenAI's schema (including SSE streaming chunks), Open WebUI thinks it's talking to OpenAI. You point its openaiBaseUrl at this service and your custom RAG agent shows up as a selectable model. No frontend work.

3. One gateway, many models

LangGraph nodes never name a provider. They call one factory:

# app/llm/client.py
from langchain_openai import ChatOpenAI

def get_llm(model=None, temperature=None, streaming=True) -> ChatOpenAI:
    s = get_settings()
    return ChatOpenAI(
        base_url=f"{s.litellm_url}/v1",   # gateway, not a provider
        api_key=s.litellm_key,
        model=model or s.default_model,
        temperature=s.default_temperature if temperature is None else temperature,
        streaming=streaming,
    )

The base_url points at a LiteLLM gateway, not at any specific vendor. LiteLLM exposes an OpenAI-compatible endpoint and fans out to whatever its model_list says — a hosted API today, self-hosted vLLM tomorrow. Migrating off a paid API to an in-cluster GPU model becomes a gateway config edit; this Python file never changes.

There's one deliberate escape hatch — when the gateway is down locally, point straight at Ollama's OpenAI-compatible endpoint:

    if s.chat_provider.lower() == "ollama":
        return ChatOpenAI(base_url=f"{s.ollama_url}/v1", api_key="ollama",
                          model=model or s.ollama_chat_model, ...)

Same ChatOpenAI class, different base_url. The OpenAI-compatible interface shows up three times in this architecture — inbound API, gateway, and local fallback — and that consistency is the whole trick.

4. Tracing in one line

A multi-node graph with a tool loop is opaque when it goes wrong. Did the model skip the tool? Retrieve garbage? Loop twice? Langfuse's LangChain callback captures the entire run — every node transition, tool call, and LLM call — as a single nested trace.

The integration is genuinely one object:

# app/obs/langfuse.py
from functools import lru_cache

@lru_cache
def get_langfuse_handler():
    s = get_settings()
    if not (s.langfuse_public_key and s.langfuse_secret_key):
        return None  # no keys → tracing silently disabled (safe for local/POC)
    from langfuse.langchain import CallbackHandler
    return CallbackHandler()

Heads-up for the SDK version churn: on Langfuse SDK v3+ the import is from langfuse.langchain import CallbackHandler, and the handler reads LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_HOST from the environment — you don't pass keys to the constructor anymore. This tripped up a lot of v2 tutorials.

Then attach it per request via the graph config — which is also where you stamp user/session metadata so traces are filterable in the Langfuse UI:

# app/api/router.py
handler = get_langfuse_handler()
if handler is not None:
    config["callbacks"] = [handler]
    config["metadata"] = {
        "langfuse_user_id": req.user or "anonymous",
        "langfuse_session_id": getattr(req, "chat_id", None) or "no-session",
        "langfuse_tags": ["my-agent", settings.served_model_name],
    }

Passing the handler through config["callbacks"] (rather than baking it into the LLM client) means it propagates down the entire graph automatically. One request → one trace → every step visible.

What this buys you

Concern	How it's handled	Why it scales
Frontend integration	OpenAI-compatible API	Any OpenAI client works unchanged
Model choice	LiteLLM gateway behind `ChatOpenAI`	Swap providers via config, not code
Agent logic	LangGraph `StateGraph` + prebuilts	ReAct loop in ~10 lines, extensible to multi-agent
Observability	Langfuse callback via graph `config`	One trace per request, zero per-node wiring
Local dev	Ollama fallback through same interface	No gateway needed to hack offline

None of these pieces is exotic. The point is the seams: an OpenAI boundary on the outside, a gateway boundary on the model side, and a callback boundary for observability. Get the seams right and the agent in the middle stays small and swappable.

The same skeleton extends cleanly to a supervisor/worker multi-agent graph, a Postgres checkpointer for persistent threads, and an in-cluster vLLM model — each is an additive change behind one of those seams. But that's a follow-up post.

Built with LangGraph, LangChain, LiteLLM, Qdrant, and Langfuse. If you're running LangGraph in production and want to compare notes on deployment patterns, reach out.

DEV Community: duke

[Databricks on AWS #6] How We Structure the Terraform: Terragrunt, YAML-Driven Modules, and Atlantis GitOps

The repo, from the top

Humans write YAML, not HCL

One provider override per workspace

The two Databricks providers, and why

dependency + mock_outputs (and the "init" gotcha)

Apply order is not a suggestion

Atlantis ties the bow

The series, in one breath

[Databricks on AWS #5] Fixing Databricks BOOTSTRAP_TIMEOUT with AWS PrivateLink: Control Plane Over the Backbone, Zero New Subnets

Why VPC endpoints alone don't fix it (the trap from Part 4)

Back-end vs front-end PrivateLink

The elegant part: no new subnets

Why there's no routing change either

Private DNS is what redirects the traffic

The AWS side

The Databricks side (Terraform)

Gotchas (the ones that cost real time)

Verification

Takeaways

[Databricks on AWS #4] The BOOTSTRAP_TIMEOUT Mystery: Tracing a Databricks Cluster from Data Plane to Control Plane (Transit Gateway + Firewall)

The setup

The symptom that rules out half the internet

Why "just use VPC endpoints" doesn't fix this

Tracing the path

What actually broke

The Databricks outbound you actually need

Takeaways

[Databricks on AWS #3] Compute Governance on Databricks: Instance Pools, Cluster Policies, and Shared Clusters

The three layers, top to bottom

Layer 1: Instance Pools — pre-warmed VMs

Layer 2: Cluster Policies — the actual governance

Layer 3: The Entitlement Gate — allow_cluster_create

Shared clusters — for the people who can't create

Apply order: pool → policy → compute (and the Terragrunt trap)

Takeaways

[Databricks on AWS #2] RBAC on Databricks: Function-Role Groups, Workspace Assignment, and Why USER/ADMIN Isn't the Whole Story

The model in one line

Groups are the only thing you create

USER vs ADMIN: built-in, and you just pick

Where membership actually lives (hint: not IaC)

The apply-order gotcha that eats an afternoon

Takeaways

[Databricks on AWS #1] Building a Databricks AI Platform on AWS: Two Workspaces, One Unity Catalog Metastore

The shape of the platform

One metastore to govern them all

Customer-managed VPC, shared storage

Everything through Terraform + Atlantis

Gotchas we hit

Where this leaves us

[Databricks on AWS #0] The Target Architecture: Isolating Prod, Dev, and Sandbox with Unity Catalog

Three principles, and everything follows

The zone model

Pattern 1: zero-copy reads

Pattern 2: materialize once, in the sandbox

Pattern 3: the promote path

The infra standard underneath

The takeaway

[Open-Source LLM Agent #3] Running a Whole RAG Agent Offline: LangGraph + Ollama + Embedded Qdrant (Zero API Keys)

What "offline" actually requires

The embeddings factory — swap by config

The vector store — embedded vs. remote, also by config

Ingestion: docs → chunks → vectors

Running the agent — no server needed

Gotchas (the part that's actually worth reading)

1. The empty synthesis turn — the local model, not the pipeline

2. Embedded Qdrant locks the directory

3. Embedding dimensions must match end to end

4. The first call is slow

Why this matters

[Open-Source LLM Agent #2] Streaming a LangGraph Agent as OpenAI-Compatible SSE (with a Thinking Panel)

The two formats

The core loop

Surfacing the agent's thinking

Two production details that bite

The shape of the whole thing

[Open-Source LLM Agent #1] Running a LangGraph ReAct Agent in Production: OpenAI-Compatible API + Multi-Model Gateway + One-Line Tracing

The shape of the thing

1. The ReAct graph

Layer 3: The Entitlement Gate — `allow_cluster_create`