duke

Posted on Jul 2

[Databricks on AWS #3] Compute Governance on Databricks: Instance Pools, Cluster Policies, and Shared Clusters

#databricks #aws #terraform #governance

📚 Series: Databricks on AWS (Part 3)

Building a Databricks AI Platform on AWS

RBAC with Function-Role Groups

Compute Governance: Pools, Policies, Clusters ← you are here

The BOOTSTRAP_TIMEOUT Mystery

Fixing It with AWS PrivateLink

How We Structure the Terraform

RBAC decides who a user is. Compute governance decides what hardware they're allowed to spin up. Here's how instance pools, cluster policies, and an entitlement gate turn "anyone can launch a 128-core GPU box" into "you get exactly what your role permits."

In Part 2 we built the RBAC model: users map to function roles, function roles map to access roles, access roles get grants. That controls which data someone can touch.

But there's a second axis nobody talks about until the cloud bill arrives: compute. Left ungoverned, a single curious analyst can launch a cluster of r6i.4xlarge nodes (16 cores, 128 GB, each) at 2 a.m., forget to turn it off, and greet you with a five-figure surprise on Monday. Governance is the answer, and on Databricks it comes in three layers.

The three layers, top to bottom

Think of it as a funnel. Each layer narrows what a user can actually do:

Layer	What it does	Who feels it
Instance pool	Pre-warmed VMs waiting to be claimed — faster cluster starts	Everyone (transparently)
Cluster policy	The rules: which instance types, sizes, autotermination, runtime	Engineers creating clusters
Entitlement gate	`allow_cluster_create` on/off per group	Non-admins (blocked)

And on top of all that: shared, pre-made clusters for the people who shouldn't be creating anything at all.

Let's go layer by layer.

Layer 1: Instance Pools — pre-warmed VMs

A cold cluster start on AWS means: request EC2 capacity → wait for the instances → install the Databricks runtime → join the cluster. That's several minutes of a user staring at a spinner.

An instance pool keeps a set of VMs pre-acquired (or ready to be acquired fast) so clusters attach to them instead of provisioning from scratch. It's purely about speed and cost predictability — a pool doesn't restrict anything on its own.

For our workspace we defined six pools, split by CPU/GPU and by size:

Pool	Instance type	Capacity	Tag
`ip_cpu_small`	m6g.large (2 vCPU / 8 GB)	10	cpu/small
`ip_cpu_medium`	m6g.xlarge (4 vCPU / 16 GB)	15	cpu/medium
`ip_cpu_large`	r6i.2xlarge (8 vCPU / 64 GB)	15	cpu/large
`ip_cpu_xlarge`	r6i.4xlarge (16 vCPU / 128 GB)	20	cpu/xlarge
`ip_gpu_small`	g5.xlarge (1× A10G)	10	gpu/small
`ip_gpu_large`	g5.2xlarge (1× A10G)	20	gpu/large

The important knobs are shared across all of them:

min_idle_instances = 0 — we don't pay to keep VMs warm 24/7; the pool spins up on demand and the first start after idle pays the cold-start tax. A tradeoff we accepted for cost.
idle_instance_autotermination_minutes = 10 — idle VMs release themselves after 10 minutes.
availability = ON_DEMAND — no spot reclaim surprises for interactive work.

The custom_tags matter more than they look: they flow into AWS cost allocation, so you can answer "how much did the GPU pools cost this month?" without guessing.

Layer 2: Cluster Policies — the actual governance

Instance pools make clusters fast. Cluster policies make clusters legal.

A policy is a template that constrains what a cluster can be: which pool it draws from, which Spark runtime, min/max workers, autotermination, and how many clusters one user can create. A user with a policy can only create clusters that fit inside it — they can't override the instance type, can't disable autotermination, can't ask for 200 workers.

Our analytics workspace has eight policies. A representative slice:

Policy	Pool (worker / driver)	Runtime	Workers	Autoterm (min)	Max clusters/user
`cp_cpu_small`	ip_cpu_small / ip_cpu_small	14.3.x-scala2.12	1–4	10	2
`cp_cpu_medium`	ip_cpu_medium / ip_cpu_medium	14.3.x-scala2.12	0–8	10	2
`cp_cpu_large`	ip_cpu_large / ip_cpu_medium	14.3.x-scala2.12	0–12	10	1
`cp_gpu_small`	ip_gpu_small / ip_cpu_small	14.3.x-gpu-ml-scala2.12	0–8	10	1
`cp_job_standard`	ip_cpu_medium / ip_cpu_medium	14.3.x-scala2.12	0–16	30	—

A few design choices worth calling out:

Autotermination is not optional. The policy sets it (10 min for interactive, 30 for jobs) and the user can't remove it. This single rule kills the "forgot to shut it down over the weekend" bill.
data_security_mode = USER_ISOLATION across the board — Unity Catalog enforcement stays on. Governance from Part 2 doesn't get bypassed by a cleverly configured cluster.
Driver often runs on a smaller pool than workers (e.g. cp_cpu_large uses ip_cpu_large workers but an ip_cpu_medium driver). The driver rarely needs to match worker muscle, and this trims cost.
max_clusters_per_user caps sprawl. The big policies are limited to 1 cluster per person.
A shared cost_center tag on every policy feeds the same cost-allocation story as the pools.

Job policies (higher worker ceilings, longer autotermination) live alongside the interactive ones, so batch pipelines get their own lane without borrowing the interactive budget.

Layer 3: The Entitlement Gate — `allow_cluster_create`

Policies govern how someone creates a cluster. But some people shouldn't create clusters at all.

Databricks has a workspace-level entitlement for exactly this: allow_cluster_create. Turn it off for a group, and members of that group physically cannot create clusters — the button is gone, the API call is rejected. It doesn't matter what policies exist; the door is locked before they reach it.

This is the gate that makes the whole role model coherent:

Role	`allow_cluster_create`	Cluster policies	What they can do
Admin	on	all / unrestricted	Create anything. Break glass.
Engineer	on	assigned policies only	Create clusters — but only inside their policy's box
Analyst	off	none	No creation at all — only attach to pre-made shared clusters

An engineer gets freedom within guardrails. An analyst gets no guardrails because they get no steering wheel — they use what's already there. And that "what's already there" is the last piece.

Shared clusters — for the people who can't create

If analysts can't create clusters, they need clusters waiting for them. So the analytics workspace runs a couple of always-available shared, all-purpose clusters:

Cluster	Policy	Runtime	Shape
`cp_shared_small`	cp_cpu_small	14.3.x-scala2.12	m6g.large, 1 driver + 1 worker
`cp_shared_medium`	cp_cpu_medium	14.3.x-scala2.12	m6g.xlarge, 1 driver + 0 min workers

These are built from the same policies engineers use, so they inherit the same autotermination and isolation rules. Access is granted per group via ACLs (CAN_USE, CAN_ATTACH_TO) — an analyst attaches, runs their notebook, and never touches a provisioning decision.

The pipeline workspace, by contrast, has no shared clusters — it's job-driven, so clusters are ephemeral and spun up per run from job policies. Different workspace, different compute personality, same governance primitives.

Apply order: pool → policy → compute (and the Terragrunt trap)

Here's where Infrastructure-as-Code bites. The three layers have a hard dependency chain:

instance-pool  →  cluster-policy  →  compute (clusters)

Policies reference pool IDs. Clusters reference policy IDs. So you must apply in order:

# 1. pools first
atlantis apply -d .../ws-landing/instance-pool
# 2. re-plan, then policies (they now see real pool IDs)
atlantis apply -d .../ws-landing/cluster-policy
# 3. re-plan, then clusters
atlantis apply -d .../ws-landing/compute

And now the gotcha that cost us a confused afternoon. On the very first plan — before pools are applied — the policy and compute plans fail with:

Error: ... instance-pool ... is a dependency of ... cluster-policy ...
       but detected no outputs. ...

This looks like a broken config. It isn't. Terragrunt lets a dependency return mock outputs so downstream modules can plan before the upstream is applied — but only for the Terraform commands you allowlist:

mock_outputs_allowed_terraform_commands = ["validate", "plan"]

The failure happens during terragrunt init, and init isn't in that list. So on a cold bootstrap, init tries to fetch the real output of an unapplied pool, finds nothing, and dies. On our older workspaces this never surfaced — those pools were already applied, so real outputs existed.

Two ways out:

Just apply in order (pool → re-plan → policy → re-plan → compute). Once the upstream is applied, real outputs exist and the mock is never needed. This is what we did.
Add "init" to mock_outputs_allowed_terraform_commands on the policy/compute modules. The mock now covers the init phase too — but you still apply in order, because a cluster genuinely can't be created against a pool that doesn't exist yet.

The lesson: on a greenfield deployment, "detected no outputs" almost always means "you haven't applied the thing upstream yet," not "your dependency block is wrong."

Takeaways

Three layers, one funnel. Pools = speed. Policies = the real governance (instance type, size, autotermination — non-negotiable). Entitlement gate = who's even allowed to create.
Roles map cleanly onto the layers. Admin is unrestricted, engineer creates within a policy, analyst creates nothing and uses shared clusters. allow_cluster_create = off is what makes the analyst tier real.
Autotermination baked into policy is the single highest-ROI rule you can ship. It ends surprise weekend bills.
Apply order is load-bearing. pool → policy → compute, and the first cold plan failing with "detected no outputs" is a Terragrunt mock/init quirk, not a bug in your code.

So we applied it. Pools came up — real IDs issued. Policies came up — real IDs issued. Then we ran the compute apply to create those two shared clusters, and...

databricks_cluster.this["shared_small"]: Still creating... [10m20s elapsed]
...
Error: cannot create cluster: failed to reach RUNNING, got TERMINATED:
  Self-bootstrap timed out during launch ... BOOTSTRAP_TIMEOUT

The EC2 nodes booted. Status checks passed, 3/3. And the clusters refused to start anyway — 25 minutes of "Still creating," then TERMINATED. Not a policy problem. Not an IAM problem. Something in the network was quietly eating our packets.

Next: The BOOTSTRAP_TIMEOUT Mystery — tracing a Databricks cluster from data plane to control plane, across three AWS accounts and one very quiet firewall.

DEV Community

[Databricks on AWS #3] Compute Governance on Databricks: Instance Pools, Cluster Policies, and Shared Clusters

The three layers, top to bottom

Layer 1: Instance Pools — pre-warmed VMs

Layer 2: Cluster Policies — the actual governance

Layer 3: The Entitlement Gate — `allow_cluster_create`

Shared clusters — for the people who can't create

Apply order: pool → policy → compute (and the Terragrunt trap)

Takeaways

Top comments (0)

The three layers, top to bottom

Layer 1: Instance Pools — pre-warmed VMs

Layer 2: Cluster Policies — the actual governance

Layer 3: The Entitlement Gate — allow_cluster_create

Shared clusters — for the people who can't create

Apply order: pool → policy → compute (and the Terragrunt trap)

Takeaways

Layer 3: The Entitlement Gate — `allow_cluster_create`