📚 Series: Databricks on AWS (Part 3)
- Building a Databricks AI Platform on AWS
- RBAC with Function-Role Groups
- Compute Governance: Pools, Policies, Clusters ← you are here
- The BOOTSTRAP_TIMEOUT Mystery
- Fixing It with AWS PrivateLink
- How We Structure the Terraform
RBAC decides who a user is. Compute governance decides what hardware they're allowed to spin up. Here's how instance pools, cluster policies, and an entitlement gate turn "anyone can launch a 128-core GPU box" into "you get exactly what your role permits."
In Part 2 we built the RBAC model: users map to function roles, function roles map to access roles, access roles get grants. That controls which data someone can touch.
But there's a second axis nobody talks about until the cloud bill arrives: compute. Left ungoverned, a single curious analyst can launch a cluster of r6i.4xlarge nodes (16 cores, 128 GB, each) at 2 a.m., forget to turn it off, and greet you with a five-figure surprise on Monday. Governance is the answer, and on Databricks it comes in three layers.
The three layers, top to bottom
Think of it as a funnel. Each layer narrows what a user can actually do:
| Layer | What it does | Who feels it |
|---|---|---|
| Instance pool | Pre-warmed VMs waiting to be claimed — faster cluster starts | Everyone (transparently) |
| Cluster policy | The rules: which instance types, sizes, autotermination, runtime | Engineers creating clusters |
| Entitlement gate |
allow_cluster_create on/off per group |
Non-admins (blocked) |
And on top of all that: shared, pre-made clusters for the people who shouldn't be creating anything at all.
Let's go layer by layer.
Layer 1: Instance Pools — pre-warmed VMs
A cold cluster start on AWS means: request EC2 capacity → wait for the instances → install the Databricks runtime → join the cluster. That's several minutes of a user staring at a spinner.
An instance pool keeps a set of VMs pre-acquired (or ready to be acquired fast) so clusters attach to them instead of provisioning from scratch. It's purely about speed and cost predictability — a pool doesn't restrict anything on its own.
For our workspace we defined six pools, split by CPU/GPU and by size:
| Pool | Instance type | Capacity | Tag |
|---|---|---|---|
ip_cpu_small |
m6g.large (2 vCPU / 8 GB) | 10 | cpu/small |
ip_cpu_medium |
m6g.xlarge (4 vCPU / 16 GB) | 15 | cpu/medium |
ip_cpu_large |
r6i.2xlarge (8 vCPU / 64 GB) | 15 | cpu/large |
ip_cpu_xlarge |
r6i.4xlarge (16 vCPU / 128 GB) | 20 | cpu/xlarge |
ip_gpu_small |
g5.xlarge (1× A10G) | 10 | gpu/small |
ip_gpu_large |
g5.2xlarge (1× A10G) | 20 | gpu/large |
The important knobs are shared across all of them:
-
min_idle_instances = 0— we don't pay to keep VMs warm 24/7; the pool spins up on demand and the first start after idle pays the cold-start tax. A tradeoff we accepted for cost. -
idle_instance_autotermination_minutes = 10— idle VMs release themselves after 10 minutes. -
availability = ON_DEMAND— no spot reclaim surprises for interactive work.
The custom_tags matter more than they look: they flow into AWS cost allocation, so you can answer "how much did the GPU pools cost this month?" without guessing.
Layer 2: Cluster Policies — the actual governance
Instance pools make clusters fast. Cluster policies make clusters legal.
A policy is a template that constrains what a cluster can be: which pool it draws from, which Spark runtime, min/max workers, autotermination, and how many clusters one user can create. A user with a policy can only create clusters that fit inside it — they can't override the instance type, can't disable autotermination, can't ask for 200 workers.
Our analytics workspace has eight policies. A representative slice:
| Policy | Pool (worker / driver) | Runtime | Workers | Autoterm (min) | Max clusters/user |
|---|---|---|---|---|---|
cp_cpu_small |
ip_cpu_small / ip_cpu_small | 14.3.x-scala2.12 | 1–4 | 10 | 2 |
cp_cpu_medium |
ip_cpu_medium / ip_cpu_medium | 14.3.x-scala2.12 | 0–8 | 10 | 2 |
cp_cpu_large |
ip_cpu_large / ip_cpu_medium | 14.3.x-scala2.12 | 0–12 | 10 | 1 |
cp_gpu_small |
ip_gpu_small / ip_cpu_small | 14.3.x-gpu-ml-scala2.12 | 0–8 | 10 | 1 |
cp_job_standard |
ip_cpu_medium / ip_cpu_medium | 14.3.x-scala2.12 | 0–16 | 30 | — |
A few design choices worth calling out:
- Autotermination is not optional. The policy sets it (10 min for interactive, 30 for jobs) and the user can't remove it. This single rule kills the "forgot to shut it down over the weekend" bill.
-
data_security_mode = USER_ISOLATIONacross the board — Unity Catalog enforcement stays on. Governance from Part 2 doesn't get bypassed by a cleverly configured cluster. -
Driver often runs on a smaller pool than workers (e.g.
cp_cpu_largeusesip_cpu_largeworkers but anip_cpu_mediumdriver). The driver rarely needs to match worker muscle, and this trims cost. -
max_clusters_per_usercaps sprawl. The big policies are limited to 1 cluster per person. - A shared
cost_centertag on every policy feeds the same cost-allocation story as the pools.
Job policies (higher worker ceilings, longer autotermination) live alongside the interactive ones, so batch pipelines get their own lane without borrowing the interactive budget.
Layer 3: The Entitlement Gate — allow_cluster_create
Policies govern how someone creates a cluster. But some people shouldn't create clusters at all.
Databricks has a workspace-level entitlement for exactly this: allow_cluster_create. Turn it off for a group, and members of that group physically cannot create clusters — the button is gone, the API call is rejected. It doesn't matter what policies exist; the door is locked before they reach it.
This is the gate that makes the whole role model coherent:
| Role | allow_cluster_create |
Cluster policies | What they can do |
|---|---|---|---|
| Admin | on | all / unrestricted | Create anything. Break glass. |
| Engineer | on | assigned policies only | Create clusters — but only inside their policy's box |
| Analyst | off | none | No creation at all — only attach to pre-made shared clusters |
An engineer gets freedom within guardrails. An analyst gets no guardrails because they get no steering wheel — they use what's already there. And that "what's already there" is the last piece.
Shared clusters — for the people who can't create
If analysts can't create clusters, they need clusters waiting for them. So the analytics workspace runs a couple of always-available shared, all-purpose clusters:
| Cluster | Policy | Runtime | Shape |
|---|---|---|---|
cp_shared_small |
cp_cpu_small | 14.3.x-scala2.12 | m6g.large, 1 driver + 1 worker |
cp_shared_medium |
cp_cpu_medium | 14.3.x-scala2.12 | m6g.xlarge, 1 driver + 0 min workers |
These are built from the same policies engineers use, so they inherit the same autotermination and isolation rules. Access is granted per group via ACLs (CAN_USE, CAN_ATTACH_TO) — an analyst attaches, runs their notebook, and never touches a provisioning decision.
The pipeline workspace, by contrast, has no shared clusters — it's job-driven, so clusters are ephemeral and spun up per run from job policies. Different workspace, different compute personality, same governance primitives.
Apply order: pool → policy → compute (and the Terragrunt trap)
Here's where Infrastructure-as-Code bites. The three layers have a hard dependency chain:
instance-pool → cluster-policy → compute (clusters)
Policies reference pool IDs. Clusters reference policy IDs. So you must apply in order:
# 1. pools first
atlantis apply -d .../ws-landing/instance-pool
# 2. re-plan, then policies (they now see real pool IDs)
atlantis apply -d .../ws-landing/cluster-policy
# 3. re-plan, then clusters
atlantis apply -d .../ws-landing/compute
And now the gotcha that cost us a confused afternoon. On the very first plan — before pools are applied — the policy and compute plans fail with:
Error: ... instance-pool ... is a dependency of ... cluster-policy ...
but detected no outputs. ...
This looks like a broken config. It isn't. Terragrunt lets a dependency return mock outputs so downstream modules can plan before the upstream is applied — but only for the Terraform commands you allowlist:
mock_outputs_allowed_terraform_commands = ["validate", "plan"]
The failure happens during terragrunt init, and init isn't in that list. So on a cold bootstrap, init tries to fetch the real output of an unapplied pool, finds nothing, and dies. On our older workspaces this never surfaced — those pools were already applied, so real outputs existed.
Two ways out:
- Just apply in order (pool → re-plan → policy → re-plan → compute). Once the upstream is applied, real outputs exist and the mock is never needed. This is what we did.
-
Add
"init"tomock_outputs_allowed_terraform_commandson the policy/compute modules. The mock now covers the init phase too — but you still apply in order, because a cluster genuinely can't be created against a pool that doesn't exist yet.
The lesson: on a greenfield deployment, "detected no outputs" almost always means "you haven't applied the thing upstream yet," not "your dependency block is wrong."
Takeaways
- Three layers, one funnel. Pools = speed. Policies = the real governance (instance type, size, autotermination — non-negotiable). Entitlement gate = who's even allowed to create.
-
Roles map cleanly onto the layers. Admin is unrestricted, engineer creates within a policy, analyst creates nothing and uses shared clusters.
allow_cluster_create = offis what makes the analyst tier real. - Autotermination baked into policy is the single highest-ROI rule you can ship. It ends surprise weekend bills.
-
Apply order is load-bearing. pool → policy → compute, and the first cold plan failing with "detected no outputs" is a Terragrunt mock/
initquirk, not a bug in your code.
So we applied it. Pools came up — real IDs issued. Policies came up — real IDs issued. Then we ran the compute apply to create those two shared clusters, and...
databricks_cluster.this["shared_small"]: Still creating... [10m20s elapsed]
...
Error: cannot create cluster: failed to reach RUNNING, got TERMINATED:
Self-bootstrap timed out during launch ... BOOTSTRAP_TIMEOUT
The EC2 nodes booted. Status checks passed, 3/3. And the clusters refused to start anyway — 25 minutes of "Still creating," then TERMINATED. Not a policy problem. Not an IAM problem. Something in the network was quietly eating our packets.
Next: The BOOTSTRAP_TIMEOUT Mystery — tracing a Databricks cluster from data plane to control plane, across three AWS accounts and one very quiet firewall.
Top comments (0)