DEV Community

duke
duke

Posted on

[Databricks on AWS #3] Compute Governance on Databricks: Instance Pools, Cluster Policies, and Shared Clusters

📚 Series: Databricks on AWS (Part 3)

  1. Building a Databricks AI Platform on AWS
  2. RBAC with Function-Role Groups
  3. Compute Governance: Pools, Policies, Clusters ← you are here
  4. The BOOTSTRAP_TIMEOUT Mystery
  5. Fixing It with AWS PrivateLink
  6. How We Structure the Terraform

RBAC decides who a user is. Compute governance decides what hardware they're allowed to spin up. Here's how instance pools, cluster policies, and an entitlement gate turn "anyone can launch a 128-core GPU box" into "you get exactly what your role permits."

In Part 2 we built the RBAC model: users map to function roles, function roles map to access roles, access roles get grants. That controls which data someone can touch.

But there's a second axis nobody talks about until the cloud bill arrives: compute. Left ungoverned, a single curious analyst can launch a cluster of r6i.4xlarge nodes (16 cores, 128 GB, each) at 2 a.m., forget to turn it off, and greet you with a five-figure surprise on Monday. Governance is the answer, and on Databricks it comes in three layers.


The three layers, top to bottom

Think of it as a funnel. Each layer narrows what a user can actually do:

Layer What it does Who feels it
Instance pool Pre-warmed VMs waiting to be claimed — faster cluster starts Everyone (transparently)
Cluster policy The rules: which instance types, sizes, autotermination, runtime Engineers creating clusters
Entitlement gate allow_cluster_create on/off per group Non-admins (blocked)

And on top of all that: shared, pre-made clusters for the people who shouldn't be creating anything at all.

Let's go layer by layer.


Layer 1: Instance Pools — pre-warmed VMs

A cold cluster start on AWS means: request EC2 capacity → wait for the instances → install the Databricks runtime → join the cluster. That's several minutes of a user staring at a spinner.

An instance pool keeps a set of VMs pre-acquired (or ready to be acquired fast) so clusters attach to them instead of provisioning from scratch. It's purely about speed and cost predictability — a pool doesn't restrict anything on its own.

For our workspace we defined six pools, split by CPU/GPU and by size:

Pool Instance type Capacity Tag
ip_cpu_small m6g.large (2 vCPU / 8 GB) 10 cpu/small
ip_cpu_medium m6g.xlarge (4 vCPU / 16 GB) 15 cpu/medium
ip_cpu_large r6i.2xlarge (8 vCPU / 64 GB) 15 cpu/large
ip_cpu_xlarge r6i.4xlarge (16 vCPU / 128 GB) 20 cpu/xlarge
ip_gpu_small g5.xlarge (1× A10G) 10 gpu/small
ip_gpu_large g5.2xlarge (1× A10G) 20 gpu/large

The important knobs are shared across all of them:

  • min_idle_instances = 0 — we don't pay to keep VMs warm 24/7; the pool spins up on demand and the first start after idle pays the cold-start tax. A tradeoff we accepted for cost.
  • idle_instance_autotermination_minutes = 10 — idle VMs release themselves after 10 minutes.
  • availability = ON_DEMAND — no spot reclaim surprises for interactive work.

The custom_tags matter more than they look: they flow into AWS cost allocation, so you can answer "how much did the GPU pools cost this month?" without guessing.

Layer 2: Cluster Policies — the actual governance

Instance pools make clusters fast. Cluster policies make clusters legal.

A policy is a template that constrains what a cluster can be: which pool it draws from, which Spark runtime, min/max workers, autotermination, and how many clusters one user can create. A user with a policy can only create clusters that fit inside it — they can't override the instance type, can't disable autotermination, can't ask for 200 workers.

Our analytics workspace has eight policies. A representative slice:

Policy Pool (worker / driver) Runtime Workers Autoterm (min) Max clusters/user
cp_cpu_small ip_cpu_small / ip_cpu_small 14.3.x-scala2.12 1–4 10 2
cp_cpu_medium ip_cpu_medium / ip_cpu_medium 14.3.x-scala2.12 0–8 10 2
cp_cpu_large ip_cpu_large / ip_cpu_medium 14.3.x-scala2.12 0–12 10 1
cp_gpu_small ip_gpu_small / ip_cpu_small 14.3.x-gpu-ml-scala2.12 0–8 10 1
cp_job_standard ip_cpu_medium / ip_cpu_medium 14.3.x-scala2.12 0–16 30

A few design choices worth calling out:

  • Autotermination is not optional. The policy sets it (10 min for interactive, 30 for jobs) and the user can't remove it. This single rule kills the "forgot to shut it down over the weekend" bill.
  • data_security_mode = USER_ISOLATION across the board — Unity Catalog enforcement stays on. Governance from Part 2 doesn't get bypassed by a cleverly configured cluster.
  • Driver often runs on a smaller pool than workers (e.g. cp_cpu_large uses ip_cpu_large workers but an ip_cpu_medium driver). The driver rarely needs to match worker muscle, and this trims cost.
  • max_clusters_per_user caps sprawl. The big policies are limited to 1 cluster per person.
  • A shared cost_center tag on every policy feeds the same cost-allocation story as the pools.

Job policies (higher worker ceilings, longer autotermination) live alongside the interactive ones, so batch pipelines get their own lane without borrowing the interactive budget.

Layer 3: The Entitlement Gate — allow_cluster_create

Policies govern how someone creates a cluster. But some people shouldn't create clusters at all.

Databricks has a workspace-level entitlement for exactly this: allow_cluster_create. Turn it off for a group, and members of that group physically cannot create clusters — the button is gone, the API call is rejected. It doesn't matter what policies exist; the door is locked before they reach it.

This is the gate that makes the whole role model coherent:

Role allow_cluster_create Cluster policies What they can do
Admin on all / unrestricted Create anything. Break glass.
Engineer on assigned policies only Create clusters — but only inside their policy's box
Analyst off none No creation at all — only attach to pre-made shared clusters

An engineer gets freedom within guardrails. An analyst gets no guardrails because they get no steering wheel — they use what's already there. And that "what's already there" is the last piece.

Shared clusters — for the people who can't create

If analysts can't create clusters, they need clusters waiting for them. So the analytics workspace runs a couple of always-available shared, all-purpose clusters:

Cluster Policy Runtime Shape
cp_shared_small cp_cpu_small 14.3.x-scala2.12 m6g.large, 1 driver + 1 worker
cp_shared_medium cp_cpu_medium 14.3.x-scala2.12 m6g.xlarge, 1 driver + 0 min workers

These are built from the same policies engineers use, so they inherit the same autotermination and isolation rules. Access is granted per group via ACLs (CAN_USE, CAN_ATTACH_TO) — an analyst attaches, runs their notebook, and never touches a provisioning decision.

The pipeline workspace, by contrast, has no shared clusters — it's job-driven, so clusters are ephemeral and spun up per run from job policies. Different workspace, different compute personality, same governance primitives.


Apply order: pool → policy → compute (and the Terragrunt trap)

Here's where Infrastructure-as-Code bites. The three layers have a hard dependency chain:

instance-pool  →  cluster-policy  →  compute (clusters)
Enter fullscreen mode Exit fullscreen mode

Policies reference pool IDs. Clusters reference policy IDs. So you must apply in order:

# 1. pools first
atlantis apply -d .../ws-landing/instance-pool
# 2. re-plan, then policies (they now see real pool IDs)
atlantis apply -d .../ws-landing/cluster-policy
# 3. re-plan, then clusters
atlantis apply -d .../ws-landing/compute
Enter fullscreen mode Exit fullscreen mode

And now the gotcha that cost us a confused afternoon. On the very first plan — before pools are applied — the policy and compute plans fail with:

Error: ... instance-pool ... is a dependency of ... cluster-policy ...
       but detected no outputs. ...
Enter fullscreen mode Exit fullscreen mode

This looks like a broken config. It isn't. Terragrunt lets a dependency return mock outputs so downstream modules can plan before the upstream is applied — but only for the Terraform commands you allowlist:

mock_outputs_allowed_terraform_commands = ["validate", "plan"]
Enter fullscreen mode Exit fullscreen mode

The failure happens during terragrunt init, and init isn't in that list. So on a cold bootstrap, init tries to fetch the real output of an unapplied pool, finds nothing, and dies. On our older workspaces this never surfaced — those pools were already applied, so real outputs existed.

Two ways out:

  1. Just apply in order (pool → re-plan → policy → re-plan → compute). Once the upstream is applied, real outputs exist and the mock is never needed. This is what we did.
  2. Add "init" to mock_outputs_allowed_terraform_commands on the policy/compute modules. The mock now covers the init phase too — but you still apply in order, because a cluster genuinely can't be created against a pool that doesn't exist yet.

The lesson: on a greenfield deployment, "detected no outputs" almost always means "you haven't applied the thing upstream yet," not "your dependency block is wrong."


Takeaways

  • Three layers, one funnel. Pools = speed. Policies = the real governance (instance type, size, autotermination — non-negotiable). Entitlement gate = who's even allowed to create.
  • Roles map cleanly onto the layers. Admin is unrestricted, engineer creates within a policy, analyst creates nothing and uses shared clusters. allow_cluster_create = off is what makes the analyst tier real.
  • Autotermination baked into policy is the single highest-ROI rule you can ship. It ends surprise weekend bills.
  • Apply order is load-bearing. pool → policy → compute, and the first cold plan failing with "detected no outputs" is a Terragrunt mock/init quirk, not a bug in your code.

So we applied it. Pools came up — real IDs issued. Policies came up — real IDs issued. Then we ran the compute apply to create those two shared clusters, and...

databricks_cluster.this["shared_small"]: Still creating... [10m20s elapsed]
...
Error: cannot create cluster: failed to reach RUNNING, got TERMINATED:
  Self-bootstrap timed out during launch ... BOOTSTRAP_TIMEOUT
Enter fullscreen mode Exit fullscreen mode

The EC2 nodes booted. Status checks passed, 3/3. And the clusters refused to start anyway — 25 minutes of "Still creating," then TERMINATED. Not a policy problem. Not an IAM problem. Something in the network was quietly eating our packets.

Next: The BOOTSTRAP_TIMEOUT Mystery — tracing a Databricks cluster from data plane to control plane, across three AWS accounts and one very quiet firewall.

Top comments (0)